VQGAN + CLIP

Published: 10/14/2021

This project uses a fork of VQGAN+CLIP by Katherine Crowson, available here. This fork allows users to generate images and video using sequential text prompts.

If you're interested in exploring VQGAN+CLIP there are some excellent tutorials available, like this one.

I started with a bit excerpt from one of my favorite books, The Hobbit. I also used the book's word count, 95356, as the initial numeric seed.

Far over the misty mountains cold || To dungeons deep and caverns old || We must away ere break of day || To seek the pale enchanted gold. || The dwarves of yore made mighty spells, || While hammers fell like ringing bells || In places deep, where dark things sleep, || In hollow halls beneath the fells. || For ancient king and elvish lord || There many a gleaming golden hoard || They shaped and wrought, and light they caught || To hide in gems on hilt of sword. || On silver necklaces they strung || The flowering stars, on crowns they hung || The dragon-fire, in twisted wire || They meshed the light of moon and sun. || Far over the misty mountains cold || To dungeons deep and caverns old || We must away, ere break of day, || To claim our long-forgotten gold. || Goblets they carved there for themselves || And harps of gold; where no man delves || There lay they long, and many a song || Was sung unheard by men or elves. || The pines were roaring on the height, || The winds were moaning in the night. || The fire was red, it flaming spread; || The trees like torches blazed with light. || The bells were ringing in the dale || And men looked up with faces pale; || The dragons ire more fierce than fire || Laid low their towers and houses frail. || The mountain smoked beneath the moon; || The dwarves, they heard the tramp of doom. || They fled their hall to dying fall || Beneath his feet, beneath the moon. || Far over the misty mountains grim || To dungeons deep and caverns dim || We must away, ere break of day, || To win our harps and gold from him!

The || serves as a delimiter in the sequence of prompts. Each || resets the model, using the final image from the previous phase as an input. The S-FLCKR model is providing the initial basis, and the output is 512x512, with 72 images generated per prompt.

With the current configuration and tier of Google Colab it is taking ~6 minutes to process each prompt.

Initial Output

Far over the misty mountains cold

loading...

This was a lovely surprise. Not sure what I expected, but still, these blew me away.

And then it got weird.

The Second Prompt

To dungeons deep and caverns old

loading...

The Third Prompt

We must away ere break of day

loading...

The Fourth Prompt

To seek the pale enchanted gold

loading...

Is that a cat? I think that's a cat. We got 4 prompts in before the damn thing dreamed up a cat.

And that's just the tip of the iceburg. Things were off the rails at this point, and I was curious to see what the end result would be. It took roughly 4 hours, all said and done. The entire result is below. Enjoy!

The Result

loading...