VQGAN + CLIP

This post was published 2021/10/14 & last updated 2023/11/03

aigansmachinelearning

This project uses a fork of VQGAN+CLIP by Katherine Crowson. This fork allows users to generate images and video using sequential text prompts.

If you're interested in exploring VQGAN+CLIP there are some excellent tutorials available, like this one.


I started with a bit excerpt from one of my favorite books, The Hobbit. I also used the book's word count, 95356, as the initial numeric seed.

Far over the misty mountains cold || To dungeons deep and caverns old || We must away ere break of day || To seek the pale enchanted gold. || The dwarves of yore made mighty spells, || While hammers fell like ringing bells || In places deep, where dark things sleep, || In hollow halls beneath the fells. || For ancient king and elvish lord || There many a gleaming golden hoard || They shaped and wrought, and light they caught || To hide in gems on hilt of sword. || On silver necklaces they strung || The flowering stars, on crowns they hung || The dragon-fire, in twisted wire || They meshed the light of moon and sun. || Far over the misty mountains cold || To dungeons deep and caverns old || We must away, ere break of day, || To claim our long-forgotten gold. || Goblets they carved there for themselves || And harps of gold; where no man delves || There lay they long, and many a song || Was sung unheard by men or elves. || The pines were roaring on the height, || The winds were moaning in the night. || The fire was red, it flaming spread; || The trees like torches blazed with light. || The bells were ringing in the dale || And men looked up with faces pale; || The dragons ire more fierce than fire || Laid low their towers and houses frail. || The mountain smoked beneath the moon; || The dwarves, they heard the tramp of doom. || They fled their hall to dying fall || Beneath his feet, beneath the moon. || Far over the misty mountains grim || To dungeons deep and caverns dim || We must away, ere break of day, || To win our harps and gold from him!

The || serves as a delimiter in the sequence of prompts. Each || resets the model, using the final image from the previous phase as an input. The S-FLCKR model is providing the initial basis, and the output is 512x512, with 72 images generated per prompt.

With the current configuration and tier of Google Colab it is taking ~6 minutes to process each prompt.

Things were off the rails at this point, and I was curious to see what the end result would be. It took roughly 4 hours, all said and done.

https://vimeo.com/632591202

Back