Ship made of flowers sailing on clouds
These images were created by the AI code | Code model name: VQGAN-CLIP
The Colorful World of AI Art
Generative artificial intelligence algorithms have done amazing work in the creative realm of visual art. Now several AI-based image generation tools are available that allow anyone to experiment with image generation by writing a description of the required image. Tools like Midjourney, DALL-E 2, DreamStudio, NightCafe, and DALL-E mini have gained mainstream popularity in this field, and it’s now becoming common to see people posting their AI-generated art and images on social media. As people become more curious about what AI will generate, they are becoming increasingly creative with designing prompts, making AI-generated art a new art form.
Burning and Melting Ice Castle at Night
These images were created by the AI code | Code model name: VQGAN-CLIP
How AI Uses Text to Generate Art
The idea behind generating art images based on text is to use a generative model that could generate images, while being conditioned by a representation of the text information. The generative models commonly used for it include Generative Adversarial Network (GAN), Variational Auto-Encoder (VAE), and Diffusion models. GANs include a Generator Neural Network model and a Discriminator model. The Generator learns to generate images that could be mistaken as real examples by the Discriminator model. The Discriminator simultaneously learns to classify real and generated images correctly. VAE models learn to encode the image into a latent space representation from which the original image could be reconstructed well using the Decoder model. The images are generated by sampling from the encoded latent space distribution. Diffusion models are inspired by thermodynamics and work by changing concentration through adding or removing noise. Denoising Diffusion models generate images by iteratively filling in the details of a noisy image representation.
OpenAI’s Contrastive Language-Image Pre-training (CLIP) model is being used widely for text-conditioned image synthesis. In CLIP, a Vision Transformer-based image encoder is jointly trained with a text encoder using image-text pairs and a multi-modal embedding space is learned, in which similar images and text are closer. These CLIP embeddings can be used for zero-shot reasoning in new tasks, by comparing images with text. CLIP is being used for conditioning and guiding different generative models.
In the original DALL-E model, a Transformer-based decoder was trained to generate images by taking a single stream of both text and image, while using a VAE. CLIP was used with it for re-ranking and choosing the best generated images. In DALL-E 2, a prior is generated using CLIP encoding and then a Diffusion model called unCLIP is used to reconstruct images from CLIP embedding by filling in details. In several models like VQGAN-CLIP and CLIP guided diffusion, the CLIP model is used to steer generative models like GANs and Diffusion models, towards generating output similar to given text. It’s done by finding similarity between CLIP embedding and generated images and using it as a loss function for optimization.
DreamStudio is based on the Stable Diffusion model which uses a Latent Diffusion Model (LDM). In LDM, a U-Net based Auto-Encoder is used to compress the latent space to facilitate generating high resolution images. The Diffusion model works on that latent space and also includes cross-attention conditioning mechanism. The diffusion is conditioned on text embeddings from CLIP.
Google Brain’s Imagen generates photorealistic images. It utilizes a large Transformer for text embeddings and then it uses a Diffusion model having U-Net and an improved sampling technique named dynamic thresholding.
Steampunk Spaceship in thunder storm
These images were created by the AI code | Code model name: VQGAN-CLIP
Going Beyond Text Prompts and Pre-trained Models Using CLIP
Training text conditioned image synthesis models is hard and mostly requires very high computational resources and long training time. For this article, I designed an approach that enables generation of AI art, that is guided by text as well as the user’s custom and scalable collection of art images, without requiring model training. The system was developed by utilizing zero-shot reasoning capabilities of CLIP for image search and combining it with the flexible and smaller VQGAN-CLIP model. VQGAN combines Convolutional Neural Network (CNN) with Transformer for generating images, while using vector quantization (VQ) in learning good representations of images. In VQGAN-CLIP the VQGAN makes alterations in an image and uses its similarity with CLIP embeddings of input text as a loss function for guiding the process. It’s also possible to use images to guide the process by using CLIP embeddings of target images in a similar way. The alteration process can also be started from a given initial image instead of noise. The initial image has more influence on the overall structure and content of the image, while the target images have relatively more influence on style and details.
The main idea in my approach for creating the images attached with this article is to use CLIP and rules for searching and ranking images in the user’s dataset and finding the most suitable images to guide image generation, during different steps of image generation. The suitability of images is based on the CLIP embeddings similarity between an image’s content and the input text describing required output. This results in using guiding images, having content similar to output requirements e.g. using flower images in the dataset while drawing flowers in output. The user can also organize the images folders by artists and concepts, and they can be specified in inputs for more weightage to them during generation. The images of new concepts and things can also be specified in the folder name. Information is extracted from user inputs using some Natural Language Processing (NLP) steps and a set of rules are used for sampling images, while using CLIP for similarity scores. Some randomness is added in sampling for variety, and embeddings are pre-computed for speed. The used models, dataset, and approach focuses on more abstract art rather than photorealism.
Trees of gems on a bridge
These images were created by the AI code | Code model name: VQGAN-CLIP
Adding Some More Style and Making it Big with NST and Super-Resolution
Neural Style Transfer (NST) is used to add the style of images to the content of another image. In NST, the content and style of images are extracted and combined using CNN, through an optimization process based on content loss and style loss. In the algorithm used for this article, one image is chosen as the initial image, while several images are chosen for both target images and style images. The style images are used for applying NST to the output generated by VQGAN-CLIP. Some style of target image is already captured during generation but NST adds style more effectively. The image generated after NST is of lower resolution, especially if a powerful GPU is not available. Image Super-Resolution (ISR) was used to increase the resolution to 3X. In ISR the image upscaling is done using CNN-based interpolation, which is of higher quality than simple interpolation, like linear or cubic. Enhanced Deep Residual Network (EDSR) is used for ISR, which is based on modifying ResNet for upscaling.
So there’s a lot to look forward to when we talk about AI becoming an artist. And the more we learn, the more AI will learn.
References:
[1] OpenAI. “CLIP: Connecting Text and Images”, 2021.
[2] K. Crowsen, S. Biderman, et al. “VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance”. 2023.
[3] L. Gatys, A. Ecker and M. Bethge. “A Neural Algorithm of Artistic Style”. Journal of Vision-2016.
[4] B. Lim, S.Son, et al. “Enhanced Deep Residual Networks for Single Image Super-Resolution”. CVPRW-2017.
[5] A. Ramesh, P. Dhariwal, A.Nichol. “Hierarchical Text-Conditional Image Generation with CLIP Latents”. 2023
[6] OpenAI. “DALL·E: Creating Images from Text”, 2021.
[7] P. Esser, R. Rombach, B. Ommer. “Taming Transformers for High-Resolution Image Synthesis”, 2021.
[8] E. Rombach, A. Blattmann, et al. “High-Resolution Image Synthesis with Latent Diffusion Models”. CVPR-2023.
[9] C. Saharia, W. Chan, et al. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”. CVPR 2023.