Stable Diffusion Case Study

Author: Borbála Kökény <borikokeny>

Introduction

Stable Diffusion is an open source Python written text-to-image generator, a deep learning system and a deep generative artificial neural network based on latent diffusion techniques. It was released by CompVis Group, Runway and Stability.ai in 2022. LAION and PyTorch Lightning where used mainly for its training with around 5 billion image and title pairs. The model is available via API provided by Fireworks AI. The feature has 10M global users, and was created more the 400 million images since its start. Its community with 240.000 members meets on Discord.

Stable Diffusion is primarily used to generate detailed images conditioned on text descriptions, meanwhile it is capable to other tasks like inpainting, outpainting, and generating image-to-image translations guided by a text prompt.

Brief History

2019: Stability AI’s mission begins
2022: release of Stable Diffusion 1.1
2022: release of DreamStudio
2023: release of SDXL and SDXL Turbo
2023: release of SDXL-refiner-1.0
2024: early preview of Stable Diffusion 3
2024: release of Stable Video 3D

Main Features

Stable Diffusion is primarily used to generate detailed images conditioned on text descriptions, and its main architectural components include a variational autoencoder, forward and reverse diffusion, a noise predictor, and text conditioning.

| SD XL: | Released with larg UNet backbone, larger cross-attention context, two text encoders instead of one, and trained on multiple aspect ratios. With 1024x1024 resolution and improved generation for limbs and text. | | SDXL Turbo: | It is a distilled version of SD XL 1.0, designed for rapid generation of 512x512 pixel images. It uses the same text conditioning models as SD XL 1.0 and has 3.5 billion parameters. SDXL Turbo is able to produce images in just a single step. | | SDXL Refiner: | An advanced refinement tool designed for use in text-to-image generative models. This tool plays a critical role in enhancing the quality of images generated by these models. | | Stable Diffusion 3: | Works with a Rectified Flow Transformer, where the text encoding affects the image encoding and conversely. | | img2img: | This script generates a new image based on the original image that also features elements provided within the text prompt. | | Negative Promt: | It allows the user to specify prompts which the model should avoid during image generation. | | Open Source: | Its repository: https://github.com/Stability-AI/generative-models | | Easy to access: | It can run on most consumer hardware equipped with at least 4 GB VRAM GPU. |

Market Comparison

It is huge interest in AI image generating nowadays, there are many competitors on the market. The best and biggest providers are Image Creator from Microsoft Designer, DALL·E 3, ImageFX by Google, Dream by WOMBO, Craiyon, Midjourney, Adobe Firefly, Generative AI by Getty Images and of course DreamStudio by Stability AI.

All of them works with proper training and speed, the differences come mostly in price, accessibility and features. Four of the nine, ImageFX, Image Creator, Craiyon and Adobe Firefly offers completely free service, all the others require credits or subscription.

DALL·E 3 is the only with app but just in Apple Store and Stable Diffusion stands apart as the only AI image generation model which is fully open source.

DALL·E 3 produces a much better picture than Midjourney or Stable Diffusion when there are more than two characters, makes things that look more abstract or computer-generated. Midjourney, on the other hand, is a tool best known for its artistic style, it seems more like a paintingn then a photograph.

Stable Diffusion has a relatively good understanding of contemporary artistic illustration and can produce highly detailed artwork, is excellent for creative, photorealistic illustrations but falls short when creating general images such as logos, faces or limbs. Meanwhile it is a great pro with the open source, accessibility for individual developers can also be a problem to run models in consumer electronics, as the training process for e.g. waifu-diffusion requires a minimum 30 GB of VRAM.

Getting Started

There are several ways to access to the Stable Diffusion models.

If you are interested in its open sources code, you can download the latest version of Stable Diffusion, install it on your own computer, and even train it on your own data using the https://stability.ai/stable-image page, where you can choose between Stable Diffusion 3, SDXL Turbo and Stable Diffusion XL. The page provides the Stability API key which can be used both on Google Colab GitHub page and on Fireworks AI which powers the model. On the Fireworks the user can give prompts for both text-to-image and image-to-image creations. As an alternative there is an option for negative promts which allows users to indicate which terms should not appear in the image. On https://github.com/Stability-AI/generative-models, they are providing example training configs.

You can also use Stability AI’s own application DreamStudio (https://beta.dreamstudio.ai/generate), or in a more basic form, through Clipdrop (https://clipdrop.co/). Meanwhile Dreamstudio gives only the image generating features with some adjustment possibilities, but Stability AI users can use their credits on both, Clipdrop offers a huge range of fancy image manupilating options with own pricing.

After depleting the initial 25 free credits on Stability AI, additional credits can be purchased via the account page. Credits are priced at $10 per 1,000 credits, which is enough for roughly 5,000 SDXL 1.0 images.

As Stable Diffusion became a popular model, come more and more applications alive using it. Just to mention some: DiffusionBee, NovelAI, NightCafé, Charl-e, Draw Things, Lensa, Sinkin AI and many more.

How it actually works: you write a command in the promt field, explaining your image wish with as accurate discribing as you need. Descripting words at the start of the prompt will be used more emphasized in the generating as the last ones. There is a negative promt field, where you can precise what kind of elements should not appear on the picture. It is not only the elements of the image which can be required but the style, as photographic, anime, pixel art, origami etc. The ratio and the number of given images per promt can be adjusted.

Conclusion

Meanwhile it is a crazy fun to generate images with AI through promts with or without any drawing or photographing ability, it is a must to talk about the disadvantages and limitations in general of this story. As the model was trained on images with English descriptions, “the generated images reinforce social biases and are from a western perspective, as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages, with western or white cultures often being the default representation”(wiki).

Stable Diffusion and many competitors were trained on images which were not filtered by human input, which causes that large amounts of private and sensitive information appears in the training data and also infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.

With the development of AI products, it will become more and more difficult to recognize the difference between the original and general AI images, which may eventually lead to human artists, along with photographers, models, cinematographers, and actors, gradually losing commercial viability against AI-based competitors.

References

Stability AI about its own product
Wikipedia about Stable Diffusion
Intro to Latent Diffusion Models by deeplizard (https://www.youtube.com/watch?v=IxHXQpk5kkg)
Stable Diffusion & Midjourney comparison
Stable Diffusion & Dall.E comparison
Stable Diffusion & Dall.E & Midjourney comparison
Best AI image generators