London-based startup Stability AI announced on August 22 the public release of Stable Diffusion, a text-to-image model similar to Dall-E 2 from Open AI or Imagen from Google. The open-source model is the result of a collaboration between Stability AI, RunwayML, research groups at the Machine Vision & Learning Center at LMU Munich (formerly the CompVis lab at the University of Heidelberg), EleutherAI and LAION.
Prior to this public release, Stability AI had announced on August 10 that Stable Diffusion would be made available to about 1,000 researchers, with the large-scale model previously tested by more than 10,000 beta testers via the Discord server.
The Stable Diffusion system
The model itself builds on the work of the CompVis and Runway team for their latent diffusion model, which was combined with information from the conditional diffusion models of Katherine Crowson, lead generative AI developer at Stability AI, Dall-E 2 from Open AI, Imagen from Google Brain, and other models.
The base dataset was trained on LAION-Aesthetics, a subset of LAION 5B, created with a new CLIP-based model that filtered LAION-5B based on the “beauty” of an image, based on Stable Diffusion’s alpha tester scores. The model was trained on the 4,000 A100 Ezra-1 AI ultracluster during June of this year and will be the first in a series of models exploring this and other approaches.
Stable Diffusion can generate 512×512 pixel images, in seconds, using about 6.9 GB of VRAM on consumer GPUs.
Stability AI cooperated with HuggingFace’s legal, ethics, and technology teams to release the model under a Creative ML OpenRAIL-M license, a permissive license focused on ethical and moral use, allowing both commercial and non-commercial use.
By registering with the DreamStudio interface, users get 200 credits for free. After that, they will pay about 1 euro for 100 generations.
LAION-Aesthetics, a subset of LAION 5B
LAION, (Large-scale Artificial Intelligence Open Network), is a non-profit organization with a mission to make machine learning models, databases and code available to the public. It has built LAION 5B, a CLIP-based dataset of 5.85 billion filtered image-text pairs, 14 times larger than LAION-400M, previously the largest openly accessible image-text dataset in the world.
Next, she created LAION-Aesthetics, which combines several subsets of LAION 5B and was used to form Stable Diffusion.
Use for pornographic purposes
The disclaimer on the purpose of the LAION 5B dataset points out that the data comes from the Internet and that the dataset is not organized, the organization advises to use the demo links with caution and cannot entirely exclude the possibility that harmful content is still present in safe mode.
LAION-400M, its predecessor, was known to contain pornographic or racist text and images. Stability AI therefore developed Safety Classifier, a security classifier included by default in the overall Stable Diffusion package, to detect and block offensive or unwanted images, which can be disabled.
The model was leaked on the Internet before its public release, it was used by Internet users and in particular in the discussion forum 4chan to generate images of naked personalities or pornographic scenes.
DALL-E 2 has a filter, which prevents it from generating images of public figures, which is not the case with Stability Diffusion, which could well be used to create deepfakes.