Learn more about how the Cal Poly Humboldt Library can help
support your research and learning needs.
Stay updated at Campus Ready.
Andrew Gallimore
[This document is the text-only version of the original slide presentation Andrew created as part of a Cal Poly Humboldt Library Makerspace internship. To view the slides with visual aids, visit the full slide deck for Learn how AI image generation works!]
You will learn:
Generate this image in your head: "Draw an image of an astronaut riding a horse on mars."
What might you come up with?
Awesome, but how did you envision it? There has never been an astronaut on mars riding a horse to use as a reference.
For human brains: We have learned what an astronaut looks like, what someone riding a horse looks like, and then just put them on mars.
Rather than searching for similar images on the web, or stitching together an astronaut + a horse + mars, it has learned what each word means to draw.
Under the hood, there is a mathematical function that translates the text from a prompt into somewhere around 21,000 semantically meaningful numbers, so that each word or bit of text is assigned these numbers to indicate the meaning of the text. The number 21,000 is referencing the amount of contextual/semantic vectors GPT's (Generative Pre-trained Transformer) use, which is what is used in Chat GPT-3, and as far as I know that is what is also likely used by Stable Diffusion base xl 1.0 as well.
Each semantic value can be thought of as an attribute, style, or object. When building the image, it repeatedly uses the 21,000 values as it iteratively builds your image (more on that later).
A good example is when graphing the values of words, it may assign meaning for “aunt” and “uncle” to indicate that they are opposites. Similarly, it may do the same for “male” and “female.” But then it can also assign meaning to tie “aunt” and “female” together and “uncle” and “male” together.
Try it out! Use this interactive AI image generation webpage to put in a prompt and see what an AI model called Stable Diffusion creates! Stable Diffusion can generate a variety of types of images:
What is Stable Diffusion? Simply put, it is a computer model/program that can generate images using a prompt generated from random noise (it’s the one you used earlier). Built and trained by computer scientists, the model and code are free for people to use.
In step one at the beginning we may see an image representing noise in the process: a set of pixels in a seemingly random arrangement of colors, producing an overall somewhat gray appearance. Steps 2 and 3 represent the process of noise removal. In step 4 the final image is shown, which is a photorealistic image of a man wearing a billed hat.
A “classified image” is simply an image with text describing what's in it attached. For example, an image of a horse standing in a field or meadow would have the accompanying text: “horse standing in a field.” Or an image of a cabin in the wilderness may have the accompanying text: “cabin in the woods next to a lake.”
You can add noise to an image, and in Stable Diffusion we can define the amount of noise in an image based on its Step.
NOTE: The amount of steps to get full noise can change, we will use 20 in these examples.
Looking at a series of images from step 0 to step 20, the original image eventually appears as noise. From the original image of the astronaut riding a horse on Mars in Step 0, with more noise being introduced in each following step through to step 20 where the only visual content is noise.
During training, we add two amounts of noise to a classified image:
For example, in steps 4 and 5, we may see a small amount of noise added to the image of the man with the hat so that the step 4 image appears as less defined, but is still recognizable as a man with a hat. Step 5, where there is more added noise, shows an image that is still somewhat recognizable, but is edging toward losing enough definition to be a clear image because of the introduction of the pixelated noise.
When we use the model after training, it starts with a full noise image and the classifier for the image is our prompt. Training works with pre-classified images.
Instead of trying to remove noise all at once – which is hard for a computer – we have it remove only a Step of noise at a time, but repeat it for many steps. Twenty steps is fairly standard to get a good quality image without spending too much time with a large amount of steps. Each step takes a non-zero amount of time, so we can’t throw as many steps as we want at it without it taking a while.
During training, we train the Neural Network to remove a step of noise. For example, if we are generating an image of a man wearing a hat, between step 5 and step 4 the neural network will remove some noise so that the man and hat are more clearly defined. The training compares what it generated in step 4 with what the expected step 4 image would be. For example, the AI result might look remarkably similar but with small differences, such as larger teeth and offset eyes, that at first glance and with the level of noise is fairly unnoticeable, producing a somewhat satisfactory result for training.
When we use the network, we use the Neural Network trained before. It removes noise one step at a time starting with full noise. Our man with a hat might start off as an image that is almost entirely noise (pixelated colors forming a static that, when viewed from far away, appears gray) at step 20, moving to a version of the image in step 19 that begins to reveal a silhouette of a man’s head, and then resolving to an image that is a photorealistic representation of a headshot of a man wearing a hat.
A neural network imitates how human brains think:
The neural network uses these tools between each step of noise removal. The “millions of inputs and outputs” are image pixels plus the 21,000 semantic values from the prompt plus the image in the current step. The neural network takes all of this information, processes it, and then outputs the next step in the development of the image.
Stable Diffusion has been trained to make images that we deem good. But who are we? And how is a “good image” classified? It's important to note that different models of stable diffusion have bias. Most are biased towards American culture and were trained on English prompts with images from the public domain. There may be other bias as well.