What is a diffusion model — why AI images are born from noise

Here's a counterintuitive thing: when AI draws an image from your description, it doesn't move a brush or lay it out pixel by pixel. It starts with a screen of pure "TV static" — and step by step removes what doesn't belong, until a cat in a spacesuit emerges from the grain. The image isn't drawn. It's developed, like a photo in a tray. That's a diffusion model.
What it is, in one line
A diffusion model is a neural network trained to turn random noise into an image by gradually removing that noise. Most modern image generators work exactly this way.
Think of a sculptor and a block of marble. They don't "add" the statue, they chip away what's extra until it emerges. A diffusion model chips away noise the same way: formless grain at the start, a crisp image at the finish. Except it doesn't chip at random — it chips knowing what should appear.
How it works, step by step
There's a clever twist: to learn to remove noise, the model first learned to add it.
- Training in reverse. Take millions of real images and gradually "ruin" each one — pour in more and more noise until only pure grain is left. The model watches each step and memorizes: this is what an image with a little noise looks like, and this is what one with a lot looks like.
- The model learns to predict noise. Its core skill is, looking at a noisy image, to guess what here is extra. Guess right, and it can subtract it and make the image a little cleaner.
- Generation — running it backwards. Now hand it pure noise. It predicts what's "extra," removes a bit, looks again, removes more — dozens of times over. With each step the grain turns into a picture.
So where's your prompt? The text "cat in a spacesuit" is the steering wheel. It directs which noise to remove at each step, so a cat emerges and not a dog. A separate part handles understanding the text — often a transformer, the same kind of model that powers chatbots.
Why it matters to you
Understand the mechanism and you stop being surprised by generators' quirks — and start steering them.
- Why it's slow and heats your GPU. An image isn't one pass but dozens of "remove the noise" steps. Each step is the network working. Fewer steps — faster and rougher; more — slower and cleaner.
- Why the result differs every time. The start is random noise. A different grain (set by a seed number) means a different image for the same prompt. Fix the seed and you get a repeatable result.
- Where the mangled hands and extra fingers come from. The model develops plausible texture, it doesn't compute anatomy. So fine logic (fingers, text on a sign) is harder for it than the overall picture.
Where you'll run into it
Anywhere AI makes images from text: generators of pictures, avatars, icons, backgrounds. Diffusion is the main approach to images, and it's part of a bigger theme — multimodality, where a model works not only with text but with images, audio, and video. The same "out of noise" principle is now being tried for video, and even for generating text.
Question: how is a diffusion model different from a transformer?
They're about different things and often work together. A transformer is the text expert: it understands the prompt, holds a conversation. A diffusion model is the image expert: it develops a picture out of noise. In an image generator, the transformer reads your "cat in a spacesuit" and diffusion paints it. Not competitors — different tools.
Question: why does the same phrase give different pictures?
Because each run starts from random noise. Change the starting grain and the result changes, even with the same text. That's not a bug, it's a feature: it lets you cycle through variations until you like one. And if you need the exact same result — set a fixed seed, and the start stops being random.
Short story-lessons, an agent simulator and daily practice — in our mobile app. Free.





