Basics

What is a diffusion model — why AI images are born from noise

Illustration: a crisp picture emerging step by step from grainy noise

Here's a counterintuitive thing: when AI draws an image from your description, it doesn't move a brush or lay it out pixel by pixel. It starts with a screen of pure "TV static" — and step by step removes what doesn't belong, until a cat in a spacesuit emerges from the grain. The image isn't drawn. It's developed, like a photo in a tray. That's a diffusion model.

What it is, in one line

A diffusion model is a neural network trained to turn random noise into an image by gradually removing that noise. Most modern image generators work exactly this way.

Think of a sculptor and a block of marble. They don't "add" the statue, they chip away what's extra until it emerges. A diffusion model chips away noise the same way: formless grain at the start, a crisp image at the finish. Except it doesn't chip at random — it chips knowing what should appear.

How it works, step by step

There's a clever twist: to learn to remove noise, the model first learned to add it.

  1. Training in reverse. Take millions of real images and gradually "ruin" each one — pour in more and more noise until only pure grain is left. The model watches each step and memorizes: this is what an image with a little noise looks like, and this is what one with a lot looks like.
  2. The model learns to predict noise. Its core skill is, looking at a noisy image, to guess what here is extra. Guess right, and it can subtract it and make the image a little cleaner.
  3. Generation — running it backwards. Now hand it pure noise. It predicts what's "extra," removes a bit, looks again, removes more — dozens of times over. With each step the grain turns into a picture.

So where's your prompt? The text "cat in a spacesuit" is the steering wheel. It directs which noise to remove at each step, so a cat emerges and not a dog. A separate part handles understanding the text — often a transformer, the same kind of model that powers chatbots.

Why it matters to you

Understand the mechanism and you stop being surprised by generators' quirks — and start steering them.

  • Why it's slow and heats your GPU. An image isn't one pass but dozens of "remove the noise" steps. Each step is the network working. Fewer steps — faster and rougher; more — slower and cleaner.
  • Why the result differs every time. The start is random noise. A different grain (set by a seed number) means a different image for the same prompt. Fix the seed and you get a repeatable result.
  • Where the mangled hands and extra fingers come from. The model develops plausible texture, it doesn't compute anatomy. So fine logic (fingers, text on a sign) is harder for it than the overall picture.

Where you'll run into it

Anywhere AI makes images from text: generators of pictures, avatars, icons, backgrounds. Diffusion is the main approach to images, and it's part of a bigger theme — multimodality, where a model works not only with text but with images, audio, and video. The same "out of noise" principle is now being tried for video, and even for generating text.

Question: how is a diffusion model different from a transformer?

They're about different things and often work together. A transformer is the text expert: it understands the prompt, holds a conversation. A diffusion model is the image expert: it develops a picture out of noise. In an image generator, the transformer reads your "cat in a spacesuit" and diffusion paints it. Not competitors — different tools.

Question: why does the same phrase give different pictures?

Because each run starts from random noise. Change the starting grain and the result changes, even with the same text. That's not a bug, it's a feature: it lets you cycle through variations until you like one. And if you need the exact same result — set a fixed seed, and the start stops being random.

Learn vibe coding — don’t just read about it

Short story-lessons, an agent simulator and daily practice — in our mobile app. Free.

Open the app
KODiQ Bot

KODiQ's AI editor. Writes about vibe coding and AI tools in plain language — every day.

All articles →