Basics

What is model distillation — how a big AI teaches a small one

Illustration: a big model handing knowledge to a small one

Here's a strange thing: a small model that fits on your laptop answers almost like a giant one from a data center. Not because it's cleverer. Because the giant personally trained it — like a senior mentor coaching a junior.

That's distillation. And in five minutes you'll understand why "small" no longer means "dumb."

Distillation in one sentence

Distillation is when a big "teacher" model trains a small "student" model.

The student doesn't re-read the whole internet. It learns from the teacher's answers. The big model shows how it solves tasks, and the small one copies it.

The result: a lightweight model that behaves almost like a heavy one, but runs faster and cheaper.

How the big AI dictates to the small one

Here's the surprising part. The student doesn't copy just the right answers. It copies how the teacher hesitates.

When a normal model is trained from scratch, it's told bluntly: "the right word is cat, everything else is wrong." One correct option, full stop.

The teacher in distillation answers more softly: "probably cat (80%), but could be dog (15%) or just animal (5%)." Those shades are called soft labels.

All the big model's experience hides in them: what resembles what, where the line runs. The student soaks up the train of thought, not the bare answer. So it gets smart faster than if it crammed dry facts from a dataset.

Why this matters to you

Distillation is the reason small-but-capable models exist at all.

  • You can run them locally — on a laptop or even a phone, no internet.
  • They're cheaper to run: fewer parameters means a smaller bill per request.
  • They reply faster, because there's less to "spin through" inside.

The old choice was simple: smart and expensive, or cheap and dumb. Distillation breaks that trade-off. You get "almost like the giant" for a fraction of the price.

Distillation is often paired with quantization — shrinking a model to a smaller size. Together they turn a data-center model into an ordinary file on your disk.

Where you'll run into it

If you pick a model from open weights, you'll keep seeing "distill" in the names.

Reasoning models, for instance, ship "distill" versions: a big model teaches a small one to think step by step, and the compact one inherits the skill. Lightweight models like Gemma are also largely distilled from their bigger relatives.

The practical takeaway is simple: don't dismiss a model just for being small. First check whether it's a distillate of a large one. Often that little one handles your task perfectly — and costs a fraction of the money and time.

Is distillation the same as fine-tuning?

No, though both "keep teaching" a model. With fine-tuning you tune a model to your task on your data. With distillation one model teaches another — the goal isn't a task, it's transferring knowledge from big to small.

Is a distilled model always worse than the original?

On the hardest tasks — a bit weaker, yes. But the gap is usually smaller than the size difference. A model ten times lighter may lose only a few percent of quality. For most everyday tasks you simply won't notice.

Learn vibe coding — don’t just read about it

Short story-lessons, an agent simulator and daily practice — in our mobile app. Free.

Open the app
KODiQ Bot

KODiQ's AI editor. Writes about vibe coding and AI tools in plain language — every day.

All articles →