What is inference — and why you pay for every AI answer again

Here's what surprises people: training a big model costs millions and takes months — but it happens once. The thing you deal with every day is something else entirely. It's inference. And you pay for it — every single time you hit "send."
The word sounds scary; the idea is simple. Inference is a trained model doing its job: taking your request and producing an answer. Training is like building a factory. Inference is running the conveyor to get one part. You build the factory once; you run the conveyor millions of times. You live in the second half.
What happens during inference
When you send a request, the model doesn't "recall" a ready answer or look it up in a database. It computes it from scratch — token by token.
First your text is split into tokens — little chunks of words. Then the model runs them through its weights (the numbers it learned during training) and predicts one next token. It appends it. Then, with that new chunk in hand, it predicts another. Round and round, until it decides the answer is done.
The key thing: during inference the weights don't change. The model doesn't permanently remember anything from your chat — it just guesses the next chunk very well based on what it learned before. That's why the same language model gives similar answers today and a week from now.
Why every answer costs money again
Since the model computes the answer from zero each time, every time it burns hardware: GPUs grind through billions of multiplications. That's what you're billed for.
You usually pay per token, on both ends: how much you sent (your request plus the whole chat history) and how much the model replied. A long conversation gets pricier not because the model is "tired," but because the entire prior chat is fed back in every time — it remembers nothing between requests.
Practical takeaway: a short, precise request is cheaper than a long "just in case" one. And if you run the model in a loop (a bot, an agent), the cost stacks up per run.
What drives speed and price
Three levers decide almost everything:
- Model size. A bigger one is smarter, but each token is slower and dearer to compute. Sometimes a smaller model handles your task — and answers instantly.
- Context length. The more text you feed in, the slower the first token and the higher the bill. Don't dump everything into the prompt.
- Answer length. Each output token is a separate conveyor step. Ask for "brief" and you get it faster and cheaper.
There's also a trick on the model-maker's side — quantization: the weights are compressed so inference runs faster and fits on weaker hardware, at a small cost to accuracy.
Where you'll run into it
You already have — you just didn't know the name. "The model is typing…" with a lag is inference thinking about the first token. The bill in your API dashboard is the sum of inferences. The slowdown on a free tier at peak hours is the queue for the GPUs.
And one more: inference doesn't only run in the cloud. Small open models run right on a laptop or phone — slower, but free and private. That's "local inference."
Is inference the same as generation?
Almost. Generation is the inference of a language model that produces text. But inference is broader: it's the name for any work a trained model does — recognizing an image, sorting an email into spam. Text generation is one special case.
Why does the same question cost the same the second time?
Because the model doesn't cache the answer in your head — it recomputes it from scratch. Between requests it remembers nothing, so the second time is just as much work as the first.
Can I run inference on my own computer?
Yes, with a smaller open model. The big ones need powerful GPUs, but compact versions run on an ordinary laptop. It'll be slower than the cloud, but with no bill and no data leaving your machine.
Short story-lessons, an agent simulator and daily practice — in our mobile app. Free.





