What is streaming — why AI types its answer word by word
Ever notice how ChatGPT seems to type its answer live — word by word, instead of dropping the whole thing at once? Here's the surprise: that's not a cosmetic animation. The model genuinely doesn't know the end of the sentence when it starts. It invents the next chunk right now and shows it to you immediately. That's called streaming — and in a couple of minutes you'll get what's happening under the hood.
The model thinks one token at a time
An answer isn't built all at once — it's built in tiny chunks called tokens. A token is a piece of a word, sometimes a whole word, sometimes a couple of characters.
The mechanism is simple, almost unnervingly so:
- The model looks at everything so far (your question + what it's already written).
- It predicts one next token.
- It glues it on and goes back to step 1.
Hundreds of times, until it decides the answer is done. Each predicted token can either be stored up and delivered all at once at the end, or handed over the moment it's ready. The second option is streaming.
Why hand it over piece by piece
- You start reading immediately. The first words land in half a second instead of after a ten-second wait for the whole answer. There's even a name for it — "time to first token."
- It feels like a live conversation. Text that appears gradually reads like a chat, not like an unloaded file.
- You can cut it off. See the model going the wrong way? Hit "stop" without watching to the end. You save your time and spare the extra tokens.
What this means for you
A few takeaways you can use right away:
- Generating the answer is inference, and it runs at exactly the speed you see the words appear. You can't "skip to the end" — the end doesn't physically exist yet.
- If you cut an answer off mid-way, you still paid for the tokens already generated. Stopping saves time, but it doesn't un-generate what's done.
- A reasoning model runs a "thinking" phase before streaming the answer — sometimes hidden. That's why the pause before the first word can be longer: the model reasons privately first, then starts typing.
- In your own app, streaming is usually a single flag in the request (
stream: true). It turns "the app froze for 8 seconds" into "the app answers live."
Once you hold the image of "the model births text piece by piece," the oddity of "why does it type instead of showing it at once?" disappears. It types because, in that moment, it's genuinely thinking.
Why does an answer sometimes cut off mid-way?
Usually a dropped connection or the model hitting a length limit. The stream just stops; whatever arrived stays on screen. Retrying the request is usually enough.
Does streaming make the answer faster?
No. The total generation time is the same. But you start reading from the first words, so it subjectively feels much faster.
Can you turn streaming off?
Yes. In an API it's a flag you can disable — then you get the whole answer at once. Handy when you need a complete, finished text (say, to parse JSON) rather than a live type-out.
Short story-lessons, an agent simulator and daily practice — in our mobile app. Free.


