What is multimodality — how AI 'sees' an image when it has no eyes

You send a model a photo of your fridge and ask for a recipe — and it answers. It looks like it "saw" the food the way you do. But the model has no eyes, and it doesn't see the image in our sense. It does something cleverer: it turns your photo into the same numbers it uses for text. That ability to work with image, sound and text in one language is what's called multimodality.
What multimodality is — in one line
A "modality" is just a kind of data: text, image, sound, video. A multimodal model is one that understands several kinds of data at once, not just text. Drop in a photo and a question in words — it parses both in a single answer.
It used to be that each kind lived apart: one program for text, another for recognizing images. Multimodality folded them into one head.
How it works: it all comes down to numbers
This is where the heart of it hides. Inside, a language model doesn't understand letters or pixels — it understands numbers. Text it first cuts into tokens and turns into numbers. With an image it does essentially the same: a dedicated part of the model splits the picture into fragments and turns each into a set of numbers.
And here's the trick: after that, image and text become "one substance" — numbers in a shared space. The model then processes them in a single stream. That's why it can reason about a photo in words: for it, these aren't two different tasks but one stream of numbers, where the "ginger cat" from the text and the ginger cat from the photo lie side by side.
An analogy: picture a translator who turns speech, gestures and images into one common language — and from then on thinks only in it. It doesn't matter what came in; inside, everything has become one.
What it makes possible — things that weren't before
Multimodality opened up a pile of things that were science fiction for a single model just a couple of years ago:
- snap a receipt — get a table of expenses;
- show a screenshot of an error — the model reads the text on screen and suggests a fix;
- point your camera at a menu in another language — get a translation and what to order;
- drop in a chart — ask it to explain what's on it.
Notice: in every example the input is not text, yet you still talk in words. That's the power of multimodality — the line between "show" and "tell" dissolves.
Where its limits are
Multimodality is impressive, but it's not magic. The model easily misses on fine print, dense tables, and exact counts of objects in a photo ("how many people are here?" — it can slip). And "saw it" doesn't mean "understood it right": from an image it can make up a detail just as confidently as it does from text. So check a photo-based result the same way you'd check any AI answer.
Q: Is a multimodal model a different model from a regular one?
Usually it's one model trained from the start to handle both text and images. Not "a text model bolted onto a recognizer," but a single head whose input can be text, a photo or sound. That's why it answers coherently, not in two separate pieces.
Q: Does it really "see," or just guess?
Depends what you count as "seeing." It has no eyes or human vision — it has turning pixels into numbers and finding patterns in them. But in practice that's enough to describe a scene, read text off a photo, find an object. It's not human sight, but it's not empty guessing either.
Q: How is multimodality different from image generation?
Two sides of one coin. Multimodality is on the input — the model understands an image you send. Generation is the model drawing a new image from text. People mix them up because both are "about images," but the direction is opposite: one reads, the other creates.
Short story-lessons, an agent simulator and daily practice — in our mobile app. Free.





