Hold your phone up to a wine list in a language you do not read, ask out loud which bottle goes with fish, and get a spoken answer in under a second. Three things happened there that used to need three separate systems: a model read pixels, a model heard speech, and a model reasoned about food. In 2024 those stopped being three systems. They became one.
That shift has a name. A natively multimodal model is trained from the start on text, images, audio, and video together, in a single network, rather than being a language model with other senses bolted on afterward. The word "natively" is carrying real weight. It marks the difference between a model that was taught to see and a model that was assembled from a seeing part and a talking part. This post is about why that difference exists, what it changes for real work, and where it still falls down.
Origin: from a stack of models to one model
For most of the deep learning era, handling more than one kind of data meant wiring separate specialists together. You wanted a system that could answer questions about a photo, so you took a vision model that turned the photo into a description, and you fed that description into a language model. The language model never saw the photo. It saw a translation of the photo, produced by a different network with its own blind spots. This is the stitched pipeline, and for years it was simply how multimodal AI was built.
It works, up to a point, and the point is tight integration. The vision model decides in advance what is worth describing. If it does not mention the small print in the corner of a contract, the language model cannot ask about it, because as far as the language model knows, the small print was never there. Detail gets thrown away at the handoff, and no cleverness downstream gets it back.
The intellectual groundwork for closing that seam was laid in January 2021, when OpenAI released CLIP. CLIP trained an image encoder and a text encoder together on 400 million image and caption pairs scraped from the web, with one objective: put a picture and its matching words in the same place (CLIP on Wikipedia). Not a similar place. The same coordinate space, so that the vector for a photo of a dog and the vector for the words "a photo of a dog" land next to each other. That idea, a shared representation space across modalities, is the seed of everything that followed. If pictures and words can be points in one space, a single model can reason over both at once. (If the word "vector" needs grounding, our piece on what an embedding is covers it from first principles.)
CLIP still had two encoders. The next move was to stop treating vision and language as separate networks at all, and that is where native multimodality begins. The field now distinguishes two ways to combine senses (Apple research on early fusion). Late fusion keeps a dedicated vision encoder and merges its output into the language model near the end. Early fusion, the basis of native multimodality, sends image and text into the same transformer from the first layer and processes them jointly all the way down. A 2025 Apple study trained both from scratch under matched compute and found early fusion at least as strong, more efficient at lower compute budgets, and needing fewer parameters to hit the same quality (scaling laws for native multimodal models). The stitched pipeline was not just inelegant. It was leaving capability on the table.
Present: everything is a token
The trick that makes one network handle four kinds of input is simpler than it sounds, and it leans on the same architecture as text models. (For how that architecture works, see the transformer explained in plain English.) A transformer does not actually consume words. It consumes tokens, which are just numbered chunks, each turned into a vector. Nothing in the design says those chunks have to be text.
So an image gets cut into a grid of small patches, say sixteen pixels on a side, and each patch becomes a token, the same kind of object a word becomes (understanding multimodal LLMs). Audio gets sliced into short time windows and each window becomes a token. Video is images plus audio across time, so it becomes a long sequence of both. Once a photo, a sound, and a sentence are all sequences of tokens in one shared space, the transformer does not need to know which modality a given token came from. It runs attention across all of them together, so a token from the corner of a chart can directly influence how the model reads the question typed beneath it. That direct cross-modal attention, with nothing thrown away at a handoff, is the whole payoff.
In practice models still differ in how the senses meet. Some concatenate image tokens straight into the text stream, the unified-embedding approach used by models like Mistral's Pixtral. Others, including Meta's Llama 3.2 vision models, keep a vision encoder and let text attend to image features through added cross-attention layers (two approaches to multimodal LLMs). The cleanest mental model is still the first one: turn everything into tokens, drop them in the same space, let attention do the rest.
The models you have heard of are built this way. OpenAI's GPT-4o, released on May 13, 2024, was the consumer turning point. The "o" stands for omni, and unlike GPT-4, which routed audio through separate speech models, GPT-4o processes text, vision, and audio in one network (GPT-4o on Wikipedia). That is why it answers a spoken question in around 320 milliseconds, close to human conversational tempo, where the older pipeline took seconds: there is no handoff between a transcription model, a text model, and a voice model, because there is one model (IBM on GPT-4o). In March 2025 OpenAI extended the same model to generate images directly, building them token by token rather than calling a separate image generator, which is why it can finally render legible text inside a picture (4o image generation).
Google built the Gemini family multimodal from the start, designed to process text, images, audio, video, and code from inception. The December 2023 launch and Gemini 1.5 in February 2024 pushed the context window to one million tokens, enough to hold hours of video in a single prompt (Gemini on Wikipedia, Gemini 1.5 technical report). Anthropic's Claude 3 family, released March 2024, added vision across all three sizes, with strength in document and diagram reasoning (Claude 3 family, Claude vision docs). The open-weight tier has closed much of the gap: Alibaba's Qwen2.5-VL flagship matches GPT-4o and Claude on document understanding, while Pixtral, Molmo, Phi-4 multimodal, and Gemma give teams capable vision-language models they can self-host (open vision-language models 2026, Qwen2.5-VL technical report).
What this changes for real work is the part the demos undersell. Four examples:
- Document understanding. A stitched pipeline ran a PDF through optical character recognition, lost the layout, and handed a wall of text to the model. A native model sees the page. It reads a number in a table and the column header above it and the footnote below it as one visual object, which is why multimodal models now drive invoice extraction, claims processing, and contract review. McKinsey estimates that digital and AI-driven claims handling can cut claims expenses by up to 30 percent once documents are read this way (McKinsey on the future of insurance claims).
- Voice interfaces. Because one model hears and answers, voice assistants can interrupt, catch tone, and reply fast enough to feel like conversation rather than a walkie-talkie exchange. The voice AI market reflects the pull: the AI voice generator segment was worth roughly 3 billion dollars in 2024 and is forecast to pass 20 billion by 2031 (AI voice generator market).
- Video analysis. A model that ingests frames and audio together can answer questions about a two-hour recording: where a topic was discussed, what was on screen when a claim was made, which clip matches a description. Search inside video stops meaning search of a human-written transcript.
- Agents that see a screen. Computer-use agents work by taking a screenshot, reading it with a multimodal model, deciding where to click, and repeating. This only works because the model genuinely sees the interface. Claude Sonnet 4.6, released in February 2026, reached 72.5 percent on the OSWorld benchmark of 369 real desktop tasks, edging past the 72.36 percent recorded for a human baseline (computer use agents 2026).
Where it still breaks
Native multimodality is real progress and a genuinely leaky abstraction. The honest version has caveats.
Visual hallucination is the big one. A multimodal model will confidently describe an object that is not in the image, assign the wrong color, or invent a spatial relationship, and a survey of the problem shows it is distinct from text hallucination and not fixed by borrowing text-only solutions (survey of multimodal hallucination). Part of the cause is imbalance: the language side of these models is trained on far more data than the vision side, so the model often leans on what is statistically likely in text rather than what is actually in the pixels. A wrong answer can still read as authoritative.
Precise visual tasks stay weak. Counting objects in a crowded scene, reading an exact value off an unlabeled chart, reasoning about which shape is left of which: these are where multimodal models still slip, because attention over patches captures the gist of an image better than its precise geometry. Audio and video are less mature than still images, and the picture gets worse in languages with thin training data.
There is also a cost the token framing hides. A high-resolution image becomes a great many tokens, and video becomes far more, so multimodal inputs are expensive to process and they consume the context window fast. The convenience of dropping in a video is real. So is the bill.
Future and impact
The direction is toward output, not just input. Through 2024 and 2025 most of these models could read four modalities but mainly write text. That is ending. At Google I/O on May 19, 2026, Google introduced Gemini Omni, a family meant to take text, images, audio, and video as input and produce video, edited images, and audio as output, with CEO Sundar Pichai framing the goal as creating anything from any input (Gemini Omni). A model that both perceives and renders across modalities is the next plateau, and the market expects it: multimodal AI was valued at roughly 1.6 billion dollars in 2024 and is forecast to grow above 30 percent a year through the following decade (multimodal AI market).
The more interesting consequence is for agents. An agent that can see a screen, hear a user, read a document, and watch a video is not limited to tasks already digitized into clean text. It can work the messy interfaces and mixed media that most real jobs run on. The hard part stops being perception and becomes reliability: a model that hallucinates an object in an image, inside an agent that then acts on that object, compounds a small error into a wrong action. The engineering that matters is the verification and the guardrails around the model, not the model alone. That gap, between a multimodal model that impresses in a demo and a multimodal agent that holds up in production, is where an implementation partner like Perform Digital does its work.
Treat native multimodality as a real capability gain with sharp edges. It removes the seam that used to throw away detail between the eye and the mind of an AI system. It does not remove the need to check what the system claims to see.
Council summary
The council judged this post publishable. It makes the right central argument: native multimodality is an architectural break, not a marketing label, because training one network on text, images, audio, and video removes the lossy handoff that crippled the old stitched pipelines. The mechanism is explained without math, the lineage from CLIP through GPT-4o, Gemini, Claude vision, and the open-weight tier is accurate and current to mid-2026, and the payoff is grounded in named applications rather than adjectives. The treatment of limits is the strongest part, refusing to oversell visual hallucination, weak counting, immature audio and video, and the token cost of images. A reader leaves able to explain what "natively multimodal" means, why it beats a bolted-on pipeline, and exactly where to stay skeptical.
Comments