Ask a search engine for "ways to cut my electricity bill" and it can hand back a page titled "lowering household energy costs" that shares not one word with your query. Ask a music app for something like the song you just played and it finds a track by an artist you have never heard of. Neither system understands you. Both are doing the same small piece of arithmetic underneath, and that arithmetic is the most quietly important idea in applied AI. It is called an embedding, and once you see it, you see it everywhere: in semantic search, in the retrieval step of every RAG system, in recommendations, in spam filters, in the memory of an AI agent. The flashy part of AI is the model that writes. The part that makes most of it useful at scale is the embedding.
An embedding turns a thing, a word, a sentence, a photo, a product, a user, into a list of numbers called a vector. That list might be 384 numbers long, or 1,536, or 3,072. The numbers themselves are meaningless to a human. What matters is where the list lands. Each embedding is a coordinate, a position in a space with hundreds or thousands of dimensions, and the whole point of the exercise is that things with similar meaning land in similar places. Meaning becomes position. Similarity becomes distance. That single move is what this post is about.
Origin: meaning as company kept
The idea is older than the neural networks that made it famous. In 1954 the linguist Zellig Harris argued that you could analyze a word purely by the patterns of text it appears in. Three years later J.R. Firth gave the idea its memorable line: you shall know a word by the company it keeps (ACL Wiki on the distributional hypothesis). The claim is that "coffee" and "tea" mean similar things because they show up in similar company, near "drink," "hot," "cup," "morning." You never need a dictionary. You just need enough text and a way to count.
For decades that was a theory without a good machine behind it. The first computational attempts arrived in the 1990s. Latent semantic analysis built a giant table of which words appeared in which documents, then used linear algebra to squeeze it down into a few hundred dimensions per word (distributional semantics). It worked, slowly. The breakthrough that reached everyone was word2vec, released by a Google team led by Tomas Mikolov in 2013. It trained word vectors on billions of words fast, and it produced an effect nobody had designed for. The vectors did arithmetic.
That is the famous "king minus man plus woman" result, and it is worth being precise about, because it is often half-explained. Take the vector for "king," subtract the vector for "man," add the vector for "woman," and the nearest vector to the result is "queen." It works because a consistent relationship, in this case gender, turns out to be a consistent direction in the space. The step from "man" to "woman" is roughly the same move, same direction and distance, as the step from "king" to "queen," or "actor" to "actress," or "uncle" to "aunt." Relationships became geometry. One honest caveat: most implementations only get "queen" because they exclude the three input words from the answer. The literal closest vector to the result is usually "king" itself, and "queen" is the nearest word that was not part of the sum (why king minus man plus woman works). The analogy is real. It is also a tidy demo that hides some mess, and the same arithmetic happily returns "doctor minus man plus woman equals nurse" because it learned that bias straight from the text.
Word2vec and its Stanford counterpart GloVe had one hard limit. Every word got exactly one vector. "Bank" had a single position whether you meant a river or a mortgage. The fix came with the transformer in 2017 and with BERT in 2018: contextual embeddings, where the vector for a word is computed fresh for the sentence it sits in (transformer vs word2vec embeddings). "River bank" and "savings bank" now land in different places. That is the version of embeddings the modern stack runs on.
Present: the layer almost everything sits on
Today an embedding is produced by a dedicated neural network, an embedding model, trained on a specific job: place similar things near each other and push different things apart. The training method has a name, contrastive learning. Show the model millions of pairs that should be close, a question and its correct answer, an image and its caption, and millions that should be far apart, and adjust the numbers until the geometry matches (how CLIP learns multimodal embeddings). OpenAI's CLIP did this in 2021 across 400 million image and text pairs, and the striking result was a single shared space holding both. The photo of a dog and the words "a dog" land near each other, which is why you can search a photo library by typing.
To compare two embeddings you need a measure of closeness, and the standard one is cosine similarity. Skip the formula and keep the picture. Every embedding is an arrow from the origin to a point. Cosine similarity looks only at the angle between two arrows, not their length. Arrows pointing the same way score 1. At a right angle they score 0. Pointing opposite, -1 (Pinecone on vector similarity). Ignoring length is the useful part: a three word query and a thousand word document can still register as a near-perfect match if they point the same way, because direction carries the meaning and length mostly carries how much text there was (IBM on cosine similarity). It is not a perfect tool. Recent work shows cosine scores can be inflated by how a model was trained rather than by real similarity (is cosine similarity really about similarity). It is still the workhorse metric, because it is cheap and it mostly works.
Stack those two ideas, meaning as position and distance as similarity, and a long list of products falls out of the same mechanism:
- Semantic search. Embed every document once. Embed the query. Return the documents whose vectors sit closest. This is the "electricity bill" example, and it is why search stopped depending on shared keywords.
- Retrieval-augmented generation. Before a language model answers, the system embeds the question, retrieves the closest chunks from a private knowledge base, and pastes them into the prompt. The embedding is the retrieval engine. See our companion piece, Retrieval-Augmented Generation: From Vector Search to Agentic RAG.
- Recommendation. Netflix, Spotify, and Amazon embed both users and items into one space, then recommend the items sitting nearest a user (embeddings powering search and RAG). More on the family of methods in How Recommendation Algorithms Actually Work.
- Classification and clustering. To sort feedback into themes, embed every comment and group the nearby ones. To tag email as spam, embed it and check whether it sits closer to the "spam" example or the "not spam" one. No retraining required.
- Agent memory. When an agent needs to recall a past conversation, it embeds the current moment and retrieves the closest stored memories. Covered in Agent Memory Explained.
All of this needs somewhere to keep millions or billions of vectors and search them in milliseconds. That is the job of the vector database. Pinecone is the managed, speed-to-production option. Qdrant, written in Rust, closed a 50 million dollar Series B in March 2026 and tops its own published benchmarks for query latency and throughput against rivals (Qdrant Series B, Qdrant benchmarks). Weaviate, Milvus for billion-vector scale, and Chroma for fast prototyping fill out the field (vector database comparison 2026). They all lean on the same trick to stay fast: approximate nearest neighbor search. Comparing a query against every vector is too slow at scale, so an algorithm called HNSW builds a layered graph of shortcuts through the space and hops toward the closest matches, trading a sliver of accuracy for an enormous speed gain (how HNSW works).
A practical note on size. OpenAI's text-embedding-3-small outputs 1,536 numbers by default, text-embedding-3-large outputs 3,072, both capped at 8,192 input tokens (OpenAI embeddings guide). More dimensions can hold more nuance, but every dimension costs storage, memory, and retrieval time. For most retrieval work the quality gain flattens out past roughly 768 dimensions. Bigger is not free, and it is often not better.
Future and impact: smaller, faster, and harder to fool
The clearest trend is shrinking the vector without losing what it knows. Matryoshka representation learning, named for the nesting dolls, trains a model to pack the most important information into the earliest numbers of the vector. You can then chop a long embedding down to 256 or even 128 numbers and keep most of the retrieval quality (Matryoshka embedding models). Voyage AI's voyage-4 family, released in January 2026, ships this alongside binary quantization, an option that stores each number as a single bit instead of a 32 bit float (Voyage 4 announcement). Binary quantization alone shrinks raw vector storage by a factor of 32, and stacking it with Matryoshka truncation has been measured to cut total vector storage costs by roughly 80 percent at a small accuracy price (quantization and Matryoshka cost study). The emerging production pattern is two stage: tiny embeddings to grab a wide set of candidates fast, then full-size embeddings to rank the finalists carefully.
The second shift is away from text alone. The same space that holds words is starting to hold images, audio, and video together, so a single query can reach across all of them. The third is sharper specialization. General models are giving ground to embeddings tuned for code, law, and medicine, where domain training adds several points of retrieval accuracy over a generic model.
Be clear-eyed about the limits. An embedding compresses meaning, and compression loses things, which is why teams now pair embedding search with keyword search and a reranking step rather than trusting vectors alone. The arithmetic also has no conscience: a model trained on biased text encodes that bias as geometry, and "doctor minus man plus woman equals nurse" is the field's standing reminder. Embeddings are a measurement of the text they were fed, not a measurement of truth.
For any team building agentic systems, this is not a detail to delegate. The quality of an agent's retrieval, its memory, and its grounding is decided by the embedding model and the chunking around it, long before the language model speaks. Picking that layer well is a large part of the gap between a demo that impresses and a system that holds up. It is the kind of unglamorous engineering choice Perform Digital tends to find at the root of the difference.
Council summary
The post argues that the embedding, not the language model, is the component doing most of the heavy lifting in applied AI: it turns meaning into position so that similarity becomes measurable distance, and that one move quietly powers semantic search, RAG retrieval, recommendation, classification, and agent memory. It traces a clean arc from Firth's 1950s distributional hypothesis through word2vec's arithmetic to today's contextual, contrastively trained models, and on to the near-term direction of smaller quantized vectors, multimodal spaces, and domain-specific models. It is honest about the limits, that embeddings compress and lose detail, that they encode the bias of their training text, and that serious systems now pair them with keyword search and reranking. The reader's takeaway is concrete: the embedding model and the chunking around it are an architecture decision, not an implementation footnote, and getting that layer right is much of the distance between an AI demo and a system that holds up in production.
Comments