Open the product page for any item on a large marketplace and look at the URL. Somewhere in it sits a string like B0040JHNQG. That is the item's identifier, and it is arbitrary. The item next to it on the shelf has a string that is no closer in any meaningful way. The number is a name with nothing behind it, a label assigned at random when the item entered the catalog.
That choice, harmless as it looks, sets the hardest limit on a recommendation system. A model that learns from random IDs has to learn every single item from scratch, purely from how people interact with it. For a popular item that works. For everything else it does not. Semantic IDs are the fix that the largest recommenders have spent the last three years adopting, and they amount to a change in what an item's name is. If you have read our recommendation algorithms arc, this is the representation shift that piece ended on, examined up close.
Origin: the cost of a meaningless name
To see why a random ID hurts, you have to know what a recommender does with one. Almost every modern system represents each item as an embedding, a vector of learned numbers that places the item in a space where nearby vectors mean similar items. (Our explainer on what an embedding is covers the idea in full.) The model keeps a giant lookup table, one row per item, and each row holds that item's embedding.
Here is the catch. When the item ID is random, the embedding table is the only place item knowledge can live. The ID itself tells the model nothing, so the row has to be filled in entirely from interaction data, click by click, purchase by purchase. Three problems follow directly, and they are not edge cases.
The first is cold start. A brand new item arrives with a fresh random ID and a blank, randomly initialized row in the table. The model has never seen anyone interact with it, the ID carries no hint of what it is, so the system has nothing to go on. A new product can sit nearly invisible until enough behavior accumulates to teach its row. On a platform where fresh items appear constantly, that delay is a real cost.
The second is the long tail. Most catalogs follow a brutal distribution: a small head of popular items gets the vast majority of interactions, and a long tail of rare items gets almost none. Random IDs give every tail item a row that stays badly trained forever, because the data to train it never arrives. The model effectively cannot recommend the long tail well, which is the part of the catalog where discovery would matter most.
The third is size. One row per item means the embedding table grows in lockstep with the catalog. A marketplace with hundreds of millions of items carries an embedding table with hundreds of millions of rows, often the largest single component of the model by parameter count. The catalog also never sits still. Items are added and retired daily, so the table is a moving target. As one practitioner overview puts it, random IDs are good at memorization and lookup and bad at generalization, and generalization is exactly what new and rare items need.
The root cause sits underneath all three. A random ID has no structure to share. Item 84412 and item 84413 might be near-identical products, but the model has no way to know that from the IDs, so whatever it learns about one transfers nothing to the other. Every item is an island.
Present: an ID that carries meaning
A semantic ID throws out the single random number and replaces it with a short sequence of discrete codes derived from the item's own content. Instead of B0040JHNQG, an item gets something like (12, 153, 87, 21). The codes are not random. They come from the item's content, so two similar items end up sharing leading codes, the way two files in the same project share the start of their folder path. The ID itself now carries meaning. As one technical walkthrough frames it, the model can see that two items are related before a single person has interacted with either.
How is such a code made? The pipeline has two stages, and neither is exotic.
Stage one is a content embedding. Take whatever describes the item, the title, the description, the brand, the category, the images, and run it through a pretrained encoder to get a dense vector. A clothing item and a similar clothing item land close together in that vector space, because their descriptions are close. This is ordinary content understanding, the same machinery behind semantic search.
Stage two turns that continuous vector into a short list of discrete codes. The method that made this practical is called RQ-VAE, for residual-quantized variational autoencoder, and the idea worth keeping is the residual part. Quantizing means snapping a precise vector to the nearest entry in a fixed codebook, a small set of representative vectors. One snap is crude and loses detail. Residual quantization fixes that by going in rounds. Round one snaps the vector to its nearest codebook entry and records that code. Then it measures the residual, the gap between the true vector and the entry it snapped to, and round two quantizes that leftover gap against a second codebook. Round three quantizes what is still left over, and so on. Each round captures a finer slice of the item than the last.
The result is a coarse-to-fine code. The first code is a broad bucket, something like the item's general category. The Medium analysis "Unveiling the Secret of Semantic IDs" showed this directly through perturbation tests: the first code tracked broad category, while later codes captured detail attributes like size, color and material. Later codes refine within the bucket. Because two similar items get snapped to the same coarse bucket first, they share that first code, and often the second. The shared prefix is not a happy accident, it is the design.
This also explains a quiet efficiency win. A residual scheme with three codebooks of 256 entries each can describe 256 times 256 times 256 distinct items, roughly 16.7 million, while storing only 768 codebook vectors. The model no longer needs a unique trained row per item. It needs trained embeddings for the codes, and the codes are reused across the whole catalog. In one worked practitioner case study, a setup with 1,024 code embeddings stood in for what would otherwise have been 66,000 separate item embeddings. The representation stops growing with the catalog.
Present: why generative recommenders need this
Semantic IDs would be a tidy trick on their own. What makes them important is what they let a transformer do.
A transformer generates text one token at a time, each token drawn from a fixed vocabulary of a few tens of thousands. If an item is a single random ID, you cannot generate it that way, because the vocabulary of item IDs is the whole catalog, hundreds of millions of entries, and it changes every day. But if an item is a short sequence of codes from small codebooks, it looks exactly like a short sequence of tokens. A transformer can read a user's history and generate the next item the way it generates the next phrase, one code at a time.
This is the architecture behind generative retrieval, and the paper that defined it is TIGER, short for Transformer Index for Generative Recommenders, published by a Google team at NeurIPS 2023. TIGER encodes each item's text features with a Sentence-T5 model, quantizes that embedding with a three-level RQ-VAE of 256 codes per level, and trains a sequence-to-sequence transformer to generate the semantic ID of the next item directly. No candidate list, no nearest-neighbor lookup over an embedding table. The model writes the answer.
The payoff is twofold. First, retrieval and ranking, two pipeline stages the field kept separate for a decade, collapse into a single generative act. Second, and this is the part that matters most, generalization improves. On the standard Amazon Reviews benchmark datasets, TIGER reported Recall@5 of 0.0454 on the Beauty category against 0.0387 for the strong SASRec baseline, with similar gains on the Sports and Toys datasets. More telling was a cold-start test. The authors removed 5 percent of test items from training entirely, simulating items the model had never seen, and TIGER could still generate them, because a new item's semantic ID is computed from its content and naturally shares a prefix with similar items already known. A random-ID model has no equivalent move.
Industry did not wait long. A 2025 survey of generative recommendation lists production or near-production systems at Meta, Kuaishou, Alibaba, Meituan, ByteDance, Netflix and others, and semantic IDs sit at the core of most of them. The ranking stage adopted the idea too, not just retrieval: Google researchers ran a case study in YouTube's ranking model that swapped random video IDs for RQ-VAE semantic IDs and improved quality on new and long-tail videos without hurting overall performance. Kuaishou's OneRec, described in a February 2025 paper, is a fully generative recommender that tokenizes each short video into three semantic codes and adds 24,576 of them to the model vocabulary. Kuaishou reported a 1.6 percent watch-time gain in its main recommendation scene from OneRec, and a later account cited a roughly 21 percent GMV lift when the approach was applied to its local-services marketplace, alongside a large drop in serving cost. Spotify Research published work in September 2025 using semantic IDs to unify search and recommendation in one model, finding that a single semantic ID scheme has to be trained on both tasks at once or it helps one and hurts the other. The direction is consistent across very different platforms.
Future and impact: the honest tradeoffs
Semantic IDs are not free, and the costs are worth stating plainly.
You now have a quantizer to build and maintain. The RQ-VAE is a second model with its own training, its own loss terms balancing reconstruction against codebook use, and its own tuning. Worse, if the content encoder is upgraded or the catalog drifts, the codes can go stale, and re-quantizing the catalog means the downstream recommender may have to relearn what the codes mean. A 2025 practitioner handbook on building these systems treats this maintenance burden as a first-order concern, not a footnote.
Then there are collisions. Quantization is lossy by design, so two genuinely different items can land on the identical sequence of codes. In the TIGER work, a three-level scheme left a meaningful share of items non-unique, and the authors handled it by appending an extra ordinal token, a fourth code, to break ties. In one independent build, a three-level RQ-VAE reached only 89 percent uniqueness before that tiebreaker was added. Collisions are not always pure harm. When a brand new item collides with an existing one, the shared ID is precisely what lets the model recommend the newcomer on day one. But uncontrolled collisions blur distinct items together and have to be managed.
Last is system complexity. A classic recommender is an embedding table plus a nearest-neighbor index, both well understood. A semantic-ID generative recommender adds a content encoder, a quantizer, a token vocabulary and an autoregressive decoder, and autoregressive decoding has its own failure modes, including drifting toward popular codes or generating a code sequence that maps to nothing real. None of that is unsolvable. All of it is more moving parts.
The trade is still worth making at scale, which is why the largest platforms made it. A random ID forces a model to learn every item from interactions alone, and that fails exactly where it hurts most, on new items and on the long tail. A semantic ID puts meaning into the name itself, so similar items share structure, the representation stops ballooning with the catalog, and a transformer can generate recommendations the way it generates language. The marketplaces moving to this are not chasing a trend. They are removing a constraint that has shaped recommendation since the first systems shipped.
Council summary
This post argues that the random ID at the heart of every classic recommender is a hidden design flaw, and that semantic IDs fix it by making an item's name carry meaning. The explanation holds together: it shows why random IDs cripple cold start, the long tail and table size, then walks the RQ-VAE pipeline round by round so the coarse-to-fine code structure is genuinely intuitive rather than asserted. It is honest about the price, a quantizer to maintain, code collisions, and an autoregressive decoder with its own failure modes. The reader should leave understanding not just what a semantic ID is but why TIGER, OneRec and the YouTube ranking work all converged on the same move, and why that move only pays off at catalog scale. The verdict: a clear, well-sourced explainer that teaches the concept rather than name-dropping it.
Comments