For about thirty years, a recommender system did one job. It looked at a user and an item, and it predicted a number: how many stars would this person give this movie, how likely is this click. Every famous recommendation engine, from the first research prototype to the system that decided your Netflix homepage, was a machine for predicting that number well. Rank the items by their predicted scores, show the top few, done.
That job is now being retired. The newest large-scale recommenders do not predict a score for a candidate you hand them. They generate the recommendation directly, producing the identifier of the next item the way a language model produces the next word. It is a genuine change in what the algorithm is, not a faster version of the old thing. To see why it happened, and why it happened now, it helps to walk the full arc. Each step on it solved a specific limitation of the step before.
Origin: from similar items to similar people
The earliest idea was the obvious one. If a person liked something, recommend more things like it. This is content-based filtering, and it borrowed its machinery wholesale from information retrieval: describe each item by its features, describe the user by the features of items they liked, and recommend by similarity. A user who reads spy novels gets more spy novels. It works, it needs no other users, and it handles a brand new item fine because an item's features exist the moment the item does.
It also has a ceiling you hit fast. A content-based system can only recommend more of the same. It cannot tell you that people who buy this particular camera tend to love a specific unrelated lens, because nothing in the camera's text says so. It is trapped inside the features it was given.
The breakthrough was to stop looking at the item and start looking at other people. The term collaborative filtering was coined at Xerox PARC for Tapestry, an experimental mail system described by Goldberg, Nichols, Oki and Terry in a 1992 Communications of the ACM paper. Tapestry still needed humans to annotate documents by hand. The leap to automation came from GroupLens, a research project started at the University of Minnesota in 1992, which built a system that predicted how interesting a Usenet article would be to you based on ratings from people whose past ratings matched yours. Around the same time, the Ringo system at MIT did the same for music. No one had to describe an item at all. The algorithm found people similar to you and recommended what they liked. That single move, from item similarity to user similarity, is the foundation everything else is built on.
Collaborative filtering had a scaling problem, though, and it was Amazon that solved it in a way that stuck. Comparing every customer to every other customer in real time was, in Amazon's own words, prohibitively slow at their catalog size. In a 2003 IEEE Internet Computing paper, Amazon engineers Greg Linden, Brent Smith and Jeremy York described item-to-item collaborative filtering: instead of finding users like you, find items that tend to be bought by the same people, then recommend items related to what you already touched. The heavy computation moves offline into an item-to-item similarity table, and the live request is a cheap lookup. Amazon had run the algorithm in production since 1998, five years before the paper appeared. It powered the "customers who bought this also bought" row for a generation, and in 2017 the IEEE board named that paper the one from twenty years of the journal that had best survived.
Origin continued: the Netflix Prize and latent factors
The next shift came with a competition. In October 2006 Netflix launched the Netflix Prize: one million dollars to any team that could beat its in-house Cinematch algorithm by 10 percent on rating prediction, measured as root mean squared error. The target was to push the error from 0.9514 down to 0.8572. The contest ran for almost three years and was won in September 2009 by the team BellKor's Pragmatic Chaos, at an RMSE of 0.8567, a 10.06 percent improvement.
The Prize mattered less for who won than for what it crowned as the dominant technique: matrix factorization. The idea is elegant. Lay out every rating as a giant, mostly empty grid, users down one side, movies across the top. Matrix factorization assumes that grid can be approximated by multiplying two much smaller matrices together, one describing users, one describing items. Each user becomes a short vector of learned numbers, each item becomes a short vector, and a predicted rating is the dot product of the two.
What makes this powerful is that the numbers in those vectors are not designed, they are learned. The algorithm decides, on its own, what dimensions matter. One might end up tracking how action-heavy a film is, another how mainstream, another something with no clean name at all. Nobody labels these. They are latent factors, discovered because they help explain the ratings that do exist. This was the first time a recommender represented users and items as dense learned vectors, and that representation, the embedding, became the unit of currency for everything after. One footnote worth keeping: Netflix never deployed the full prize-winning ensemble, judging the engineering cost too high against the gain. The technique outlived the contest anyway.
Present: deep learning, two towers, and sequences
Matrix factorization scores a user and an item with a dot product, and a dot product is a straight line. It cannot capture an interaction like "this user likes horror, and likes comedy, but not horror-comedy." So researchers swapped the dot product for a neural network. The 2017 paper Neural Collaborative Filtering by He and colleagues kept the learned user and item embeddings but fed them through layers that could model curved, non-linear relationships. The embedding idea survived. The scoring function got an upgrade.
The bigger structural change was forced by scale. A platform with millions of items cannot score every one of them for every request. So the modern pipeline splits in two. First retrieval: cut the catalog down to a few hundred plausible candidates, fast. Then ranking: score those candidates carefully. Google's 2016 paper on deep neural networks for YouTube recommendations set this two-stage pattern as the industry default.
Retrieval is where the two-tower model lives. One neural network, the user tower, turns the user and their context into an embedding. A separate network, the item tower, turns each item into an embedding in the same space. Because the towers are independent, every item embedding can be computed in advance and stored. At request time you embed only the user, then use approximate nearest neighbor search to grab the closest item vectors in milliseconds. It is matrix factorization's geometry, two vectors and a distance, rebuilt with deep networks and made to scale.
The other present-day shift treats your history as a sequence. Earlier models saw a user as a bag of past items with no order. But order carries meaning: someone who just bought a phone is in a different state than someone who bought one a year ago. Sequence models import the transformer from language modeling and read your interaction history the way a language model reads a sentence. SASRec, from Kang and McAuley in 2018, used attention to predict the next item from the items before it. BERT4Rec masked items in the middle of a history and trained the model to fill them in, reading context from both directions. The user stopped being a static profile and became a trajectory.
Future and impact: generative recommenders
Every system so far, even the transformer-based ones, ends the same way. It produces embeddings, then matches or scores them. The candidate has to exist as a vector before the model can rate it. The current research frontier removes that step.
The pivot is how an item is named. Traditionally an item ID is an arbitrary number, item 84412, meaningless, carrying no relationship to item 84413. The new approach gives items semantic IDs: a short sequence of discrete codes derived from the item's own content, so that similar items share leading codes, the way nearby files share a folder path. A 2023 Google paper, Recommender Systems with Generative Retrieval, did exactly this with a model it called TIGER, for Transformer Index for Generative Recommenders. It quantized each item's content embedding into a tuple of codewords, then trained a sequence-to-sequence transformer to read a user's history and generate the semantic ID of the next item, one code at a time. (For the underlying ideas, see our explainers on what an embedding is and semantic IDs.)
That is the real break. There is no candidate set, no nearest neighbor lookup, no separate scoring pass. The model generates the answer directly, the same autoregressive move a language model makes when it writes the next token. Retrieval and ranking, two stages the field treated as separate for almost a decade, collapse into one generative act. And because similar items share code prefixes, a brand new item with almost no interaction history can still be generated, which softens the cold-start problem that has dogged collaborative filtering since 1992.
This is not academic only. Meta's 2024 paper Actions Speak Louder than Words reframed recommendation as a generative sequential task with an architecture it calls HSTU, for Hierarchical Sequential Transduction Units. Meta reported a 12.4 percent metric lift in online A/B tests and showed recommendation quality climbing as a power law of compute, the same scaling behavior that drives large language models. YouTube has been adapting Gemini into large recommender models that tokenize videos as semantic IDs. Pinterest deployed PinRec, a generative retrieval model, and reported sitewide gains. Netflix described a foundation model that ingests hundreds of billions of interactions as tokens and aims to replace a sprawl of specialized models with one. The framing is consistent across all of them: recommendation is converging on the techniques of language modeling.
The honest caveats matter. Generative serving is expensive, and YouTube's own account describes needing very large cost reductions to make it viable at production scale. Autoregressive decoding can drift toward popular items or produce an ID for something that does not quite exist, which retrieval-based systems never risked. Latency budgets for a live feed are unforgiving. None of this is solved.
But the direction is set, and the strategic point is larger than any one model. For most of its history, recommendation was its own discipline with its own tricks. It is now becoming an application of the same sequence-modeling stack that powers chat assistants, sharing scaling laws, tokenization and architecture. For any team that builds personalization, that convergence is the planning assumption: the recommender of the next few years will look less like a scoring function and more like a model that generates. Whether to ride a foundation model rather than tune yet another bespoke ranker is now a live question, and the learning-to-rank methods that defined the ranking stage are themselves being absorbed into it.
The stakes are worth stating plainly. McKinsey's 2013 retail analysis put 35 percent of what people buy on Amazon down to recommendations, and Netflix's own engineers reported in 2012 that roughly 75 percent of viewing started with one. Those two numbers, still the most cited in the field, are the surface this rebuild is happening under. The arc that started with a 1992 prototype reading Usenet ratings now runs through a transformer that writes an item ID. The job changed from predicting a rating to generating a recommendation, and the field is being rebuilt to match.
Council summary
This post argues that recommendation has stopped being a scoring problem and become a generation problem, and it earns that claim by walking the thirty-year arc one limitation at a time: content-based filtering trapped inside item features, collaborative filtering trapped by its compute cost, matrix factorization trapped by the straight line of a dot product, and every embedding-based system trapped by needing the candidate to exist before it can be ranked. Generative recommenders break the last of those traps by giving items semantic IDs and letting a transformer write the next item directly, which collapses retrieval and ranking into a single autoregressive step and softens cold start as a side effect. The reader's takeaway is concrete and current: the same sequence-modeling stack behind chat assistants, with its scaling laws and tokenization, is now the planning assumption for personalization, so the real decision is whether to ride a foundation model or keep tuning a bespoke ranker. The post is honest that generative serving is expensive, latency-bound, and prone to drift, none of it solved. Verdict: accurate, well sourced, and genuinely instructive.
Comments