In June 2017, eight researchers at Google published a paper with a title that was half a joke. They borrowed the cadence of a Beatles song and called it "Attention Is All You Need." Inside was an architecture they named the transformer, picked, by one author's account, simply because the word sounded good. Nine years later that paper has been cited more than 173,000 times, which puts it among the ten most-cited scientific papers of this century. Every model you have spoken to, ChatGPT, Claude, Gemini, Copilot, runs on the idea it proposed.
This is the story of that idea. Not the equations, the idea. What machines used to do with language, why it held them back, what the transformer changed, and where, in 2026, the architecture is starting to creak.
The thirty-year bottleneck
Before 2017, a machine that read a sentence read it the way you might read a fax coming off the roll: one word at a time, in order, with no way to see what was coming. The dominant tool was the recurrent neural network, the RNN, and later a refined version called the LSTM, short for long short-term memory.
The mechanics matter, so here is the plain version. An RNN keeps a single running summary, a kind of mental note, and updates that note as each new word arrives. Word one updates the note. Word two updates it again. Word twenty updates it once more. By the end of the sentence, everything the model knows is squeezed into that one note.
Two problems came straight out of this design, and neither was small.
The first was memory. The whole sentence has to fit inside one fixed summary. Early words get overwritten by later ones, the way the start of a long voicemail fades from your head by the time it ends. Researchers measured it: an LSTM's useful grip on natural language ran to roughly 400 words before the thread frayed. Ask it to connect a pronoun to a name forty sentences back and the connection was simply gone.
The second problem was speed, and it turned out to be the one that mattered most. Because each word's update depends on the update before it, the work cannot be split up. Word twenty cannot be processed until word nineteen is done. A hundred-word sentence means a hundred steps in strict single file. This was a brutal fit for the hardware of the era. A GPU is a chip built to do thousands of calculations at the same instant, and a strictly sequential model leaves nearly all of that power idle. You could buy a bigger machine and it would barely help.
There was a partial fix already in circulation. In 2014, Dzmitry Bahdanau and his co-authors had introduced something they called an attention mechanism. Instead of forcing the decoder to lean on one final summary, their model let it glance back at every word of the input and decide which ones mattered for the word it was producing now. It worked well. But it was bolted onto a recurrent network, so the sequential bottleneck stayed. Attention was a passenger. The 2017 team asked a sharper question: what if you threw out the recurrent network entirely and kept only the attention?
What attention actually does
Hence the title. Attention, the Google team claimed, was all you needed. Recurrence could go.
Picture a sentence laid out flat, every word visible at once, no roll, no fax. Now take one word, say "it" in "the trophy did not fit in the suitcase because it was too small." For a human, "it" obviously points to the suitcase. For a machine, that is a real decision. Self-attention is how a transformer makes it.
Here is the mechanism without the linear algebra. Every word issues a small query, a question about what it is looking for. Every word also carries a key, a label advertising what it offers, and a value, the actual content it can pass along. The model compares one word's query against every other word's key and scores the match. Strong matches mean strong attention. The word then pulls in a blend of the values from the words it scored highest, weighted by those scores. "It" sends out its query, "suitcase" answers most loudly, and "it" absorbs the meaning of "suitcase." Do this for every word against every other word, all at once, and each word's representation is now shaded by the precise context it sits in.
That phrase, all at once, is the whole revolution. There is no running summary to corrupt and no fixed order to obey. The link between "it" and "suitcase" is one direct comparison, and it costs exactly the same whether the two words are three apart or three hundred apart. The thirty-year memory bottleneck did not get improved. It got deleted.
The 2017 paper added two refinements worth naming. The first is multi-head attention: run several of these comparison passes side by side, each free to track a different kind of relationship, one watching grammar, another watching subject and object, another watching tone. The second is positional encoding. Because the model now sees every word simultaneously, it has lost the natural sense of order, so the transformer stamps each word with a marker of where it sits. Order becomes data the model reads, not a constraint it obeys.
Why parallel training changed the industry
The cleverness of attention is the headline. The reason it took over the industry is quieter, and it is about money.
Because every word is processed at the same time, transformer training maps almost perfectly onto a GPU. The core operation is one enormous batch of multiplications, which is the exact thing the chip exists to do at speed. Training that crawled on an RNN now ran at full tilt. The numbers in the paper were stark: the transformer's base model trained on eight GPUs in about twelve hours and beat the previous best score on English-to-German translation, while the larger variant set a new record on English-to-French in three and a half days, a fraction of the compute the older models had needed.
This handed the field a recipe with an unusual property: it kept paying off when you simply made it bigger. Add more data, add more parameters, add more compute, and the model got better in a way that was smooth and, to a useful degree, predictable. Researchers later wrote this regularity down and called it a scaling law. With a recurrent network there had been little point spending ten times the money, because the architecture itself capped the return. With a transformer the return kept coming. That single fact set off the spending race that defines AI economics to this day.
From one paper to the whole field
The recipe was public, and it did not stay quiet for long.
In 2018, OpenAI took the decoder half of the transformer, trained it to do one plain thing, predict the next word, across a large pile of text, and called it GPT, for generative pre-trained transformer. The first version held 117 million parameters. Months later Google took the encoder half, trained it to fill in blanked-out words by reading both directions at once, and called it BERT. For years afterward, BERT quietly powered a large share of Google Search.
Then came the part that surprised even the people building it. OpenAI kept the architecture almost untouched and kept enlarging it. GPT-2 in 2019 reached 1.5 billion parameters and wrote fluent paragraphs. GPT-3 in 2020 reached 175 billion and could do tasks nobody had trained it for, translation, summary, simple code, from a few examples in the prompt. The design barely changed between them. Scale did the work.
ChatGPT arrived in November 2022 and reached a million users in five days. Everything since, GPT-5, Claude, Gemini, Llama, Grok, longer context, images and audio, the reasoning models that pause to think, is a variation on the 2017 template, not a replacement for it. The authors themselves scattered. All eight have left Google, and six started their own AI companies, among them Cohere, Character.AI, Sakana AI, and Essential AI, ventures together valued in the billions. The eight-page paper turned into an industry.
Where the architecture strains
A foundational explainer that ended on triumph would be lying by omission. In 2026 the transformer is under visible pressure on three fronts, and the people who built it say so out loud.
The first is cost, and it is wired into the design. Self-attention compares every word with every other word, so doubling the input does not double the work, it roughly quadruples it. Computer scientists call this quadratic scaling. It is the reason a very long prompt gets expensive fast, and the reason context windows have stopped racing upward. The thing that made the transformer powerful, every token touching every other token, is the same thing that makes it costly at length.
The second is what practitioners now call context rot. The marketing promised million-token context windows, and the windows are real. The recall is not. Independent testing across frontier models has found that accuracy falls as the input grows, and that facts buried in the middle of a long document are the ones most often missed, a pattern researchers named "lost in the middle." In one set of retrieval tests, strong models dropped from near-perfect accuracy on a short input to well below half on a long one. A bigger window is not the same as a better memory.
The third is the data wall. The scaling recipe is hungry for human-written text, and the supply is finite. Researchers at Epoch AI estimate the usable stock of public text could be largely consumed somewhere between 2026 and 2032, and some forecasts land at the early end of that range. Ilya Sutskever, an OpenAI co-founder, has put it bluntly: the field has reached peak data. You cannot scale on a resource you have already spent.
None of this has dethroned the transformer, but it has reopened a question that looked closed. Serious alternatives now exist. State-space models, the best known is Mamba, drop the every-word-to-every-word comparison and process a sequence in a way that scales linearly rather than quadratically, which makes very long inputs far cheaper. The competitive answer for now is hybrid: architectures such as Jamba interleave a few transformer attention layers with cheaper state-space layers, keeping attention's quality where it counts and shedding its cost where it does not. At the 2024 Nvidia GTC conference, several of the original paper's authors said plainly that the world needs something better than the transformer. The architecture that beat recurrence may, in time, be beaten the same way.
For now it stands. Almost every model in production in 2026 is a transformer or a transformer hybrid. One paper, half-named after a Beatles song, written to make machine translation a little faster, threw out the way machines had read language for thirty years and quietly became the foundation the entire field is built on. Knowing how it works is no longer specialist knowledge. It is the floor.
Council summary
This post argues that one 2017 idea, attention, explains nearly everything about modern AI: it deleted the sequential bottleneck that had capped language models for three decades, and in doing so it turned model quality into something you could buy with compute. The explanation of self-attention through queries, keys, and values, and the "all at once" insight, does the hard teaching job without a single equation, which is the piece most explainers get wrong. Its real value is the honest third act: rather than ending on triumph, it names the three cracks now showing, quadratic cost, context rot, and the data wall, and treats hybrids like Jamba and state-space models like Mamba as live competition rather than hype. The reader should leave with a working mental model of why transformers won, why scaling worked, and why the same logic that beat recurrence could one day unseat the transformer itself. Knowing this is, as the post puts it, no longer specialist knowledge but the floor.
Comments