history of natural language processing

Teaching Machines Language: NLP Before the Transformer

For sixty years, every attempt to make a machine read hit the same wall. The story of those failures is the only way to see what the transformer actually fixed.

In June 2017 a paper called "Attention Is All You Need" reset the field of language AI. That story gets told a lot. The story underneath it gets told less: the transformer replaced something, and that something was sixty years of work. Four distinct eras of it, each with a clever idea, each running into a wall. The walls are the interesting part, because they line up. Every method here failed in a way that pointed at the same flaw, and the transformer was the first design to face it head on. Our companion piece, The Transformer, Explained Without the Math, covers what the 2017 paper proposed. This one covers what it was answering.

The era of rules: machines that were told everything

The first attempt to make a computer read was also the first to overpromise. On 7 January 1954, IBM and Georgetown University ran a demonstration in IBM's New York head office: an IBM 701 mainframe translated more than sixty Russian sentences into English. The researchers predicted machine translation would be solved within three to five years. The reality was thinner. The system ran on six grammar rules and a 250-word vocabulary, and the sentences had been hand-picked to fit them. It was a dictionary lookup with a little reordering, not a reading machine. (Georgetown-IBM experiment)

That set the template. A human linguist wrote down rules, the computer followed them, and if the rules covered the sentence you got an answer. If they did not, you got nonsense.

Two famous programs show the charm and the ceiling. Joseph Weizenbaum built ELIZA at MIT between 1964 and 1966. Its DOCTOR script imitated a Rogerian psychotherapist, the kind who reflects your statements back at you. Tell ELIZA "you are very helpful" and it flipped the pronouns into "what makes you think I am very helpful?" The persona needed no knowledge of the world, only the ability to bounce sentences back. It worked well enough that Weizenbaum's own secretary asked him to leave the room so she could talk to it in private. ELIZA understood nothing: no memory of the previous sentence, no model of what any word meant. (ELIZA)

Terry Winograd's SHRDLU, written at MIT between 1968 and 1970, went the other way. It lived inside a simulated tabletop of colored blocks and could genuinely follow instructions: pick up the red pyramid, put it on the green box, answer questions about what it had just done. Inside that tiny world it looked like real comprehension. The catch was the world. SHRDLU knew blocks and nothing else, and nobody could grow the blocks world into the real one, because the real world has no finite list of objects.

The reckoning came in 1966. A US government committee called ALPAC reviewed a decade of machine translation funding and concluded the field had not delivered. Government money for machine translation in the United States dried up for roughly twenty years, a freeze often counted as the first AI winter. (ALPAC) The reason rules failed is simple: language has effectively infinite variety, and you cannot write a rule for every sentence a person might produce. The field needed a method that learned the rules itself.

The statistical turn: counting instead of decreeing

Through the late 1980s the approach flipped. Instead of a linguist dictating rules, the computer would read a large pile of text and work out the patterns from raw frequency. The workhorse was the n-gram language model.

The idea is plain. To guess the next word, look at the few words before it and ask what usually came next. A bigram model looks at one previous word, a trigram at two. The probability is a division: to score the chance "saw" follows "I", count how often "I saw" appeared and divide by how often "I" appeared. (n-gram language model) This rests on a deliberate simplification, the Markov assumption: pretend a word depends only on the last few words. It is false about real language, and still good enough to power the speech recognition and machine translation systems of the 1990s and 2000s. IBM's statistical translation work, learning word alignments straight from bilingual text, ran on this counting logic and pushed translation away from hand-written rules for good.

Two flaws never went away. The first is data sparsity. Most word sequences are rare, and a model that has never seen a particular trigram assigns it a probability of zero, which poisons the calculation. Researchers built a toolkit of smoothing techniques to patch the holes, from add-one smoothing to Kneser-Ney. The patches helped but cured nothing, because the problem grows exponentially with context: a vocabulary of 100,000 words has 100,000 to the power of three possible trigrams, and almost none will appear in any training set. (Bengio et al., 2003) The second flaw is short memory. A trigram sees two words back, full stop. In "the keys to the old wooden cabinet in the hallway were missing," the verb "were" must agree with "keys," nine words earlier, and no trigram can reach that far. It cannot tell that "car" and "automobile" mean nearly the same thing either, because every word is a distinct symbol with a count. The statistical era traded brittleness for blurriness.

Words as vectors: the embedding breakthrough

The fix to that blurriness was an old idea waiting for computing power. The linguist J.R. Firth put it in 1957: you shall know a word by the company it keeps. "Coffee" and "tea" both turn up near "drink," "hot," and "cup." Track the company each word keeps, and similarity falls out of the statistics.

Yoshua Bengio and colleagues built the first strong neural version in 2003. Their neural probabilistic language model represented each word as a distributed vector, a list of numbers, learned alongside the prediction task. The argument was precise: if "cat" and "dog" land near each other in vector space, a sentence seen with "cat" quietly teaches the model about "dog" too. One training sentence informs an exponential number of similar sentences, which is how you fight sparsity. (Bengio et al., 2003)

The idea reached everyone in 2013, when a team led by Tomas Mikolov at Google released word2vec. It could train high-quality word vectors on billions of words quickly, using one of two setups: CBOW predicts a word from its surrounding words, skip-gram does the reverse. (word2vec) What caught the field's attention was an effect nobody designed for: the vectors did arithmetic. Take "king," subtract "man," add "woman," and the nearest vector is "queen." The offset that separated man from woman also separated king from queen, and it kept working, with Paris minus France plus Italy landing near Rome. Mikolov, Wen-tau Yih, and Geoffrey Zweig documented this in a 2013 paper, "Linguistic Regularities in Continuous Space Word Representations." Relationships had become directions in space. (Mikolov, Yih and Zweig, 2013)

A year later, Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford released GloVe, short for Global Vectors. Where word2vec learned from local windows of text, GloVe built one big table of how often every word appears near every other across the whole corpus, then factored it into vectors. Both reached a similar destination by different roads. (GloVe) Embeddings are still everywhere today, in search, recommendation, and the retrieval step of modern AI systems. But they left their own flaw: every word got exactly one vector. "Bank" had a single representation whether the sentence was about a river or a mortgage. Solving that needed a model that read sequences, not just words.

The present before the transformer: RNN, LSTM, and seq2seq

By the mid-2010s, embeddings had become the input layer to a new generation of models. The state of the art for any task involving a sequence of words was the recurrent neural network.

An RNN reads a sentence one word at a time and carries a hidden state, a running summary, from each step to the next. Word one updates it, word two updates it again, and the summary holds everything read so far. In principle it can carry information across any distance, exactly what n-grams could not. (recurrent neural network) In practice plain RNNs choked on long sentences, undone by the vanishing gradient: the learning signal shrank toward zero as it traveled backward through many steps, so the network never learned to connect distant words. Sepp Hochreiter and Jurgen Schmidhuber had designed the fix back in 1997, the Long Short-Term Memory network. An LSTM adds a cell state and a set of gates, small learned controls that decide what to keep, overwrite, and read out. The forget gate, added by Felix Gers and colleagues around 2000, let the network clear its own memory. (Long short-term memory) LSTMs were not academic: Google moved speech recognition onto them around 2015 and cut transcription errors sharply.

The biggest application was translation, through a design called sequence-to-sequence. Two 2014 papers set it up. Ilya Sutskever, Oriol Vinyals, and Quoc Le at Google used a multilayer LSTM as an encoder to read the whole source sentence, then a second LSTM as a decoder to write the translation word by word. On the standard English-to-French benchmark it scored 34.8 BLEU and beat the phrase-based statistical system. (Sutskever, Vinyals and Le, 2014) A second 2014 paper, from Kyunghyun Cho's group, built the same encoder-decoder shape with a related unit called the GRU. (seq2seq)

The shape had a flaw you can see by looking at it. The encoder had to compress the entire source sentence, every word and clause, into one fixed-length vector before the decoder saw a single word. For a long sentence that vector became a funnel, and translation quality fell as sentences grew. The repair came the same year. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio published "Neural Machine Translation by Jointly Learning to Align and Translate" in September 2014, and named the funnel directly: the fixed-length vector is a bottleneck. Their fix was an attention mechanism. Instead of relying on one summary vector, the decoder could look back at every encoder state and, for each word it produced, decide which source words mattered most right now. Plotted out, the attention weights lined up with how a human translator aligns two sentences, and quality on long sentences recovered. (Bahdanau, Cho and Bengio, 2014) In 2016 Google replaced the engine behind Google Translate with a deep LSTM seq2seq system, eight encoder layers and eight decoder layers with attention, cutting translation errors by roughly 60 percent against the old production system. (Google Research blog, 2016)

The shared bottleneck, and the door it left open

Step back and the four eras fail in a connected way. Rule systems were brittle, because a human had to anticipate every sentence. N-grams learned from data but saw only a few words back and treated every word as an unrelated symbol. Embeddings gave words real meaning as vectors yet pinned each word to one fixed vector regardless of context. Recurrent networks and LSTMs finally read whole sequences in context, and attention let them reach across long ones.

But the recurrent design carried a cost no gate could remove. An RNN reads strictly in order: word twenty cannot be processed until word nineteen is finished, because each step depends on the hidden state before it. That sequential chain was a poor match for the GPU, a chip built to run thousands of calculations at the same instant. A strictly sequential model leaves most of that capacity idle, so training a large recurrent model was slow and a bigger machine barely helped.

The late pre-transformer state of the art held two things at once. It had attention, which had quietly proven the most useful idea in the stack, the part that let a model weigh every word against every other. And it had recurrence, the sequential reading order, which was now the thing holding everything back. The 2014 attention mechanism was a passenger, bolted onto a recurrent network that still set the pace. The obvious question, once someone framed it plainly: what if you throw out the recurrent network and keep only the attention? That is what "Attention Is All You Need" answered in 2017, and the subject of The Transformer, Explained Without the Math.

One thing carried across the divide. Word embeddings, the idea that meaning can live as a vector, were not replaced by the transformer. They were absorbed: every modern language model still turns words into vectors at its input. The embedding era did not end. It became the foundation everything else was built on.

Council summary

The post argues that the transformer was not a bolt from the blue but the resolution of a sixty-year search. It walks four eras, rule systems, n-gram statistics, word embeddings, and recurrent networks, and shows each one solving a real problem while leaving a sharper one behind: brittleness, then short memory, then context-blind word vectors, then a sequential reading order that no amount of attention could speed up. The reader's takeaway is that the field's failures were cumulative rather than random, and that "Attention Is All You Need" won by keeping the one component that kept proving its worth and discarding the recurrence that throttled it. The piece also leaves the reader with the one idea that survived every transition intact: meaning stored as a vector, still the input layer of every model in use today. Read alongside the companion transformer explainer, it gives a non-specialist the full before-and-after of modern language AI.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next