For most of the last decade, building a better AI model was almost embarrassingly simple. Take a bigger transformer, feed it more text, run it on more chips, and capability would climb in a way that was smooth and, to a useful degree, predictable. Researchers wrote that regularity down and called it a scaling law. It held for GPT-2, then GPT-3, then GPT-4, and it turned model quality into something a company could buy with a budget.
The recipe has three ingredients: compute, parameters, and data. Two you can keep buying, by building more data centres and designing larger models. The third you cannot manufacture out of money, because it was written by people, and people have written a finite amount. That is the data wall, and in 2026 it is the constraint frontier labs talk about most. This is the story of why a finite resource ended up at the centre of the most expensive race in technology, and why the field's main answer, having models write their own training data, is both genuinely promising and quietly dangerous.
How data became the bottleneck
The data wall follows directly from the scaling laws, so it helps to be precise about what those laws say. The cleanest version, from a 2022 DeepMind paper on a model called Chinchilla, found that for a given compute budget, models had been built too large and trained on too little text. The compute-optimal balance was roughly twenty tokens of training data per model parameter. The lesson the industry took was blunt: to keep scaling, you need proportionally more data, not just more compute.
That made text a strategic input. Epoch AI, a research group that tracks the inputs to AI, found the text fed into frontier models has grown about 2.5 times a year. Meta's Llama 3 was trained on up to 15 trillion tokens, a corpus a human reading without sleep would take tens of thousands of years to finish. An exponential does not care how large the starting pile is. It eats it. The transformer explainer covers the 2017 architecture that made the recipe possible. The short version: it works, it keeps working when you make it bigger, and making it bigger means it needs to be fed.
What "peak data" means
The phrase that crystallised the problem came from one of the recipe's inventors. In December 2024, at the NeurIPS conference, Ilya Sutskever, a co-founder and former chief scientist of OpenAI, said plainly that pre-training as we know it will end. Compute keeps growing, through better chips and bigger clusters. Data does not, because we have but one internet. He called data the fossil fuel of AI: it took a long time to form, there is a fixed amount, and you can burn through it.
"Peak data" is the moment the supply of usable human-written text stops being the thing you can lean on. It does not mean the internet is empty. It means the high-quality text, the kind that actually makes a model better, has largely been collected and used.
Epoch AI put numbers on it. Their estimate of the effective stock of human-generated public text, adjusted for quality and reasonable reuse, is around 300 trillion tokens, with wide uncertainty. Their projection: frontier labs will have fully used that stock somewhere between 2026 and 2032, median near 2028. The exact year depends on overtraining, the practice of feeding a model far more data than the compute-optimal balance suggests because it makes the finished model cheaper to run. The more the industry overtrains, the sooner the supply runs dry. Epoch's Tamay Besiroglu compared it to a literal gold rush that depletes a finite resource.
Why you cannot just scrape more
The obvious objection: the internet is enormous and grows every day, so surely there is always more to scrape. There is more text. There is not more of the useful kind, and three problems block simply scraping harder.
The first is quality. Epoch's 300 trillion token figure is already the generous number, adjusted upward to count text reused several times. The raw web is mostly low value for training: spam, boilerplate, machine-translated pages, near-duplicates, search engine filler. The high-signal sources, edited books, reference works, code repositories, the better end of forums and journalism, are a small slice, and that slice has been heavily mined.
The second is legal. The era of scraping anything without consequence is closing. In September 2025, Anthropic agreed to pay 1.5 billion dollars to settle a lawsuit from authors whose books had been used in training, the largest publicly reported copyright recovery on record. The New York Times case against OpenAI and Microsoft is still live. Publishers, social platforms, and forums have signed paid licensing deals or locked their archives behind anti-scraping defences. Text that used to be free to take is now licensed or fenced off.
The third is contamination, and it is the strangest. Since ChatGPT, a fast-growing share of new web text is itself written by AI. One 2025 analysis by Ahrefs of nearly a million pages published that April found that around 74 percent contained detectable AI-generated or AI-assisted text, and a separate study by the SEO firm Graphite found newly published AI articles had briefly drawn level with human-written ones. The open web is no longer a clean record of what humans wrote. It is increasingly a record of what models wrote. Scraping it harder just collects more model output, which, as we will see, is exactly the thing to be careful about.
The field's answer, and the AlphaGo precedent
If you cannot find enough human text, make the text. Synthetic data, training data generated by AI models rather than collected from people, has gone from a niche trick to a central pre-training research agenda.
A precedent makes this less mad than it sounds. In 2017, DeepMind built AlphaGo Zero. The original AlphaGo had learned Go partly from a database of human expert matches. AlphaGo Zero used no human game data at all. It played millions of games against itself, generating its own training data, and surpassed every previous version. In that closed, checkable world, self-generated data was not a poor substitute for human data. It was better, because it held more useful examples and was not capped by human skill.
The catch is "closed and checkable." Go has fixed rules and a clear win condition, so a generated game can be verified. Language has neither. That is why synthetic data for language models works unevenly: strong where output can be checked, weak where it cannot.
In practice, synthetic data already does real work across the training pipeline. Distillation, where a large capable teacher model generates examples to train a smaller student, is now standard, and it works because the student copies a verified-good source rather than inventing from nothing. Microsoft's Phi family of small models was built largely on textbook-style synthetic text, and argued the approach helps most where raw web text gives diminishing returns. NVIDIA's Nemotron pipeline rewrites trillions of tokens of messy web text into cleaner synthetic versions. Reasoning models add another angle: they generate step-by-step solutions, keep only the chains that reach a correct, checkable answer, and train on the filtered set, an approach the 2022 Self-Taught Reasoner work formalised. It is the same logic behind how reasoning models think: when a model can verify its own output, that output becomes a training signal.
Model collapse: the real risk
Synthetic data has a failure mode with a name, and it is the reason "just generate more" is not a free lunch.
In 2024, researchers from Oxford, Cambridge, and other institutions published a paper in Nature titled, bluntly, AI models collapse when trained on recursively generated data. Model collapse is what happens when a model is trained on the output of a previous model, whose output trained the one before, and so on. Errors compound across generations. It runs in two stages. Early collapse: the model loses the tails of the distribution, the rare and unusual cases, while average performance still looks fine. Late collapse: the model converges toward bland, repetitive output and loses most of its variety. Co-author Nicolas Papernot gave the analogy: training on AI-generated data is like photocopying a photocopy. Each copy loses a little detail. Do it enough times and the page is grey mush.
This is a genuine risk, made worse by web contamination, since a model scraping the open web now ingests other models' output whether the lab intends it or not. But the finding is often quoted as a doomsday verdict and it is not one. The Nature experiment used pure recursion: each generation trained only on the previous model's output, human data thrown away. That is not how labs work. A 2024 follow-up, Is Model Collapse Inevitable?, showed the distinction that matters is replace versus accumulate. Replace human data with synthetic data each generation and collapse happens. Let synthetic data accumulate alongside a stable base of real human data and collapse is avoided, across model sizes and data types. Filtered, verified, human-anchored synthetic data is a tool. Indiscriminate recursive training is the poison. They are not the same activity.
The other escape route: stop scaling pre-training
There is a second response to the data wall, and it is arguably the bigger story. If feeding the pre-trained model is hitting a limit, move the gains somewhere else.
This is already happening. The industry has shifted weight from pre-training toward two other stages. Post-training, including reinforcement learning, improves a pre-trained model with a comparatively small amount of high-value feedback. Test-time compute, the engine behind reasoning models, spends extra computation when you ask a question, letting the model work through a problem rather than answer in one pass. NVIDIA now describes three scaling regimes, not one: pre-training, post-training, and test-time. The second and third do not depend on an ever-larger pile of human text. They depend on compute and on cleverness.
The clearest public sign of the shift came from OpenAI itself. According to Fortune, a model developed under the codename Orion, intended at one point to become GPT-5, struggled to deliver the leap over GPT-4 that pre-training scaling had reliably produced before. After two long training runs that fell short, OpenAI shipped that model in February 2025 as GPT-4.5, its last model in the old non-reasoning mould, and pivoted. GPT-5, released in August 2025, was not a bigger pre-trained model in the GPT-4 lineage. It was a routed system that leans on reasoning and test-time compute. When the firm that wrote the scaling playbook changes the playbook, that is the signal worth watching.
So is the data wall a ceiling or a detour?
The honest read is two-sided.
The case that it is a real ceiling: pre-training scaling on human text is hitting a hard limit, the easy era of buying capability with scraped data is over, and synthetic data is not a clean substitute because language cannot be verified the way a Go game can. The naive recipe has run its course.
The case that it is a detour: the same Epoch AI that raised the alarm published a careful 2025 analysis, Can AI scaling continue through 2030?, concluding that data scarcity is less likely to halt scaling than critics expect. Multimodal data, training on images, audio, and video, plausibly multiplies the pool. Verified, human-anchored synthetic data expands it further. Data efficiency keeps improving, so each token teaches more. Epoch's view is that power and chip supply will bite before text does. And a recent systematic study found synthetic data in pre-training helps when handled carefully and hurts when it is not, which is a tuning problem, not a wall.
Both are true, and they describe different things. The data wall is a real ceiling for one specific recipe: get smarter purely by pre-training a bigger model on more scraped human text. That recipe is near its end. It is a detour for the broader goal of more capable AI, because the field has already rerouted through post-training, test-time compute, multimodal data, and verified synthetic data. Progress did not stop. It changed shape.
For anyone building real systems on these models, the practical consequence is the useful part. The years when each new model was simply, dramatically better than the last are giving way to slower, narrower gains at the frontier. The advantage moves to what you build around the model: retrieval, memory, tool use, evaluation, and the orchestration that turns a capable model into a reliable system. That is the unglamorous engineering an implementation partner like Perform Digital exists to do. The data wall does not end AI. It ends the era when scale alone was the strategy.
Council summary
This post argues that the data wall is real but narrow: pre-training a bigger model on more scraped human text has nearly exhausted its fuel, yet the broader pursuit of more capable AI has simply changed route. It walks the reader from the Chinchilla scaling law that made data a strategic input, through Epoch AI's roughly 300 trillion token estimate of usable human text and the problems that block scraping more, to synthetic data and its real failure mode, model collapse. The distinction to carry away is replace versus accumulate: recursive training on model output degrades a model, while filtered, verified, human-anchored synthetic data is a working tool. The takeaway for practitioners is that effortless capability gains from scale are ending, and the advantage now shifts to the engineering built around the model: retrieval, memory, tool use, evaluation, and orchestration.
Comments