Ask a standard language model a hard question and it starts typing immediately. There is no pause, no visible deliberation. The model predicts the first word of its answer, then the next, then the next, and the answer is whatever falls out of that one forward pass. It is fast, and for most questions it is fine. For a competition math problem or a subtle piece of code, it is often confidently wrong.
A reasoning model does something different. Ask it the same hard question and it stops. Behind the scenes it generates a long stretch of working text, a private scratchpad where it sets up the problem, tries an approach, notices the approach is failing, backs up, tries another, checks the result. Only then does it write the answer you see. That hidden work is the entire idea, and everything else here is a consequence of it.
Where the pause came from
The first clue was a prompt, not a model. In January 2022 a team at Google, led by Jason Wei, published Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. The finding was almost embarrassingly simple. Ask a large model a multi-step problem, add the words "let's think step by step" or show it a couple of worked examples, and it writes out the intermediate steps and gets the answer right far more often. The model already had the capability. It just needed permission to spend words reaching the answer instead of guessing it in one shot.
That raised an obvious question. If letting a model write its reasoning helps, why leave it to the user to remember a magic phrase? Why not train the model to always do it, and to do it well?
OpenAI answered first. In September 2024 it released a preview of o1, the first widely used model trained specifically to reason before answering. o1 was not prompted into thinking. It was trained to generate a long internal chain of thought as a built-in habit, using reinforcement learning. The results were not a small bump. The earlier GPT-4o solved about 12 percent of problems on the AIME mathematics competition. o1 solved 74 percent on a single attempt, and with more sampling crossed into the top tier of human contestants (Wikipedia on o1). It exceeded human PhD-level accuracy on a graduate science benchmark and ranked in the 89th percentile on competitive programming. No exotic new architecture was involved. The model was simply allowed, and trained, to think first.
What "thinking" actually means: test-time compute
Here is the concept that makes reasoning models click. It has two names that mean the same thing: test-time compute and inference-time compute. Both describe how much computation a model spends at the moment you use it, as opposed to the computation spent once, up front, to train it.
For a decade the field improved models by scaling training: bigger models, more data, more training compute. That worked until it stopped working cheaply, because the supply of training data and the budgets both have ceilings. Test-time compute is a second dial. Instead of a one-pass answer, the model is allowed to generate thousands of intermediate tokens, explore several lines of attack, and check itself before it commits. More thinking tokens is more computation spent per question.
The striking part is that this dial genuinely substitutes for model size on hard problems. A 2024 DeepMind study, Scaling LLM Test-Time Compute Optimally, found that a smaller model given a well-chosen amount of thinking time could match a model fourteen times its size on problems where it had a real chance to begin with. Hugging Face later reproduced the effect with tiny open models, showing a 3-billion-parameter model edging past a 70-billion one on a math benchmark when allowed to think. A small brain with time can beat a big brain in a hurry. That is the sentence to keep.
The training trick: rewarding the answer, not the steps
Why does a reasoning model produce a good chain of thought when a standard model wanders? The answer is the training method, and it is worth getting right because it is the real engine of this whole generation.
You cannot teach reasoning by showing a model millions of human-written solutions. Humans skip steps, and there are not enough expert solutions in the world. You also cannot easily score a chain of thought for quality, since judging whether reasoning is good is itself a hard, subjective task. So the labs stopped trying to grade the reasoning.
Instead they grade the outcome. The method is called reinforcement learning with verifiable rewards, or RLVR. The model attempts a problem whose answer can be checked automatically: a math problem with a known result, a coding task that either passes its tests or does not. If the final answer is correct, the whole attempt gets a positive reward. If it is wrong, it does not. The reasoning in between is never directly judged. The model is simply pushed, over millions of attempts, toward whatever internal process tends to land on correct answers.
The clearest public demonstration came from the Chinese lab DeepSeek. In January 2025 it released DeepSeek-R1, with a full paper and open weights, and it matched o1 on the headline benchmarks: 79.8 percent on AIME 2024 against o1's 79.2, and 97.3 percent on the MATH-500 set. An experimental sibling, R1-Zero, was trained with reinforcement learning and nothing else, no supervised warm-up at all. Left to optimise for correct answers, the model taught itself to reason. Its chains of thought grew longer on harder problems. It began re-checking its own steps and, in the researchers' description, having an "aha moment" where it caught a mistake and corrected course. Nobody scripted that behaviour. It emerged because the behaviour produced right answers, and right answers got rewarded.
DeepSeek-R1 also broke an assumption about cost. DeepSeek later put the reinforcement learning stage at around 294,000 dollars, and offered the model through its API at roughly a thirtieth of o1's price. When markets opened on 27 January 2025, Nvidia lost close to 600 billion dollars of value in a single day. A frontier reasoning model was no longer something only a handful of companies could build.
Where reasoning models stand in 2026
The pattern held and spread. OpenAI followed o1 with o3. Even its low-compute setting roughly tripled o1's score on the ARC-AGI abstract-reasoning benchmark, and a high-compute run reached a breakthrough 87.5 percent, edging past the 85 percent human baseline, though that run burned an enormous amount of compute. By 2026 the line between "reasoning model" and "model" has mostly dissolved. The current frontier systems from OpenAI, Anthropic, Google, and DeepSeek are hybrids: they answer simple requests directly and switch into extended thinking when the problem warrants it. Most expose a dial. You set an effort level, or a thinking-token budget, and trade speed and cost against depth. DeepSeek's V4, previewed in April 2026, continues the same arc of closing the gap to the closed frontier at a fraction of the price.
So a model that thinks can clear problems a larger model cannot. That is real, and it is the reason the technique took over. But 2026 brought a more honest second finding, and it is the part most coverage skips.
Thinking longer is not thinking better
The early framing was seductive: thinking is good, so more thinking is more good. Push the budget up, watch accuracy climb. It does climb, for a while. Then it stops, and on some problems it reverses.
The diminishing return is the gentle version. Each extra thousand tokens of reasoning buys less than the last thousand. A study on overthinking in test-time compute scaling found that beyond a moderate budget the marginal gains nearly vanish, and that the right amount of thinking depends on the difficulty of the question. Easy questions reach the point of no further benefit early. Spend more after that and you are paying for tokens that change nothing.
The sharper version is inverse scaling, where more thinking actively lowers accuracy. A 2025 paper, Inverse Scaling in Test-Time Compute, built tasks where longer reasoning made leading models worse, and named the failure modes. Some models drift toward irrelevant detail the longer they run. Some lock onto the framing of a question and miss its substance. On a deliberately simple counting question wrapped in planted distractors, the answer was 2, and the study found that both Claude and DeepSeek-R1 grew less accurate the longer they reasoned, because the extra tokens were spent trying to fold the irrelevant numbers into the sum. A separate line of work identified the opposite glitch, underthinking, where a model abandons a promising approach too soon and flits between half-explored ideas. And Apple's much-discussed paper The Illusion of Thinking showed that on controlled puzzles, reasoning models do not degrade gracefully as difficulty rises. Past a complexity threshold their accuracy collapses to zero, and stranger still, they often spend fewer reasoning tokens on the hardest problems, as if quietly giving up. That paper drew sharp rebuttals over its puzzle design, which is a fair fight to have, but the broader result that more tokens are not free has held up across many independent studies.
Two more cracks. A reasoning model's chain of thought is not a reliable transcript of why it answered as it did. Anthropic tested this directly: when a model was fed a hint and used it, it mentioned the hint in its visible reasoning only a minority of the time. The scratchpad is useful, but treating it as a confession is a mistake. And the cost is concrete. Reasoning tokens are the most expensive tokens a model produces, and a hard task can run an order of magnitude more of them than a direct answer. One 2026 analysis of chain-of-thought economics put the typical inflation at two to five times the tokens and five to fifteen extra seconds per call, for tasks like classification where the thinking buys nothing. A model that thinks about everything is a model with a budget problem.
Where this is heading
The first era of reasoning models asked how to make a model think. The next era is about teaching it how much to think, and when not to bother. That is the right problem. The valuable skill is not unlimited deliberation. It is calibration: a quick answer for a quick question, a long and careful one for a genuinely hard one, and the judgement to tell the two apart.
Expect three moves. The dial gets automatic, so the model itself estimates how hard a question is and allocates effort, rather than leaning on the user to pick a setting. Routing gets normal, so a cheap fast model handles the easy majority of requests and a deep reasoning model is reserved for the cases that earn it, which keeps the bill sane. And research keeps pressing on quality over quantity: training that rewards reaching the answer in fewer steps, and methods that cut a reasoning trace down without losing the result.
For anyone building on these models, the practical takeaway is plain. A reasoning model is a specialised, expensive tool, not a default. It is worth its cost on multi-step problems where being right matters more than being fast. It is wasted on extraction, formatting, and routing, where it adds latency, adds spend, and occasionally talks itself out of a correct answer. Picking the right model for each step of a workflow, and capping how long any of them is allowed to think, is exactly the kind of unglamorous engineering decision that separates an agent that works in production from one that only worked in the demo. It is a large part of what an implementation partner like Perform Digital is actually paid to get right.
The deeper point is that the field found a second way to make models smarter, one that does not depend on training an ever-larger model on an ever-larger internet. That matters, because both of those were running short. It connects to the collapse in inference cost that makes spending extra computation per question affordable in the first place, and to the shift in how models learn from human preference to verifiable reward. A reasoning model is what you get when those two threads meet. The honest 2026 lesson is just that the pause has a sweet spot, and the next round of progress is about finding it.
Council summary
This post argues that a reasoning model is just a standard model given permission and training to work a problem on a hidden scratchpad before it answers, and that the engine behind it is reinforcement learning that rewards correct outcomes rather than graded steps. That second compute dial, spent at inference time, genuinely substitutes for raw model size on hard problems, which is why the technique took over the frontier. But the honest 2026 finding is that thinking longer is not thinking better: past a point the gains diminish, and on some tasks more reasoning actively lowers accuracy. The reader's takeaway is that a reasoning model is a specialised, expensive tool, not a default, and that the real skill, for labs and for anyone building on these models, is calibration: matching the depth of thought to the difficulty of the question.
Comments