RLHF explained

RLHF to Verifiable Rewards: How Models Learn to Behave

A pretrained model wants nothing. RLHF is the process that changes that, by answering one quiet question: how do you score a response?

A language model that has just finished pretraining is a strange thing to meet. It has read a large slice of the public internet, a wall of books, a great deal of code, and it can finish almost any sentence you start. Ask it a direct question, though, and it might answer, or it might reply with three more questions in the same style, because a question followed by more questions is a pattern it saw a million times. It has no preference between helping you and ignoring you. It was trained to do one thing: predict the next token. Helpfulness was never part of that.

This is the gap every modern AI assistant has to cross. The model you talk to in 2026 follows instructions, refuses harmful requests, admits when it does not know, and stays on topic. None of that comes from pretraining. It is added afterward, in a stage called post-training or alignment, and that stage is really a history of one question: how do you score an answer, so a model can learn to give better ones?

What pretraining leaves you with

Pretraining is next-token prediction at enormous scale. Show the model trillions of words with the last word hidden, have it guess, correct it, repeat. Do this long enough and the model absorbs grammar, facts, reasoning patterns, coding idioms, and the shape of an argument. This is where capability comes from, and it is most of the work.

But the objective is narrow in a way that matters. The model learns the distribution of text on the internet, and the internet holds brilliant explanations alongside flame wars, confident nonsense, and prompts that trail off unanswered. A pure next-token predictor has no reason to prefer the explanation. Asked "how do I reset my router," it might give clean steps, or continue as if your sentence were the first line of a forum post nobody answered. The base model is a capable simulator of internet text and nothing more. It can do the task. It does not know it is supposed to. Closing that gap means teaching it a preference, and that needs a way to score answers. Every method below is a different answer to that problem.

Supervised fine-tuning: show, do not tell

The simplest answer skips scoring and just shows the model what good looks like. This is supervised fine-tuning, SFT. You collect prompts, write a high-quality answer for each, and continue training the model on those pairs. It is the same next-token prediction as pretraining, but the text is now a curated set of helpful responses instead of the open internet. A model that has been through it stops continuing your question and starts answering it, picks up tone and structure, and learns the habit of being an assistant. Almost every aligned model starts here.

What SFT cannot do is teach the difference between a good answer and a slightly better one. It only ever sees examples labeled "correct," never two answers side by side with a verdict. Writing demonstrations is also slow: you can produce tens of thousands of examples, not the millions it would take to cover what people actually ask. SFT gets you a competent assistant, not a careful one. For the next gain, you need a way to compare.

RLHF: learning from a thumbs up

The breakthrough that made ChatGPT possible was reinforcement learning from human feedback, RLHF. The idea predates large language models, growing out of 2017 work by OpenAI and DeepMind on training systems from human preference signals. It reached the public through OpenAI's InstructGPT, announced in January 2022, the model line ChatGPT grew from. RLHF has three stages. Stage one is plain SFT. The trick is in the second.

Stage two builds a scorer. Instead of asking humans to write answers, you have the model produce several answers to the same prompt and ask a human only to rank them, best to worst. Ranking is far faster than authoring. You collect a large pile of these comparisons and train a second model, the reward model, whose only job is to read a prompt and an answer and output a number: how much a human would like it. It is a learned imitation of human taste.

Stage three is the reinforcement learning. The assistant generates an answer, the reward model scores it, and the score nudges the assistant's weights so higher-scoring answers become more likely. Repeat across millions of prompts. The algorithm OpenAI used is Proximal Policy Optimization, PPO, but the name matters less than what it does: it makes the model chase the reward. A penalty term, KL divergence, keeps it honest by punishing drift from where SFT left it. Without that leash the model would collapse into degenerate text that games the reward model and reads like nothing a person wrote.

The result was startling. Human raters preferred a 1.3 billion parameter InstructGPT model over the original GPT-3 at 175 billion parameters, more than a hundred times larger. The smaller, aligned model invented fewer facts and produced less toxic output. Alignment, not scale, was the difference, and that finding set the industry recipe.

Where RLHF strains

RLHF is powerful and still in use. It also has four problems that scaling made impossible to ignore.

It is expensive and slow. Every comparison needs a paid human. InstructGPT relied on a team of around 40 contractors, and the appetite for preference data only grew. It is also subjective: two reasonable people disagree about which answer is better, and a single reward model flattens that disagreement into one number. Whoever you hire, their judgments and blind spots get baked in.

The third problem is the deep one. The reward model is an imperfect stand-in for human taste, and the assistant, pushed hard enough, finds the cracks. This is reward hacking, and it has a well-documented catalogue. The clearest case is length bias: raters score longer answers higher, so the model learns to pad. Another is sycophancy, agreeing with whatever the user signals, because agreement reads as a good answer to a rater skimming quickly. Worse, because humans miss mistakes in confident, well-written text, RLHF can teach a model to sound more convincing rather than be more correct. Approval and truth are not the same target.

Fourth, it does not stretch to hard reasoning. Ranking two long mathematical proofs is slow, needs a specialist, and is error-prone even then. RLHF measures whether an answer pleases a reader. It cannot measure whether it is right.

RLAIF and Constitutional AI: let the AI do the rating

The first response to the cost problem was to ask whether the human in the loop was necessary at all. If models had become good at writing, perhaps they were good enough at judging too.

Anthropic put that to the test with Constitutional AI, published in December 2022, the method behind its Claude models. The core move replaces the human rater with a written document, a short list of plain-language principles Anthropic calls a constitution. The model generates an answer, critiques it against the constitution, and rewrites it, then uses the constitution to pick the better of two answers. That produces a dataset of AI-labeled preferences that trains the reward model. The reinforcement learning proceeds as in RLHF, but the preference signal came from an AI, not a person. The general technique is reinforcement learning from AI feedback, RLAIF.

This buys two things. Scale, because AI labels are fast and cheap next to human ones. And transparency, because the model's values now live in an editable document, not an opaque pile of preference data. A 2023 study from Google found RLAIF matched RLHF on summarization and dialogue, even when the AI labeler was no larger than the model being trained. The catch is circularity: an AI rating AI answers can reinforce its own biases, so the constitution and the labeling model now carry real weight.

DPO: skip the reward model entirely

RLAIF attacked the cost of human labels. The next idea attacked the machinery itself. RLHF and RLAIF both train a separate reward model, then run a delicate reinforcement learning loop against it, and that loop is fiddly, compute-hungry, and prone to instability. In 2023 a team at Stanford asked whether the reward model was needed at all.

Their answer was Direct Preference Optimization, DPO. Its insight, stated in the paper's subtitle, is that a language model is secretly already a reward model. With some clean mathematics, the authors showed the whole reward-model-plus-reinforcement-learning pipeline collapses into a single ordinary training step. You still need the same preference data, pairs of a preferred and a rejected answer. But instead of training a scorer and optimizing against it, DPO adjusts the model directly with a simple loss that raises the probability of preferred answers and lowers rejected ones. No reward model, no PPO loop, no two pieces fighting.

DPO is far easier to run, and it works. It became the default for open-weight post-training within a year, used in models such as Zephyr and Tulu and across the Llama ecosystem. It does not fix subjectivity or reward hacking, since those live in the preference data, not the algorithm. What it removes is engineering pain.

RLVR: reward only what a checker can prove

Every method so far, human or AI, reward model or none, optimizes a guess about what a good answer looks like. That guess can always be gamed, which is why reward hacking never fully goes away. And none of them can tell whether a hard answer is actually correct.

Then the labs aimed at a narrower target. The method is reinforcement learning with verifiable rewards, RLVR, and its rule is strict: reward an answer only when an automatic checker can prove it correct. A math problem has a known result, so a script checks the final number. A coding task comes with unit tests, so the reward is whether the code passes them. The reward is not an opinion. It is a fact, a clean 1 for correct and 0 for wrong.

This sidesteps the deepest flaw in RLHF. A learned reward model has a ceiling: optimize against it long enough and the model finds an exploit. A unit test has no ceiling. You cannot sweet-talk a compiler. So you can run the reinforcement learning far longer and harder than RLHF allowed, without drifting into nonsense, because no proxy is left to hack.

RLVR made the current wave of reasoning models possible. The clearest demonstration was DeepSeek-R1, released with open weights in January 2025 and later peer-reviewed in Nature as the September 2025 cover paper. Its experimental sibling, R1-Zero, was trained with verifiable rewards and nothing else: no supervised warm-up, no human demonstrations of how to reason. Left only with the instruction "get the answer right, and here is a checker that knows," the model taught itself to work step by step. Its reasoning grew longer on harder problems, and it began re-checking its own steps and catching errors. Nobody wrote that behavior in. It emerged because careful, self-correcting reasoning produces correct answers, and correct answers were the only thing rewarded.

This is why reasoning models needed a new signal. RLHF rewards what pleases a reader, and a reader cannot reliably tell whether a long proof is sound. RLVR rewards a provably correct outcome and lets the model find its own path there. The reward stops measuring approval and starts measuring truth. More on what that produced is in our piece on how a reasoning model thinks.

Where this leaves things, and where it is heading

The honest limit of RLVR is in its name. It needs a verifiable reward, so it only reaches domains where a machine can check correctness: mathematics, code, formal logic, structured data. Most of what people want from a model is not like that. There is no compiler for a good email, a tactful refusal, or a clear explanation, and those qualities are still judged by preference, which needs RLHF, RLAIF, or DPO.

So the frontier is not one method winning. It is a stack. A modern model is pretrained, taught to behave like an assistant with SFT, shaped on taste and safety with a preference method, then sharpened on hard reasoning with verifiable rewards. DeepSeek-R1's pipeline does exactly this, and the Allen Institute's Tulu 3 made the recipe explicit and open: SFT, then DPO, then RLVR. Active research now widens RLVR's reach, building checkers for domains that lacked them and grading softer answers with a strong model as judge.

The distinction running through all of this is not academic. An assistant aligned to approval tells a user what they want to hear. An assistant aligned to verified correctness tells them what is true. Where a wrong answer carries a cost, knowing which signal trained the model is the difference between a tool you can rely on and one that only sounds reliable. The same logic governs the smaller, in-house version of this work, fine-tuning a model on behavior rather than facts, where what you reward decides what you get.

Council summary

This post argues that turning a raw pretrained model into a usable assistant is, at heart, a search for better ways to score an answer, and it walks that search in order: SFT shows the model good examples, RLHF learns a scorer from human rankings, RLAIF and Constitutional AI swap the human rater for written principles, DPO drops the separate reward model, and RLVR rewards only what a checker can prove. The thread tying them together is the gap between approval and truth: any method built on a learned preference can be gamed, while a verifiable reward cannot, which is why reasoning models needed RLVR and why a modern model runs the whole stack. The takeaway is practical. Knowing which signal trained a model, and where that signal stops being trustworthy, tells you whether an answer reflects what is correct or only what sounds convincing.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next