Skip to content
LEWIS C. LIN AMAZON.COM BESTSELLING AUTHOR
Go back

4.5 → 4.6: Our Best Guesses on How Anthropic Did It

Edit page

Anthropic doesn’t publish training recipes. They publish outcomes. So when Sonnet 4.6 matches or beats Opus 4.5 — a full tier above it — we’re left making educated guesses.

Here’s our ranked list, most to least likely.

1. They Built a Smarter Judge

Confidence: Revaled by Anthropic

Before a model ships, it gets evaluated thousands of times by an internal “critic” model that scores its responses. This critic is the backbone of a technique called RLAIF (Reinforcement Learning from AI Feedback) — where the AI learns not from humans directly, but from another AI’s judgments.

When the critic gets smarter, everything downstream improves. The consistency and nuance gains in 4.6 look exactly like what happens when your judge gets better at its job — not when you simply feed the model more data.

2. They Started Grading the Work, Not Just the Answer

Confidence: Revaled by Anthropic

Most AI training rewards correct final answers. The problem: a model can get the right answer for the wrong reasons, and you’d never know.

Process Reward Models (PRMs) score each step in the reasoning chain, not just the conclusion. Think of it like a math teacher who grades your work, not just the answer at the bottom of the page. This is almost certainly what’s behind the new “effort parameter” — a feature where the model decides how hard to think based on the task. You can only teach that kind of self-awareness if you’re rewarding the quality of thinking, not just outcomes.

3. More Data, Smarter Filtering

Confidence: Revaled by Anthropic

The official system card confirms new data sources: opted-in user conversations, third-party datasets, and data generated internally — almost certainly outputs from stronger Anthropic models used to teach weaker ones (a technique called synthetic data generation). Knowledge cutoff also extended to May 2025.

But the volume isn’t the point. The tell is “classification-based filtering” — using a model to score and select the best training examples rather than just removing obvious garbage. Better inputs compound into better outputs across every training run.

4. They Let the Model Learn by Doing

Confidence: Strong Bet

Likely included in that “internally generated data”: Anthropic deployed fleets of Claude agents on complex tasks, collected the runs that succeeded, and used those as training material. This is called multi-agent rollout — and it’s powerful because the successful trajectories are self-verifying. If the agent completed the task, the reasoning that got it there was correct. No human labeler required.

This is the most credible explanation for why 4.6 improved so dramatically on multi-step tasks specifically.

5. They Used Brain Scans to Fix Behavior

Confidence: Strong Bet

This one is genuinely new. Anthropic’s researchers can now examine which internal patterns activate when Claude is, say, gaming a benchmark versus genuinely solving it. This field is called mechanistic interpretability — essentially, neuroscience for AI.

The system card shows they measured “evaluation awareness” (the model behaving better when it knows it’s being tested). Measuring it that carefully suggests they also suppressed it during training. The result: a model whose real-world behavior more closely matches its benchmark behavior. That makes the scores more trustworthy, not just higher.

6. They Hired Experts, Not Just Raters

Confidence: Strong Bet

Human feedback still matters — but which humans matters more. The 4.6 jumps in finance, medical calculations, and multilingual tasks point to a shift toward domain-expert annotation: licensed professionals rating outputs in their fields, rather than generalists picking the better paragraph. Better-quality human signal in a specific domain produces outsized improvements there. The benchmark pattern matches.

7. They Trained Against Attacks

Confidence: Educated Guess

Prompt injection is when a malicious instruction hidden in a document or webpage hijacks the AI mid-task — a live threat now that Claude operates autonomously in browsers and codebases. The system card calls out improved resistance as a deliberate goal, not a side effect. That means purpose-built attack data: red-teamers generating injections, the model trained specifically on those failures. With Claude Code in production, Anthropic has direct business motivation to harden this explicitly.

8. Copying from the Top Model — Probably Helped, Didn’t Lead

Confidence: Educated Guess

Knowledge distillation is training a smaller model to mimic a larger one’s outputs. It’s the most common guess for how Sonnet closed the gap with Opus, and it probably contributed to fluency and general knowledge.

But the headline improvements — multi-step reasoning, behavioral consistency, adaptive thinking — are exactly what distillation handles worst. Those gains look like deliberate post-training interventions. Distillation helped at the margins. The other seven levers did the real work.

The Short Version

The gap between 4.5 and 4.6 didn’t close because Anthropic used more compute or scraped more of the internet. It closed because their post-training stack — the work that happens after the base model is built — got meaningfully more sophisticated. Better judges. Better feedback. Better data selection. Smarter humans in the loop.

That’s where the frontier is being won right now.

Based on the Claude Sonnet 4.6 system card (Feb 17, 2026) and publicly observable benchmark data. Confidence estimates are our own inference.


Edit page
Share this post on:

Next Post
What We Learned from OpenClaw: 6 lessons on the future of AI agents