The Thinking Lever: Controlling Claude's Effort

Why More Tokens Mean Better Answers

Scaling a model’s parameters at training time improves performance — but so does letting a model spend more tokens at inference time. Anthropic calls this test-time compute, and the results mirror what happens with bigger models: across reasoning benchmarks, coding evals, computer use, and PhD-level question sets, performance rises as token spend rises.

The three types of tokens Claude can spend at runtime are:

Thinking — Claude’s internal scratch pad before it responds
Tool calls — Claude’s interface with the world (web search, MCP servers, file writes)
Text output — the response the user or system actually receives

Every additional token has a direct cost in time and money, so controlling how many tokens Claude spends on a task is a practical skill.

Three types of test-time compute and the effort dial that controls them

Adaptive Thinking: Thinking as a Tool, Not a Toggle

Early reasoning models had a fixed pipeline: think → execute tool calls → respond. Then Anthropic introduced interleaved thinking, where Claude could think after each tool call. The current generation goes further with adaptive thinking: Claude decides whether to think, when to think, and how long to think, in any order it likes.

This mirrors how people actually work. Answering “what is 10 + 10?” requires no pause. Working through a complex debugging problem requires thinking at almost every step. And sometimes you just act — hit the tennis ball, then think about tactics at the baseline.

Crucially, the old “extended thinking toggle” was not a fine-grained effort control — it was removing a capability. When you turned it off, you stripped out the thinking tool entirely. Adaptive thinking keeps the tool available and lets the model reason about when to use it, just as it reasons about when to call a web search.

Choosing an Effort Level

Effort levels are the primary practical lever. Here are rules of thumb when you can’t run a formal eval:

Level	When to use
Low	Latency-sensitive, low-intelligence tasks: classification, summarization, data extraction
Medium	Routine tasks where some reasoning helps but speed matters
High	Good balance of speed, tokens, and intelligence for most business logic
X-High	Anthropic’s default for Claude Code and claude.ai — best Pareto trade-off
Max	Only when the task is genuinely hard and you know you need every bit of intelligence

There are diminishing marginal returns at the top. Max effort often uses 10× more tokens than high for only a marginal accuracy bump. Start at x-high and go up only if evals show a meaningful gain.

The Claude Plays Pokémon example

When Anthropic put Claude on low effort in Pokémon Red, an interesting thing happened: instead of methodical exploring, Claude found creative escape routes. It used repels to avoid random encounters, escape ropes to exit caves instantly, and ran from battles wherever it could — completing the game faster through avoidance rather than confrontation.

Low effort can surface unique attractor states precisely because it constrains how much the model overthinks a path. For speed-sensitive pipelines where the task is simple, this is a feature, not a bug.

Model Size vs. Effort Level

When should you use a small model at high effort versus a big model at low effort?

The Haiku-vs-Opus simulation result from Anthropic’s research is instructive: Haiku at x-high effort spent comparable tokens to Opus at low effort, but produced visibly worse output. The conclusion: if the task requires any intelligence at all, the larger model wins — even at low effort.

Use smaller models when:

The output is so predictable that correctness is almost guaranteed
You’re doing classification, routing, or extraction with a small, well-defined label set
Latency is critical and you have evals proving the smaller model matches quality

Use larger models (Sonnet, Opus) at lower effort when speed and cost matter but the task still requires judgment.

Practical Guidance

Enable thinking whenever possible. Give Claude the scratch pad; let it decide when to use it.
Run evals before tuning effort. Build a test set of your hardest representative cases and benchmark across effort levels — that’s the only reliable way to find the knee in the curve.
Default to x-high. It’s Anthropic’s chosen default for their own products and an excellent starting point.
Set budgets, not toggles. As Claude gains longer-horizon autonomy, the preferred pattern is: give Claude a time or token budget and let it allocate compute internally, rather than micro-managing thinking on/off.

Check your understanding

4 questions · your answers are saved in this browser only

1. What does the 'thinking toggle' approach actually do when turned off?

Turning extended thinking off removes the thinking tool from Claude's repertoire entirely. It does not simply reduce effort — it eliminates the capability, similar to removing web search rather than using it less.
2. What is the recommended default effort level for most production use cases?

Anthropic set x-high as the default for Claude Code and claude.ai after finding it to be the Pareto-efficient point. Max shows diminishing marginal returns for most tasks and costs disproportionately more tokens.
3. How does adaptive thinking differ from interleaved thinking?

Interleaved thinking guaranteed a think step after each tool call. Adaptive thinking treats thinking as just another tool and lets Claude choose to call it — or not — at any point, mirroring how humans actually work through problems.
4. When is a smaller model at higher effort preferred over a larger model at lower effort?

Anthropic's simulation experiments showed the larger model wins whenever intelligence is required, even at low effort. Smaller models at higher effort are appropriate only for low-intelligence tasks where the correct output is almost guaranteed regardless.

Build it yourself

Follow these exact steps to reproduce it yourself

Try it yourself: effort-level comparison

Pick a task you currently run in production or in your own workflow — ideally something that requires some reasoning, not just extraction.
Run it at three effort levels: low, x-high, and max, with the same model (Sonnet or Opus).
Compare: output quality, token count (check usage in the API response), and wall-clock time.
If max is better than x-high by a meaningful margin for your task, x-high → max is justified. If not, stop at x-high.
Now try the same task at x-high on a one-size-smaller model (e.g., Haiku if you were using Sonnet). Does quality hold? If yes, the smaller model is a cost win.

This mini-eval takes an hour and gives you a defensible effort strategy for your use case.

The Thinking Lever: Controlling Claude's Effort

Why More Tokens Mean Better Answers

Adaptive Thinking: Thinking as a Tool, Not a Toggle

Choosing an Effort Level

The Claude Plays Pokémon example

Model Size vs. Effort Level

Practical Guidance

Check your understanding

Build it yourself

Try it yourself: effort-level comparison

Related lessons

Agent Harness Engineering: Chasing Friction

Running an AI-Native Engineering Org: What Changes When Coding Isn't the Bottleneck

Fable 5 and the AI-Native Company

Why More Tokens Mean Better Answers

Adaptive Thinking: Thinking as a Tool, Not a Toggle

Choosing an Effort Level

The Claude Plays Pokémon example

Model Size vs. Effort Level

Practical Guidance

🧠 Check your understanding

🛠️ Build it yourself

Try it yourself: effort-level comparison

Related lessons

Agent Harness Engineering: Chasing Friction

Running an AI-Native Engineering Org: What Changes When Coding Isn't the Bottleneck

Fable 5 and the AI-Native Company

Check your understanding

Build it yourself