AI Learning
intermediate ⏱️ 8 min read · 🎬 ~20 min video

The Thinking Lever: Controlling Claude's Effort

Master effort levels and adaptive thinking to get the best intelligence-speed-cost trade-off from Claude on any task.

This lesson is original educational writing based on this video by Anthropic (published May 20, 2026). All credit for the original content goes to the creators.

#best-practices #agentic-workflows
Video thumbnail: The Thinking Lever: Controlling Claude's Effort
Original video — all credit to the creators. Watch the original on YouTube ↗

Why More Tokens Mean Better Answers

Scaling a model’s parameters at training time improves performance — but so does letting a model spend more tokens at inference time. Anthropic calls this test-time compute, and the results mirror what happens with bigger models: across reasoning benchmarks, coding evals, computer use, and PhD-level question sets, performance rises as token spend rises.

The three types of tokens Claude can spend at runtime are:

  1. Thinking — Claude’s internal scratch pad before it responds
  2. Tool calls — Claude’s interface with the world (web search, MCP servers, file writes)
  3. Text output — the response the user or system actually receives

Every additional token has a direct cost in time and money, so controlling how many tokens Claude spends on a task is a practical skill.

Test-time computeThinkingInternal scratch padReason before actingTool CallsSearch, MCP, filesInterface with worldText OutputFinal responseQuestions, summariesEffort dialLowMediumHighX-HighMaxDefault ★More effort → more tokens → better intelligence (with diminishing returns)
Three types of test-time compute and the effort dial that controls them

Adaptive Thinking: Thinking as a Tool, Not a Toggle

Early reasoning models had a fixed pipeline: think → execute tool calls → respond. Then Anthropic introduced interleaved thinking, where Claude could think after each tool call. The current generation goes further with adaptive thinking: Claude decides whether to think, when to think, and how long to think, in any order it likes.

This mirrors how people actually work. Answering “what is 10 + 10?” requires no pause. Working through a complex debugging problem requires thinking at almost every step. And sometimes you just act — hit the tennis ball, then think about tactics at the baseline.

Crucially, the old “extended thinking toggle” was not a fine-grained effort control — it was removing a capability. When you turned it off, you stripped out the thinking tool entirely. Adaptive thinking keeps the tool available and lets the model reason about when to use it, just as it reasons about when to call a web search.

Choosing an Effort Level

Effort levels are the primary practical lever. Here are rules of thumb when you can’t run a formal eval:

LevelWhen to use
LowLatency-sensitive, low-intelligence tasks: classification, summarization, data extraction
MediumRoutine tasks where some reasoning helps but speed matters
HighGood balance of speed, tokens, and intelligence for most business logic
X-HighAnthropic’s default for Claude Code and claude.ai — best Pareto trade-off
MaxOnly when the task is genuinely hard and you know you need every bit of intelligence

There are diminishing marginal returns at the top. Max effort often uses 10× more tokens than high for only a marginal accuracy bump. Start at x-high and go up only if evals show a meaningful gain.

The Claude Plays Pokémon example

When Anthropic put Claude on low effort in Pokémon Red, an interesting thing happened: instead of methodical exploring, Claude found creative escape routes. It used repels to avoid random encounters, escape ropes to exit caves instantly, and ran from battles wherever it could — completing the game faster through avoidance rather than confrontation.

Low effort can surface unique attractor states precisely because it constrains how much the model overthinks a path. For speed-sensitive pipelines where the task is simple, this is a feature, not a bug.

Model Size vs. Effort Level

When should you use a small model at high effort versus a big model at low effort?

The Haiku-vs-Opus simulation result from Anthropic’s research is instructive: Haiku at x-high effort spent comparable tokens to Opus at low effort, but produced visibly worse output. The conclusion: if the task requires any intelligence at all, the larger model wins — even at low effort.

Use smaller models when:

  • The output is so predictable that correctness is almost guaranteed
  • You’re doing classification, routing, or extraction with a small, well-defined label set
  • Latency is critical and you have evals proving the smaller model matches quality

Use larger models (Sonnet, Opus) at lower effort when speed and cost matter but the task still requires judgment.

Practical Guidance

  1. Enable thinking whenever possible. Give Claude the scratch pad; let it decide when to use it.
  2. Run evals before tuning effort. Build a test set of your hardest representative cases and benchmark across effort levels — that’s the only reliable way to find the knee in the curve.
  3. Default to x-high. It’s Anthropic’s chosen default for their own products and an excellent starting point.
  4. Set budgets, not toggles. As Claude gains longer-horizon autonomy, the preferred pattern is: give Claude a time or token budget and let it allocate compute internally, rather than micro-managing thinking on/off.

Check your understanding

4 questions · your answers are saved in this browser only

  1. 1. What does the 'thinking toggle' approach actually do when turned off?

  2. 2. What is the recommended default effort level for most production use cases?

  3. 3. How does adaptive thinking differ from interleaved thinking?

  4. 4. When is a smaller model at higher effort preferred over a larger model at lower effort?

Build it yourself

Follow these exact steps to reproduce it yourself

Try it yourself: effort-level comparison

  1. Pick a task you currently run in production or in your own workflow — ideally something that requires some reasoning, not just extraction.
  2. Run it at three effort levels: low, x-high, and max, with the same model (Sonnet or Opus).
  3. Compare: output quality, token count (check usage in the API response), and wall-clock time.
  4. If max is better than x-high by a meaningful margin for your task, x-high → max is justified. If not, stop at x-high.
  5. Now try the same task at x-high on a one-size-smaller model (e.g., Haiku if you were using Sonnet). Does quality hold? If yes, the smaller model is a cost win.

This mini-eval takes an hour and gives you a defensible effort strategy for your use case.

Related lessons

intermediate 🎬 Anthropic · ~27 min

Agent Harness Engineering: Chasing Friction

AirOps's hard-won lessons from shipping Claude agents to non-technical enterprise users: intentional scoping, specialized tools over primitive exploration, and sub-agents for context isolation.

#agentic-workflows #best-practices
intermediate 🎬 Anthropic · ~30 min

Fable 5 and the AI-Native Company

What Fable 5's capabilities unlock, how dynamic workflows reshape engineering at scale, and what it looks like when a company runs on an AI substrate.

#best-practices #agentic-workflows #claude-code