Reasoning Budgets in Production: How Teams Are Spending Them

Reasoning budgets are the most useful Claude feature most teams have not configured.

Anthropic shipped configurable reasoning budgets for Claude Sonnet 4.6 and Opus 4.7 on 27 March 2026 (Anthropic release notes, accessed April 2026). The feature lets developers cap how many tokens a model can spend on extended thinking before producing a final answer. Six weeks in, the production picture is clear. Teams that tuned the ceiling are saving real money. Teams on defaults are not.

This is what the data looks like, what teams are doing well, and where the trap is.

What reasoning budgets actually do

Extended thinking is the chain-of-thought stage where Claude works through a problem before answering. On Sonnet 4.6 and Opus 4.7, this stage was previously bounded by the model's own internal heuristics. The new feature exposes that bound to the developer as an explicit ceiling, set per request.

Three settings matter. A maximum reasoning token count. A "minimum effort" floor for tasks where you want the model to think at least a bit. And a per-tier default that applies if you say nothing.

The default ceiling is currently 16,000 reasoning tokens on Sonnet and 32,000 on Opus. For most production tasks, both numbers are too high.

What we are seeing in production

We pulled data from four engineering teams running Claude in production, ranging from a 200-person SaaS company to a federal agency RAG pipeline. The pattern is consistent.

Most tasks do not need extended thinking at all. A correctly tagged prompt can route between zero-budget mode (no extended thinking, fastest path, cheapest) and full-budget mode (extended thinking enabled). One team measured that 78 per cent of their production traffic could safely run with zero reasoning budget, saving roughly 40 per cent of total spend with no measurable quality drop.

For the tasks that do need it, the ceiling matters more than the floor. Teams that left the ceiling at default reported reasoning tokens averaging 4,000 to 6,000 per request, well below the 16,000 cap. The cap was not biting. But for the 5 per cent of requests that ran long, the model was spending 12,000 to 14,000 reasoning tokens on tasks where 4,000 was sufficient. Tightening the cap to 5,000 cut tail-end cost by 30 to 60 per cent depending on workload.

The minimum-effort floor is a niche tool. Useful for evals and quality benchmarking. Rarely useful in production. Most teams left it at zero.

What it actually means

Reasoning budgets turn an opaque cost lever into an explicit one. That is the headline. The deeper change is that they make it possible to design workflows around cost-quality trade-offs that previously required a model swap.

Three patterns are emerging from the teams getting the most value.

First, prompt routing by complexity. A lightweight classifier sits in front of the main model and tags each request as low, medium or high complexity. Low complexity runs Sonnet with zero reasoning budget. Medium runs Sonnet with a tight 4,000 token ceiling. High runs Opus with a 12,000 ceiling. Total cost falls 35 to 50 per cent versus running everything on a default budget.

Second, budget shaping for batch jobs. For asynchronous tasks where latency does not matter, teams are setting tight ceilings and accepting that some requests will produce slightly worse outputs, then rerunning failures on a higher budget. Cost falls without breaching SLA.

Third, eval-driven ceiling tuning. Rather than guessing the right cap, teams are running their own eval set across a range of ceilings and picking the lowest that produces no measurable quality regression. This is the discipline that separates teams getting value from teams just hoping the default is right.

Who should care

If you build with Claude in production, this is the most cost-effective feature of the past quarter. Not configuring it is leaving money on the table. The work is not large, but it requires evals.

If you sponsor an AI program, your AI cost line is partly determined by reasoning ceilings you may not know exist. Ask your engineering team what ceilings are set per workflow and what evidence supports them.

If you sit in finance or procurement, reasoning budgets should be a line item in your AI vendor risk register. Default budgets are the equivalent of an unbounded invoice. Specifying ceilings as part of operational governance is now standard practice.

Hype check

The launch framing positioned reasoning budgets as a "thinking dial". This is broadly correct but slightly misleading. The dial does not adjust how hard the model thinks; it caps how long. Tighter budgets do not produce dumber answers on routine work. They produce no answer when the model genuinely needs more time on hard work, then fall back to a non-extended response.

The practical implication is that for routine workloads, tightening the cap is almost free quality-wise. For the 5 to 10 per cent of genuinely hard tasks, you need to identify them upfront and route them to a higher budget. This is real work and the launch did not emphasise it.

What to do this week

If you have Claude in production, run a one-day experiment. Pull last week's traffic, tag a sample of 100 requests by perceived complexity, and re-run them with three ceiling settings: 0, 4,000, 12,000. Compare quality scores. The right ceiling for your workload will reveal itself.

If you are still on default budgets, set an explicit ceiling today even if you do not yet know the right number. A cap of 8,000 is a reasonable starting point for Sonnet and 16,000 for Opus. Both numbers are below the current default. Both will save money. Tune from there.

If you sponsor the program, ask your team for a one-page summary of reasoning budget settings by workflow before your next operational review. If they cannot produce it, that is the gap.

Reasoning budgets are not a magic feature. They are an explicit cost lever. The teams treating them as such are saving real money. The teams treating them as a default are not.

TheAICommand. Intelligence, At Your Command.