You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If your Claude Max 5h window is depleting in 60-90 minutes when it used to last all day, you're not imagining it and it's not your workflow. Five things are stacking on top of each other right now, only one of which Anthropic has publicly acknowledged. Most of the burn rate is fixable from your side without waiting on Anthropic.
This post: what's actually happening, why every cause matters, and the four concrete fixes ordered by how much they'll move the needle. The first two are client-side and free, the third is a config change in your agent framework, the fourth is dario.
Single-prompt jumps from 21% to 100%. Fresh sessions burning to zero in under 90 minutes. Workloads that worked fine last month suddenly impossible. Multiple Max 20x users — the most expensive tier — reporting the same shape of failure.
This isn't one bug. It's five things stacking.
The five compounding causes
1. Anthropic genuinely tightened the 5h window during peak hours (acknowledged)
Anthropic confirmed in late March 2026 that they tightened 5h session limits during peak hours — weekdays roughly 8am–2pm ET / 5am–11am PT / 1pm–7pm GMT. Their framing: ~7% of users will hit caps they wouldn't have before, weekly limits unchanged, the per-session distribution is what's tightened. If your bursts land in those windows, your 5h budget is smaller than it used to be.
You can't fix this directly, but you can work around it by shifting heavy workloads outside peak hours and by reducing per-request token volume (the next four causes).
2. Opus 4.6 adaptive thinking generates 2K-20K reasoning tokens per response
Opus 4.6 introduced adaptive thinking. Every response generates internal thinking-type blocks at output token pricing. With effort: high (or unset, which defaults to high in many clients), a single response can burn 20K+ thinking tokens before writing one line of code. Ten of those = 200K output tokens consumed by reasoning you never see.
The "Opus 4.6 feels worse than 4.5" complaints are the same root cause from a different angle — see Discussion #9. It's not a model regression. It's the same model producing more output tokens by default.
3. Thinking blocks accumulate in your conversation history and bill as input on every subsequent request
This is the multiplier nobody talks about. By default, when an agent framework appends an assistant turn to history, the full assistant turn gets appended — including the thinking blocks. Those blocks are 2K-20K tokens each. By turn 15 of a multi-turn session, you have 100K+ tokens of stale thinking sitting in conversation history.
Every subsequent request sends that entire history. Anthropic counts input tokens before applying any server-side context edit. So even if you set context_management: { edits: [{ type: "clear_thinking_20251015" }] }, the server bills you for the input tokens before it strips them. Server-side stripping does not save you tokens — it only saves the model from seeing the stale thinking. Token billing is unchanged.
The only thing that actually saves tokens is stripping thinking blocks client-side before sending the next request. Claude Code does this. Most third-party agent frameworks do not.
4. Tool definitions are invisible context bloat
Every MCP server you have configured injects its tool schema into the system prompt of every request. Each MCP server adds roughly 18K tokens of tool definitions. If you have three MCP servers, that's 54K tokens sent on every single message — before your actual conversation has even started. Ten messages = 540K tokens of tool definitions alone.
Tool definitions also count against your input bill, and unlike thinking blocks they cache (under 1h ephemeral cache) — so the cost is mostly on the first request of a session, but it still rapidly burns the 5h window if you start fresh sessions often.
If you're using anything other than the Claude Code binary itself, Anthropic's behavioral classifier may reclassify your requests from your subscription (five_hour claim) to pay-per-token (overage claim). When this fires, your subscription doesn't get drained — it gets bypassed entirely, and your overage bill climbs instead. If you have overage disabled, you get a hard 402 instead.
The classifier looks at request-shape and session-shape signals: tool names, max_tokens, effort, system prompt structure, JSON field order, billing tag, plus cumulative session-level signals like token throughput and conversation depth. We documented eight of the request-shape signals in Discussion #13 after a week of MITM captures, binary RE, and controlled A/B tests with a real Max 5x subscriber.
Most third-party clients trip at least 3-5 of these signals. The bill flips silently — the only way to detect it is by reading the anthropic-ratelimit-unified-representative-claim response header, which most clients don't surface.
The four fixes (ordered by impact)
Fix 1 — Strip thinking blocks from conversation history client-side (free, biggest win)
If you're hand-rolling an agent loop or using a framework you control, this is the highest-leverage change you can make. Before sending each new request, walk the prior assistant messages and remove blocks where type === "thinking":
This alone can cut your input token consumption by 40-70% on a multi-turn session. It does not affect output quality — Claude Code does exactly this and the model performs fine.
Do not rely on context_management: { edits: [{ type: "clear_thinking_20251015" }] } to save you tokens. The server applies that edit after counting input tokens. It improves model focus by reducing context noise, but your bill is calculated on what you sent, not what the model saw.
Fix 2 — Use effort: medium and max_tokens: 64000
Both of these cut token burn directly and also happen to match Claude Code's defaults (which means they avoid the billing reclassification in Fix 4 too).
effort: high is 2-3x more expensive than medium for marginal quality gains on most workloads. max_tokens: 128000 (a common "give me the most" default) burns the 5h window in 10-15 requests instead of 50+. thinking: { type: "adaptive" } is the right setting — { type: "enabled", budget_tokens: ... } forces thinking on every response and bills harder.
If your framework doesn't expose these knobs, find where it builds the request and patch them in. They're standard Anthropic Messages API parameters.
Fix 3 — Reduce your MCP server count
Each MCP server is ~18K tokens of tool definitions in every request. The cost is real. Audit which servers you actually need for this session and unload the rest. If you have a filesystem MCP, a github MCP, a slack MCP, and a postgres MCP all loaded, but you're only doing code review work right now, kill three of them. Your token bill drops by ~54K per request.
Claude Code's tool set has roughly 25 native tools and runs comfortably. Most third-party agent setups end up with 40-60 tools through MCP server stacking, and most of those tools are never called in a given session.
Fix 4 — Use dario (or any subscription-routing proxy) to consolidate the above
Doing fixes 1-3 by hand on every framework you use is tedious and easy to forget. dario does all four automatically:
Strips thinking blocks from conversation history client-side before forwarding to Anthropic (the Fix 1 you'd otherwise hand-roll). v3.10.1 also handles the trailing-empty-turn edge case (#36).
Forces effort: medium, max_tokens: 64000, thinking: adaptive on every request regardless of what your client sends.
Template-replays the entire request shape to match Claude Code exactly, which avoids billing reclassification on most third-party clients (Cursor, Aider, OpenClaw, Hermes, Continue, etc.). See Discussion #14 for the architecture and Discussion #8 for the eight detection signals it neutralizes.
Surfaces the actual billing claim (five_hour vs overage) in verbose log output so you can see when reclassification fires and respond to it, instead of finding out from your credit card.
Then point your tool at http://localhost:3456 instead of https://api.anthropic.com. Three of the four fixes apply automatically, and the fourth (MCP server count) is a manual choice you still have to make in your client config.
What's not a fix
A few things that get suggested in these threads that don't actually help:
"Switch to Sonnet for simple tasks." Sonnet uses adaptive thinking too on 4.6. Same root cause, smaller magnitude. You'll still burn the 5h window faster than you used to — just slower than Opus. Worth doing for cost but not a structural fix.
"Disable extra usage." This protects you from surprise overage charges, but it does not fix the root problem (subscription burn) and turns the failure mode into a hard 402 instead of a soft cost increase. Useful as a guardrail, not as a solution.
"Wait for Anthropic to fix it." They acknowledged the peak-hours change but explicitly said weekly limits are unchanged and the new per-session distribution is intended. Causes 2-5 above are independent of that change and Anthropic has no incentive to fix them — they're how third-party traffic gets monetized.
"Use a cheaper API key instead of OAuth." Pay-per-token API keys have no rate limits but are 5-20x more expensive than the equivalent subscription burn. This is solving cost by paying more cost.
Further reading
The reverse-engineering work that backs the claims above lives in these discussions:
If you're on a Max 20x and burning to zero in 90 minutes, the order of operations is: strip thinking client-side (Fix 1), then drop your MCP server count (Fix 3), then check whether your framework is on effort: high (Fix 2), then if you're on a third-party client check representative-claim headers to see if you're being silently reclassified (Fix 4). Most users get back to a sustainable burn rate after the first two.
Comments and counter-data welcome. If you've measured a different cause or found a fix that worked for a workload we haven't covered, post it below — this discussion is the one we'll keep updated as Anthropic's behavior continues to shift.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
If your Claude Max 5h window is depleting in 60-90 minutes when it used to last all day, you're not imagining it and it's not your workflow. Five things are stacking on top of each other right now, only one of which Anthropic has publicly acknowledged. Most of the burn rate is fixable from your side without waiting on Anthropic.
This post: what's actually happening, why every cause matters, and the four concrete fixes ordered by how much they'll move the needle. The first two are client-side and free, the third is a config change in your agent framework, the fourth is dario.
What you're seeing
The complaints are everywhere right now:
Single-prompt jumps from 21% to 100%. Fresh sessions burning to zero in under 90 minutes. Workloads that worked fine last month suddenly impossible. Multiple Max 20x users — the most expensive tier — reporting the same shape of failure.
This isn't one bug. It's five things stacking.
The five compounding causes
1. Anthropic genuinely tightened the 5h window during peak hours (acknowledged)
Anthropic confirmed in late March 2026 that they tightened 5h session limits during peak hours — weekdays roughly 8am–2pm ET / 5am–11am PT / 1pm–7pm GMT. Their framing: ~7% of users will hit caps they wouldn't have before, weekly limits unchanged, the per-session distribution is what's tightened. If your bursts land in those windows, your 5h budget is smaller than it used to be.
You can't fix this directly, but you can work around it by shifting heavy workloads outside peak hours and by reducing per-request token volume (the next four causes).
2. Opus 4.6 adaptive thinking generates 2K-20K reasoning tokens per response
Opus 4.6 introduced adaptive thinking. Every response generates internal
thinking-type blocks at output token pricing. Witheffort: high(or unset, which defaults to high in many clients), a single response can burn 20K+ thinking tokens before writing one line of code. Ten of those = 200K output tokens consumed by reasoning you never see.The "Opus 4.6 feels worse than 4.5" complaints are the same root cause from a different angle — see Discussion #9. It's not a model regression. It's the same model producing more output tokens by default.
3. Thinking blocks accumulate in your conversation history and bill as input on every subsequent request
This is the multiplier nobody talks about. By default, when an agent framework appends an assistant turn to history, the full assistant turn gets appended — including the
thinkingblocks. Those blocks are 2K-20K tokens each. By turn 15 of a multi-turn session, you have 100K+ tokens of stale thinking sitting in conversation history.Every subsequent request sends that entire history. Anthropic counts input tokens before applying any server-side context edit. So even if you set
context_management: { edits: [{ type: "clear_thinking_20251015" }] }, the server bills you for the input tokens before it strips them. Server-side stripping does not save you tokens — it only saves the model from seeing the stale thinking. Token billing is unchanged.The only thing that actually saves tokens is stripping thinking blocks client-side before sending the next request. Claude Code does this. Most third-party agent frameworks do not.
4. Tool definitions are invisible context bloat
Every MCP server you have configured injects its tool schema into the system prompt of every request. Each MCP server adds roughly 18K tokens of tool definitions. If you have three MCP servers, that's 54K tokens sent on every single message — before your actual conversation has even started. Ten messages = 540K tokens of tool definitions alone.
Tool definitions also count against your input bill, and unlike thinking blocks they cache (under 1h ephemeral cache) — so the cost is mostly on the first request of a session, but it still rapidly burns the 5h window if you start fresh sessions often.
5. Billing reclassification on third-party clients (Cursor, Aider, OpenClaw, Hermes, Continue, etc.)
If you're using anything other than the Claude Code binary itself, Anthropic's behavioral classifier may reclassify your requests from your subscription (
five_hourclaim) to pay-per-token (overageclaim). When this fires, your subscription doesn't get drained — it gets bypassed entirely, and your overage bill climbs instead. If you have overage disabled, you get a hard 402 instead.The classifier looks at request-shape and session-shape signals: tool names, max_tokens, effort, system prompt structure, JSON field order, billing tag, plus cumulative session-level signals like token throughput and conversation depth. We documented eight of the request-shape signals in Discussion #13 after a week of MITM captures, binary RE, and controlled A/B tests with a real Max 5x subscriber.
Most third-party clients trip at least 3-5 of these signals. The bill flips silently — the only way to detect it is by reading the
anthropic-ratelimit-unified-representative-claimresponse header, which most clients don't surface.The four fixes (ordered by impact)
Fix 1 — Strip thinking blocks from conversation history client-side (free, biggest win)
If you're hand-rolling an agent loop or using a framework you control, this is the highest-leverage change you can make. Before sending each new request, walk the prior assistant messages and remove blocks where
type === "thinking":This alone can cut your input token consumption by 40-70% on a multi-turn session. It does not affect output quality — Claude Code does exactly this and the model performs fine.
Do not rely on
context_management: { edits: [{ type: "clear_thinking_20251015" }] }to save you tokens. The server applies that edit after counting input tokens. It improves model focus by reducing context noise, but your bill is calculated on what you sent, not what the model saw.Fix 2 — Use
effort: mediumandmax_tokens: 64000Both of these cut token burn directly and also happen to match Claude Code's defaults (which means they avoid the billing reclassification in Fix 4 too).
{ "output_config": { "effort": "medium" }, "max_tokens": 64000, "thinking": { "type": "adaptive" } }effort: highis 2-3x more expensive thanmediumfor marginal quality gains on most workloads.max_tokens: 128000(a common "give me the most" default) burns the 5h window in 10-15 requests instead of 50+.thinking: { type: "adaptive" }is the right setting —{ type: "enabled", budget_tokens: ... }forces thinking on every response and bills harder.If your framework doesn't expose these knobs, find where it builds the request and patch them in. They're standard Anthropic Messages API parameters.
Fix 3 — Reduce your MCP server count
Each MCP server is ~18K tokens of tool definitions in every request. The cost is real. Audit which servers you actually need for this session and unload the rest. If you have a
filesystemMCP, agithubMCP, aslackMCP, and apostgresMCP all loaded, but you're only doing code review work right now, kill three of them. Your token bill drops by ~54K per request.Claude Code's tool set has roughly 25 native tools and runs comfortably. Most third-party agent setups end up with 40-60 tools through MCP server stacking, and most of those tools are never called in a given session.
Fix 4 — Use dario (or any subscription-routing proxy) to consolidate the above
Doing fixes 1-3 by hand on every framework you use is tedious and easy to forget. dario does all four automatically:
effort: medium,max_tokens: 64000,thinking: adaptiveon every request regardless of what your client sends.five_hourvsoverage) in verbose log output so you can see when reclassification fires and respond to it, instead of finding out from your credit card.Then point your tool at
http://localhost:3456instead ofhttps://api.anthropic.com. Three of the four fixes apply automatically, and the fourth (MCP server count) is a manual choice you still have to make in your client config.What's not a fix
A few things that get suggested in these threads that don't actually help:
Further reading
The reverse-engineering work that backs the claims above lives in these discussions:
representative-claimto detect reclassification.cchchecksum from the Claude Code binary.If you're on a Max 20x and burning to zero in 90 minutes, the order of operations is: strip thinking client-side (Fix 1), then drop your MCP server count (Fix 3), then check whether your framework is on
effort: high(Fix 2), then if you're on a third-party client checkrepresentative-claimheaders to see if you're being silently reclassified (Fix 4). Most users get back to a sustainable burn rate after the first two.Comments and counter-data welcome. If you've measured a different cause or found a fix that worked for a workload we haven't covered, post it below — this discussion is the one we'll keep updated as Anthropic's behavior continues to shift.
Beta Was this translation helpful? Give feedback.
All reactions