Your Claude Max usage is burning in minutes — what's actually happening, and the four fixes that work #39

askalf · 2026-04-15T14:14:16Z

askalf
Apr 15, 2026
Maintainer

TL;DR

If your Claude Max 5h window is depleting in 60-90 minutes when it used to last all day, you're not imagining it and it's not your workflow. Five things are stacking on top of each other right now, only one of which Anthropic has publicly acknowledged. Most of the burn rate is fixable from your side without waiting on Anthropic.

This post: what's actually happening, why every cause matters, and the four concrete fixes ordered by how much they'll move the needle. The first two are client-side and free, the third is a config change in your agent framework, the fourth is dario.

What you're seeing

The complaints are everywhere right now:

[BUG] Instantly hitting usage limits with Max subscription — anthropics/claude-code#16157
[BUG] Claude Max plan session limits exhausted abnormally fast since March 23, 2026 — anthropics/claude-code#38335
Max 20 plan: rate limit 100% exhausted within ~70 minutes after reset — anthropics/claude-code#41788
Opus 4.6 token consumption significantly higher than 4.5 — anthropics/claude-code#23706
Claude Code Users Report Rapid Rate Limit Drain, Suspect Bug — MacRumors
Anthropic admits Claude Code quotas running out too fast — The Register
Claude is limiting usage more aggressively during peak hours — TechRadar

Single-prompt jumps from 21% to 100%. Fresh sessions burning to zero in under 90 minutes. Workloads that worked fine last month suddenly impossible. Multiple Max 20x users — the most expensive tier — reporting the same shape of failure.

This isn't one bug. It's five things stacking.

The five compounding causes

1. Anthropic genuinely tightened the 5h window during peak hours (acknowledged)

Anthropic confirmed in late March 2026 that they tightened 5h session limits during peak hours — weekdays roughly 8am–2pm ET / 5am–11am PT / 1pm–7pm GMT. Their framing: ~7% of users will hit caps they wouldn't have before, weekly limits unchanged, the per-session distribution is what's tightened. If your bursts land in those windows, your 5h budget is smaller than it used to be.

You can't fix this directly, but you can work around it by shifting heavy workloads outside peak hours and by reducing per-request token volume (the next four causes).

2. Opus 4.6 adaptive thinking generates 2K-20K reasoning tokens per response

Opus 4.6 introduced adaptive thinking. Every response generates internal thinking-type blocks at output token pricing. With effort: high (or unset, which defaults to high in many clients), a single response can burn 20K+ thinking tokens before writing one line of code. Ten of those = 200K output tokens consumed by reasoning you never see.

The "Opus 4.6 feels worse than 4.5" complaints are the same root cause from a different angle — see Discussion #9. It's not a model regression. It's the same model producing more output tokens by default.

3. Thinking blocks accumulate in your conversation history and bill as input on every subsequent request

This is the multiplier nobody talks about. By default, when an agent framework appends an assistant turn to history, the full assistant turn gets appended — including the thinking blocks. Those blocks are 2K-20K tokens each. By turn 15 of a multi-turn session, you have 100K+ tokens of stale thinking sitting in conversation history.

Every subsequent request sends that entire history. Anthropic counts input tokens before applying any server-side context edit. So even if you set context_management: { edits: [{ type: "clear_thinking_20251015" }] }, the server bills you for the input tokens before it strips them. Server-side stripping does not save you tokens — it only saves the model from seeing the stale thinking. Token billing is unchanged.

The only thing that actually saves tokens is stripping thinking blocks client-side before sending the next request. Claude Code does this. Most third-party agent frameworks do not.

4. Tool definitions are invisible context bloat

Every MCP server you have configured injects its tool schema into the system prompt of every request. Each MCP server adds roughly 18K tokens of tool definitions. If you have three MCP servers, that's 54K tokens sent on every single message — before your actual conversation has even started. Ten messages = 540K tokens of tool definitions alone.

Tool definitions also count against your input bill, and unlike thinking blocks they cache (under 1h ephemeral cache) — so the cost is mostly on the first request of a session, but it still rapidly burns the 5h window if you start fresh sessions often.

5. Billing reclassification on third-party clients (Cursor, Aider, OpenClaw, Hermes, Continue, etc.)

If you're using anything other than the Claude Code binary itself, Anthropic's behavioral classifier may reclassify your requests from your subscription (five_hour claim) to pay-per-token (overage claim). When this fires, your subscription doesn't get drained — it gets bypassed entirely, and your overage bill climbs instead. If you have overage disabled, you get a hard 402 instead.

The classifier looks at request-shape and session-shape signals: tool names, max_tokens, effort, system prompt structure, JSON field order, billing tag, plus cumulative session-level signals like token throughput and conversation depth. We documented eight of the request-shape signals in Discussion #13 after a week of MITM captures, binary RE, and controlled A/B tests with a real Max 5x subscriber.

Most third-party clients trip at least 3-5 of these signals. The bill flips silently — the only way to detect it is by reading the anthropic-ratelimit-unified-representative-claim response header, which most clients don't surface.

The four fixes (ordered by impact)

Fix 1 — Strip thinking blocks from conversation history client-side (free, biggest win)

If you're hand-rolling an agent loop or using a framework you control, this is the highest-leverage change you can make. Before sending each new request, walk the prior assistant messages and remove blocks where type === "thinking":

def strip_thinking(messages):
    for msg in messages:
        if msg.get("role") == "assistant" and isinstance(msg.get("content"), list):
            msg["content"] = [b for b in msg["content"] if b.get("type") != "thinking"]
    return messages

This alone can cut your input token consumption by 40-70% on a multi-turn session. It does not affect output quality — Claude Code does exactly this and the model performs fine.

Do not rely on context_management: { edits: [{ type: "clear_thinking_20251015" }] } to save you tokens. The server applies that edit after counting input tokens. It improves model focus by reducing context noise, but your bill is calculated on what you sent, not what the model saw.

Fix 2 — Use `effort: medium` and `max_tokens: 64000`

Both of these cut token burn directly and also happen to match Claude Code's defaults (which means they avoid the billing reclassification in Fix 4 too).

{
  "output_config": { "effort": "medium" },
  "max_tokens": 64000,
  "thinking": { "type": "adaptive" }
}

effort: high is 2-3x more expensive than medium for marginal quality gains on most workloads. max_tokens: 128000 (a common "give me the most" default) burns the 5h window in 10-15 requests instead of 50+. thinking: { type: "adaptive" } is the right setting — { type: "enabled", budget_tokens: ... } forces thinking on every response and bills harder.

If your framework doesn't expose these knobs, find where it builds the request and patch them in. They're standard Anthropic Messages API parameters.

Fix 3 — Reduce your MCP server count

Each MCP server is ~18K tokens of tool definitions in every request. The cost is real. Audit which servers you actually need for this session and unload the rest. If you have a filesystem MCP, a github MCP, a slack MCP, and a postgres MCP all loaded, but you're only doing code review work right now, kill three of them. Your token bill drops by ~54K per request.

Claude Code's tool set has roughly 25 native tools and runs comfortably. Most third-party agent setups end up with 40-60 tools through MCP server stacking, and most of those tools are never called in a given session.

Fix 4 — Use dario (or any subscription-routing proxy) to consolidate the above

Doing fixes 1-3 by hand on every framework you use is tedious and easy to forget. dario does all four automatically:

Strips thinking blocks from conversation history client-side before forwarding to Anthropic (the Fix 1 you'd otherwise hand-roll). v3.10.1 also handles the trailing-empty-turn edge case (#36).
Forces effort: medium, max_tokens: 64000, thinking: adaptive on every request regardless of what your client sends.
Template-replays the entire request shape to match Claude Code exactly, which avoids billing reclassification on most third-party clients (Cursor, Aider, OpenClaw, Hermes, Continue, etc.). See Discussion #14 for the architecture and Discussion #8 for the eight detection signals it neutralizes.
Surfaces the actual billing claim (five_hour vs overage) in verbose log output so you can see when reclassification fires and respond to it, instead of finding out from your credit card.

npm install -g @askalf/dario@latest
dario login
dario proxy --hybrid-tools --port 3456

Then point your tool at http://localhost:3456 instead of https://api.anthropic.com. Three of the four fixes apply automatically, and the fourth (MCP server count) is a manual choice you still have to make in your client config.

What's not a fix

A few things that get suggested in these threads that don't actually help:

"Switch to Sonnet for simple tasks." Sonnet uses adaptive thinking too on 4.6. Same root cause, smaller magnitude. You'll still burn the 5h window faster than you used to — just slower than Opus. Worth doing for cost but not a structural fix.
"Disable extra usage." This protects you from surprise overage charges, but it does not fix the root problem (subscription burn) and turns the failure mode into a hard 402 instead of a soft cost increase. Useful as a guardrail, not as a solution.
"Wait for Anthropic to fix it." They acknowledged the peak-hours change but explicitly said weekly limits are unchanged and the new per-session distribution is intended. Causes 2-5 above are independent of that change and Anthropic has no incentive to fix them — they're how third-party traffic gets monetized.
"Use a cheaper API key instead of OAuth." Pay-per-token API keys have no rate limits but are 5-20x more expensive than the equivalent subscription burn. This is solving cost by paying more cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Your Claude Max usage is burning in minutes — what's actually happening, and the four fixes that work #39

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Your Claude Max usage is burning in minutes — what's actually happening, and the four fixes that work #39

Uh oh!

askalf Apr 15, 2026 Maintainer

TL;DR

What you're seeing

The five compounding causes

1. Anthropic genuinely tightened the 5h window during peak hours (acknowledged)

2. Opus 4.6 adaptive thinking generates 2K-20K reasoning tokens per response

3. Thinking blocks accumulate in your conversation history and bill as input on every subsequent request

4. Tool definitions are invisible context bloat

5. Billing reclassification on third-party clients (Cursor, Aider, OpenClaw, Hermes, Continue, etc.)

The four fixes (ordered by impact)

Fix 1 — Strip thinking blocks from conversation history client-side (free, biggest win)

Fix 2 — Use effort: medium and max_tokens: 64000

Fix 3 — Reduce your MCP server count

Fix 4 — Use dario (or any subscription-routing proxy) to consolidate the above

What's not a fix

Further reading

Replies: 0 comments

askalf
Apr 15, 2026
Maintainer

Fix 2 — Use `effort: medium` and `max_tokens: 64000`