Error in user YAML: (<unknown>): mapping values are not allowed in this context at line 10 column 234
---
file: failures.md
purpose: When X breaks, where does the symptom appear, and what is the resolution path
audience: AI
last_verified: 2026-05-11
last_verified_commit: HEAD
single_owner: yes — failure-mode propagation lives here. Reads like a debug runbook.
see_also: flows.md (the pipelines whose disruption causes these failures), lifecycles.md (which transitions don't fire), probes.md (what to call to diagnose)
verify:
- name: M1 — claude-code idle timeout is 600000ms (NOT 120000)
cmd: python3 -c 'import subprocess,json; r=subprocess.run(["openclaw","gateway","call","debug.session.config","--params",json.dumps({"provider":"claude-code"})],capture_output=True,text=True); assert "\"resolvedRequestTimeoutMs\": 600000" in r.stdout, r.stdout[-500:]'
- name: M3 fork RPC alive — config.openExternalFile
cmd: python3 -c 'import subprocess,json; r=subprocess.run(["openclaw","gateway","call","config.openExternalFile","--params",json.dumps({"path":"/dev/null"})],capture_output=True,text=True); assert "\"ok\"" in r.stdout, r.stdout[-500:]'
- name: M3 fork RPC alive — files.resolveBareName
cmd: python3 -c 'import subprocess,json; r=subprocess.run(["openclaw","gateway","call","files.resolveBareName","--params",json.dumps({"name":"BRIEFING.md"})],capture_output=True,text=True); assert "matches" in r.stdout, r.stdout[-500:]'
- name: M3 fork RPC alive — briefing.resolve
cmd: python3 -c 'import subprocess; r=subprocess.run(["openclaw","gateway","call","briefing.resolve"],capture_output=True,text=True); assert ("\"path\"" in r.stdout) or ("\"content\"" in r.stdout), r.stdout[-500:]'
- name: M3 fork RPC alive — debug.dumpUiSnapshot
cmd: python3 -c 'import subprocess; r=subprocess.run(["openclaw","gateway","call","debug.dumpUiSnapshot"],capture_output=True,text=True); assert "\"ok\"" in r.stdout, r.stdout[-500:]'
---
Each row is one failure mode: where it originates → what envelope/symptom it produces → which channel it surfaces in → user-visible signal → resolution path.
The __ERR_ENV__: envelope (bible §5.69) is the single wire-level format for surfacing failures across channels. Categories and icons:
| Category | Icon | Fatal? | Triggered by |
|---|---|---|---|
subscription |
💳 | yes | OAuth subscription exhausted |
billing |
💸 | yes | metered API key billing failure |
auth |
🔐 | yes | 401, OAuth invalid |
rate_limit |
🚦 | no | 429 |
overload |
🌊 | no | 500 / provider overload |
network |
📡 | no | DNS, ECONNRESET, etc. |
timeout |
⏱️ | no | LLM idle watchdog, request timeout |
lane_busy |
🔄 | no | previous run still shutting down (and: gateway restart chip uses this with fatal=false) |
reply_run_already_active |
⏳ | no | duplicate idempotency |
incomplete_turn |
🫥 | no | stream ended without final |
tool_error |
🔧 | no | exec tool failure surfaced |
compaction_error |
🧹 | no | compaction pipeline failure |
generic |
yes | unclassified |
Generation: from src/fork/error-envelope.ts. Update this table when categories are added.
- diagnose_with:
debug.session.config({provider:"claude-code"})→ assertresolvedRequestTimeoutMs=600000.gateway.stuckSessions({thresholdMs:120000})→ if a session is past the old 120s threshold butresolvedRequestTimeoutMs=600000, the watchdog is correctly relaxed; otherwise the overlay broke. Journal grep[idle-timeout-diag]confirms per-turn resolution. - Origin:
streamWithIdleTimeoutinsrc/agents/pi-embedded-runner/run/llm-idle-timeout.ts. Idle timer racingstreamIterator.next(). cc-bridge intentionally does NOT emitstreamevents during tool work (seetool-loop.md), so heavy turns can starve the timer. - Propagation: idle timeout rejects →
idleTimeoutTrigger(error)→abortRun(true, error)abortsrunAbortController→ cc-bridge worker receives signal →worker.kill("SIGTERM")→ claude-cli exits →assistant-failover.tsclassifies assurface_error reason=timeout. - Envelope: category
timeout(icon ⏱️) orlane_busyif classified that way. Text:"🤖 ⚠️ Something went wrong while processing your request." - Surface:
- WhatsApp: envelope delivered as chunked text via
deliverWebReply. User sees the chip. - Tinker UI: pre-2026-05-10 → spinner stuck on
sending...(lifecycle event dropped). Post-fix → backstopbroadcastChatFinalfires; chip renders, spinner clears.
- WhatsApp: envelope delivered as chunked text via
- Resolution: ensure
timeoutSecondsis correctly resolved (see config-shape.md M1 ↔ that file). Architectural fix LIVE 2026-05-11: cc-bridge now emits an empty-delta heartbeat every 25s during a turn so the watchdog resets without re-executing tools. The 600s overlay is now belt-and-suspenders rather than load-bearing. Seetool-loop.md. - Detection probe: journal grep
[llm-idle-timeout]lines, plus the[idle-timeout-diag]log shows the resolved timeout. Heavy turns hitting 138s/279s without[idle-timeout-diag] idleTimeoutMs=600000means the overlay path broke. - Bug history: 2026-05-05 catalog
timeoutSeconds:600was dead code; 2026-05-10 fixed via plugin overlay (bible §11.6d, §11.6e).
- diagnose_with:
debug.dumpUiSnapshot()thenRead(~/.openclaw/data/tinker-ui-snapshot.html)and grep forthinking-pending. Cross-check withdebug.session.state({sessionKey})— ifactiveRunIds=[]but athinking-pendingchip is in the snapshot, this is M2. - Origin:
chat.send's.then()previously only emittedbroadcastChatFinalwhen!agentRunStarted. On surface_error timeouts the agent did start; the lifecycle event fromserver-chat.ts:emitChatFinalwas the only path tostate="final". If that path dropped (e.g.isControlUiVisible=false), TUI received NO final event. - Propagation: lifecycle event dropped → no
state="final"broadcast → TUI'sthinking-pendingindicator never clears. - Surface: Tinker UI spinner stays on
sending...indefinitely. - Resolution: backstop in
chat.ts.then()agentRunStarted branch always emitsbroadcastChatFinalwithdeliveredRepliescontent (FORK 2026-05-10, bible §11.6e). Idempotent vs lifecycle path becausebroadcastChatFinaldeletesagentRunSeq[runId]. - Detection: read
~/.openclaw/data/tinker-ui-snapshot.htmland grep forthinking-pending. If a session shows pending without an active run in the journal, this is the failure. - Bug history: regression C from 2026-05-10.
- diagnose_with: each fork RPC has its own probe in this file's
verifyblock —config.openExternalFile,files.resolveBareName,briefing.resolve,debug.dumpUiSnapshot. If any returnsunknown method, M3 fired during the last merge. The full set is enumerated under "Detection probe" below. - Origin: weekly merge of upstream/main rebases
src/gateway/server-methods.ts; conflicts that look "safe" can drop fork-added imports / spreads. - Propagation: handler missing from
coreGatewayHandlers→ any RPC call to it returnsunknown method: <name>(INVALID_REQUEST). - Surface: the call site fails. For UI features (e.g.,
.fs-linkclick), the click silently does nothing or shows a console error. - Resolution:
- Short-term: restore the handler import + spread.
- Long-term: J15 merge gate — a
verifycommand for each fork-added RPC fails the merge.
- Detection probe: for each fork-added handler,
openclaw gateway call <method> --params '{...}'. List of fork RPCs that need verify commands:config.openExternalFile(FORK §5.68)briefing.resolve(FORK §11.6b)files.resolveBareName(FORK §11.6c)debug.dumpUiSnapshot(FORK §11.6c)fork.subagents.spawnfork.prefrontal.state.*
- Bug history: regression A from 2026-05-10 (
config.openExternalFilewiped 2026-04-29, surfaced 2026-05-09).
- diagnose_with:
Read(~/.openclaw/cc-bridge/session-map.json)and assert each WA channel + TUI channel produces a distinctopenclawSessionId. If two channels share an openclawSessionId, the channel-isolation invariant has actually been violated (would be a real M4, not the suspected one). The 2026-05-09 evidence: WA=a87a4e61, TUI=bf76b61f— separate. - Origin: suspected when
/newwas typed in TUI and a WhatsApp reply showed unexpected content. - Investigation result: NOT bleed at cc-bridge layer. The two channels have distinct
openclawSessionId(e.g. WA=a87a4e61, TUI=bf76b61f); session-map'sgetLatestResumeSessionIdByOpenclawSessionIdonly resolves WITHIN one openclaw session. - Actual cause of the symptom: M1 (cc-bridge SIGTERM) on the WA turn + concurrent /new turn timing; the "something went wrong" envelope on WA was timing-correlated with the /new in TUI, leading to misattribution.
- Resolution: none required at cc-bridge layer. Document that channel isolation is invariant: openclawSessionId is canonical, lookup priority is openclawSessionId-first.
- Don't regress: in
worker-pool.getOrCreate, openclawSessionId lookup MUST come BEFORE sessionKey lookup. Reversing brings back stale-entry-wins behavior.
- diagnose_with:
plugin.boot.status({status:"error"})returns every plugin that failed to load witherror,failurePhase, andfailedAt. The 2026-05-11 example we hit:tinkerclaw-round-tableandtinkerclaw-total-recallboth withCannot find module '@sinclair/typebox'. Probe replaces the journal grep that used to be the only path. - manifest_via:
debug.simulate.pluginLoadFail({pluginId:"__simulated-test", failurePhase:"load"})(admin-scope) injects a fake plugin record withstatus:"error"directly into the in-memory registry; callingplugin.boot.status --params '{"status":"error"}'immediately after must include it. Cleanup withdebug.simulate.pluginLoadFail --params '{"action":"clear"}'. Round-trip-tests the diagnose_with claim above. - Origin:
pnpm.onlyBuiltDependenciesinpackage.jsonis wiped on upstream merges. After a merge,better-sqlite3,opusscript,@discordjs/opusare no longer pre-built. - Propagation: plugin import fails with
Cannot find module '@sinclair/typebox'or similar native-binding errors. - Surface: gateway boot warning, plugin disabled, features silently missing. Today's example:
tinkerclaw-round-tableandtinkerclaw-total-recallfail to load. - Resolution:
- Add the deps back to
pnpm.onlyBuiltDependenciesin package.json. - Run
pnpm rebuild better-sqlite3 opusscript @discordjs/opus. - Restart gateway with
--full.
- Add the deps back to
- Detection probe: parse boot journal for
failed to load pluginlines. - Long-term fix: a post-merge hook that asserts the deps are present.
- diagnose_with:
plugin.boot.status({status:"error"})— if a plugin'sfailurePhaseis"validation", the manifest itself is invalid (M6 territory);"load"means import-time crash (M5 territory);"register"means the plugin loaded but itsregister()hook threw. The phase distinguishes the cascading-config-validation failure (gateway refuses to start at all) from a single-plugin native-deps issue. - Origin: since 2026-03-05, upstream requires
configSchemafield in everyopenclaw.plugin.json. Forgetting it = config validation error loop blocks ALL plugins (cascading). - Surface: gateway boot fails entirely.
- Resolution: add
configSchema: {}(at minimum) to every fork plugin manifest. See bible's "Plugin manifests" rule.
- diagnose_with:
wa.recentOutbound({n:20})— scan for outbounds to LID chats that are NOT the owner's self-LID. Cross-reference with the trigger-gate unit tests (extensions/tinkerclaw-whatsapp/src/auto-reply/monitor/decide-trigger.test.tscovers the post-rescue gate). Also:decide-trigger.test.tscase (b) is a regression guard for this exact class. - Origin:
inbound/monitor.tsLID rescue used to accept ANY@lidchat withfromMe=trueas self-DM. When the owner DMed a family contact via that contact's LID, rescue rewrote the chat to owner self-DM, trigger fired without prefix, reply leaked to the contact's DM. - Propagation: rescue branch sets
from=lidStringand prompts pass; but the wrong recipient is computed for the outbound. - Surface: unintended outbound to a non-owner chat.
- Resolution (2026-05-04): rescue is gated on
self.lid===remoteJidOR (remoteJid ∈ noPrefixChats ∧ allowFrom). Anything looser = bug. - Resolution (2026-05-12):
self.lidis now populated from the whatsmeow SQLite store.auth-store.ts:readWebSelfIdentityfalls back toidentity-whatsmeow-db.ts:readWhatsmeowDeviceIdentitywhencreds.jsonis absent or itsme.lidis null (which is always the case on whatsmeow-backed accounts, since whatsmeow doesn't write a JSON creds file). Thewhatsmeow_device.lidcolumn is read read-only viabetter-sqlite3. Closes the open follow-up — path (a)self.lid === remoteJidis now the primary signal on every whatsmeow account.
- diagnose_with:
cron.lastRun({jobId:"morning-briefing"})returns the receipt path.Read(memory/morning-briefings/<date>.md)shows the cron pass content. If the user-pass-1 output omits items present in the cron pass receipt, M8 is firing (delta-mode against an unread audit artifact). - Origin: cron runs
morning-briefingat 07:00 and writesmemory/morning-briefings/YYYY-MM-DD.mdas an audit artifact (deliver:false). User typing/newlater expects a USER pass, not a delta on top of the cron pass. - Propagation: Jarvis reads the cron pass, treats
/newas Pass 2, renders "delta-only" → user is confused because they never saw Pass 1. - Surface: user
/newreturns sparse content with no priorities visible. - Resolution: the cron pass is an audit;
/newis always user-pass-1 → render FULL content. Always ENUMERATE preflight reds/yellows by name (counts alone are useless). See memory note 2026-05-11.
- diagnose_with:
pnpm bible:invariantsis the canonical post-merge probe. Everyverify:in the bible files is a contract; the runner is what enforces them.cron.lastRun({jobId:"daily-fork-sync"})confirms whether the auto-merge ran. While the cron is disabled (2026-05-09), running the runner manually after every merge is the workaround. - Origin: the daily-fork-sync cron (currently DISABLED 2026-05-09) merges upstream and ships if
pnpm buildpasses. Build-passing ≠ behavior-preserving. - Propagation: fork RPC wiped (M3), fork patch reverted, plugin manifest field dropped, native-deps array wiped (M5), etc.
- Resolution: J15 merge gate (paper §5) — run
pnpm test:invariantsafter build; refuse merge if anyverifycommand newly fails. Today this is proposed, not implemented. The merge cron stays disabled until the gate ships.
- diagnose_with:
gateway.stuckSessions({thresholdMs:60000})returns processing sessions older than 60s, sorted by age.debug.session.state({sessionKey})then returns the persistedsessions.jsonentry plus the liveactiveRunIds— if entry showsstatus:runningbutactiveRunIds=[], the session is stuck in M10.gateway.observability.snapshotincludes the stuck-session example list as one section of the single-call dashboard. - manifest_via:
debug.simulate.stuckSession({ageMs:120000})(admin-scope) injects a fake processing session aged 2 minutes; callinggateway.stuckSessionsimmediately after must include it in the returned list. Round-trip-tests the diagnose_with claim above. Cleanup withdebug.simulate.stuckSession --params '{"action":"clear"}'. - Resolution (2026-05-12): synchronous session-status transition on surface_error is now wired.
src/agents/pi-embedded-runner/run.tscallsforkAttemptHooks.markFailedOnSurfaceError({sessionKey, reason})at thepromptFailoverDecision.action === "surface_error"throw site. The hook (insrc/fork/attempt-hooks.ts) scans every agent'ssessions.jsonand transitions the entry fromstatus:"running"→status:"failed"withabortedLastRun:true. Best-effort — never blocks or masks the original throw. The boot-timemarkRunningMainSessionsAsInterruptedrecovery is now belt-and-suspenders rather than load-bearing. - Origin: L1 lifecycle (lifecycles.md) — on surface_error or timeout, the session.status sometimes stays
runninginsessions.json. - Propagation: next message on the same session may behave oddly; recovery code catches it on next reboot via
markRunningMainSessionsAsInterrupted. - Surface: symptom is bounded (recovery cleans it up) but cosmetic confusion.
- Resolution: open follow-up — pi-agent-core should transition status synchronously on surface_error. For now, restart picks it up.
verify:
- cmd: openclaw gateway call config.openExternalFile --params '{"path":"/dev/null"}'
expect: ".ok != null" # M3 probe for this specific fork RPC
- cmd: openclaw gateway call files.resolveBareName --params '{"name":"BRIEFING.md"}'
expect: ".matches | length >= 1"
- cmd: openclaw gateway call briefing.resolve
expect: ".path != null"Wire every M-row's "Resolution" column to a probe + verify cmd. Today: zero of these are wired into a merge gate. Future: J15 §5.