-
Notifications
You must be signed in to change notification settings - Fork 15.4k
Open
Description
Problem
Gateway can get into a state where agent sessions hang indefinitely, causing message processing delays and timeouts. Recent logs showed:
- Sessions stuck for 400+ seconds in 'processing' state
- Discord message listener taking 40-289 seconds (should be instant)
- Agent runs hitting 10-minute timeout
- Multiple diagnostic warnings about stuck sessions
Example from logs:
[diagnostic] stuck session: sessionId=460e2072-f74a-4f44-b373-0753700c74db state=processing age=515s queueDepth=0
[discord] Slow listener detected: DiscordMessageListener took 289 seconds for event MESSAGE_CREATE
[agent/embedded] embedded run timeout: runId=91f22b6a-bf15-4af6-9d8a-41447434dc52 timeoutMs=600000
Manual gateway restart clears the stuck sessions, but this requires manual intervention.
Proposed Solution
Add built-in stuck session detection and auto-recovery:
1. Session Timeout Config
{
"gateway": {
"sessionTimeout": 120000, // Max 2 min per session
"autoRestart": {
"onStuckSessions": true,
"threshold": 3 // Auto-restart if 3+ sessions stuck
}
}
}2. Detection Logic
- If session is stuck >2 minutes → kill it
- If multiple sessions stuck (configurable threshold) → auto-restart gateway
- Log all auto-restarts with reason for debugging
3. Rate Limit Handling
- When one auth profile times out/rate-limits, fail over faster to next profile
- Current behavior: waits for full timeout before trying next profile
- Desired: Detect rate limit errors and immediately try next profile
Benefits
- Self-healing: Gateway recovers automatically without manual intervention
- Better UX: Users don't experience multi-minute delays waiting for stuck sessions to timeout
- Reliability: Prevents cascading failures from stuck sessions blocking the queue
Workaround (Current)
Manual cron watchdog:
*/5 * * * * if grep -q "stuck session" ~/.clawdbot/logs/gateway.err.log; then gateway restart --reason "Stuck session detected"; fiBut this is reactive and has 5-minute granularity. Native gateway support would be better.
Metadata
Metadata
Assignees
Labels
No labels