Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Feature]: Auto-restart gateway on stuck sessions #4410

@itsahedge

Description

@itsahedge

Problem

Gateway can get into a state where agent sessions hang indefinitely, causing message processing delays and timeouts. Recent logs showed:

  • Sessions stuck for 400+ seconds in 'processing' state
  • Discord message listener taking 40-289 seconds (should be instant)
  • Agent runs hitting 10-minute timeout
  • Multiple diagnostic warnings about stuck sessions

Example from logs:

[diagnostic] stuck session: sessionId=460e2072-f74a-4f44-b373-0753700c74db state=processing age=515s queueDepth=0
[discord] Slow listener detected: DiscordMessageListener took 289 seconds for event MESSAGE_CREATE
[agent/embedded] embedded run timeout: runId=91f22b6a-bf15-4af6-9d8a-41447434dc52 timeoutMs=600000

Manual gateway restart clears the stuck sessions, but this requires manual intervention.

Proposed Solution

Add built-in stuck session detection and auto-recovery:

1. Session Timeout Config

{
  "gateway": {
    "sessionTimeout": 120000,  // Max 2 min per session
    "autoRestart": {
      "onStuckSessions": true,
      "threshold": 3  // Auto-restart if 3+ sessions stuck
    }
  }
}

2. Detection Logic

  • If session is stuck >2 minutes → kill it
  • If multiple sessions stuck (configurable threshold) → auto-restart gateway
  • Log all auto-restarts with reason for debugging

3. Rate Limit Handling

  • When one auth profile times out/rate-limits, fail over faster to next profile
  • Current behavior: waits for full timeout before trying next profile
  • Desired: Detect rate limit errors and immediately try next profile

Benefits

  • Self-healing: Gateway recovers automatically without manual intervention
  • Better UX: Users don't experience multi-minute delays waiting for stuck sessions to timeout
  • Reliability: Prevents cascading failures from stuck sessions blocking the queue

Workaround (Current)

Manual cron watchdog:

*/5 * * * * if grep -q "stuck session" ~/.clawdbot/logs/gateway.err.log; then gateway restart --reason "Stuck session detected"; fi

But this is reactive and has 5-minute granularity. Native gateway support would be better.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions