Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@steipete
Copy link
Contributor

Problem

In containers (Docker, Fly.io, etc.), PIDs are recycled quickly after restarts. When a container restarts, a different process might get the same PID as the previous gateway, causing the lock check to incorrectly think the old gateway is still running.

This causes a boot-loop where the gateway refuses to start because it sees a 'stale' lock that actually belongs to a different process with the same PID.

Solution

Add isGatewayProcess() which verifies on Linux that the PID actually belongs to a clawdbot gateway by checking /proc/PID/cmdline. If the cmdline doesn't contain 'clawdbot' or 'gateway', we assume the lock is stale and remove it.

On non-Linux platforms (macOS, Windows), we fall back to the existing behavior since PID recycling is less of an issue outside containers.

Testing

  • Tested by deploying to Fly.io where the boot-loop was occurring
  • On macOS, behavior is unchanged (falls back to existing PID check)

🦞 Fix discovered while deploying Flawd to Fly.io

@cursor
Copy link

cursor bot commented Jan 24, 2026

PR Summary

Fixes gateway startup lock loops in containers and adds Fly.io deployment docs.

  • Gateway lock (Linux): writes startTime into lock payload; on acquisition, reads /proc/<pid>/stat to compare start time (or falls back to /proc/<pid>/cmdline) to distinguish recycled PIDs; retains locks unless proven dead or stale; improved stale checks; adds targeted tests for recycled PIDs and proc access failures
  • Docs/Config: new platforms/fly guide, nav entry, and sample fly.toml for deploying the Gateway on Fly.io

Written by Cursor Bugbot for commit 8bf6e63. This will update automatically on new commits. Configure here.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5bd0326bb7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

CLAWDBOT_PREFER_PNPM = "1"

[http_service]
internal_port = 3000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Point Fly internal_port at the gateway listener

With the current internal_port = 3000, Fly will route traffic to port 3000, but the gateway HTTP/WS server binds to CLAWDBOT_GATEWAY_PORT which defaults to 18789 (see DEFAULT_GATEWAY_PORT in src/config/paths.ts). In a default deploy that follows this config and doesn’t override the port, Fly health checks and fly open will fail because nothing listens on 3000. Set internal_port to 18789 or explicitly configure the gateway to bind 3000 so routing matches the service.

Useful? React with 👍 / 👎.

- Add fly.toml configuration for Fly.io deployment
- Add docs/platforms/fly.md with deployment guide
- Uses London (lhr) region by default
- Includes persistent volume for data storage
In containers, PIDs can be recycled quickly after restarts. When a container
restarts, a different process might get the same PID as the previous gateway,
causing the lock check to incorrectly think the old gateway is still running.

This fix adds isGatewayProcess() which verifies on Linux that the PID actually
belongs to a clawdbot gateway by checking /proc/PID/cmdline. If the cmdline
doesn't contain 'clawdbot' or 'gateway', we assume the lock is stale.

Fixes gateway boot-loop in Docker/Fly.io deployments.
Based on actual Flawd deployment experience:
- Proper fly.toml configuration with all required settings
- Step-by-step guide following exe.dev doc format
- Troubleshooting section with common issues and fixes
- Config file creation via SSH
- Cost estimates
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@steipete steipete force-pushed the fix/container-pid-lock branch from 6a38b83 to 8bf6e63 Compare January 24, 2026 08:14
@steipete steipete merged commit 3fff943 into main Jan 24, 2026
40 of 43 checks passed
@steipete steipete deleted the fix/container-pid-lock branch January 24, 2026 08:15
@steipete
Copy link
Contributor Author

Landed via temp rebase onto main.\n\n- Gate: pnpm lint && pnpm build && pnpm test\n- Land commit: 8bf6e63\n- Merge commit: 3fff943\n\nThanks @steipete!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants