-
Notifications
You must be signed in to change notification settings - Fork 15.5k
fix: handle PID recycling in container gateway lock #1572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR SummaryFixes gateway startup lock loops in containers and adds Fly.io deployment docs.
Written by Cursor Bugbot for commit 8bf6e63. This will update automatically on new commits. Configure here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5bd0326bb7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| CLAWDBOT_PREFER_PNPM = "1" | ||
|
|
||
| [http_service] | ||
| internal_port = 3000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point Fly internal_port at the gateway listener
With the current internal_port = 3000, Fly will route traffic to port 3000, but the gateway HTTP/WS server binds to CLAWDBOT_GATEWAY_PORT which defaults to 18789 (see DEFAULT_GATEWAY_PORT in src/config/paths.ts). In a default deploy that follows this config and doesn’t override the port, Fly health checks and fly open will fail because nothing listens on 3000. Set internal_port to 18789 or explicitly configure the gateway to bind 3000 so routing matches the service.
Useful? React with 👍 / 👎.
- Add fly.toml configuration for Fly.io deployment - Add docs/platforms/fly.md with deployment guide - Uses London (lhr) region by default - Includes persistent volume for data storage
In containers, PIDs can be recycled quickly after restarts. When a container restarts, a different process might get the same PID as the previous gateway, causing the lock check to incorrectly think the old gateway is still running. This fix adds isGatewayProcess() which verifies on Linux that the PID actually belongs to a clawdbot gateway by checking /proc/PID/cmdline. If the cmdline doesn't contain 'clawdbot' or 'gateway', we assume the lock is stale. Fixes gateway boot-loop in Docker/Fly.io deployments.
Based on actual Flawd deployment experience: - Proper fly.toml configuration with all required settings - Step-by-step guide following exe.dev doc format - Troubleshooting section with common issues and fixes - Config file creation via SSH - Cost estimates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
6a38b83 to
8bf6e63
Compare
Problem
In containers (Docker, Fly.io, etc.), PIDs are recycled quickly after restarts. When a container restarts, a different process might get the same PID as the previous gateway, causing the lock check to incorrectly think the old gateway is still running.
This causes a boot-loop where the gateway refuses to start because it sees a 'stale' lock that actually belongs to a different process with the same PID.
Solution
Add
isGatewayProcess()which verifies on Linux that the PID actually belongs to a clawdbot gateway by checking/proc/PID/cmdline. If the cmdline doesn't contain 'clawdbot' or 'gateway', we assume the lock is stale and remove it.On non-Linux platforms (macOS, Windows), we fall back to the existing behavior since PID recycling is less of an issue outside containers.
Testing
🦞 Fix discovered while deploying Flawd to Fly.io