Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix: prevent self-reinforcing error loop in deployment creation#3860

Open
vcode-sh wants to merge 2 commits intoDokploy:canaryfrom
vcode-sh:fix/deployment-error-loop
Open

fix: prevent self-reinforcing error loop in deployment creation#3860
vcode-sh wants to merge 2 commits intoDokploy:canaryfrom
vcode-sh:fix/deployment-error-loop

Conversation

@vcode-sh
Copy link

@vcode-sh vcode-sh commented Mar 2, 2026

Summary

Fixes #3752

A transient failure during removeLastTenDeployments() (called at the start of every createDeployment()) creates a permanent, self-reinforcing error loop that prevents all future deployments for the affected application.

Root Cause

  1. createDeployment() calls removeLastTenDeployments() first
  2. If any old deployment removal fails (e.g. transient DB/network error), the catch block inserts an error deployment record with logPath: ""
  3. path.join("") returns ".", so the logPath !== "." guard is bypassed
  4. On the next deploy attempt, removeLastTenDeployments() tries to clean up this poisoned record
  5. removeDeployment() runs rm -f . (or rm -f with empty string) which fails
  6. The failure triggers the catch block again, inserting another error record with logPath: ""
  7. The error count grows by 1+ on every attempt, and the application can never deploy again

Changes

1. Make removeDeployment() idempotent

  • Return undefined instead of throwing when a deployment is already deleted (race condition / concurrent cleanup)
  • Guard against empty/invalid logPath before running shell rm -f command
  • Fix copy-paste error: "Error creating the deployment""Error removing the deployment"

2. Make removeLastTenDeployments() resilient

  • Wrap each individual deployment removal in its own try-catch
  • Log errors with console.error (including deployment ID) but don't propagate
  • Guard execAsyncRemote against empty command strings
  • Cleanup of old records should never block creation of new deployments

3. Make removeLastFiveDeployments() resilient

  • Same try-catch pattern for server deployment cleanup
  • Add logPath !== "." and logPath !== "none" guards (consistent with removeLastTenDeployments)

4. Fix poisoned logPath: "" in error records

  • Change logPath: "" to logPath: "none" in all catch blocks of:
    • createDeployment()
    • createDeploymentPreview()
    • createDeploymentCompose()
    • createDeploymentBackup()
    • createDeploymentSchedule()
    • createDeploymentVolumeBackup()
  • Update removeDeployment() logPath guard to also skip "none"

How to Reproduce

  1. Deploy an application with >10 deployment records
  2. Simulate a transient failure during removeLastTenDeployments() (e.g. DB connection timeout)
  3. Observe error deployment record created with logPath: ""
  4. Attempt to deploy again — it will fail permanently with "Error creating the deployment"
  5. Each retry adds more error records, making recovery progressively harder

Verification

  • pnpm check passes (Biome formatting/linting)
  • pnpm --filter=server typecheck passes (TypeScript)
  • removeDeployment return type changes from Deployment to Deployment | undefined — no callers use the return value from cleanup paths

Test Plan

  • Verify pnpm check passes
  • Verify pnpm --filter=server typecheck passes
  • Manual test: deploy an application normally (no regression)
  • Manual test: insert a deployment record with logPath: "" or logPath: "none", verify next deployment succeeds instead of looping
  • Review: confirm no callers depend on removeDeployment throwing when already deleted

Greptile Summary

Fixed critical self-reinforcing error loop that permanently blocked deployments after transient failures.

Key Changes:

  • Changed error deployment records from logPath: "" to logPath: "none" to prevent path.join("") returning ".", which bypassed safety guards and caused rm -f . commands to fail
  • Made removeDeployment() idempotent by returning undefined instead of throwing when deployment already deleted (handles race conditions in concurrent cleanup)
  • Added resilient error handling in removeLastTenDeployments() and removeLastFiveDeployments() - each deployment removal is wrapped in try-catch, logged on failure, but doesn't block cleanup of other deployments
  • Added guards to prevent file operations on empty, ".", or "none" paths before executing shell commands
  • Added safety check to skip execAsyncRemote() when command string is empty

Impact:
The fix prevents a scenario where a single transient failure during deployment cleanup creates a poisoned database record that causes all subsequent deployments to fail, with the error count growing on each attempt. The changes make the system self-healing by ensuring cleanup operations never block new deployments.

API Note:
The deployment.removeDeployment tRPC endpoint now returns Deployment | undefined instead of throwing when the deployment doesn't exist. This makes the endpoint idempotent and the frontend handles this correctly.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it fixes a critical production bug with well-designed defensive changes
  • Score reflects: (1) clear identification and fix of root cause, (2) consistent implementation across all affected functions with proper guards, (3) improved error isolation prevents cascading failures, (4) type changes are backward compatible in practice, (5) changes are focused and don't introduce new complexity
  • No files require special attention - the single modified file contains focused, well-structured changes

Last reviewed commit: bf1f1cf

(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!

vcode-sh and others added 2 commits March 2, 2026 09:33
When deployment creation fails (e.g. transient DB/network error during
removeLastTenDeployments), the catch block writes an error deployment
record with logPath: "". On the next deploy attempt, cleanup tries to
delete this record, runs `rm -f ` with an empty path, fails, and the
cycle repeats — permanently blocking deployments.

Changes:
- Make removeDeployment() idempotent (return undefined instead of
  throwing when already deleted, guard against empty/invalid logPath)
- Fix copy-paste error message "Error creating" → "Error removing"
- Wrap each deployment removal in removeLastTenDeployments() and
  removeLastFiveDeployments() in individual try-catch so cleanup of
  old records never blocks creation of new deployments
- Use logPath: "none" instead of "" in error deployment records to
  prevent path.join() producing "." which bypasses guards

Fixes Dokploy#3752

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
@vcode-sh vcode-sh requested a review from Siumauricio as a code owner March 2, 2026 08:35
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Mar 2, 2026
@dosubot
Copy link

dosubot bot commented Mar 2, 2026

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

An error have occured: Deployment not found

1 participant