Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: add zombie DAG run detection to scheduler#1163

Merged
yottahmd merged 6 commits intomainfrom
1130-zombie-detector
Aug 6, 2025
Merged

feat: add zombie DAG run detection to scheduler#1163
yottahmd merged 6 commits intomainfrom
1130-zombie-detector

Conversation

@yottahmd
Copy link
Collaborator

@yottahmd yottahmd commented Aug 5, 2025

Overview
Implements automatic detection and cleanup of zombie DAG runs - processes marked as running but whose underlying process is no longer alive.

Feedback-by: @jonasban
Issue: #1130

Changes

  • Add ZombieDetector that periodically checks running DAG runs
  • Configure via zombieDetectionInterval (default 45s, 0 to disable)
  • Support environment variable DAGU_SCHEDULER_ZOMBIE_DETECTION_INTERVAL
  • Update zombie DAG runs to Error status with appropriate node error messages
  • Add panic recovery and comprehensive unit tests

@jonasban
Copy link

jonasban commented Aug 5, 2025

Jesus, I have not seen an open source project move so fast. I am truly blown away. ❤️

@jonasban
Copy link

jonasban commented Aug 5, 2025

One question: how expensive is this ZombieDetector process? 45s seems a bit long to wait and I would set it to a value within 2-5s to get faster feedback in the UI. Would this be a bad idea?

@yottahmd
Copy link
Collaborator Author

yottahmd commented Aug 6, 2025

@jonasban It probably won’t be too expensive if there are fewer DAG-runs like 10 for file-based storage, but that can vary depending on the hardware. The heartbeat timeout is currently 45s, so that’s how long it would take to catch a zombie run anyway. I think under normal circumstances, zombies are relatively rare, so I’m not sure it’s worth make it something like 2–5s. What do you think?

Still, it would be nice to make them configurable if user needs more control.

@yottahmd yottahmd merged commit 6f02ac2 into main Aug 6, 2025
4 checks passed
@yottahmd yottahmd deleted the 1130-zombie-detector branch August 6, 2025 12:47
@codecov
Copy link

codecov bot commented Aug 6, 2025

Codecov Report

❌ Patch coverage is 76.74419% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.42%. Comparing base (209f015) to head (833265b).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
internal/scheduler/zombie_detector.go 81.05% 13 Missing and 5 partials ⚠️
internal/dagrun/manager.go 25.00% 4 Missing and 2 partials ⚠️
internal/scheduler/scheduler.go 62.50% 5 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1163      +/-   ##
==========================================
+ Coverage   65.37%   65.42%   +0.05%     
==========================================
  Files         126      127       +1     
  Lines       18769    18893     +124     
==========================================
+ Hits        12270    12361      +91     
- Misses       5516     5540      +24     
- Partials      983      992       +9     
Files with missing lines Coverage Δ
internal/config/config.go 74.46% <ø> (ø)
internal/config/loader.go 87.88% <100.00%> (+0.22%) ⬆️
internal/dagrun/manager.go 42.16% <25.00%> (-0.27%) ⬇️
internal/scheduler/scheduler.go 53.80% <62.50%> (+0.35%) ⬆️
internal/scheduler/zombie_detector.go 81.05% <81.05%> (ø)

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 209f070...833265b. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants