Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ilyam8
Copy link
Member

@ilyam8 ilyam8 commented Dec 11, 2025

The manager previously called job.Stop() while holding the runningJobs mutex.
Since Stop() is a blocking call—it waits until the job’s goroutine exits—this created a scenario where a job stuck in runOnce() could hold the mutex for an extended time.

Summary
Test Plan
Additional Information
For users: How does this change affect me?

Summary by cubic

Moved job Stop() out of the runningJobs mutex to avoid blocking the manager when stopping slow or stuck jobs. This keeps other jobs responsive during stop operations.

  • Bug Fixes
    • stopRunningJob now removes the job under lock, unlocks, then calls Stop().
    • startRunningJob calls stopRunningJob before locking, then starts and registers the new job.
    • Prevents global stalls caused by a slow Stop() while holding the mutex.

Written for commit eaf7bd4. Summary will update automatically on new commits.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a mutex deadlock issue in the job manager where calling job.Stop() while holding the runningJobs mutex could block all job operations if a job was stuck in runOnce().

Key Changes:

  • Modified startRunningJob to call stopRunningJob before acquiring the mutex, preventing the lock from being held during the blocking stop operation
  • Refactored stopRunningJob to remove the job from the map and explicitly release the mutex before calling the blocking Stop() method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ilyam8 ilyam8 enabled auto-merge (squash) December 11, 2025 12:40
@ilyam8 ilyam8 merged commit 510f134 into netdata:master Dec 11, 2025
133 of 134 checks passed
@ilyam8 ilyam8 deleted the fix-go.d-jobmgr-slow-stop branch December 11, 2025 12:49
stelfrag pushed a commit to stelfrag/netdata that referenced this pull request Dec 11, 2025
@stelfrag stelfrag mentioned this pull request Dec 11, 2025
Ferroin pushed a commit that referenced this pull request Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/collectors Everything related to data collection area/go collectors/go.d

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants