-
Notifications
You must be signed in to change notification settings - Fork 757
Make plugin metric registration thread safe #6532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Bito Automatic Review Skipped - Draft PR |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6532 +/- ##
==========================================
+ Coverage 58.66% 58.67% +0.01%
==========================================
Files 938 938
Lines 71466 71485 +19
==========================================
+ Hits 41928 41947 +19
+ Misses 26351 26349 -2
- Partials 3187 3189 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Jason Parraga <[email protected]>
| } | ||
|
|
||
| // Acquire read lock for fast read, this is the happy case | ||
| t.taskMetricsMapMutex.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After startup there will likely be no writes to this map so using a shared read lock is a minor optimization so there is no lock contention.
Signed-off-by: Jason Parraga <[email protected]>
machichima
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Make plugin metric registration thread safe Signed-off-by: Jason Parraga <[email protected]> * Address feedback Signed-off-by: Jason Parraga <[email protected]> --------- Signed-off-by: Jason Parraga <[email protected]>
Tracking issue
Closes #4046
Why are the changes needed?
Propeller workers sometimes panic on startup, thus killing the workers/goroutines for the duration of the process.
Previously this code could panic due to concurrent map access as well as duplicate prometheus metrics being registered. Ultimately the issue occurs when plugins are being exercised for the first time on startup by the Flyte Propeller workers. If these workers are executing the same plugin in parallel they may race and cause issues.
What changes were proposed in this pull request?
This pull request makes the plugin metric registration thread safe by guarding operations on the map with a read/write mutex. Specifically it guards the read-modify-write operation with a write lock.
How was this patch tested?
None yet, but will be tested in our production environment in the next few days.
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito
This pull request resolves a critical issue and enhances thread safety in Flyte Propeller workers by implementing read/write mutex locks for task metrics registration. This prevents race conditions and duplicate metric registrations, improving system reliability and stability in concurrent plugin environments.