Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Sovietaced
Copy link
Member

@Sovietaced Sovietaced commented Jul 13, 2025

Tracking issue

Closes #4046

Why are the changes needed?

Propeller workers sometimes panic on startup, thus killing the workers/goroutines for the duration of the process.

Previously this code could panic due to concurrent map access as well as duplicate prometheus metrics being registered. Ultimately the issue occurs when plugins are being exercised for the first time on startup by the Flyte Propeller workers. If these workers are executing the same plugin in parallel they may race and cause issues.

What changes were proposed in this pull request?

This pull request makes the plugin metric registration thread safe by guarding operations on the map with a read/write mutex. Specifically it guards the read-modify-write operation with a write lock.

How was this patch tested?

None yet, but will be tested in our production environment in the next few days.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This pull request resolves a critical issue and enhances thread safety in Flyte Propeller workers by implementing read/write mutex locks for task metrics registration. This prevents race conditions and duplicate metric registrations, improving system reliability and stability in concurrent plugin environments.

@Sovietaced Sovietaced added fixed For any bug fixes review-needed labels Jul 13, 2025
@flyte-bot
Copy link
Collaborator

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

@codecov
Copy link

codecov bot commented Jul 13, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 58.67%. Comparing base (b200b5d) to head (f890ce1).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...lytepropeller/pkg/controller/nodes/task/handler.go 80.00% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6532      +/-   ##
==========================================
+ Coverage   58.66%   58.67%   +0.01%     
==========================================
  Files         938      938              
  Lines       71466    71485      +19     
==========================================
+ Hits        41928    41947      +19     
+ Misses      26351    26349       -2     
- Partials     3187     3189       +2     
Flag Coverage Δ
unittests-datacatalog 59.03% <ø> (ø)
unittests-flyteadmin 56.22% <ø> (+0.02%) ⬆️
unittests-flytecopilot 39.56% <ø> (ø)
unittests-flytectl 64.72% <ø> (ø)
unittests-flyteidl 76.12% <ø> (ø)
unittests-flyteplugins 61.14% <ø> (+<0.01%) ⬆️
unittests-flytepropeller 54.84% <80.00%> (+<0.01%) ⬆️
unittests-flytestdlib 64.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Sovietaced Sovietaced marked this pull request as ready for review July 13, 2025 17:27
}

// Acquire read lock for fast read, this is the happy case
t.taskMetricsMapMutex.RLock()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After startup there will likely be no writes to this map so using a shared read lock is a minor optimization so there is no lock contention.

Signed-off-by: Jason Parraga <[email protected]>
@Sovietaced Sovietaced requested a review from machichima July 18, 2025 21:06
Copy link
Member

@machichima machichima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Sovietaced Sovietaced merged commit fd40b61 into flyteorg:master Jul 19, 2025
53 of 54 checks passed
@Sovietaced Sovietaced deleted the issue-4046 branch July 19, 2025 00:02
Sovietaced added a commit to Sovietaced/flyte that referenced this pull request Jul 29, 2025
* Make plugin metric registration thread safe

Signed-off-by: Jason Parraga <[email protected]>

* Address feedback

Signed-off-by: Jason Parraga <[email protected]>

---------

Signed-off-by: Jason Parraga <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fixed For any bug fixes review-needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Flyte propeller fatal error: concurrent map writes

3 participants