-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: FIFO buffer channel for status events to prevent slow status API blocking #7522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: FIFO buffer channel for status events to prevent slow status API blocking #7522
Conversation
… blocking If a status API is slow to respond it can cause OPA to be blocked writing to an unbuffered channel. This fixes it by using a buffered channel that never blocks but drops the oldest status update if full. Signed-off-by: sspaink <[email protected]>
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Signed-off-by: sspaink <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Just some nits.
v1/plugins/status/plugin_test.go
Outdated
|
||
// make sure the lastBundleStatuses has been written so the trigger sends the expected status | ||
// otherwise there could be a race condition before the bundle status is written | ||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could fail in CI because of an extremely slow agent. Might not be an issue, though.
If this becomes an issue, we could revisit and try to wait for the right condition rather than sleeping. Maybe by looping in the below goroutine until len(fixture.plugin.bulkBundleCh) => 1
(with an extremely short sleep per iteration) before calling Trigger()
(see e.g. test.Eventually()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the race detector doesn't like test.Eventually also checking the lastBundleStatuses variable, not sure how to resolve it without adding a lock 🤔 Could have p.loop
accept a function to set channels so we can swap it in the test with one that has a mutex and one that doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit with a possible alternative: 5afa765
removed needing the p.loop
and just does the same steps but within the test, so we don't have to worry about the uncertainty of another routine. I think it overall still tests the same thing? I split the tests because just the second half needs this. not having a sleep in a test would help me sleep better at night haha
Co-authored-by: Johan Fylling <[email protected]> Signed-off-by: Sebastian Spaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 👍
go func() { | ||
_ = fixture.plugin.Trigger(ctx) | ||
_ = fixture.plugin.Trigger(context.Background()) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not see the full picture here, but I don't understand why we need this trigger routine and can't simply call fixture.plugin.oneShot(context.Background())
in the other routine. I think the test asserts what it's supposed to assert though, so won't hold this PR up because of this nit.
Why the changes in this PR are needed?
resolve: #7492
As described in the issue above, if a status API is slow to respond it can cause OPA to be blocked writing to the
bulkBundleCh
.What are the changes in this PR?
@mjungsbluth thank you for the suggested change 🥳
Changed the channels to be buffered channels. Reusing the same logic from the decision log event buffer, updated the
UpdateBundleStatus
andBulkUpdateBundleStatus
methods to add to the buffered channel while never blocking. If the buffered channel is full the oldest status update is dropped. In case that dropped spot is taken by another concurrent call toBulkUpdateBundleStatus
, try again 1000 times until dropping the incoming status event. Never should block.The limit is configurable with a new option
buffer_status_limit
, defaults to 10. Not sure if there is a better default, if the status API is slow probably better not overwhelm it by default?