fix: FIFO buffer channel for status events to prevent slow status API blocking #7522

sspaink · 2025-04-18T20:50:35Z

Why the changes in this PR are needed?

resolve: #7492

As described in the issue above, if a status API is slow to respond it can cause OPA to be blocked writing to the bulkBundleCh.

What are the changes in this PR?

@mjungsbluth thank you for the suggested change 🥳

Changed the channels to be buffered channels. Reusing the same logic from the decision log event buffer, updated the UpdateBundleStatus and BulkUpdateBundleStatus methods to add to the buffered channel while never blocking. If the buffered channel is full the oldest status update is dropped. In case that dropped spot is taken by another concurrent call to BulkUpdateBundleStatus, try again 1000 times until dropping the incoming status event. Never should block.

The limit is configurable with a new option buffer_status_limit, defaults to 10. Not sure if there is a better default, if the status API is slow probably better not overwhelm it by default?

… blocking If a status API is slow to respond it can cause OPA to be blocked writing to an unbuffered channel. This fixes it by using a buffered channel that never blocks but drops the oldest status update if full. Signed-off-by: sspaink <[email protected]>

netlify · 2025-04-18T20:52:42Z

✅ Deploy Preview for openpolicyagent ready!

Name	Link
🔨 Latest commit	`5d7225f`
🔍 Latest deploy log	https://app.netlify.com/sites/openpolicyagent/deploys/680817205e3e4500087cbfe9
😎 Deploy Preview	https://deploy-preview-7522--openpolicyagent.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Signed-off-by: sspaink <[email protected]>

johanfylling

LGTM 👍

Just some nits.

docs/content/configuration.md

v1/util/channel.go

docs/content/configuration.md

johanfylling · 2025-04-22T17:01:52Z

v1/plugins/status/plugin_test.go


+	// make sure the lastBundleStatuses has been written so the trigger sends the expected status
+	// otherwise there could be a race condition before the bundle status is written
+	time.Sleep(10 * time.Millisecond)


This could fail in CI because of an extremely slow agent. Might not be an issue, though.
If this becomes an issue, we could revisit and try to wait for the right condition rather than sleeping. Maybe by looping in the below goroutine until len(fixture.plugin.bulkBundleCh) => 1 (with an extremely short sleep per iteration) before calling Trigger() (see e.g. test.Eventually()).

Unfortunately the race detector doesn't like test.Eventually also checking the lastBundleStatuses variable, not sure how to resolve it without adding a lock 🤔 Could have p.loop accept a function to set channels so we can swap it in the test with one that has a mutex and one that doesn't.

I just pushed a commit with a possible alternative: 5afa765

removed needing the p.loop and just does the same steps but within the test, so we don't have to worry about the uncertainty of another routine. I think it overall still tests the same thing? I split the tests because just the second half needs this. not having a sleep in a test would help me sleep better at night haha

Co-authored-by: Johan Fylling <[email protected]> Signed-off-by: Sebastian Spaink <[email protected]>

Signed-off-by: sspaink <[email protected]>

johanfylling

Nice! 👍

johanfylling · 2025-04-23T09:50:15Z

v1/plugins/status/plugin_test.go

 	go func() {
-		_ = fixture.plugin.Trigger(ctx)
+		_ = fixture.plugin.Trigger(context.Background())
 	}()


I might not see the full picture here, but I don't understand why we need this trigger routine and can't simply call fixture.plugin.oneShot(context.Background()) in the other routine. I think the test asserts what it's supposed to assert though, so won't hold this PR up because of this nit.

replace waiting loop with sleep

54397a2

Signed-off-by: sspaink <[email protected]>

johanfylling approved these changes Apr 22, 2025

View reviewed changes

sspaink and others added 8 commits April 22, 2025 14:23

Update docs/content/configuration.md

abb64b3

Co-authored-by: Johan Fylling <[email protected]> Signed-off-by: Sebastian Spaink <[email protected]>

remove config option

0479a15

Signed-off-by: sspaink <[email protected]>

clean up status config

5b5ca56

Signed-off-by: sspaink <[email protected]>

use shorter time for test.Eventually time

e842ba6

Signed-off-by: sspaink <[email protected]>

Merge branch 'main' into slowstatusplugin

5d7225f

oops misconfigured test.Eventually, checking wrong channel

1287309

Signed-off-by: sspaink <[email protected]>

back to sleeping

3489e27

Signed-off-by: sspaink <[email protected]>

remove sleeping by removing the need for the loop

5afa765

Signed-off-by: sspaink <[email protected]>

johanfylling approved these changes Apr 23, 2025

View reviewed changes

johanfylling merged commit d65888c into open-policy-agent:main Apr 23, 2025
70 of 74 checks passed

sspaink mentioned this pull request Apr 23, 2025

fix(status plugin): make sure the latest status is read before manually triggering or returning a snapshot #7533

Merged

BrewTestBot mentioned this pull request May 1, 2025

opa 1.4.0 Homebrew/homebrew-core#222136

Merged

sspaink mentioned this pull request May 12, 2025

plugin/status: support graceful shutdown timeout #7575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: FIFO buffer channel for status events to prevent slow status API blocking #7522

fix: FIFO buffer channel for status events to prevent slow status API blocking #7522

Uh oh!

sspaink commented Apr 18, 2025 •

edited

Loading

Uh oh!

netlify bot commented Apr 18, 2025 •

edited

Loading

Uh oh!

johanfylling left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johanfylling Apr 22, 2025

Uh oh!

sspaink Apr 22, 2025 •

edited

Loading

Uh oh!

sspaink Apr 22, 2025 •

edited

Loading

Uh oh!

johanfylling left a comment

Uh oh!

johanfylling Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

fix: FIFO buffer channel for status events to prevent slow status API blocking #7522

fix: FIFO buffer channel for status events to prevent slow status API blocking #7522

Uh oh!

Conversation

sspaink commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why the changes in this PR are needed?

What are the changes in this PR?

Uh oh!

netlify bot commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for openpolicyagent ready!

Uh oh!

johanfylling left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johanfylling Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

sspaink Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sspaink Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johanfylling left a comment

Choose a reason for hiding this comment

Uh oh!

johanfylling Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sspaink commented Apr 18, 2025 •

edited

Loading

netlify bot commented Apr 18, 2025 •

edited

Loading

sspaink Apr 22, 2025 •

edited

Loading

sspaink Apr 22, 2025 •

edited

Loading