Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@sspaink
Copy link
Member

@sspaink sspaink commented Mar 12, 2025

Why the changes in this PR are needed?

resolves: #5724

The existing buffer used by the decisions log plugin suffers from performance issues at a large scale. The heavy use of locks and when the events are compressed are the culprits for this slowdown.

What are the changes in this PR?

This PR introduces a new buffer implementation for the Decisions Log Plugin with the goal to improve performance.

Two new configuration options are introduced:

  • decision_logs.reporting.buffer_type - toggle to use new buffer (labeled "event") and defaults to current implementation (labeled "size")
  • decision_logs.reporting.buffer_size_limit_events - sets the number of events the buffer can hold, must be above zero and defaults to 100

Simple performance comparison

The following setup was used for both the old and new buffer implementation:

example.rego:

package example

allow if {
    true
}

Run the OPA server

$ ./opa_darwin_arm64 run -c opa-conf.yaml --server ./example.rego

Run a local server to consume logs (code runs a server and prints out decisions)

$ go run main.go

config

services:
  logeater:
    url: http://localhost:8080

decision_logs:
  service: logeater
  reporting:
    buffer_type: <changed per test>
    min_delay_seconds: 5
    max_delay_seconds: 10
Results using the "event" buffer type
buffer_type: event
buffer_size_limit_events: 1000
$ echo 'POST http://localhost:8181/v1/data/example/allow' | vegeta attack --duration=300s -rate=500 | tee results.bin | vegeta report
Requests      [total, rate, throughput]         150000, 500.00, 500.00
Duration      [total, attack, wait]             5m0s, 5m0s, 671.375µs
Latencies     [min, mean, 50, 90, 95, 99, max]  101.834µs, 681.405µs, 673.425µs, 797.284µs, 856.462µs, 1.41ms, 6.776ms
Results using the "size" buffer type:
buffer_type: size
$ echo 'POST http://localhost:8181/v1/data/example/allow' | vegeta attack --duration=300s -rate=500 | tee results.bin | vegeta report
Requests      [total, rate, throughput]         150000, 500.00, 500.00
Duration      [total, attack, wait]             5m0s, 5m0s, 572.833µs
Latencies     [min, mean, 50, 90, 95, 99, max]  96.042µs, 930.795µs, 585.86µs, 807.216µs, 884.121µs, 6.514ms, 80.595ms

Notes to assist PR review:

Please see the included README.md for a comparison between the current and this new buffer implementation.

@sspaink sspaink requested a review from johanfylling March 12, 2025 23:59
@netlify
Copy link

netlify bot commented Mar 13, 2025

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit fa26f44
🔍 Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/67e1c658f8be850008fa68bb
😎 Deploy Preview https://deploy-preview-7446--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@sspaink sspaink force-pushed the improvelogbuffer branch 2 times, most recently from 61267a3 to 8d17daa Compare March 13, 2025 04:30
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, especially the numbers. I've gone through the code and commented a bit, adding a few questions for my understanding. I hope you don't mind 😃

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for bearing with me 😃

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 👍 😃

Some questions/comments.

To be continued ..

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional notes.

Do we have some sort of comparative analysis on drop behavior between the two buffer types? E.g. is one buffer more prone to dropping events than the other while under the same pressure?

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making good progress 😃

sspaink and others added 13 commits March 24, 2025 15:54
This new event-based buffer provides a performance improvement over
the exisiting buffer by reducing locks and allowing concurrent writes.

Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
…opping of ND cache only happens when event is read from buffer.

Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
@johanfylling
Copy link
Contributor

Some recent measurements (500 concurrent requesting clients):

Screenshot 2025-03-26 at 15 38 54

Blue is the old size type buffer, green is the new event type buffer. Blue spikes correlate with uploads to log server.

Old size type buffer:

Global metrics:
Requests: 200
Requests total: 100000
Duration: 1m19.336304917s
Max concurrency: 500
Average req/s: 1260.4569888226824
timer_server_handler_ns: Min: 81.042µs, Max: 145.218709ms, Mean: 2.116985ms, P50: 263.708µs, P75: 423.947µs, P90: 1.20862ms, P95: 10.839443ms, P99: 34.751751ms, P99.9: 145.114025ms, P99.99: 145.218709ms
timer_rego_external_resolve_ns: Min: 41ns, Max: 6.75µs, Mean: 208ns, P50: 167ns, P75: 250ns, P90: 292ns, P95: 416ns, P99: 988ns, P99.9: 6.611µs, P99.99: 6.75µs
timer_rego_query_compile_ns: Min: 10.875µs, Max: 3.265709ms, Mean: 53.166µs, P50: 35.146µs, P75: 45.864µs, P90: 70.816µs, P95: 108.539µs, P99: 362.924µs, P99.9: 3.247513ms, P99.99: 3.265709ms
timer_rego_query_eval_ns: Min: 5.875µs, Max: 685.167µs, Mean: 39.172µs, P50: 20.521µs, P75: 45.458µs, P90: 72.959µs, P95: 102.108µs, P99: 366.857µs, P99.9: 683.887µs, P99.99: 685.167µs
duration: Min: 193.666µs, Max: 163.735333ms, Mean: 3.120354ms, P50: 570.374µs, P75: 1.413905ms, P90: 5.221721ms, P95: 9.14081ms, P99: 39.370902ms, P99.9: 163.383476ms, P99.99: 163.735333ms
Peaks:
duration: 168.657916ms
timer_server_handler_ns: 164.581375ms
timer_rego_external_resolve_ns: 569.792µs
timer_rego_query_compile_ns: 17.858625ms
timer_rego_query_eval_ns: 17.963041ms
Peak duration: 168.657916ms

New event type buffer:

Global metrics:
Requests: 200
Requests total: 100000
Duration: 1m18.066683334s
Max concurrency: 500
Average req/s: 1280.9561739950016
timer_rego_query_eval_ns: Min: 5.709µs, Max: 2.339125ms, Mean: 43.701µs, P50: 24.084µs, P75: 46.916µs, P90: 76.249µs, P95: 105.041µs, P99: 295.527µs, P99.9: 2.309766ms, P99.99: 2.339125ms
timer_server_handler_ns: Min: 65.417µs, Max: 9.999292ms, Mean: 365.524µs, P50: 245.292µs, P75: 363.969µs, P90: 534.25µs, P95: 861.422µs, P99: 3.090057ms, P99.9: 9.927875ms, P99.99: 9.999292ms
timer_rego_external_resolve_ns: Min: 41ns, Max: 2.459µs, Mean: 241ns, P50: 209ns, P75: 292ns, P90: 417ns, P95: 500ns, P99: 833ns, P99.9: 2.428µs, P99.99: 2.459µs
duration: Min: 185.042µs, Max: 19.097167ms, Mean: 1.403245ms, P50: 552.854µs, P75: 1.244437ms, P90: 3.413012ms, P95: 5.84851ms, P99: 12.394373ms, P99.9: 19.062707ms, P99.99: 19.097167ms
timer_rego_query_compile_ns: Min: 11.834µs, Max: 1.517209ms, Mean: 53.961µs, P50: 40.729µs, P75: 57.5µs, P90: 84.883µs, P95: 112.292µs, P99: 273.027µs, P99.9: 1.499569ms, P99.99: 1.517209ms
Peaks:
timer_rego_query_compile_ns: 22.58375ms
timer_rego_query_eval_ns: 23.331ms
timer_server_handler_ns: 25.787417ms
timer_rego_external_resolve_ns: 757.167µs
duration: 50.463917ms
Peak duration: 50.463917ms

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great 👍 .
Only two remaining concerns.

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!
Thank you for bearing with me :)

@sspaink sspaink merged commit cd66fa3 into open-policy-agent:main Mar 26, 2025
28 checks passed
@anderseknert
Copy link
Member

142 comments (143 once I post this, I suppose) on a PR must be a new record for OPA!
Had it been me I'd probably have closed my laptop for good to pursue farming or something.

Amazing work @sspaink 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make latency impact of decision logs predictable

5 participants