-
Notifications
You must be signed in to change notification settings - Fork 321
New atomic-based implementation squeezed into int64 #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #85 +/- ##
===========================================
- Coverage 100.00% 98.98% -1.02%
===========================================
Files 3 4 +1
Lines 70 99 +29
===========================================
+ Hits 70 98 +28
- Misses 0 1 +1
Continue to review full report at Codecov.
|
|
Regarding levered coverage: |
|
Thanks, this looks really cool! I took a brief look, left some minor comments, will need to take another pass to re-learn the old code. I don't actually know/understand - could you explain why/how the new implementation is so much better than the previous atomic based one? I imagined that storing/loading a pointer atomically would be approximately as fast as storing an int64. Also, do you know why the overhead for {1,2,3} goroutines is larger than 4+? This seems counterintuitive. |
| .PHONY: bench | ||
| bench: | ||
| go test -bench=. ./... | ||
| bench: bin/benchstat bin/benchart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be open to putting up the benchmarking changes as a separate PR, to be landed before this one?
I'd love to keep the history clean/separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup: #86
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. thanks. I'll wait for you to rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, this is not rebased, right? I'm seeing a bunch of changes that are already on main.
I'm not intimately familiar with github workflows, but I'm assuming there's a way to:
- rebase your changes on top of main
- potentially squash the commit
I wonder if I can figure out how to do this myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I squashed all commits into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand github then - I expected it's possible to have this PR ONLY list the things we're actually changing here. I expected all the makefile, go.sum, etc changes to be gone/not visible.
Is that not possible? If so, do we need a new PR rebased on the changes already in main?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I clicked the right button now.
Lemme take one last look later today, I'll merge it in after.
limiter_atomic_int64.go
Outdated
| state int64 // unix nanoseconds of the next permissions issue. | ||
| //lint:ignore U1000 Padding is unused but it is crucial to maintain performance | ||
| // of this rate limiter in case of collocation with other frequently accessed memory. | ||
| postpadding [56]byte // cache line size - state pointer size = 64 - 8; created to avoid false sharing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this one a bit more? Since we have other fields of the struct below, how does postpadding work here exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time when we perform CAS on state field, we mark the whole cache line as "dirty". So when some other goroutine wants to read perRequest, maxSlack or clock caching protocol will force us to reload the whole cache line even though perRequest, maxSlack or clock are constants in our case. So we introduce postpadding to move perRequest, maxSlack and clock or separate cache line. More about it here https://en.wikipedia.org/wiki/False_sharing
Benchmarks that show perf degradations when we remove postpadding check that atomic_int64_no_padding and atomic_int64_no_sched_no_padding are slower than their counterparts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fascinating. This sounds like an even more advanced case of https://pkg.go.dev/golang.org/x/tools/go/analysis/passes/fieldalignment, where we want to separate read/write variables.
Questions:
(1) do you know if there's any runtime variable that defines cache size line and int64 line? Perhaps we could avoid the hardcodes ints, and thus make the code self documenting.
(2) The comment needs updating, it's no longer "statePointerSize"
Unrelated:
(3) Are you using same padding techniques in some other projects? Could you paste on some links? I'd be curious to see how often, and to what effect these are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if there's any runtime variable that defines cache size line and int64 line?
I didn't know about it before you asked, but apparently golang.org/x/sys/cpu package has CacheLinePad struct for it https://pkg.go.dev/golang.org/x/sys/cpu#CacheLinePad
I can you that, but I'd prefer to add it in a separate PR if you don't mind. I wanted to keep this one as is, not to remeasure everything from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment needs updating, it's no longer "statePointerSize"
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using same padding techniques in some other projects? Could you paste on some links?
Yes, it is a widely used technic. I can find some good examples and post them here if you want. In some languages, like Java, for example, AtomicReference or AtomicInteger already have padding built-in.
limiter_atomic_int64.go
Outdated
| break | ||
| } | ||
| // We faced contention, so we call runtime to yield the processor, allowing other goroutines to run | ||
| // This technique was originally described in thishttps://arxiv.org/abs/1305.5800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // This technique was originally described in thishttps://arxiv.org/abs/1305.5800 | |
| // This technique was originally described in this https://arxiv.org/abs/1305.5800 |
limiter_atomic_int64.go
Outdated
| // We faced contention, so we call runtime to yield the processor, allowing other goroutines to run | ||
| // This technique was originally described in thishttps://arxiv.org/abs/1305.5800 | ||
| // and showed great results in benchmark tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // We faced contention, so we call runtime to yield the processor, allowing other goroutines to run | |
| // This technique was originally described in thishttps://arxiv.org/abs/1305.5800 | |
| // and showed great results in benchmark tests | |
| // yield, like above. |
limiter_atomic_int64.go
Outdated
| if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) { | ||
| break | ||
| } | ||
| // We faced contention, so we call runtime to yield the processor, allowing other goroutines to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you happen to have run a test with and without the yielding?
Naively thinking, the fact that we failed to load means that another go-routine has updated the value. N-1 go-routines will then try to re-try, each of them yielding ... meaning the go-routine that has successfully written before, is more likely to write again?
This seems like it would be affecting the "fairness" of the rate-limiting. While we don't explicitly say we want to be fair, I wonder if this can be unexpected in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good that you asked about it. Because when I basically copied current atomic implementation added yielding, measured that it improves things and didn't go back to re-check when switched to int64 based implementation. So to compare different techniques used in this branch I created separate branch
On that graph we can see that yields improve performance of atomic pointer based rate limiter, I suspect because cost of contention and retry is very high in that implementation, we basically allocate new state for every try.
But in case with int64 based rate limiter the cost of retry is much lower, so it is OK to try immediately after failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, according to this benchmarks I'll change atomicInt64Limiter to remove yielding.
Yes, the main overhead is due to allocation. We basically need to allocate a state struct on the heap and then set a pointer to it using CAS. There are no allocations when using only int64, and GC is not bothered with all that small objects, etc.
This is just an artifact of how we measure time per operation. What we actually do is take So with atomic_int64 the overhead of synchronization is so small that 1, 2, or 4 goroutines basically can't progress through To demonstrate it, I implemented a bit different way to measure time, where I measured not the time of how quickly goroutines together finish |
* Setup benchmarks stats and graphs For more details look at: #85 (comment)
* Setup benchmarks stats and graphs For more details look at: uber-go#85 (comment)
limiter_atomic_int64.go
Outdated
| if timeOfNextPermissionIssue == 0 { | ||
| newTimeOfNextPermissionIssue = now | ||
| if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) { | ||
| break | ||
| } | ||
| continue | ||
| } | ||
|
|
||
| if now-timeOfNextPermissionIssue > int64(t.maxSlack) { | ||
| // a lot of nanoseconds passed since the last Take call | ||
| // we will limit max accumulated time to maxSlack | ||
| newTimeOfNextPermissionIssue = now - int64(t.maxSlack) | ||
| } else { | ||
| // calculate the time at which our permission was issued | ||
| newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if timeOfNextPermissionIssue == 0 { | |
| newTimeOfNextPermissionIssue = now | |
| if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) { | |
| break | |
| } | |
| continue | |
| } | |
| if now-timeOfNextPermissionIssue > int64(t.maxSlack) { | |
| // a lot of nanoseconds passed since the last Take call | |
| // we will limit max accumulated time to maxSlack | |
| newTimeOfNextPermissionIssue = now - int64(t.maxSlack) | |
| } else { | |
| // calculate the time at which our permission was issued | |
| newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest) | |
| } | |
| switch { | |
| case timeOfNextPermissionIssue == 0: | |
| newTimeOfNextPermissionIssue = now | |
| case now-timeOfNextPermissionIssue > int64(t.maxSlack): | |
| // a lot of nanoseconds passed since the last Take call | |
| // we will limit max accumulated time to maxSlack | |
| newTimeOfNextPermissionIssue = now - int64(t.maxSlack) | |
| default: | |
| // calculate the time at which our permission was issued | |
| newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest) | |
| } |
I think this is equivalent (?), and I like that we only have a single CompareAndSwapInt64 - WDYT?
Separately, I was thinking about changing the initialization of the limiter so that it always inits with valid now. This way we don't have to handle the 0 case in the "hot" path.
This would change the startup behavior - we'd probably start with "maxSlack" accumulated requests on first request, where now we don't. The one "if/else" didn't seem worth it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to rewrite it in such way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do, please just do the switch.
For the second part (initialization change) let's keep it separate - I'd like to keep the initialization consistent across all implementation. I was just asking for your opinion about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rabbbit
Hey, I made the following implementation in a separate branch it is a bit simpler, but also a 3.3 ns slower, 27.4ns vs 24.1ns. I think it is because of using time.Time.Sub() or time.Time.Add(), that make some additional checks and handle internal state that int64 math doesn't do.
But this implementation handles zero state better. What do you think if I use that approach for state handling, but leave int64 math around?
We can also merge this PR as is and deal with refactoring in subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're conflating 3 different changes here, right?
- initial state change
- single vs multiple CAS
- int64 vs time types
I'm not okay with changing (1) (sorry, I left a note on this above too) - if we decide to change the initial state, we should change it in all implementations, and potentially add a test to formalize the startup behavior. I would definitely do this separately, or perhaps not at all.
I think I'd like us to decide on (2) in the current PR. As of my current understanding, there's no performance benefit in the original version, and the code seems clearer. So this seems like something we should just do.
I'm okay with leaving (3) as a follow up. Seems like a tradeoff between a tiny bit of speed (2ns-3ns) vs developer experience (compiler help). There's also a question of correctness, but I'm not right now sure how the serialization to int64 plays with the monotonic clock. We can potentially iterate on that separately.
Sorry for the amount of questions/comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on all 3 points. I'll quickly double check if switch has OK perf and make this change. For 2 other things we can discuss later in issues maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results are definitely strange, but switch is slower:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry about the number of comments. If this is getting annoying, let me know, and I can take over.
I'm okay with landing the switch version, and then iterating. For now:
- I prefer the readability over 5ns
- we're now significantly faster than the previous iteration anyway.
However, ^ made me realize is that I don't necessarily care about switch being present, but I'd like to have a single CAS operation. This means we could probably do something like:
if timeOfNextPermissionIssue == 0 {
// If this is our first request, then we allow it.
newTimeOfNextPermissionIssue = now - int64(t.perRequest)
}
if now-timeOfNextPermissionIssue > int64(t.maxSlack) {
// a lot of nanoseconds passed since the last Take call
// we will limit max accumulated time to maxSlack
newTimeOfNextPermissionIssue = now - int64(t.maxSlack)
} else {
// calculate the time at which our permission was issued
newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)
}
if atomic.CompareAndSwapInt64(&t.state, timeOfNextPermissionIssue, newTimeOfNextPermissionIssue) {
break
}Which I think is largely equivalent to your original version, but easier to parse? We're doing exactly one more subtraction on the startup phase, but that shouldn't matter.
But, again, I'm okay with the switch version if you'd like to land that. And we can iterate from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's land switch for now and then iterate. It can be that only one CAS is causing the slowdown, or maybe we won't need handling of timeOfNextPermissionIssue == 0 in the future. We will see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Yeah, we've seen that in the past, but didn't debug (we don't make many changes to this repo). It seems to be happening mostly, but not exclusively, during codecov generation.
The another surprising part was for your i7 the 1,2,4 goroutines cases were slower, but it was reversed for the M1 machine. Magic.
Nice. The performance of the limiter seems to be significantly slower with more goroutines here - presumably because time calculation is more expensive here? Thanks for ding this, and sorry for all the nits/questions :) Separately, this PR probably resolves #22 - we see that the mutex was performing better than the old atomic version. |
It was not the case on 1.16 or go 1.15 but it is now I think. |
M1 has 4 slow cores and 4 fast cores, I think 4 fast cores start to work at fist and when 4 slow join the party they only get in a way. If you want I can run techniques_comp benches on M1 to the the whole picture |
* Setup benchmarks stats and graphs For more details look at: uber-go#85 (comment)
|
@rabbbit It looks like our clock mocking library is supper old and archived by author. |
011e559 to
2e7a9fa
Compare
| type atomicInt64Limiter struct { | ||
| //lint:ignore U1000 Padding is unused but it is crucial to maintain performance | ||
| // of this rate limiter in case of collocation with other frequently accessed memory. | ||
| prepadding [64]byte // cache line size = 64; created to avoid false sharing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about this more, I think this can be [56] as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepadding is usually used to avoid accidental colocations with other contended objects on the heap, since we don't know their size, we just use full cache line size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I mistaken thinking that [56] achieves the same target?
It seem to me that prepadding of 56, size of 8, and post-padding of 56 guarantees that the object-in-question will always be "alone" in the cache line?
| case timeOfNextPermissionIssue == 0: | ||
| // If this is our first request, then we allow it. | ||
| newTimeOfNextPermissionIssue = now | ||
| case now-timeOfNextPermissionIssue > int64(t.maxSlack): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can reduce the number of substractions here potentially too.
maxTimeOfNextPermissionIssue = now - int64(t.maxSlack)
if timeOfNextPermissionIssue < maxTimeOfNextPermissionIssue {
newTimeOfNextPermissionIssue = maxTimeOfNextPermissionIssue
} else {
newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)we can micro-optimize this separately though, in follow-ups.
}
This limiter was introduced and merged in the following PR #85 Later @twelsh-aw found an issue with this implementation #90 So @rabbbit reverted this change in #91 Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better. With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that #93 will detect the bug. Right after it, we'll open a subsequent PR to fix this bug.
This limiter was introduced and merged in the following PR uber-go#85 Later @twelsh-aw found an issue with this implementation uber-go#90 So @rabbbit reverted this change in uber-go#91 Our tests did not detect this issue, so we have a separate PR uber-go#93 that enhances our tests approach to detect potential errors better. With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that uber-go#93 will detect the bug. Right after it, we'll open a subsequent PR to fix this bug.
* Setup benchmarks stats and graphs For more details look at: uber-go/ratelimit#85 (comment)
Hi everyone,
It's been a few years since we introduced the atomic-based rate limiter. Since then, after numerous changes to the runtime scheduler, mutex-based implementation has become much more stable.
So I found a way to improve atomic-based implementation further and squeeze its state into one
int64. The new implementation is much faster, stable under contention, and has zero allocations.Benchmarks on 8 core i7 intel machine:

Benchmarks on 8 core M1 machine:
