Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@storozhukBM
Copy link
Contributor

Hi everyone,
It's been a few years since we introduced the atomic-based rate limiter. Since then, after numerous changes to the runtime scheduler, mutex-based implementation has become much more stable.
So I found a way to improve atomic-based implementation further and squeeze its state into one int64. The new implementation is much faster, stable under contention, and has zero allocations.

Benchmarks on 8 core i7 intel machine:
chart 1

Benchmarks on 8 core M1 machine:
chart 2

@CLAassistant
Copy link

CLAassistant commented Jun 3, 2022

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Jun 3, 2022

Codecov Report

Merging #85 (bb12424) into main (bca0419) will decrease coverage by 1.01%.
The diff coverage is 96.66%.

❗ Current head bb12424 differs from pull request most recent head 5d22744. Consider uploading reports for the commit 5d22744 to get more accurate results

@@             Coverage Diff             @@
##              main      #85      +/-   ##
===========================================
- Coverage   100.00%   98.98%   -1.02%     
===========================================
  Files            3        4       +1     
  Lines           70       99      +29     
===========================================
+ Hits            70       98      +28     
- Misses           0        1       +1     
Impacted Files Coverage Δ
limiter_atomic_int64.go 96.55% <96.55%> (ø)
ratelimit.go 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bca0419...5d22744. Read the comment docs.

@storozhukBM
Copy link
Contributor Author

@abhinav @rabbbit

Regarding levered coverage:
The only few lines that are not covered with tests in new implementation require very high contention to get triggered in tests, help the scheduler a bit, and introduce no behavioral change. So I think it is OK to leave them uncovered. What do you think?

@rabbbit
Copy link
Contributor

rabbbit commented Jun 5, 2022

Thanks, this looks really cool!

I took a brief look, left some minor comments, will need to take another pass to re-learn the old code.

I don't actually know/understand - could you explain why/how the new implementation is so much better than the previous atomic based one? I imagined that storing/loading a pointer atomically would be approximately as fast as storing an int64.
Is it just the overhead of loading an int vs a struct?

Also, do you know why the overhead for {1,2,3} goroutines is larger than 4+? This seems counterintuitive.

.PHONY: bench
bench:
go test -bench=. ./...
bench: bin/benchstat bin/benchart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be open to putting up the benchmarking changes as a separate PR, to be landed before this one?

I'd love to keep the history clean/separate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup: #86

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. thanks. I'll wait for you to rebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, this is not rebased, right? I'm seeing a bunch of changes that are already on main.

I'm not intimately familiar with github workflows, but I'm assuming there's a way to:

  • rebase your changes on top of main
  • potentially squash the commit

I wonder if I can figure out how to do this myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I squashed all commits into one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand github then - I expected it's possible to have this PR ONLY list the things we're actually changing here. I expected all the makefile, go.sum, etc changes to be gone/not visible.

Is that not possible? If so, do we need a new PR rebased on the changes already in main?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I clicked the right button now.

Lemme take one last look later today, I'll merge it in after.

state int64 // unix nanoseconds of the next permissions issue.
//lint:ignore U1000 Padding is unused but it is crucial to maintain performance
// of this rate limiter in case of collocation with other frequently accessed memory.
postpadding [56]byte // cache line size - state pointer size = 64 - 8; created to avoid false sharing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this one a bit more? Since we have other fields of the struct below, how does postpadding work here exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time when we perform CAS on state field, we mark the whole cache line as "dirty". So when some other goroutine wants to read perRequest, maxSlack or clock caching protocol will force us to reload the whole cache line even though perRequest, maxSlack or clock are constants in our case. So we introduce postpadding to move perRequest, maxSlack and clock or separate cache line. More about it here https://en.wikipedia.org/wiki/False_sharing

Benchmarks that show perf degradations when we remove postpadding check that atomic_int64_no_padding and atomic_int64_no_sched_no_padding are slower than their counterparts.

Screenshot 2022-06-05 at 16 13 08

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating. This sounds like an even more advanced case of https://pkg.go.dev/golang.org/x/tools/go/analysis/passes/fieldalignment, where we want to separate read/write variables.

Questions:
(1) do you know if there's any runtime variable that defines cache size line and int64 line? Perhaps we could avoid the hardcodes ints, and thus make the code self documenting.
(2) The comment needs updating, it's no longer "statePointerSize"

Unrelated:
(3) Are you using same padding techniques in some other projects? Could you paste on some links? I'd be curious to see how often, and to what effect these are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there's any runtime variable that defines cache size line and int64 line?

I didn't know about it before you asked, but apparently golang.org/x/sys/cpu package has CacheLinePad struct for it https://pkg.go.dev/golang.org/x/sys/cpu#CacheLinePad
I can you that, but I'd prefer to add it in a separate PR if you don't mind. I wanted to keep this one as is, not to remeasure everything from scratch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment needs updating, it's no longer "statePointerSize"

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using same padding techniques in some other projects? Could you paste on some links?

Yes, it is a widely used technic. I can find some good examples and post them here if you want. In some languages, like Java, for example, AtomicReference or AtomicInteger already have padding built-in.

break
}
// We faced contention, so we call runtime to yield the processor, allowing other goroutines to run
// This technique was originally described in thishttps://arxiv.org/abs/1305.5800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This technique was originally described in thishttps://arxiv.org/abs/1305.5800
// This technique was originally described in this https://arxiv.org/abs/1305.5800

Comment on lines 94 to 96
// We faced contention, so we call runtime to yield the processor, allowing other goroutines to run
// This technique was originally described in thishttps://arxiv.org/abs/1305.5800
// and showed great results in benchmark tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We faced contention, so we call runtime to yield the processor, allowing other goroutines to run
// This technique was originally described in thishttps://arxiv.org/abs/1305.5800
// and showed great results in benchmark tests
// yield, like above.

if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) {
break
}
// We faced contention, so we call runtime to yield the processor, allowing other goroutines to run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you happen to have run a test with and without the yielding?

Naively thinking, the fact that we failed to load means that another go-routine has updated the value. N-1 go-routines will then try to re-try, each of them yielding ... meaning the go-routine that has successfully written before, is more likely to write again?

This seems like it would be affecting the "fairness" of the rate-limiting. While we don't explicitly say we want to be fair, I wonder if this can be unexpected in some cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that you asked about it. Because when I basically copied current atomic implementation added yielding, measured that it improves things and didn't go back to re-check when switched to int64 based implementation. So to compare different techniques used in this branch I created separate branch

Screenshot 2022-06-05 at 16 13 08

On that graph we can see that yields improve performance of atomic pointer based rate limiter, I suspect because cost of contention and retry is very high in that implementation, we basically allocate new state for every try.

But in case with int64 based rate limiter the cost of retry is much lower, so it is OK to try immediately after failure.

Copy link
Contributor Author

@storozhukBM storozhukBM Jun 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, according to this benchmarks I'll change atomicInt64Limiter to remove yielding.

@storozhukBM
Copy link
Contributor Author

@rabbbit

I don't actually know/understand - could you explain why/how the new implementation is so much better than the previous atomic based one? I imagined that storing/loading a pointer atomically would be approximately as fast as storing an int64.
Is it just the overhead of loading an int vs a struct?

Yes, the main overhead is due to allocation. We basically need to allocate a state struct on the heap and then set a pointer to it using CAS. There are no allocations when using only int64, and GC is not bothered with all that small objects, etc.

Also, do you know why the overhead for {1,2,3} goroutines is larger than 4+? This seems counterintuitive.

This is just an artifact of how we measure time per operation. What we actually do is take testing.B.N as the number of permissions we should get from the rate limiter, then we distribute this amount across ng (number of goroutines)
In the end, we measure time.Duration of how long it took us to Take the testing.B.N permissions, then we divide this measured time.Duration by testing.B.N and we get ns/op.

So with atomic_int64 the overhead of synchronization is so small that 1, 2, or 4 goroutines basically can't progress through testing.B.N quickly enough, so their sum throughput is smaller in 8 goroutines.

To demonstrate it, I implemented a bit different way to measure time, where I measured not the time of how quickly goroutines together finish testing.B.N, but the average time of Take() and this graph you don't see such artefacts

Screenshot 2022-06-05 at 16 07 25

@storozhukBM
Copy link
Contributor Author

@abhinav @rabbbit
For some reason tests fail on CI on atomic rate limiter implementation that is currently in master, this is happened also on other PR #86, but I don't get such failures locally, can you check on your machines?

rabbbit pushed a commit that referenced this pull request Jun 5, 2022
* Setup benchmarks stats and graphs

For more details look at: #85 (comment)
storozhukBM added a commit to storozhukBM/ratelimit that referenced this pull request Jun 5, 2022
* Setup benchmarks stats and graphs

For more details look at: uber-go#85 (comment)
Comment on lines 68 to 79
if timeOfNextPermissionIssue == 0 {
newTimeOfNextPermissionIssue = now
if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) {
break
}
continue
}

if now-timeOfNextPermissionIssue > int64(t.maxSlack) {
// a lot of nanoseconds passed since the last Take call
// we will limit max accumulated time to maxSlack
newTimeOfNextPermissionIssue = now - int64(t.maxSlack)
} else {
// calculate the time at which our permission was issued
newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)
}
Copy link
Contributor

@rabbbit rabbbit Jun 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if timeOfNextPermissionIssue == 0 {
newTimeOfNextPermissionIssue = now
if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) {
break
}
continue
}
if now-timeOfNextPermissionIssue > int64(t.maxSlack) {
// a lot of nanoseconds passed since the last Take call
// we will limit max accumulated time to maxSlack
newTimeOfNextPermissionIssue = now - int64(t.maxSlack)
} else {
// calculate the time at which our permission was issued
newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)
}
switch {
case timeOfNextPermissionIssue == 0:
newTimeOfNextPermissionIssue = now
case now-timeOfNextPermissionIssue > int64(t.maxSlack):
// a lot of nanoseconds passed since the last Take call
// we will limit max accumulated time to maxSlack
newTimeOfNextPermissionIssue = now - int64(t.maxSlack)
default:
// calculate the time at which our permission was issued
newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)
}

I think this is equivalent (?), and I like that we only have a single CompareAndSwapInt64 - WDYT?

Separately, I was thinking about changing the initialization of the limiter so that it always inits with valid now. This way we don't have to handle the 0 case in the "hot" path.
This would change the startup behavior - we'd probably start with "maxSlack" accumulated requests on first request, where now we don't. The one "if/else" didn't seem worth it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to rewrite it in such way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do, please just do the switch.

For the second part (initialization change) let's keep it separate - I'd like to keep the initialization consistent across all implementation. I was just asking for your opinion about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabbbit
Hey, I made the following implementation in a separate branch it is a bit simpler, but also a 3.3 ns slower, 27.4ns vs 24.1ns. I think it is because of using time.Time.Sub() or time.Time.Add(), that make some additional checks and handle internal state that int64 math doesn't do.

Screenshot 2022-06-05 at 21 25 45

But this implementation handles zero state better. What do you think if I use that approach for state handling, but leave int64 math around?

We can also merge this PR as is and deal with refactoring in subsequent PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're conflating 3 different changes here, right?

  1. initial state change
  2. single vs multiple CAS
  3. int64 vs time types

I'm not okay with changing (1) (sorry, I left a note on this above too) - if we decide to change the initial state, we should change it in all implementations, and potentially add a test to formalize the startup behavior. I would definitely do this separately, or perhaps not at all.

I think I'd like us to decide on (2) in the current PR. As of my current understanding, there's no performance benefit in the original version, and the code seems clearer. So this seems like something we should just do.

I'm okay with leaving (3) as a follow up. Seems like a tradeoff between a tiny bit of speed (2ns-3ns) vs developer experience (compiler help). There's also a question of correctness, but I'm not right now sure how the serialization to int64 plays with the monotonic clock. We can potentially iterate on that separately.

Sorry for the amount of questions/comments :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on all 3 points. I'll quickly double check if switch has OK perf and make this change. For 2 other things we can discuss later in issues maybe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results are definitely strange, but switch is slower:

Screenshot 2022-06-06 at 00 29 47

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry about the number of comments. If this is getting annoying, let me know, and I can take over.

I'm okay with landing the switch version, and then iterating. For now:

  • I prefer the readability over 5ns
  • we're now significantly faster than the previous iteration anyway.

However, ^ made me realize is that I don't necessarily care about switch being present, but I'd like to have a single CAS operation. This means we could probably do something like:

		if timeOfNextPermissionIssue == 0 {
	                 // If this is our first request, then we allow it.	
			newTimeOfNextPermissionIssue = now - int64(t.perRequest)
                 }

		if now-timeOfNextPermissionIssue > int64(t.maxSlack) {
			// a lot of nanoseconds passed since the last Take call
			// we will limit max accumulated time to maxSlack
			newTimeOfNextPermissionIssue = now - int64(t.maxSlack)
                } else {
			// calculate the time at which our permission was issued
			newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)
                }

		if atomic.CompareAndSwapInt64(&t.state, timeOfNextPermissionIssue, newTimeOfNextPermissionIssue) {
			break
		}

Which I think is largely equivalent to your original version, but easier to parse? We're doing exactly one more subtraction on the startup phase, but that shouldn't matter.

But, again, I'm okay with the switch version if you'd like to land that. And we can iterate from that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's land switch for now and then iterate. It can be that only one CAS is causing the slowdown, or maybe we won't need handling of timeOfNextPermissionIssue == 0 in the future. We will see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@rabbbit
Copy link
Contributor

rabbbit commented Jun 5, 2022

@abhinav @rabbbit For some reason tests fail on CI on atomic rate limiter implementation that is currently in master, this is happened also on other PR #86, but I don't get such failures locally, can you check on your machines?

Yeah, we've seen that in the past, but didn't debug (we don't make many changes to this repo). It seems to be happening mostly, but not exclusively, during codecov generation.

So with atomic_int64 the overhead of synchronization is so small that 1, 2, or 4 goroutines basically can't progress through testing.B.N quickly enough, so their sum throughput is smaller in 8 goroutines.

The another surprising part was for your i7 the 1,2,4 goroutines cases were slower, but it was reversed for the M1 machine. Magic.

To demonstrate it, I implemented a bit different way to measure time, where I measured not the time of how quickly goroutines together finish testing.B.N, but the average time of Take() and this graph you don't see such artefacts

Nice. The performance of the limiter seems to be significantly slower with more goroutines here - presumably because time calculation is more expensive here?

Thanks for ding this, and sorry for all the nits/questions :)

Separately, this PR probably resolves #22 - we see that the mutex was performing better than the old atomic version.

@storozhukBM
Copy link
Contributor Author

Separately, this PR probably resolves #22 - we see that the mutex was performing better than the old atomic version.

It was not the case on 1.16 or go 1.15 but it is now I think.

@storozhukBM
Copy link
Contributor Author

The another surprising part was for your i7 the 1,2,4 goroutines cases were slower, but it was reversed for the M1 machine. Magic.

M1 has 4 slow cores and 4 fast cores, I think 4 fast cores start to work at fist and when 4 slow join the party they only get in a way. If you want I can run techniques_comp benches on M1 to the the whole picture

storozhukBM added a commit to storozhukBM/ratelimit that referenced this pull request Jun 5, 2022
* Setup benchmarks stats and graphs

For more details look at: uber-go#85 (comment)
@storozhukBM
Copy link
Contributor Author

@rabbbit It looks like our clock mocking library is supper old and archived by author.
What do you say if I replace it with drop-in replacement that is actively maintained https://github.com/benbjohnson/clock

@storozhukBM storozhukBM force-pushed the int64_rl branch 2 times, most recently from 011e559 to 2e7a9fa Compare June 7, 2022 20:17
type atomicInt64Limiter struct {
//lint:ignore U1000 Padding is unused but it is crucial to maintain performance
// of this rate limiter in case of collocation with other frequently accessed memory.
prepadding [64]byte // cache line size = 64; created to avoid false sharing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about this more, I think this can be [56] as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepadding is usually used to avoid accidental colocations with other contended objects on the heap, since we don't know their size, we just use full cache line size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I mistaken thinking that [56] achieves the same target?

It seem to me that prepadding of 56, size of 8, and post-padding of 56 guarantees that the object-in-question will always be "alone" in the cache line?

case timeOfNextPermissionIssue == 0:
// If this is our first request, then we allow it.
newTimeOfNextPermissionIssue = now
case now-timeOfNextPermissionIssue > int64(t.maxSlack):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reduce the number of substractions here potentially too.

maxTimeOfNextPermissionIssue = now - int64(t.maxSlack)
if timeOfNextPermissionIssue < maxTimeOfNextPermissionIssue {
     newTimeOfNextPermissionIssue = maxTimeOfNextPermissionIssue
} else {
    newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)

we can micro-optimize this separately though, in follow-ups.
}

@rabbbit rabbbit merged commit f04376c into uber-go:main Jun 7, 2022
rabbbit added a commit that referenced this pull request Jun 9, 2022
rabbbit added a commit that referenced this pull request Jun 9, 2022
rabbbit pushed a commit that referenced this pull request Jul 2, 2022
This limiter was introduced and merged in the following PR #85
Later @twelsh-aw found an issue with this implementation #90
So @rabbbit reverted this change in #91

Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better.
With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that #93 will detect the bug.
Right after it, we'll open a subsequent PR to fix this bug.
storozhukBM added a commit to storozhukBM/ratelimit that referenced this pull request Jul 2, 2022
This limiter was introduced and merged in the following PR uber-go#85
Later @twelsh-aw found an issue with this implementation uber-go#90
So @rabbbit reverted this change in uber-go#91

Our tests did not detect this issue, so we have a separate PR uber-go#93 that enhances our tests approach to detect potential errors better.
With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that uber-go#93 will detect the bug.
Right after it, we'll open a subsequent PR to fix this bug.
rabbbit pushed a commit that referenced this pull request Oct 31, 2022
* Fix return timestamp discrepancy between regular atomic limiter and int64 based one
* Make int64 based atomic limiter default

Long story: this was added in #85, but reverted in #91 due to #90. #95 fixed the issue, so we're moving forward with the new implementation.
touridev pushed a commit to touridev/limit-go that referenced this pull request Oct 7, 2025
* Setup benchmarks stats and graphs

For more details look at: uber-go/ratelimit#85 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants