New atomic-based implementation squeezed into int64 #85

storozhukBM · 2022-06-03T10:18:08Z

Hi everyone,
It's been a few years since we introduced the atomic-based rate limiter. Since then, after numerous changes to the runtime scheduler, mutex-based implementation has become much more stable.
So I found a way to improve atomic-based implementation further and squeeze its state into one int64. The new implementation is much faster, stable under contention, and has zero allocations.

Benchmarks on 8 core i7 intel machine:

Benchmarks on 8 core M1 machine:

CLAassistant · 2022-06-03T10:18:12Z

All committers have signed the CLA.

codecov · 2022-06-03T10:19:43Z

Codecov Report

Merging #85 (bb12424) into main (bca0419) will decrease coverage by 1.01%.
The diff coverage is 96.66%.

❗ Current head bb12424 differs from pull request most recent head 5d22744. Consider uploading reports for the commit 5d22744 to get more accurate results

@@             Coverage Diff             @@
##              main      #85      +/-   ##
===========================================
- Coverage   100.00%   98.98%   -1.02%     
===========================================
  Files            3        4       +1     
  Lines           70       99      +29     
===========================================
+ Hits            70       98      +28     
- Misses           0        1       +1

Impacted Files	Coverage Δ
limiter_atomic_int64.go	`96.55% <96.55%> (ø)`
ratelimit.go	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bca0419...5d22744. Read the comment docs.

storozhukBM · 2022-06-03T10:30:31Z

@abhinav @rabbbit

Regarding levered coverage:
The only few lines that are not covered with tests in new implementation require very high contention to get triggered in tests, help the scheduler a bit, and introduce no behavioral change. So I think it is OK to leave them uncovered. What do you think?

rabbbit · 2022-06-05T05:36:35Z

Thanks, this looks really cool!

I took a brief look, left some minor comments, will need to take another pass to re-learn the old code.

I don't actually know/understand - could you explain why/how the new implementation is so much better than the previous atomic based one? I imagined that storing/loading a pointer atomically would be approximately as fast as storing an int64.
Is it just the overhead of loading an int vs a struct?

Also, do you know why the overhead for {1,2,3} goroutines is larger than 4+? This seems counterintuitive.

rabbbit · 2022-06-05T05:17:21Z

Makefile

 .PHONY: bench
-bench:
-	go test -bench=. ./...
+bench: bin/benchstat bin/benchart


Would you be open to putting up the benchmarking changes as a separate PR, to be landed before this one?

I'd love to keep the history clean/separate.

Nice. thanks. I'll wait for you to rebase.

Wait, this is not rebased, right? I'm seeing a bunch of changes that are already on main.

I'm not intimately familiar with github workflows, but I'm assuming there's a way to:

rebase your changes on top of main

potentially squash the commit

I wonder if I can figure out how to do this myself.

Hey, I squashed all commits into one.

I don't understand github then - I expected it's possible to have this PR ONLY list the things we're actually changing here. I expected all the makefile, go.sum, etc changes to be gone/not visible.

Is that not possible? If so, do we need a new PR rebased on the changes already in main?

Okay, I think I clicked the right button now.

Lemme take one last look later today, I'll merge it in after.

limiter_atomic_int64.go

rabbbit · 2022-06-05T05:19:42Z

limiter_atomic_int64.go

+	state      int64    // unix nanoseconds of the next permissions issue.
+	//lint:ignore U1000 Padding is unused but it is crucial to maintain performance
+	// of this rate limiter in case of collocation with other frequently accessed memory.
+	postpadding [56]byte // cache line size - state pointer size = 64 - 8; created to avoid false sharing.


Could you explain this one a bit more? Since we have other fields of the struct below, how does postpadding work here exactly?

Every time when we perform CAS on state field, we mark the whole cache line as "dirty". So when some other goroutine wants to read perRequest, maxSlack or clock caching protocol will force us to reload the whole cache line even though perRequest, maxSlack or clock are constants in our case. So we introduce postpadding to move perRequest, maxSlack and clock or separate cache line. More about it here https://en.wikipedia.org/wiki/False_sharing

Benchmarks that show perf degradations when we remove postpadding check that atomic_int64_no_padding and atomic_int64_no_sched_no_padding are slower than their counterparts.

Fascinating. This sounds like an even more advanced case of https://pkg.go.dev/golang.org/x/tools/go/analysis/passes/fieldalignment, where we want to separate read/write variables.

Questions:
(1) do you know if there's any runtime variable that defines cache size line and int64 line? Perhaps we could avoid the hardcodes ints, and thus make the code self documenting.
(2) The comment needs updating, it's no longer "statePointerSize"

Unrelated:
(3) Are you using same padding techniques in some other projects? Could you paste on some links? I'd be curious to see how often, and to what effect these are used.

Do you know if there's any runtime variable that defines cache size line and int64 line?

I didn't know about it before you asked, but apparently golang.org/x/sys/cpu package has CacheLinePad struct for it https://pkg.go.dev/golang.org/x/sys/cpu#CacheLinePad
I can you that, but I'd prefer to add it in a separate PR if you don't mind. I wanted to keep this one as is, not to remeasure everything from scratch.

The comment needs updating, it's no longer "statePointerSize"

Done

Are you using same padding techniques in some other projects? Could you paste on some links?

Yes, it is a widely used technic. I can find some good examples and post them here if you want. In some languages, like Java, for example, AtomicReference or AtomicInteger already have padding built-in.

limiter_atomic_int64.go

rabbbit · 2022-06-05T05:38:32Z

limiter_atomic_int64.go

+				break
+			}
+			// We faced contention, so we call runtime to yield the processor, allowing other goroutines to run
+			// This technique was originally described in thishttps://arxiv.org/abs/1305.5800


Suggested change

// This technique was originally described in thishttps://arxiv.org/abs/1305.5800

// This technique was originally described in this https://arxiv.org/abs/1305.5800

rabbbit · 2022-06-05T05:38:57Z

limiter_atomic_int64.go

+		// We faced contention, so we call runtime to yield the processor, allowing other goroutines to run
+		// This technique was originally described in thishttps://arxiv.org/abs/1305.5800
+		// and showed great results in benchmark tests


Suggested change

// We faced contention, so we call runtime to yield the processor, allowing other goroutines to run

// This technique was originally described in thishttps://arxiv.org/abs/1305.5800

// and showed great results in benchmark tests

// yield, like above.

rabbbit · 2022-06-05T05:44:02Z

limiter_atomic_int64.go

+			if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) {
+				break
+			}
+			// We faced contention, so we call runtime to yield the processor, allowing other goroutines to run


Did you happen to have run a test with and without the yielding?

Naively thinking, the fact that we failed to load means that another go-routine has updated the value. N-1 go-routines will then try to re-try, each of them yielding ... meaning the go-routine that has successfully written before, is more likely to write again?

This seems like it would be affecting the "fairness" of the rate-limiting. While we don't explicitly say we want to be fair, I wonder if this can be unexpected in some cases.

Good that you asked about it. Because when I basically copied current atomic implementation added yielding, measured that it improves things and didn't go back to re-check when switched to int64 based implementation. So to compare different techniques used in this branch I created separate branch

On that graph we can see that yields improve performance of atomic pointer based rate limiter, I suspect because cost of contention and retry is very high in that implementation, we basically allocate new state for every try.

But in case with int64 based rate limiter the cost of retry is much lower, so it is OK to try immediately after failure.

OK, according to this benchmarks I'll change atomicInt64Limiter to remove yielding.

storozhukBM · 2022-06-05T15:12:36Z

@rabbbit

I don't actually know/understand - could you explain why/how the new implementation is so much better than the previous atomic based one? I imagined that storing/loading a pointer atomically would be approximately as fast as storing an int64.
Is it just the overhead of loading an int vs a struct?

Yes, the main overhead is due to allocation. We basically need to allocate a state struct on the heap and then set a pointer to it using CAS. There are no allocations when using only int64, and GC is not bothered with all that small objects, etc.

Also, do you know why the overhead for {1,2,3} goroutines is larger than 4+? This seems counterintuitive.

This is just an artifact of how we measure time per operation. What we actually do is take testing.B.N as the number of permissions we should get from the rate limiter, then we distribute this amount across ng (number of goroutines)
In the end, we measure time.Duration of how long it took us to Take the testing.B.N permissions, then we divide this measured time.Duration by testing.B.N and we get ns/op.

So with atomic_int64 the overhead of synchronization is so small that 1, 2, or 4 goroutines basically can't progress through testing.B.N quickly enough, so their sum throughput is smaller in 8 goroutines.

To demonstrate it, I implemented a bit different way to measure time, where I measured not the time of how quickly goroutines together finish testing.B.N, but the average time of Take() and this graph you don't see such artefacts

storozhukBM · 2022-06-05T16:05:51Z

@abhinav @rabbbit
For some reason tests fail on CI on atomic rate limiter implementation that is currently in master, this is happened also on other PR #86, but I don't get such failures locally, can you check on your machines?

* Setup benchmarks stats and graphs For more details look at: #85 (comment)

* Setup benchmarks stats and graphs For more details look at: uber-go#85 (comment)

ratelimit_test.go

limiter_atomic_int64.go

rabbbit · 2022-06-05T18:36:28Z

limiter_atomic_int64.go

+		if timeOfNextPermissionIssue == 0 {
+			newTimeOfNextPermissionIssue = now
+			if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) {
+				break
+			}
+			continue
+		}
+
+		if now-timeOfNextPermissionIssue > int64(t.maxSlack) {
+			// a lot of nanoseconds passed since the last Take call
+			// we will limit max accumulated time to maxSlack
+			newTimeOfNextPermissionIssue = now - int64(t.maxSlack)
+		} else {
+			// calculate the time at which our permission was issued
+			newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)
+		}


Suggested change

if timeOfNextPermissionIssue == 0 {

newTimeOfNextPermissionIssue = now

if atomic.CompareAndSwapInt64(&t.state, 0, newTimeOfNextPermissionIssue) {

break

}

continue

}

if now-timeOfNextPermissionIssue > int64(t.maxSlack) {

// a lot of nanoseconds passed since the last Take call

// we will limit max accumulated time to maxSlack

newTimeOfNextPermissionIssue = now - int64(t.maxSlack)

} else {

// calculate the time at which our permission was issued

newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)

}

switch {

case timeOfNextPermissionIssue == 0:

newTimeOfNextPermissionIssue = now

case now-timeOfNextPermissionIssue > int64(t.maxSlack):

// a lot of nanoseconds passed since the last Take call

// we will limit max accumulated time to maxSlack

newTimeOfNextPermissionIssue = now - int64(t.maxSlack)

default:

// calculate the time at which our permission was issued

newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)

}

I think this is equivalent (?), and I like that we only have a single CompareAndSwapInt64 - WDYT?

Separately, I was thinking about changing the initialization of the limiter so that it always inits with valid now. This way we don't have to handle the 0 case in the "hot" path.
This would change the startup behavior - we'd probably start with "maxSlack" accumulated requests on first request, where now we don't. The one "if/else" didn't seem worth it though.

I'll try to rewrite it in such way

If you do, please just do the switch.

For the second part (initialization change) let's keep it separate - I'd like to keep the initialization consistent across all implementation. I was just asking for your opinion about that.

@rabbbit
Hey, I made the following implementation in a separate branch it is a bit simpler, but also a 3.3 ns slower, 27.4ns vs 24.1ns. I think it is because of using time.Time.Sub() or time.Time.Add(), that make some additional checks and handle internal state that int64 math doesn't do.

But this implementation handles zero state better. What do you think if I use that approach for state handling, but leave int64 math around?

We can also merge this PR as is and deal with refactoring in subsequent PR.

I think we're conflating 3 different changes here, right?

initial state change

single vs multiple CAS

int64 vs time types

I'm not okay with changing (1) (sorry, I left a note on this above too) - if we decide to change the initial state, we should change it in all implementations, and potentially add a test to formalize the startup behavior. I would definitely do this separately, or perhaps not at all.

I think I'd like us to decide on (2) in the current PR. As of my current understanding, there's no performance benefit in the original version, and the code seems clearer. So this seems like something we should just do.

I'm okay with leaving (3) as a follow up. Seems like a tradeoff between a tiny bit of speed (2ns-3ns) vs developer experience (compiler help). There's also a question of correctness, but I'm not right now sure how the serialization to int64 plays with the monotonic clock. We can potentially iterate on that separately.

Sorry for the amount of questions/comments :)

I agree on all 3 points. I'll quickly double check if switch has OK perf and make this change. For 2 other things we can discuss later in issues maybe.

Results are definitely strange, but switch is slower:

I'm sorry about the number of comments. If this is getting annoying, let me know, and I can take over.

I'm okay with landing the switch version, and then iterating. For now:

I prefer the readability over 5ns

we're now significantly faster than the previous iteration anyway.

However, ^ made me realize is that I don't necessarily care about switch being present, but I'd like to have a single CAS operation. This means we could probably do something like:

if timeOfNextPermissionIssue == 0 { // If this is our first request, then we allow it. newTimeOfNextPermissionIssue = now - int64(t.perRequest) } if now-timeOfNextPermissionIssue > int64(t.maxSlack) { // a lot of nanoseconds passed since the last Take call // we will limit max accumulated time to maxSlack newTimeOfNextPermissionIssue = now - int64(t.maxSlack) } else { // calculate the time at which our permission was issued newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest) } if atomic.CompareAndSwapInt64(&t.state, timeOfNextPermissionIssue, newTimeOfNextPermissionIssue) { break }

Which I think is largely equivalent to your original version, but easier to parse? We're doing exactly one more subtraction on the startup phase, but that shouldn't matter.

But, again, I'm okay with the switch version if you'd like to land that. And we can iterate from that.

Yes, let's land switch for now and then iterate. It can be that only one CAS is causing the slowdown, or maybe we won't need handling of timeOfNextPermissionIssue == 0 in the future. We will see.

rabbbit · 2022-06-05T19:07:36Z

@abhinav @rabbbit For some reason tests fail on CI on atomic rate limiter implementation that is currently in master, this is happened also on other PR #86, but I don't get such failures locally, can you check on your machines?

Yeah, we've seen that in the past, but didn't debug (we don't make many changes to this repo). It seems to be happening mostly, but not exclusively, during codecov generation.

So with atomic_int64 the overhead of synchronization is so small that 1, 2, or 4 goroutines basically can't progress through testing.B.N quickly enough, so their sum throughput is smaller in 8 goroutines.

The another surprising part was for your i7 the 1,2,4 goroutines cases were slower, but it was reversed for the M1 machine. Magic.

To demonstrate it, I implemented a bit different way to measure time, where I measured not the time of how quickly goroutines together finish testing.B.N, but the average time of Take() and this graph you don't see such artefacts

Nice. The performance of the limiter seems to be significantly slower with more goroutines here - presumably because time calculation is more expensive here?

Thanks for ding this, and sorry for all the nits/questions :)

Separately, this PR probably resolves #22 - we see that the mutex was performing better than the old atomic version.

storozhukBM · 2022-06-05T19:12:26Z

Separately, this PR probably resolves #22 - we see that the mutex was performing better than the old atomic version.

It was not the case on 1.16 or go 1.15 but it is now I think.

storozhukBM · 2022-06-05T19:17:45Z

The another surprising part was for your i7 the 1,2,4 goroutines cases were slower, but it was reversed for the M1 machine. Magic.

M1 has 4 slow cores and 4 fast cores, I think 4 fast cores start to work at fist and when 4 slow join the party they only get in a way. If you want I can run techniques_comp benches on M1 to the the whole picture

* Setup benchmarks stats and graphs For more details look at: uber-go#85 (comment)

storozhukBM · 2022-06-06T00:56:38Z

@rabbbit It looks like our clock mocking library is supper old and archived by author.
What do you say if I replace it with drop-in replacement that is actively maintained https://github.com/benbjohnson/clock

rabbbit · 2022-06-07T21:32:26Z

limiter_atomic_int64.go

+type atomicInt64Limiter struct {
+	//lint:ignore U1000 Padding is unused but it is crucial to maintain performance
+	// of this rate limiter in case of collocation with other frequently accessed memory.
+	prepadding [64]byte // cache line size = 64; created to avoid false sharing.


thinking about this more, I think this can be [56] as well?

prepadding is usually used to avoid accidental colocations with other contended objects on the heap, since we don't know their size, we just use full cache line size

Am I mistaken thinking that [56] achieves the same target?

It seem to me that prepadding of 56, size of 8, and post-padding of 56 guarantees that the object-in-question will always be "alone" in the cache line?

rabbbit · 2022-06-07T21:32:34Z

limiter_atomic_int64.go

+		case timeOfNextPermissionIssue == 0:
+			// If this is our first request, then we allow it.
+			newTimeOfNextPermissionIssue = now
+		case now-timeOfNextPermissionIssue > int64(t.maxSlack):


I think we can reduce the number of substractions here potentially too.

maxTimeOfNextPermissionIssue = now - int64(t.maxSlack) if timeOfNextPermissionIssue < maxTimeOfNextPermissionIssue { newTimeOfNextPermissionIssue = maxTimeOfNextPermissionIssue } else { newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64(t.perRequest)

we can micro-optimize this separately though, in follow-ups.
}

This reverts commit f04376c.

#91) This reverts commit f04376c.

@twelsh-aw

This limiter was introduced and merged in the following PR #85 Later @twelsh-aw found an issue with this implementation #90 So @rabbbit reverted this change in #91 Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better. With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that #93 will detect the bug. Right after it, we'll open a subsequent PR to fix this bug.

@twelsh-aw

This limiter was introduced and merged in the following PR uber-go#85 Later @twelsh-aw found an issue with this implementation uber-go#90 So @rabbbit reverted this change in uber-go#91 Our tests did not detect this issue, so we have a separate PR uber-go#93 that enhances our tests approach to detect potential errors better. With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that uber-go#93 will detect the bug. Right after it, we'll open a subsequent PR to fix this bug.

* Fix return timestamp discrepancy between regular atomic limiter and int64 based one * Make int64 based atomic limiter default Long story: this was added in #85, but reverted in #91 due to #90. #95 fixed the issue, so we're moving forward with the new implementation.

* Setup benchmarks stats and graphs For more details look at: uber-go/ratelimit#85 (comment)

rabbbit reviewed Jun 5, 2022

View reviewed changes

storozhukBM mentioned this pull request Jun 5, 2022

Setup benchmarks stats and graphs #86

Merged

rabbbit pushed a commit that referenced this pull request Jun 5, 2022

Setup benchmarks stats and graphs (#86)

7d6ad6d

* Setup benchmarks stats and graphs For more details look at: #85 (comment)

storozhukBM added a commit to storozhukBM/ratelimit that referenced this pull request Jun 5, 2022

Setup benchmarks stats and graphs (uber-go#86)

bb12424

* Setup benchmarks stats and graphs For more details look at: uber-go#85 (comment)

rabbbit reviewed Jun 5, 2022

View reviewed changes

ratelimit_test.go Show resolved Hide resolved

rabbbit reviewed Jun 5, 2022

View reviewed changes

limiter_atomic_int64.go Show resolved Hide resolved

rabbbit reviewed Jun 5, 2022

View reviewed changes

storozhukBM added a commit to storozhukBM/ratelimit that referenced this pull request Jun 5, 2022

Setup benchmarks stats and graphs (uber-go#86)

405a492

* Setup benchmarks stats and graphs For more details look at: uber-go#85 (comment)

storozhukBM force-pushed the int64_rl branch 2 times, most recently from 011e559 to 2e7a9fa Compare June 7, 2022 20:17

Implement rate limiter based on atomic int64 operations

5d22744

rabbbit force-pushed the int64_rl branch from 2e7a9fa to 5d22744 Compare June 7, 2022 20:27

rabbbit approved these changes Jun 7, 2022

View reviewed changes

rabbbit merged commit f04376c into uber-go:main Jun 7, 2022

rabbbit added a commit that referenced this pull request Jun 9, 2022

Revert "Implement rate limiter based on atomic int64 operations (#85)"

ecb85f3

This reverts commit f04376c.

rabbbit mentioned this pull request Jun 9, 2022

Revert "New atomic-based implementation squeezed into int64" #91

Merged

rabbbit added a commit that referenced this pull request Jun 9, 2022

Revert "Implement rate limiter based on atomic int64 operations (#85)" (

8b3fccf

#91) This reverts commit f04376c.

storozhukBM mentioned this pull request Jun 29, 2022

Restore int64 based atomic rate limiter #94

Merged

touridev pushed a commit to touridev/limit-go that referenced this pull request Oct 7, 2025

Setup benchmarks stats and graphs (#86)

913bc46

* Setup benchmarks stats and graphs For more details look at: uber-go/ratelimit#85 (comment)

	// This technique was originally described in thishttps://arxiv.org/abs/1305.5800
	// This technique was originally described in this https://arxiv.org/abs/1305.5800

New atomic-based implementation squeezed into int64 #85

New atomic-based implementation squeezed into int64 #85

Uh oh!

Conversation

storozhukBM commented Jun 3, 2022

Uh oh!

CLAassistant commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

storozhukBM commented Jun 3, 2022

Uh oh!

rabbbit commented Jun 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

storozhukBM Jun 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

storozhukBM commented Jun 5, 2022

Uh oh!

storozhukBM commented Jun 5, 2022

Uh oh!

Uh oh!

Uh oh!

rabbbit Jun 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Jun 3, 2022 •

edited

Loading

codecov bot commented Jun 3, 2022 •

edited

Loading

storozhukBM Jun 5, 2022 •

edited

Loading

rabbbit Jun 5, 2022 •

edited

Loading