Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@adamgfraser
Copy link
Contributor

@adamgfraser adamgfraser commented Jan 23, 2021

Operations on the RingBuffer take place in three phases:

  1. Determination of whether a space is available based on its seq value
  2. Reservation of the space by compare and set on the head or tail
  3. Publishing the changes by setting the seq

Unfortunately given the concurrent nature of the RingBuffer there is no way to determine whether a group of spaces are available other than checking each of their seq values. However, we can take advantage of bulk operations by doing compare and swap once to reserve a whole block of spaces after we determine whether they are available. This reduces the number of compare and swap operations from n to 1. The disadvantage is that there is more time between getting the initial state and doing the compare and set since we have to check the seq values in between.

As a baseline, here is the performance of the existing ZQueue benchmarks on my local machine for 1,000 elements with no chunking on master. Note that this is with my latest optimizations to takeBetween.

[info] Benchmark                           Mode  Cnt     Score    Error  Units
[info] QueueParallelBenchmark.zioQueue    thrpt    5  5043.535 ± 173.277  ops/s
[info] QueueSequentialBenchmark.zioQueue  thrpt    5  9357.960 ± 61.826  ops/s

Here is the performance, also on master of a new benchmark that is exactly the same except it uses offerAll and takeBetween with a chunk size of 10.

[info] Benchmark                                Mode  Cnt      Score     Error  Units
[info] QueueChunkBenchmark.zioQueueParallel    thrpt    5   6288.556 ± 255.551  ops/s
[info] QueueChunkBenchmark.zioQueueSequential  thrpt    5  20674.860 ±  86.972  ops/s

Here is the performance of the same benchmark on this branch.

[info] Benchmark                                Mode  Cnt      Score     Error  Units
[info] QueueChunkBenchmark.zioQueueParallel    thrpt   45  13610.646 ± 116.241  ops/s
[info] QueueChunkBenchmark.zioQueueSequential  thrpt   45  28048.018 ± 186.584  ops/s

One risk with this strategy is that under very high contention we could be caught repeatedly retrying. It could be interesting to explore tracking how many times we have retried a bulk operation and either reverting to the single element case or doing exponential backoff on the chunk size we attempt to offer.

@adamgfraser adamgfraser requested review from iravid and jdegoes January 24, 2021 00:06
def zioQueueParallel(): Int = {

val io = for {
offers <- IO.forkAll {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forkAll_ since you don't want to measure the overhead of the join concatenation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use forkAll here because we need a fiber that we can await to ensure that the work was completed.


val io = for {
offers <- IO.forkAll {
List.fill(parallelism) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to fill this list outside of the benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

repeat(totalSize * 1 / chunkSize * 1 / parallelism)(zioQ.offerAll(chunk).unit)
}
}
takes <- IO.forkAll {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

}
}
takes <- IO.forkAll {
List.fill(parallelism) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

else task.flatMap(_ => repeat(task, max - 1))

val io = for {
_ <- repeat(zioQ.offerAll(chunk).unit, totalSize / chunkSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider ZIO#repeatN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skews the benchmark pretty severely because repeatN inserts a yield between every repetition.

}
}
takes <- IO.forkAll {
List.fill(parallelism) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

while (enqHead < enqTail) {
val a = iterator.next()
curIdx = posToIdx(enqHead, capacity)
buf(curIdx) = a.asInstanceOf[AnyRef]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky logic to manually verify. I suggest we make a stress test, which can bombard a ring buffer with concurrent operations (of the newly-added variety) and verify invariants. This can be a pretty robust way to catch race conditions in concurrent code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Member

@jdegoes jdegoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great! Love the fact it does not involve changes to existing code.

I had a few suggestions for improving the benchmark. My main suggestion is that I think we need at least one stress test that can verify invariants hold under concurrent load to the newly-added operations. Then will be good to merge!

@adamgfraser
Copy link
Contributor Author

@jdegoes I added concurrency stress tests and refined the benchmarks. Results are similar.

Here is current performance. Chunking is marginally faster for parallel operations and significantly faster for sequential operations.

[info] Benchmark                               Mode  Cnt      Score     Error  Units
[info] QueueChunkBenchmark.parallel           thrpt    5   5267.461 ± 479.507  ops/s
[info] QueueChunkBenchmark.parallelChunked    thrpt    5   6381.639 ± 184.475  ops/s
[info] QueueChunkBenchmark.sequential         thrpt    5  10983.624 ± 112.716  ops/s
[info] QueueChunkBenchmark.sequentialChunked  thrpt    5  18385.563 ± 104.438  ops/s

Here are the same benchmarks on this branch. The benchmarks without chunking are the same because there is no change for that code. With chunking performance dramatically increases in both the parallel and sequential cases.

[info] Benchmark                               Mode  Cnt      Score     Error  Units
[info] QueueChunkBenchmark.parallel           thrpt    5   5264.719 ±  58.599  ops/s
[info] QueueChunkBenchmark.parallelChunked    thrpt    5  13599.361 ± 161.081  ops/s
[info] QueueChunkBenchmark.sequential         thrpt    5  10832.904 ± 145.367  ops/s
[info] QueueChunkBenchmark.sequentialChunked  thrpt    5  32397.516 ± 545.071  ops/s

@adamgfraser adamgfraser requested a review from jdegoes January 26, 2021 23:12
Copy link
Member

@jdegoes jdegoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! 💪

@jdegoes jdegoes merged commit ba0992f into zio:master Jan 27, 2021
@adamgfraser adamgfraser deleted the ringbuffer branch January 27, 2021 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants