Thanks to visit codestin.com
Credit goes to github.com

Support Bulk Queue Operations Directly in the RingBuffer #4585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

jdegoes merged 10 commits into zio:master from adamgfraser:ringbuffer

Jan 27, 2021

Contributor

adamgfraser commented Jan 23, 2021 •

edited

Loading

Operations on the RingBuffer take place in three phases:

Determination of whether a space is available based on its seq value
Reservation of the space by compare and set on the head or tail
Publishing the changes by setting the seq

Unfortunately given the concurrent nature of the RingBuffer there is no way to determine whether a group of spaces are available other than checking each of their seq values. However, we can take advantage of bulk operations by doing compare and swap once to reserve a whole block of spaces after we determine whether they are available. This reduces the number of compare and swap operations from n to 1. The disadvantage is that there is more time between getting the initial state and doing the compare and set since we have to check the seq values in between.

As a baseline, here is the performance of the existing ZQueue benchmarks on my local machine for 1,000 elements with no chunking on master. Note that this is with my latest optimizations to takeBetween.

[info] Benchmark                           Mode  Cnt     Score    Error  Units
[info] QueueParallelBenchmark.zioQueue    thrpt    5  5043.535 ± 173.277  ops/s
[info] QueueSequentialBenchmark.zioQueue  thrpt    5  9357.960 ± 61.826  ops/s

Here is the performance, also on master of a new benchmark that is exactly the same except it uses offerAll and takeBetween with a chunk size of 10.

[info] Benchmark                                Mode  Cnt      Score     Error  Units
[info] QueueChunkBenchmark.zioQueueParallel    thrpt    5   6288.556 ± 255.551  ops/s
[info] QueueChunkBenchmark.zioQueueSequential  thrpt    5  20674.860 ±  86.972  ops/s

Here is the performance of the same benchmark on this branch.

[info] Benchmark                                Mode  Cnt      Score     Error  Units
[info] QueueChunkBenchmark.zioQueueParallel    thrpt   45  13610.646 ± 116.241  ops/s
[info] QueueChunkBenchmark.zioQueueSequential  thrpt   45  28048.018 ± 186.584  ops/s

One risk with this strategy is that under very high contention we could be caught repeatedly retrying. It could be interesting to explore tracking how many times we have retried a bulk operation and either reverting to the single element case or doing exponential backoff on the chunk size we attempt to offer.

adamgfraser added 3 commits

January 23, 2021 14:38


          implement offerAll and pollUpTo

6846cc9


          format

06f2adf


          fix implementation on Scala.js

46d2fbd

adamgfraser requested review from iravid and jdegoes

January 24, 2021 00:06

adamgfraser added 5 commits

January 24, 2021 14:20


          add benchmarks

37b4617


          remove unused import

aeecb6e


          fix implementation

f2ff5b2


          clean up

3bc3234


          update platform specific implementations

21c15fc

jdegoes reviewed

View reviewed changes

benchmarks/src/main/scala/zio/QueueChunkBenchmark.scala Outdated

    
                def zioQueueParallel(): Int = {

                  val io = for {

                    offers <- IO.forkAll {

Member

jdegoes Jan 26, 2021

forkAll_ since you don't want to measure the overhead of the join concatenation.

Contributor Author

adamgfraser Jan 26, 2021

I don't think we can use forkAll here because we need a fiber that we can await to ensure that the work was completed.

jdegoes reviewed

View reviewed changes

benchmarks/src/main/scala/zio/QueueChunkBenchmark.scala Outdated

    
                  val io = for {

                    offers <- IO.forkAll {

                                List.fill(parallelism) {

Member

jdegoes Jan 26, 2021

Probably best to fill this list outside of the benchmark.

Contributor Author

adamgfraser Jan 26, 2021

Done.

jdegoes reviewed

View reviewed changes

benchmarks/src/main/scala/zio/QueueChunkBenchmark.scala Outdated

    
                                  repeat(totalSize * 1 / chunkSize * 1 / parallelism)(zioQ.offerAll(chunk).unit)

                                }

                              }

                    takes <- IO.forkAll {

Member

jdegoes Jan 26, 2021

Ditto.

jdegoes reviewed

View reviewed changes

benchmarks/src/main/scala/zio/QueueChunkBenchmark.scala Outdated

    
                                }

                              }

                    takes <- IO.forkAll {

                               List.fill(parallelism) {

Member

jdegoes Jan 26, 2021

Ditto.

jdegoes reviewed

View reviewed changes

benchmarks/src/main/scala/zio/QueueChunkBenchmark.scala Outdated

    
                    else task.flatMap(_ => repeat(task, max - 1))

                  val io = for {

                    _ <- repeat(zioQ.offerAll(chunk).unit, totalSize / chunkSize)

Member

jdegoes Jan 26, 2021

Consider ZIO#repeatN.

Contributor Author

adamgfraser Jan 26, 2021

This skews the benchmark pretty severely because repeatN inserts a yield between every repetition.

jdegoes reviewed

View reviewed changes

benchmarks/src/main/scala/zio/QueueChunkBenchmark.scala Outdated

    
                                }

                              }

                    takes <- IO.forkAll {

                               List.fill(parallelism) {

Member

jdegoes Jan 26, 2021

Ditto.

jdegoes reviewed

View reviewed changes

core/js/src/main/scala/zio/internal/RingBuffer.scala

    
                    while (enqHead < enqTail) {

                      val a = iterator.next()

                      curIdx = posToIdx(enqHead, capacity)

                      buf(curIdx) = a.asInstanceOf[AnyRef]

Member

jdegoes Jan 26, 2021

This is tricky logic to manually verify. I suggest we make a stress test, which can bombard a ring buffer with concurrent operations (of the newly-added variety) and verify invariants. This can be a pretty robust way to catch race conditions in concurrent code.

Contributor Author

adamgfraser Jan 26, 2021

Added.

jdegoes requested changes

View reviewed changes

Member

jdegoes left a comment

Overall, looks great! Love the fact it does not involve changes to existing code.

I had a few suggestions for improving the benchmark. My main suggestion is that I think we need at least one stress test that can verify invariants hold under concurrent load to the newly-added operations. Then will be good to merge!

adamgfraser added 2 commits

January 26, 2021 13:43


          add concurrency stress tests

762b3b7


          refine benchmarks

2348a5a

Contributor Author

adamgfraser commented Jan 26, 2021

@jdegoes I added concurrency stress tests and refined the benchmarks. Results are similar.

Here is current performance. Chunking is marginally faster for parallel operations and significantly faster for sequential operations.

[info] Benchmark                               Mode  Cnt      Score     Error  Units
[info] QueueChunkBenchmark.parallel           thrpt    5   5267.461 ± 479.507  ops/s
[info] QueueChunkBenchmark.parallelChunked    thrpt    5   6381.639 ± 184.475  ops/s
[info] QueueChunkBenchmark.sequential         thrpt    5  10983.624 ± 112.716  ops/s
[info] QueueChunkBenchmark.sequentialChunked  thrpt    5  18385.563 ± 104.438  ops/s

Here are the same benchmarks on this branch. The benchmarks without chunking are the same because there is no change for that code. With chunking performance dramatically increases in both the parallel and sequential cases.

[info] Benchmark                               Mode  Cnt      Score     Error  Units
[info] QueueChunkBenchmark.parallel           thrpt    5   5264.719 ±  58.599  ops/s
[info] QueueChunkBenchmark.parallelChunked    thrpt    5  13599.361 ± 161.081  ops/s
[info] QueueChunkBenchmark.sequential         thrpt    5  10832.904 ± 145.367  ops/s
[info] QueueChunkBenchmark.sequentialChunked  thrpt    5  32397.516 ± 545.071  ops/s

adamgfraser requested a review from jdegoes

January 26, 2021 23:12

jdegoes approved these changes

View reviewed changes

Member

jdegoes left a comment

Excellent work! 💪

jdegoes merged commit ba0992f into zio:master

adamgfraser deleted the ringbuffer branch

January 27, 2021 00:03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet