Optimize fork-join performance #8745

kyri-petrou · 2024-04-13T13:38:43Z

Leaving this as draft for the time being as I want to add tests for the newly added queues and until we decide whether the approach for generating FiberIds is solid (more below)

/claim #8611
/fixes #8611
/split @ghostdogpr

@jdegoes assuming you're happy to reward the bounty for this PR, I'd like to share it with @ghostdogpr as we worked on this together. Is there something I need to do to indicate to the Algora bot to split the payment?

Now, let's dig into the changes in this PR

Avoiding global resource contention

Anyone that gave it a go at resolving this issue (myself included) probably kept running into loops where optimizing something didn't show any improvement in the benchmarks. The reason for this is that there are multiple points where threads are attempting to read/write from the same globally initialised resource, effectively limiting the overall throughput of the ZIO runtime to the throughput of that individual object. So unless all of the places of contention are resolved, the benchmark wouldn't show any improvement - even if the individual optimization was solid. The biggest culprits:

ZScheduler#globalQueue
WeakConcurrentBag#nursery
FiberRef.make
ZScheduler#submittedLocations

For (1) and (2), the solution I came up with was to use a "partitioned" queue consisting of multiple sub-queues, so when threads are offering / polling from them the chance of contention is reduced. Interestingly, when I first thought of this approach and shared an initial implementation with @ghostdogpr, he pointed out it was scarily similar to the cats-effect global queue. At this point we knew we were on the right track and did what any software engineer worth their buck would do; "ported" (i.e., stole) any improvement we could see in their implementation into ours.

Now (3), this is something I'm very conflicted about, and would appreciate feedback on. The current implementation uses an AtomicInteger and increments it whenever a new FiberId is created. The problem with this approach is that when multiple threads are creating fibers, there is contention on updating the AtomicInteger. This acts up as a big bottleneck that limits the overall scalability of the ZIO runtime. The solution in this PR is to allocate a random Int instead, but there are 2 issues with this approach:

it's extremely unlikely but technically possible that 2 fibers created with the same startTimeMilli and location will have the same id. One way to reduce the chances even more is to use a random Long as the id.
Previously, Fiber.Runtime objects were ordered based on (startTimeMilli, id) and with these changes we lose the ability to order based on id in case of a startTimeMilli collision. Having said that, the ordering is only used in zio.test macros, so this might not be a big issue?

Finally, (4) was a bit easier to solve. Instead of having a globally-defined map to store the submittedLocations, the way to avoid the single resource contention was by having each ZScheduler.Worker submit its own locations in its own Map. Then the ZScheduler.Supervisor (the only place that reads from submittedLocations), aggregates the submitted locations from all workers. Since the supervisor only needs to read these locations when it suspects a worker thread is blocking unnecessarily (which is never if users properly wrap blocking code in ZIO.blocking), it's much cheaper than having all workers write to the same globally shared Map.

Improved thread scheduling(?)

Now this is an interesting one. When a thread is spawning multiple child fibers (e.g., in the case of ZIO.foreachPar or when using .fork in a loop), we found that it's much more performant if the thread yields after every X number of forks. This brings a huge improvement because the main thread occasionally yields from enqueueing more runnables into the global queues, giving a chance for the newly spawned workers to complete tasks in their local queues work prior to yielding themselves (in the case of async jobs). We found that yielding every ~100 forks is a good enough tradeoff between the added overhead of yielding and improved thread scheduling.

CPU hot-path optimizations

Besides resolving global resource-contention and thread-scheduling this PR also optimizes the following which were found to improve this fairly substantially

The values added to WeakConcurrentBag are now wrapped in WeakReference. This means that fibers can be GC'd much earlier
Only a single thread is allowed to perform GC in WeakConcurrentBag under "auto gc" conditions
Made multiple optimizations to FiberRefs to avoid boxing of Ints as much as possible and avoid calling transform on the Map if we know it won't transform any of the entries

Benchmarking results

TLDR:

ZScheduler-based runtime is ~x6.5 faster with FiberRoots enabled
ZScheduler-based runtime is ~x15 faster without FiberRoots enabled
Loom-based runtime ~x2 faster. The bottleneck at this point is the Loom executor scheduler and unfortunately we can't optimize it

cats-effect (baseline)

[info] Benchmark                        (n)   Mode  Cnt    Score   Error  Units
[info] ForkJoinBenchmark.catsForkJoin             10000  thrpt   10  2906.462 ±  14.976  ops/s

series/2.x (Note: this is the same with / without FiberRoots enabled)

JDK 17 - ZScheduler
[info] Benchmark                        (n)   Mode  Cnt    Score   Error  Units
[info] ForkJoinBenchmark.zioForkJoin  10000  thrpt   10  299.090 ± 5.982  ops/s

JDK21 - Loom
[info] Benchmark                        (n)   Mode  Cnt    Score    Error  Units
[info] ForkJoinBenchmark.zioForkJoin  10000  thrpt   10  448.071 ± 28.781  ops/s

PR

JDK 17 - ZScheduler
[info] Benchmark                                    (n)   Mode  Cnt     Score     Error  Units
[info] ForkJoinBenchmark.zioForkJoin              10000  thrpt   10  1931.372 ±  31.926  ops/s
[info] ForkJoinBenchmark.zioForkJoinNoFiberRoots  10000  thrpt   10  4812.394 ± 282.183  ops/s

JDK 21 - Loom
[info] Benchmark                                    (n)   Mode  Cnt    Score   Error  Units
[info] ForkJoinBenchmark.zioForkJoin              10000  thrpt   10  660.001 ± 5.386  ops/s
[info] ForkJoinBenchmark.zioForkJoinNoFiberRoots  10000  thrpt   10  758.261 ± 6.501  ops/s

jdegoes · 2024-04-13T17:05:52Z

@kyri-petrou Fantastic work! I will do a more detailed review in the next couple of days.

I think there is a way to split using Algora, but I am not sure how. I will see if I can find out.

kyri-petrou · 2024-04-14T11:22:17Z

@jdegoes in the unlikely case that you already started reviewing this PR, I just pushed a commit that further optimizes the WeakConcurrentBag. I updated the benchmark results in the PR description to the new ones, but in short the throughput of the zioForkJoin benchmark increased from ~1400 to ~1900.

In addition, after understanding more how ZScheduler.Supervisor works, I'm extremely sceptical whether this should be enabled by default for a couple of reasons:

In the case of very heavy CPU-bound work, it's not unlikely that an effect will take more than 100ms to execute. Marking that location as "blocking" and shifting it to the blocking threadpool can be really bad as now we'll have CPU-bound workloads running an unbounded threadpool
There is no real correlation between how long an effect takes to execute and whether that effect is actually blocking (unless you're making an HTTP call to some server at the other side of the world). I'm struggling to justify using time as a proxy to identify blocking code to be correct
In servers with very low resource allocation (1-2 CPU), having the supervisor wake up every 100ms and perform GC on the _rootFibers can lead to CPU starvation during the time that the Supervisor is unparked. I think that at the very least, Fiber._roots.graduate() should be called at a much lower frequency (perhaps every 1 - 10 seconds?)

I initially became aware of the issues with the ZScheduler.Supervisor when I noticed that running the benchmarks caused additional workers to be spawned (even with no blocking code in sight). The changes I did w.r.t how the time between parking works in the latest commit somewhat fixed it, but the underlying issue remains

ghostdogpr · 2024-04-14T11:51:48Z

In addition, after understanding more how ZScheduler.Supervisor works, I'm extremely sceptical whether this should be enabled by default for a couple of reasons:

I had some serious issues with it, detailed here: #8371 (see also #7074). Now at least we can disable it 😅

ghostdogpr · 2024-04-14T12:42:44Z

Another question, with ZScheduler now being much faster than Loom on JDK21, shouldn't we just keep using it?

core/shared/src/main/scala/zio/FiberId.scala

core/js/src/main/scala/zio/internal/PlatformSpecific.scala

core/jvm/src/main/scala/zio/internal/ZScheduler.scala

core/shared/src/main/scala/zio/Fiber.scala

core/shared/src/main/scala/zio/FiberRefs.scala

# Conflicts: # core/shared/src/main/scala/zio/FiberId.scala # core/shared/src/main/scala/zio/internal/FiberRuntime.scala

kyri-petrou · 2024-04-17T13:54:26Z

@jdegoes I think the PR is ready for re-review. I think the main comment that is not currently fully addressed is this #8745 (comment).

Let me know what you think!

I also updated the benchmarking results to reflect the recent changes. With FiberRoots disabled, we're starting to look at ~20x performance increase than v2.1-RC1 🚀

core/jvm/src/main/scala/zio/internal/RingBuffer.scala

core/jvm/src/main/scala/zio/RuntimePlatformSpecific.scala

core/shared/src/main/scala/zio/internal/impls/PartitionedRingBuffer.scala

kyri-petrou · 2024-04-18T12:52:37Z

Marking the PR as Ready for Review as I've finished writing the tests that I wanted to

ghostdogpr · 2024-04-18T14:51:54Z

@kyri-petrou I fixed the build website CI issue, if you rebase hopefully the CI will be green.

kyri-petrou · 2024-04-18T14:53:49Z

@kyri-petrou I fixed the build website CI issue, if you rebase hopefully the CI will be green.

Done :)

sideeffffect · 2024-05-06T01:01:52Z

ZScheduler is much faster than the Loom executor on JDK21, so does it still make sense to automatically use Loom?

If I may ask, does the Loom executor executor imply essentially 1:1 correspondence between JDK threads and ZIO Fibers?

If yes, then the Loom executor may have benefits from usability point of view. Easier debugging, better interoperability with 3rd party tools made for Java/JDK, like for logging/telemetry (which are in turn designed to work with Threads), etc...

…performances of `Queue.unbounded` See also: - #8784 - #8745

…performances of `Queue.unbounded` (#9762) * Bring back the `addMetrics` optimization of `LinkedQueue` to improve performances of `Queue.unbounded` See also: - #8784 - #8745 * Review: prefer `<field> ne null` over using `addMetrics`

kyri-petrou added 5 commits April 13, 2024 16:35

Optimize fork fiber execution performance

4a29804

Don't pre-size the ChunkBuilder in MutableConcurrentQueue

1037160

Cleanups

8cc7fcb

Cleanups

01de75e

Rename method

d7ee77c

algora-pbc bot mentioned this pull request Apr 13, 2024

Improve performance of fork/join by one order of magnitude as measured by ForkJoinBenchmark #8611

Closed

algora-pbc bot added the 🙋 Bounty claim label Apr 13, 2024

kyri-petrou marked this pull request as draft April 13, 2024 13:38

fmt

16dee86

More optimizations for WeakConcurrentBag

50a38aa

eyalfa reviewed Apr 14, 2024

View reviewed changes

core/shared/src/main/scala/zio/FiberId.scala Show resolved Hide resolved

eyalfa reviewed Apr 14, 2024

View reviewed changes

core/js/src/main/scala/zio/internal/PlatformSpecific.scala Outdated Show resolved Hide resolved