-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Optimize fork-join performance #8745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize fork-join performance #8745
Conversation
|
@kyri-petrou Fantastic work! I will do a more detailed review in the next couple of days. I think there is a way to split using Algora, but I am not sure how. I will see if I can find out. |
|
@jdegoes in the unlikely case that you already started reviewing this PR, I just pushed a commit that further optimizes the In addition, after understanding more how
I initially became aware of the issues with the ZScheduler.Supervisor when I noticed that running the benchmarks caused additional workers to be spawned (even with no blocking code in sight). The changes I did w.r.t how the time between parking works in the latest commit somewhat fixed it, but the underlying issue remains |
|
Another question, with ZScheduler now being much faster than Loom on JDK21, shouldn't we just keep using it? |
# Conflicts: # core/shared/src/main/scala/zio/FiberId.scala # core/shared/src/main/scala/zio/internal/FiberRuntime.scala
|
@jdegoes I think the PR is ready for re-review. I think the main comment that is not currently fully addressed is this #8745 (comment). Let me know what you think! I also updated the benchmarking results to reflect the recent changes. With FiberRoots disabled, we're starting to look at ~20x performance increase than |
core/shared/src/main/scala/zio/internal/impls/PartitionedRingBuffer.scala
Outdated
Show resolved
Hide resolved
|
Marking the PR as Ready for Review as I've finished writing the tests that I wanted to |
|
@kyri-petrou I fixed the build website CI issue, if you rebase hopefully the CI will be green. |
Done :) |
If I may ask, does the Loom executor executor imply essentially 1:1 correspondence between JDK threads and ZIO Fibers? If yes, then the Loom executor may have benefits from usability point of view. Easier debugging, better interoperability with 3rd party tools made for Java/JDK, like for logging/telemetry (which are in turn designed to work with Threads), etc... |
Leaving this as draft for the time being as I want to add tests for the newly added queues and until we decide whether the approach for generating
FiberIds is solid (more below)/claim #8611
/fixes #8611
/split @ghostdogpr
@jdegoes assuming you're happy to reward the bounty for this PR, I'd like to share it with @ghostdogpr as we worked on this together. Is there something I need to do to indicate to the Algora bot to split the payment?
Now, let's dig into the changes in this PR
Avoiding global resource contention
Anyone that gave it a go at resolving this issue (myself included) probably kept running into loops where optimizing something didn't show any improvement in the benchmarks. The reason for this is that there are multiple points where threads are attempting to read/write from the same globally initialised resource, effectively limiting the overall throughput of the ZIO runtime to the throughput of that individual object. So unless all of the places of contention are resolved, the benchmark wouldn't show any improvement - even if the individual optimization was solid. The biggest culprits:
ZScheduler#globalQueueWeakConcurrentBag#nurseryFiberRef.makeZScheduler#submittedLocationsFor (1) and (2), the solution I came up with was to use a "partitioned" queue consisting of multiple sub-queues, so when threads are offering / polling from them the chance of contention is reduced. Interestingly, when I first thought of this approach and shared an initial implementation with @ghostdogpr, he pointed out it was scarily similar to the cats-effect global queue. At this point we knew we were on the right track and did what any software engineer worth their buck would do; "ported" (i.e., stole) any improvement we could see in their implementation into ours.
Now (3), this is something I'm very conflicted about, and would appreciate feedback on. The current implementation uses an
AtomicIntegerand increments it whenever a newFiberIdis created. The problem with this approach is that when multiple threads are creating fibers, there is contention on updating theAtomicInteger. This acts up as a big bottleneck that limits the overall scalability of the ZIO runtime. The solution in this PR is to allocate a randomIntinstead, but there are 2 issues with this approach:startTimeMilliandlocationwill have the sameid. One way to reduce the chances even more is to use a randomLongas theid.Fiber.Runtimeobjects were ordered based on(startTimeMilli, id)and with these changes we lose the ability to order based onidin case of astartTimeMillicollision. Having said that, the ordering is only used inzio.testmacros, so this might not be a big issue?Finally, (4) was a bit easier to solve. Instead of having a globally-defined map to store the
submittedLocations, the way to avoid the single resource contention was by having eachZScheduler.Workersubmit its own locations in its own Map. Then theZScheduler.Supervisor(the only place that reads fromsubmittedLocations), aggregates the submitted locations from all workers. Since the supervisor only needs to read these locations when it suspects a worker thread is blocking unnecessarily (which is never if users properly wrap blocking code inZIO.blocking), it's much cheaper than having all workers write to the same globally shared Map.Improved thread scheduling(?)
Now this is an interesting one. When a thread is spawning multiple child fibers (e.g., in the case of
ZIO.foreachParor when using.forkin a loop), we found that it's much more performant if the thread yields after every X number of forks. This brings a huge improvement because the main thread occasionally yields from enqueueing more runnables into the global queues, giving a chance for the newly spawned workers to complete tasks in their local queues work prior to yielding themselves (in the case of async jobs). We found that yielding every ~100 forks is a good enough tradeoff between the added overhead of yielding and improved thread scheduling.CPU hot-path optimizations
Besides resolving global resource-contention and thread-scheduling this PR also optimizes the following which were found to improve this fairly substantially
WeakConcurrentBagare now wrapped inWeakReference. This means that fibers can be GC'd much earlierWeakConcurrentBagunder "auto gc" conditionsFiberRefsto avoid boxing ofInts as much as possible and avoid callingtransformon the Map if we know it won't transform any of the entriesBenchmarking results
TLDR:
ZScheduler-based runtime is ~x6.5 faster withFiberRootsenabledZScheduler-based runtime is ~x15 faster withoutFiberRootsenabledcats-effect (baseline)
series/2.x (Note: this is the same with / without
FiberRootsenabled)PR