-
Notifications
You must be signed in to change notification settings - Fork 5k
Block layout of partition loop not ideal #108794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
You are right about the code produced when tiering disabled, however, if we run the tiering and let the JIT collect profile data, we make better decision of pushing out the cold block outside of loop. If you see below, from profile, we find out that Diffs comparison of As for the compactness/reuse portion of loop, that is something we can investigate further. |
Unfortunately, I've been noticing this and many similar block ordering issues too (with .NET 9 regressing some further where careful ordering of the code is no longer sufficient), and tiering does not apply to NativeAOT, which is an important target. I hope that the compiler gaps like these will not be relegated to DPGO cleaning them up. Thanks |
@dotnet/jit-contrib |
If you have any examples handy, I'd like to take a look. We have plans to address block layout more aggressively in .NET 10 (#107749), and it would be nice to have candidates like the above to improve upon. Our block layout plans are quite reliant on profile data, though the JIT frontend has some tricks to determine which blocks are important in the absence of profile data (for example, see |
Don't have anything on hand right now but will keep an eye out next time I open Ghidra 🙂. Some of them look like this one: #93536 except torn loops or "why does this even have jump threading" cases appear under different conditions now. In any case, thanks for looking into this. |
This is good to see - I didn't realise profile data was being used this way. However, this shouldn't need profile data to optimise.
Ok, great. If I see anything else, I'll make sure to raise issues or comment here. |
With a mainline JIT, I'm still getting the same layout with tiering disabled as above. The block layout strategy we've implemented in .NET 10 relies on block weights to model the quality of a particular layout, and we've reworked upstream phases to propagate profile data changes as we modify the flowgraph. This primarily benefits PGO scenarios; for non-PGO scenarios, we still rely on block weight heuristics that prove to be subpar in cases like the above. We want to unify our approach such that we always run profile synthesis early in the frontend, and when we don't have profile data, we use heuristics-based edge likelihoods instead. Then, changes to the flowgraph will always include profile updates, regardless of whether the profile is PGO-based or entirely synthetic. For reference, this is what the current block layout strategy gives us if we turn on edge likelihood heuristics (the BBnums have since changed, so look at the BBids to compare):
The heuristics get the preferred successor of |
Consider:
All entries less than then
input[0]
go intoleft
, everything else goes intoright
.This is the main routine inside a quicksort.
On Arm64 this runs at half the performance of the equivalent C++ version compiled by GCC.
I suspect X86 has similar issues but I haven't tried yet.
The C# disassembly looks reasonable:
However the C++ version is a little more cunning:
Due to the block ordering there are fewer branches.
In coreclr, there are always two jumps per loop iteration.
In gcc, there are zero to two jumps per loop iteration.
This difference in block order has a huge impact on performance, halving the performance of the coreclr version
The text was updated successfully, but these errors were encountered: