Block layout of partition loop not ideal #108794

a74nh · 2024-10-11T16:49:24Z

Consider:

    public static unsafe ulong PartitionScalar(ref uint* input, ref uint* left, ref uint* right, int length)
    {
      long i = 0;

      // Position within the output arrays.
      ulong index_left = 0;
      ulong index_right = 0;

      uint first = input[0];

      for(i = 0; i< length; i++)
      {
        if (input[i] < first)
        {
          left[index_left] = input[i];
          index_left++;
        }
        else
        {
          right[index_right] = input[i];
          index_right++;
        }
      }

      return index_right;
    }

All entries less than then input[0] go into left, everything else goes into right.
This is the main routine inside a quicksort.

On Arm64 this runs at half the performance of the equivalent C++ version compiled by GCC.
I suspect X86 has similar issues but I haven't tried yet.

The C# disassembly looks reasonable:

; Assembly listing for method sveincsharp.Partition:TestPartitionScalar(System.Span`1[uint],System.Span`1[uint],System.Span`1[uint],int):ulong (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; FullOpts code
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x30]!
            mov     fp, sp
            stp     xzr, xzr, [fp, #0x20]
            str     xzr, [fp, #0x18]

G_M000_IG02:                ;; offset=0x0010
            stp     x2, x0, [fp, #0x20]
            str     x4, [fp, #0x18]
            mov     x1, xzr
            mov     x3, xzr
            ldr     w5, [x0]
            sxtw    x6, w6
            cmp     x6, #0
            ble     G_M000_IG08

G_M000_IG03:                ;; offset=0x0030
            align   [0 bytes for IG04]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]

G_M000_IG04:                ;; offset=0x0030
            ldr     w7, [x0]
            cmp     w7, w5
            blo     G_M000_IG06

G_M000_IG05:                ;; offset=0x003C
            str     w7, [x4, x3, LSL #2]
            add     x3, x3, #1
            b       G_M000_IG07

G_M000_IG06:                ;; offset=0x0048
            str     w7, [x2, x1, LSL #2]
            add     x1, x1, #1

G_M000_IG07:                ;; offset=0x0050
            add     x0, x0, #4
            sub     x6, x6, #1
            cbnz    x6, G_M000_IG04

G_M000_IG08:                ;; offset=0x005C
            mov     x0, x3

G_M000_IG09:                ;; offset=0x0060
            ldp     fp, lr, [sp], #0x30
            ret     lr

; Total bytes of code 104

However the C++ version is a little more cunning:

0000000000000ba0 <_Z15partitionScalarPjS_S_j>:
 ba0:	b9400007 	ldr	w7, [x0]
 ba4:	aa0003e6 	mov	x6, x0
 ba8:	2a0303e8 	mov	w8, w3
 bac:	34000283 	cbz	w3, bfc <_Z15partitionScalarPjS_S_j+0x5c>
 bb0:	2a0703e4 	mov	w4, w7
 bb4:	d2800000 	mov	x0, #0x0                   	// #0
 bb8:	d2800005 	mov	x5, #0x0                   	// #0
 bbc:	d2800003 	mov	x3, #0x0                   	// #0
 bc0:	14000007 	b	bdc <_Z15partitionScalarPjS_S_j+0x3c>

# "Then" case
 bc4:	91000463 	add	x3, x3, #0x1
 bc8:	b8257824 	str	w4, [x1, x5, lsl #2]
 bcc:	910004a5 	add	x5, x5, #0x1
 bd0:	eb08007f 	cmp	x3, x8
 bd4:	54000120 	b.eq	bf8 <_Z15partitionScalarPjS_S_j+0x58>  // b.none

# Start of the loop
 bd8:	b86378c4 	ldr	w4, [x6, x3, lsl #2]
 bdc:	6b0400ff 	cmp	w7, w4
 be0:	54ffff28 	b.hi	bc4 <_Z15partitionScalarPjS_S_j+0x24>  // b.pmore

# "Else" case
 be4:	91000463 	add	x3, x3, #0x1
 be8:	b8207844 	str	w4, [x2, x0, lsl #2]
 bec:	91000400 	add	x0, x0, #0x1
 bf0:	eb08007f 	cmp	x3, x8
 bf4:	54ffff21 	b.ne	bd8 <_Z15partitionScalarPjS_S_j+0x38>  // b.any

 bf8:	d65f03c0 	ret
 bfc:	d2800000 	mov	x0, #0x0                   	// #0
 c00:	d65f03c0 	ret

Due to the block ordering there are fewer branches.

In coreclr, there are always two jumps per loop iteration.

In gcc, there are zero to two jumps per loop iteration.

This difference in block order has a huge impact on performance, halving the performance of the coreclr version

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-10-11T16:49:56Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

kunalspathak · 2024-10-11T18:02:06Z

You are right about the code produced when tiering disabled, however, if we run the tiering and let the JIT collect profile data, we make better decision of pushing out the cold block outside of loop.

If you see below, from profile, we find out that BB07 is less likely and push it towards bottom, reducing the extra unconditional branch.

Diffs comparison of FullOpts vs. Tier 1.

As for the compactness/reuse portion of loop, that is something we can investigate further.

neon-sunset · 2024-10-11T18:48:43Z

when tiering disabled

Unfortunately, I've been noticing this and many similar block ordering issues too (with .NET 9 regressing some further where careful ordering of the code is no longer sufficient), and tiering does not apply to NativeAOT, which is an important target. I hope that the compiler gaps like these will not be relegated to DPGO cleaning them up.

Thanks

kunalspathak · 2024-10-11T18:55:37Z

@dotnet/jit-contrib

amanasifkhalid · 2024-10-11T19:51:25Z

Unfortunately, I've been noticing this and many similar block ordering issues too (with .NET 9 regressing some further where careful ordering of the code is no longer sufficient), and tiering does not apply to NativeAOT, which is an important target. I hope that the compiler gaps like these will not be relegated to DPGO cleaning them up.

If you have any examples handy, I'd like to take a look. We have plans to address block layout more aggressively in .NET 10 (#107749), and it would be nice to have candidates like the above to improve upon. Our block layout plans are quite reliant on profile data, though the JIT frontend has some tricks to determine which blocks are important in the absence of profile data (for example, see Compiler::optSetBlockWeights, though this phase needs quite a bit of work, too).

neon-sunset · 2024-10-11T20:37:28Z

If you have any examples handy, I'd like to take a look.

Don't have anything on hand right now but will keep an eye out next time I open Ghidra 🙂. Some of them look like this one: #93536 except torn loops or "why does this even have jump threading" cases appear under different conditions now. In any case, thanks for looking into this.

a74nh · 2024-10-14T08:20:23Z

You are right about the code produced when tiering disabled, however, if we run the tiering and let the JIT collect profile data, we make better decision of pushing out the cold block outside of loop.

This is good to see - I didn't realise profile data was being used this way.

However, this shouldn't need profile data to optimise.

We have plans to address block layout more aggressively in .NET 10 (#107749), and it would be nice to have candidates like the above to improve upon.

Ok, great. If I see anything else, I'll make sure to raise issues or comment here.

amanasifkhalid · 2025-04-30T17:12:09Z

With a mainline JIT, I'm still getting the same layout with tiering disabled as above. The block layout strategy we've implemented in .NET 10 relies on block weights to model the quality of a particular layout, and we've reworked upstream phases to propagate profile data changes as we modify the flowgraph. This primarily benefits PGO scenarios; for non-PGO scenarios, we still rely on block weight heuristics that prove to be subpar in cases like the above. We want to unify our approach such that we always run profile synthesis early in the frontend, and when we don't have profile data, we use heuristics-based edge likelihoods instead. Then, changes to the flowgraph will always include profile updates, regardless of whether the profile is PGO-based or entirely synthetic. For reference, this is what the current block layout strategy gives us if we turn on edge likelihood heuristics (the BBnums have since changed, so look at the BBids to compare):

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..012)-> BB07(0.232),BB02(0.768) ( cond )                   i LIR IBC
BB02 [0008]  1       BB01                  0.77  77 [012..???)-> BB03(1)                 (always)                   LIR IBC internal
BB03 [0001]  2       BB06,BB02             2.23 223 [012..01E)-> BB05(0.48),BB04(0.52)   ( cond )                   i LIR IBC loophead bwd bwd-target
BB04 [0002]  1       BB03                  1.16 116 [01E..037)-> BB06(1)                 (always)                   i LIR IBC bwd
BB06 [0004]  2       BB04,BB05             2.23 223 [04E..058)-> BB03(0.656),BB07(0.344) ( cond )                   i LIR IBC bwd
BB07 [0009]  2       BB01,BB06             1    100 [058..05A)                           (return)                   i LIR IBC
BB05 [0003]  1       BB03                  1.07 107 [037..04E)-> BB06(1)                 (always)                   i LIR IBC bwd
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The heuristics get the preferred successor of BB03 wrong, but we now get a fallthrough path through the loop. Enabling and iterating upon the likelihood heuristics wouldn't be technically difficult, but we currently lack the infrastructure to see if such changes would be a net improvement: Our microbenchmarks currently only measure PGO scenarios. I don't see us enabling heuristics + synthesis for non-PGO scenarios in .NET 10, but I do plan on pursuing this in the next release.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 11, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 11, 2024

JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Oct 11, 2024

JulieLeeMSFT removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 21, 2024

JulieLeeMSFT added this to the 10.0.0 milestone Nov 21, 2024

JulieLeeMSFT assigned amanasifkhalid Nov 21, 2024

amanasifkhalid modified the milestones: 10.0.0, Future Apr 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block layout of partition loop not ideal #108794

Block layout of partition loop not ideal #108794

a74nh commented Oct 11, 2024

dotnet-policy-service bot commented Oct 11, 2024

kunalspathak commented Oct 11, 2024

neon-sunset commented Oct 11, 2024 •

edited

Loading

kunalspathak commented Oct 11, 2024

amanasifkhalid commented Oct 11, 2024

neon-sunset commented Oct 11, 2024

a74nh commented Oct 14, 2024

amanasifkhalid commented Apr 30, 2025

Block layout of partition loop not ideal #108794

Block layout of partition loop not ideal #108794

Comments

a74nh commented Oct 11, 2024

dotnet-policy-service bot commented Oct 11, 2024

kunalspathak commented Oct 11, 2024

neon-sunset commented Oct 11, 2024 • edited Loading

kunalspathak commented Oct 11, 2024

amanasifkhalid commented Oct 11, 2024

neon-sunset commented Oct 11, 2024

a74nh commented Oct 14, 2024

amanasifkhalid commented Apr 30, 2025

neon-sunset commented Oct 11, 2024 •

edited

Loading