Merged stores: Fix alignment-related issues and enable SIMD where possible #92939

EgorBo · 2023-10-03T15:06:44Z

Adjust rules when we can use unaligned stores for merged ones. Also, enable 2xLONG/REF -> SIMD. And 2xSIMD to wider SIMD.

Wider scalar primitives for naturally aligned data of primitives (>1B):

Target memory	Crosses cache-line boundary?	x64*	arm64
Global memory	Yes	🚫	🚫
	No	✅	🚫
Local memory (not exposed)	Yes	✅	✅
	No	✅	✅

SIMD for for naturally aligned data of primitives (>1B):

Target memory	Known alignment	x64*	arm64
Global memory	1B (aka unknown)	🚫	🚫
	8B (most common)	🚫	✅
	16B (rare**)	✅(AVX+)	✅
Local memory (not exposed)	1B	✅	✅
	8B	✅	✅
	16B	✅	✅

* both Intel and AMD
** it's very unlikely JIT can assume 16-byte alignment currently anyhow

PS: Merged stores are conservatively disabled on LA64 and RISC-V

Per "Arm Architecture Reference Manual":

* Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory
  are treated as a pair of single - copy atomic 64 - bit writes.

@tannergooding said that x64 with AVX promises atomicy for 16B for 16B aligned data - so far it seems to be the only thing x64 can guarantee to us.

Related issues: #76503, #51638,

ghost · 2023-10-03T15:06:58Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Merge e.g. two consecutive SIMD stores (e.g. 2x Vector256 into 1x Vector512).
It's safe to do since we take an existing SIMD store that doesn't promise any guarantees about atomicity and convert it to a bigger SIMD store.

But I am still trying to build a mental model for the case with "multiple scalar stores -> SIMD store" (we currently don't do it).
I came up with this (only for 64bit for simplicity):

Target memory	Known alignment	x64	arm64
Heap	1B (aka unknown)	🚫	🚫
	8B	🚫	✅
Stack	1B	✅*	✅*
	8B	✅*	✅
Unmanaged	1B	✅	✅
	8B	✅	✅

* - only if target (e.g. struct) is known not to contain GC handles

So far, it seems that x86/AMD64 doesn't offer any kind of guarantee for atomicity officially (even per component).
At the same time, per "Arm Architecture Reference Manual":

* Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory
  are treated as a pair of single - copy atomic 64 - bit writes.

Related issues: #76503, #51638,

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2023-10-03T16:01:30Z

@tannergooding said that x64 with AVX promises atomicy for 16B for 16B aligned data - so far it seems to be the only thing x64 can guarantee to us.

Note this is from 9.1.1 Guaranteed Atomic Operations in Intel® 64 and IA-32 Architectures Software Developer’s Manual; Volume 3 (3A, 3B, 3C, & 3D): System Programming Guide

tannergooding · 2023-10-03T16:02:32Z

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
• Reading or writing a byte.
• Reading or writing a word aligned on a 16-bit boundary.
• Reading or writing a doubleword aligned on a 32-bit boundary.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary.
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation
will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee
that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
• MOVAPD, MOVAPS, and MOVDQA.
• VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
• VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking
disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)
Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be
atomic by the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium,
and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and
P6 family processors provide bus control signals that permit external memory subsystems to make split accesses
atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be
avoided.

Except as noted above, an x87 instruction or an SSE instruction that accesses data larger than a quadword may be
implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may
complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g., due an
page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible
to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section
4.10.4.4), such page faults may occur even if all accesses are to the same page.

src/coreclr/jit/lower.cpp

src/coreclr/jit/gentree.h

Co-authored-by: SingleAccretion <[email protected]>

EgorBo · 2023-10-04T13:11:01Z

@jakobbotsch @dotnet/jit-contrib PTAL, Diffs (regression as expected because it made the whole #92852 algorithm more conservative, but the initial diffs were -400kb so most wins are expected to remain, obviously, most base addresses are TYP_REF like Jakob predicted).

Wins on ARM64 due better SIMD guarantees.

src/coreclr/jit/lower.cpp

EgorBo · 2023-10-05T17:48:06Z

Improved Diffs on arm64

kunalspathak · 2023-10-05T17:57:39Z

Improved Diffs on arm64

seems there are more regressions on linux/windows x64. Do we know why?

kunalspathak · 2023-10-05T18:02:30Z

Improved Diffs on arm64

seems there are more regressions on linux/windows x64. Do we know why?

EgorBo · 2023-10-05T18:05:04Z

seems there are more regressions on linux/windows x64. Do we know why?

these are reverted improvements from #92852 because they turned out to be not legal (but fortunately, most improvements remained)

EgorBo · 2023-10-05T19:15:50Z

x86 SPMI jobs failed with timeout/"no space left", I'll check other runs

EgorBo added 2 commits October 3, 2023 16:49

Merge simd stores into wider simds

9a123c9

Take PreferredVectorByteLength into account

dcf2d6c

ghost assigned EgorBo Oct 3, 2023

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 3, 2023

Clean up

1a31d1a

EgorBo force-pushed the merge-stores-simd branch from 4250c12 to 1a31d1a Compare October 3, 2023 15:46

Clean up

a8ea815

Conservatively disable the optimization when we can't guess alignment

ebc9dc2

EgorBo commented Oct 3, 2023

View reviewed changes

src/coreclr/jit/lower.cpp Show resolved Hide resolved

SingleAccretion reviewed Oct 3, 2023

View reviewed changes

src/coreclr/jit/gentree.h Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request Oct 3, 2023

Tracking issue for CI build timeouts #76454

Closed

EgorBo and others added 2 commits October 3, 2023 23:24

Update src/coreclr/jit/gentree.h

3c4ec5e

Co-authored-by: SingleAccretion <[email protected]>

Clean up

12f6081

EgorBo marked this pull request as ready for review October 3, 2023 21:54

EgorBo added 2 commits October 4, 2023 01:03

Enable scalar -> simd

ad8be75

Clean up

233cbb0

EgorBo changed the title ~~JIT: Merge SIMD stores into wider SIMDs~~ Merged stores: Fix alignment-related issues and enable SIMD where possible Oct 4, 2023

More clean up

eb6bf79

EgorBo mentioned this pull request Oct 4, 2023

JIT: Merge stores #92852

Merged

Improve TP.

82011ad

EgorBo requested a review from jakobbotsch October 4, 2023 13:11

jakobbotsch reviewed Oct 5, 2023

View reviewed changes

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved

jakobbotsch reviewed Oct 5, 2023

View reviewed changes

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved

EgorBo added 2 commits October 5, 2023 15:48

Address feedback

0375ba7

Address feedback

8cb7624

jakobbotsch reviewed Oct 5, 2023

View reviewed changes

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved

jakobbotsch approved these changes Oct 5, 2023

View reviewed changes

EgorBo added 2 commits October 5, 2023 16:49

Update lower.cpp

f358b32

Update lower.cpp

433f81d

EgorBo merged commit ce655e3 into dotnet:main Oct 5, 2023

EgorBo deleted the merge-stores-simd branch October 5, 2023 19:16

This was referenced Oct 5, 2023

Build crashes in System.Runtime.Serialization.SerializationGuard #92007

Closed

System.Diagnostics.Tests.ProcessTests.TestCheckChildProcessUserAndGroupIds failing Subset assertion #92944

Closed

ghost locked as resolved and limited conversation to collaborators Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merged stores: Fix alignment-related issues and enable SIMD where possible #92939

Merged stores: Fix alignment-related issues and enable SIMD where possible #92939

Uh oh!

EgorBo commented Oct 3, 2023 •

edited

Loading

Uh oh!

ghost commented Oct 3, 2023

Uh oh!

tannergooding commented Oct 3, 2023

Uh oh!

tannergooding commented Oct 3, 2023

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Oct 4, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Oct 5, 2023 •

edited

Loading

Uh oh!

kunalspathak commented Oct 5, 2023

Uh oh!

kunalspathak commented Oct 5, 2023

Uh oh!

EgorBo commented Oct 5, 2023

Uh oh!

EgorBo commented Oct 5, 2023

Uh oh!

Uh oh!

Merged stores: Fix alignment-related issues and enable SIMD where possible #92939

Merged stores: Fix alignment-related issues and enable SIMD where possible #92939

Uh oh!

Conversation

EgorBo commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Oct 3, 2023

Uh oh!

tannergooding commented Oct 3, 2023

Uh oh!

tannergooding commented Oct 3, 2023

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunalspathak commented Oct 5, 2023

Uh oh!

kunalspathak commented Oct 5, 2023

Uh oh!

EgorBo commented Oct 5, 2023

Uh oh!

EgorBo commented Oct 5, 2023

Uh oh!

Uh oh!

EgorBo commented Oct 3, 2023 •

edited

Loading

EgorBo commented Oct 4, 2023 •

edited

Loading

EgorBo commented Oct 5, 2023 •

edited

Loading