Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 3, 2023

Adjust rules when we can use unaligned stores for merged ones. Also, enable 2xLONG/REF -> SIMD. And 2xSIMD to wider SIMD.

Wider scalar primitives for naturally aligned data of primitives (>1B):

Target memory Crosses cache-line
boundary?
x64* arm64
Global memory Yes 🚫 🚫
  No 🚫
Local memory (not exposed) Yes
  No

SIMD for for naturally aligned data of primitives (>1B):

Target memory Known alignment x64* arm64
Global memory 1B (aka unknown) 🚫 🚫
8B (most common) 🚫
16B (rare**) ✅(AVX+)
Local memory (not exposed) 1B
8B
16B

* both Intel and AMD
** it's very unlikely JIT can assume 16-byte alignment currently anyhow

PS: Merged stores are conservatively disabled on LA64 and RISC-V

Per "Arm Architecture Reference Manual":

* Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory
  are treated as a pair of single - copy atomic 64 - bit writes.

@tannergooding said that x64 with AVX promises atomicy for 16B for 16B aligned data - so far it seems to be the only thing x64 can guarantee to us.

Related issues: #76503, #51638,

@ghost ghost assigned EgorBo Oct 3, 2023
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 3, 2023
@ghost
Copy link

ghost commented Oct 3, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Merge e.g. two consecutive SIMD stores (e.g. 2x Vector256 into 1x Vector512).
It's safe to do since we take an existing SIMD store that doesn't promise any guarantees about atomicity and convert it to a bigger SIMD store.

But I am still trying to build a mental model for the case with "multiple scalar stores -> SIMD store" (we currently don't do it).
I came up with this (only for 64bit for simplicity):

Target memory Known alignment x64 arm64
Heap 1B (aka unknown) 🚫 🚫
  8B 🚫
Stack 1B ✅* ✅*
  8B ✅*
Unmanaged 1B
  8B

* - only if target (e.g. struct) is known not to contain GC handles

So far, it seems that x86/AMD64 doesn't offer any kind of guarantee for atomicity officially (even per component).
At the same time, per "Arm Architecture Reference Manual":

* Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory
  are treated as a pair of single - copy atomic 64 - bit writes.

Related issues: #76503, #51638,

Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo force-pushed the merge-stores-simd branch from 4250c12 to 1a31d1a Compare October 3, 2023 15:46
@tannergooding
Copy link
Member

@tannergooding said that x64 with AVX promises atomicy for 16B for 16B aligned data - so far it seems to be the only thing x64 can guarantee to us.

Note this is from 9.1.1 Guaranteed Atomic Operations in Intel® 64 and IA-32 Architectures Software Developer’s Manual; Volume 3 (3A, 3B, 3C, & 3D): System Programming Guide

@tannergooding
Copy link
Member

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
• Reading or writing a byte.
• Reading or writing a word aligned on a 16-bit boundary.
• Reading or writing a doubleword aligned on a 32-bit boundary.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary.
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation
will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee
that the 16-byte memory operations performed by the following instructions will always be carried out atomically:
• MOVAPD, MOVAPS, and MOVDQA.
• VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
• VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking
disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)
Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be
atomic by the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium,
and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and
P6 family processors provide bus control signals that permit external memory subsystems to make split accesses
atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be
avoided.

Except as noted above, an x87 instruction or an SSE instruction that accesses data larger than a quadword may be
implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may
complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g., due an
page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible
to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section
4.10.4.4), such page faults may occur even if all accesses are to the same page.

EgorBo and others added 2 commits October 3, 2023 23:24
@EgorBo EgorBo marked this pull request as ready for review October 3, 2023 21:54
@EgorBo EgorBo changed the title JIT: Merge SIMD stores into wider SIMDs Merged stores: Fix alignment-related issues and enable SIMD where possible Oct 4, 2023
@EgorBo EgorBo mentioned this pull request Oct 4, 2023
@EgorBo
Copy link
Member Author

EgorBo commented Oct 4, 2023

@jakobbotsch @dotnet/jit-contrib PTAL, Diffs (regression as expected because it made the whole #92852 algorithm more conservative, but the initial diffs were -400kb so most wins are expected to remain, obviously, most base addresses are TYP_REF like Jakob predicted).

Wins on ARM64 due better SIMD guarantees.

@EgorBo EgorBo requested a review from jakobbotsch October 4, 2023 13:11
@EgorBo
Copy link
Member Author

EgorBo commented Oct 5, 2023

Improved Diffs on arm64

@kunalspathak
Copy link
Contributor

Improved Diffs on arm64

seems there are more regressions on linux/windows x64. Do we know why?

image

@kunalspathak
Copy link
Contributor

Improved Diffs on arm64

seems there are more regressions on linux/windows x64. Do we know why?

image

@EgorBo
Copy link
Member Author

EgorBo commented Oct 5, 2023

seems there are more regressions on linux/windows x64. Do we know why?

these are reverted improvements from #92852 because they turned out to be not legal (but fortunately, most improvements remained)

@EgorBo
Copy link
Member Author

EgorBo commented Oct 5, 2023

x86 SPMI jobs failed with timeout/"no space left", I'll check other runs

@EgorBo EgorBo merged commit ce655e3 into dotnet:main Oct 5, 2023
@EgorBo EgorBo deleted the merge-stores-simd branch October 5, 2023 19:16
@ghost ghost locked as resolved and limited conversation to collaborators Nov 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants