Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Safe vectorization is a lot slower than unsafe #111309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #94941 ...
EgorBo opened this issue Jan 11, 2025 · 17 comments
Open
Tracked by #94941 ...

Safe vectorization is a lot slower than unsafe #111309

EgorBo opened this issue Jan 11, 2025 · 17 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI reduce-unsafe
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Jan 11, 2025

Currently, when we do code vectorization in C#, we switch to raw pointers (either managed or unmanaged) and unsafe APIs resulting in lack of safety/bounds checks. A good example is a function that calculates a sum using SIMD, today we would write it like this:

ref int r = ref MemoryMarshal.GetReference(span);
for (int i = 0; i < span.Length - Vector128<int>.Count; i += Vector128<int>.Count)
{
    vSum += Vector128.LoadUnsafe(ref r, (nuint)i);
}

It would be nice to move away from this practice towards fully safe APIs:

-ref int r = ref MemoryMarshal.GetReference(span);
for (int i = 0; i < span.Length - Vector128<int>.Count; i += Vector128<int>.Count)
{
-    vSum += Vector128.LoadUnsafe(ref r, (nuint)i);
+    vSum += Vector128.Create(span[i..]);
}

Unfortunately, JIT is not able to eliminate safety checks produced by Slice despite those being redundant in this case. This makes the safer version up to 2x slower, see EgorBot/runtime-utils#226

Full C# impl we want to have 0 redundant bounds check in it:

public static int Sum(ReadOnlySpan<int> span)
{
    Vector128<int> vSum = default;

    // Main loop
    int i;
    for (i = 0; i < span.Length - Vector128<int>.Count; i += Vector128<int>.Count)
        vSum += Vector128.Create(span[i..]);

    // Horizontal sum
    int sum = Vector128.Sum(vSum);

    // Trailing elements
    for (; i < span.Length; i++)
        sum += span[i];

    return sum;
}
Current codegen
; Method Bench:Sum(System.ReadOnlySpan`1[int]):int (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 32

G_M000_IG02:                ;; offset=0x0005
       vxorps   xmm0, xmm0, xmm0
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x08]
       lea      r8d, [rdx-0x04]
       test     r8d, r8d
       jle      SHORT G_M000_IG05

G_M000_IG03:                ;; offset=0x0017
       mov      r10d, edx
       align    [6 bytes for IG04]

G_M000_IG04:                ;; offset=0x0020
       mov      r9d, edx
       sub      r9d, eax
       mov      r11d, eax
       mov      ebx, r9d
       add      rbx, r11
       cmp      rbx, r10
       ja       SHORT G_M000_IG11
       mov      rbx, bword ptr [rcx]
       lea      r11, bword ptr [rbx+4*r11]
       cmp      r9d, 4
       jl       SHORT G_M000_IG10
       vpaddd   xmm0, xmm0, xmmword ptr [r11]
       add      eax, 4
       cmp      eax, r8d
       jl       SHORT G_M000_IG04

G_M000_IG05:                ;; offset=0x004E
       vpsrldq  xmm1, xmm0, 8
       vpaddd   xmm0, xmm1, xmm0
       vpsrldq  xmm1, xmm0, 4
       vpaddd   xmm0, xmm1, xmm0
       vmovd    r8d, xmm0
       cmp      eax, edx
       jge      SHORT G_M000_IG08

G_M000_IG06:                ;; offset=0x0069
       mov      rbx, bword ptr [rcx]
       align    [0 bytes for IG07]

G_M000_IG07:                ;; offset=0x006C
       cmp      eax, edx
       jae      SHORT G_M000_IG12
       mov      ecx, eax
       add      r8d, dword ptr [rbx+4*rcx]
       inc      eax
       cmp      eax, edx
       jl       SHORT G_M000_IG07

G_M000_IG08:                ;; offset=0x007C
       mov      eax, r8d

G_M000_IG09:                ;; offset=0x007F
       add      rsp, 32
       pop      rbx
       ret      

G_M000_IG10:                ;; offset=0x0085
       mov      ecx, 6
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException(int)]
       int3     

G_M000_IG11:                ;; offset=0x0091
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     

G_M000_IG12:                ;; offset=0x0098
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code: 158

There are other loop shapes typically used with SIMD.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 11, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 11, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 11, 2025

@EgorBot -amd -arm

using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;

public class Bench
{
    private static int[] Testdata = new int[10000];

    [Benchmark]
    public int Sum_safe()
    {
        ReadOnlySpan<int> span = Testdata;

        Vector128<int> vSum = default;
        int i;
        for (i = 0; i < span.Length - Vector128<int>.Count; i += Vector128<int>.Count)
            vSum += Vector128.Create(span[i..]);
        int sum = Vector128.Sum(vSum);
        for (; i < span.Length; i++)
            sum += span[i];
        return sum;
    }

    [Benchmark(Baseline = true)]
    public int Sum_unsafe()
    {
        ReadOnlySpan<int> span = Testdata;

        Vector128<int> vSum = default;
        ref int r = ref MemoryMarshal.GetReference(span);
        int i;
        for (i = 0; i < span.Length - Vector128<int>.Count; i += Vector128<int>.Count)
            vSum += Vector128.LoadUnsafe(ref r, (nuint)i);
        int sum = Vector128.Sum(vSum);
        for (; i < span.Length; i++)
            sum += span[i];
        return sum;
    }
}

@EgorBo EgorBo added this to the Future milestone Jan 11, 2025
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Jan 11, 2025
@neon-sunset
Copy link
Contributor

neon-sunset commented Jan 11, 2025

There could also be a few helper methods provided by CoreLib as a middle ground for simpler vectorized loop implementations. For example, Rust exposes https://doc.rust-lang.org/std/primitive.slice.html#method.as_simd_mut which returns a tuple of scalar prefix, aligned SIMD-typed middle section and scalar suffix. It's not ideal if you can use masking/overlapping or other alternatives for handling short inputs and/or remainders, but it's quite nice when you know you just want to make the "long input" case go fast.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2025

cc @AndyAyersMS this issue I was talking about

@tannergooding
Copy link
Member

I'm not sure I'd call the pattern above necessarily representative of what the BCL would write, so I don't think it's a pattern that would allow us to move off of the unsafe logic.

Rather, it's more the baseline pattern that we'd want to support for allowing some code to move off. Which is generally that this shape of loop (given that Vector128<T>.Count is a constant):

for (int i = 0; i < (span.Length - constant); i += constant)

Should allow bounds checks to be elided for span[i] (inclusive) through span[i + constant] (exclusive), as well as allow the following branch to be elided:

if (span.Length < constant)
{
    ArgumentOutOfRangeException.Throw(...);
}

More generally the patterns that need to be supported are a bit more expansive and also include things like manual unrolling or tracking a remainder:

int remainder = span.Length;

while (remainder >= (Vector128<T>.Count * 4))
{
    vector1 = Vector128.Create(span);
    vector2 = Vector128.Create(span[Vector128<T>.Count..]);
    vector3 = Vector128.Create(span[(Vector128<T>.Count * 2)..]);
    vector4 = Vector128.Create(span[(Vector128<T>.Count * 3)..]);

    remainder -= Vector128<T>.Count * 4;
}

scenarios like adjusting the baseline to handle remaining elements in a single vector operation:

if (span.Length > constant)
{
    int i;
    for (i = 0; i < (span.Length - constant); i += constant)
    {
        M(Vector128.Create(span[i..]));
    }

    i = span.Length - constant;
    M(Vector128.Create(span[i..]));
}

A more complex example of the previous pattern that is done alongside unrolling is:

nuint endIndex = remainder;
remainder = (remainder + (uint)(Vector128<T>.Count - 1)) & (nuint)(-Vector128<T>.Count);
 
switch (remainder / (uint)Vector128<T>.Count)
{
    case 4:
    {
        Vector128<T> vector = M(Vector128.Create(x[remainder - (uint)(Vector128<T>.Count * 4..]));
        vector.CopyTo(d[remainder - (uint)(Vector128<T>.Count * 4)..]);
        goto case 3;
    }

    case 3: // ...
}

and so on; with the general goal likely just getting a better understanding of the valid ranges given a higher condition that was met, thus allowing bounds check elimination for safe arithmetic.

@tfenise
Copy link
Contributor

tfenise commented Apr 30, 2025

Currently it is possible to do the following:

//Unsafeness encapsulated
readonly ref struct Vector128Span<T> where T : struct
{
    readonly Span<Vector128<T>> _spanVector;//Unaligned
    internal readonly Span<T> _spanRemainder;

    internal Vector128Span(Span<T> span)
    {
        if (!Vector128<T>.IsSupported) throw new NotSupportedException();

        _spanVector = MemoryMarshal.Cast<T, Vector128<T>>(span);
        _spanRemainder = span.Slice(_spanVector.Length * Vector128<T>.Count);
    }

    internal int VectorLength => _spanVector.Length;
    internal Vector128<T> ReadVector(int index) => Unsafe.ReadUnaligned<Vector128<T>>(ref Unsafe.As<Vector128<T>, byte>(ref _spanVector[index]));
    internal void WriteVector(int index, Vector128<T> v) => Unsafe.WriteUnaligned<Vector128<T>>(ref Unsafe.As<Vector128<T>, byte>(ref _spanVector[index]), v);
}

//Safe code
static void Xor(Span<byte> bits1, Span<byte> bits2)
{
    Vector128Span<byte> vs1 = new(bits1), vs2 = new(bits2);

    if (vs1.VectorLength != vs2.VectorLength || vs1._spanRemainder.Length != vs2._spanRemainder.Length)
        throw new ArgumentException();

    for (int i = 0; i < vs1.VectorLength; i++)
    {
        vs1.WriteVector(i, vs1.ReadVector(i) ^ vs2.ReadVector(i));
    }
    for (int i = 0; i < vs1._spanRemainder.Length; i++)
    {
        vs1._spanRemainder[i] = (byte)(vs1._spanRemainder[i] ^ vs2._spanRemainder[i]);
    }
}

The codegen is decent without extra range checks in the loop.

Disasm

; Assembly listing for method ConsoleApp1.SafeVec:Xor(System.Span`1[ubyte],System.Span`1[ubyte]) (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; fully interruptible
; with Synthesized PGO: fgCalledCount is 71
; 4 inlinees with PGO data; 10 single block inlinees; 2 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 40

G_M000_IG02:                ;; offset=0x0006
       mov      rax, bword ptr [rcx]
       mov      ecx, dword ptr [rcx+0x08]
       mov      r8d, ecx
       shr      r8d, 4
       mov      r10d, r8d
       shl      r10d, 4
       cmp      r10d, ecx
       ja       SHORT G_M000_IG08
       mov      r9d, r10d
       add      r9, rax
       sub      ecx, r10d
       mov      r10, bword ptr [rdx]
       mov      edx, dword ptr [rdx+0x08]
       mov      r11d, edx
       shr      r11d, 4
       mov      ebx, r11d
       shl      ebx, 4
       cmp      ebx, edx
       ja       SHORT G_M000_IG08
       mov      esi, ebx
       add      rsi, r10
       sub      edx, ebx
       cmp      r8d, r11d
       jne      SHORT G_M000_IG09
       cmp      ecx, edx
       jne      SHORT G_M000_IG09
       test     r8d, r8d
       jle      SHORT G_M000_IG05

G_M000_IG03:                ;; offset=0x0058
       xor      edx, edx
       align    [6 bytes for IG04]

G_M000_IG04:                ;; offset=0x0060
       vmovups  xmm0, xmmword ptr [rax+rdx]
       vpxor    xmm0, xmm0, xmmword ptr [r10+rdx]
       vmovups  xmmword ptr [rax+rdx], xmm0
       add      rdx, 16
       dec      r8d
       jne      SHORT G_M000_IG04

G_M000_IG05:                ;; offset=0x0079
       xor      eax, eax
       test     ecx, ecx
       jle      SHORT G_M000_IG07
       align    [1 bytes for IG06]

G_M000_IG06:                ;; offset=0x0080
       mov      edx, eax
       movzx    r8, byte  ptr [rsi+rdx]
       xor      byte  ptr [r9+rdx], r8b
       inc      eax
       cmp      eax, ecx
       jl       SHORT G_M000_IG06

G_M000_IG07:                ;; offset=0x0091
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret

G_M000_IG08:                ;; offset=0x0098
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3

G_M000_IG09:                ;; offset=0x009F
       mov      rcx, 0x7FFB2733C080
       call     CORINFO_HELP_NEWSFAST
       mov      rbx, rax
       mov      rcx, rbx
       call     [System.ArgumentException:.ctor():this]
       mov      rcx, rbx
       call     CORINFO_HELP_THROW
       int3

; Total bytes of code 195

However, it only works with relatively simple operations, and it is a bit hacky because internally it keeps an unaligned Span<Vector128<T>>.

It could be practised in places like BitArray to make them look safer.

Side notes

I would expect the following to produce better codegen, as bits1.Length now must be equal to bits2.Length and the arithmetic around the lengths wouldn't need to be duplicated:

internal static void Xor(Span<byte> bits1, Span<byte> bits2)
{
    if (bits1.Length != bits2.Length)
        throw new ArgumentException();

    Vector128Span<byte> vs1 = new(bits1), vs2 = new(bits2);

    //if (vs1.VectorLength != vs2.VectorLength || vs1._spanRemainder.Length != vs2._spanRemainder.Length)
    //    throw new ArgumentException();

    for (int i = 0; i < vs1.VectorLength; i++)
    {
        vs1.WriteVector(i, vs1.ReadVector(i) ^ vs2.ReadVector(i));
    }
    for (int i = 0; i < vs1._spanRemainder.Length; i++)
    {
        vs1._spanRemainder[i] = (byte)(vs1._spanRemainder[i] ^ vs2._spanRemainder[i]);
    }
}

Apparently not. It's worse:

; Assembly listing for method ConsoleApp1.SafeVec:Xor(System.Span`1[ubyte],System.Span`1[ubyte]) (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; fully interruptible
; with Synthesized PGO: fgCalledCount is 70
; 4 inlinees with PGO data; 8 single block inlinees; 2 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 40

G_M000_IG02:                ;; offset=0x0008
       mov      eax, dword ptr [rcx+0x08]
       mov      r8d, dword ptr [rdx+0x08]
       cmp      eax, r8d
       jne      G_M000_IG12
       mov      rcx, bword ptr [rcx]
       mov      r10d, eax
       shr      r10d, 4
       mov      r9d, r10d
       shl      r9d, 4
       cmp      r9d, eax
       ja       G_M000_IG13
       mov      r11d, r9d
       add      r11, rcx
       sub      eax, r9d
       mov      rdx, bword ptr [rdx]
       mov      r9d, r8d
       shr      r9d, 4
       mov      ebx, r9d
       shl      ebx, 4
       cmp      ebx, r8d
       ja       G_M000_IG13
       mov      esi, ebx
       add      rsi, rdx
       sub      r8d, ebx
       xor      ebx, ebx
       test     r10d, r10d
       jle      SHORT G_M000_IG06

G_M000_IG03:                ;; offset=0x0063
       cmp      r10d, r9d
       jg       SHORT G_M000_IG10

G_M000_IG04:                ;; offset=0x0068
       xor      r9d, r9d
       align    [5 bytes for IG05]

G_M000_IG05:                ;; offset=0x0070
       vmovups  xmm0, xmmword ptr [rcx+r9]
       vpxor    xmm0, xmm0, xmmword ptr [rdx+r9]
       vmovups  xmmword ptr [rcx+r9], xmm0
       add      r9, 16
       dec      r10d
       jne      SHORT G_M000_IG05

G_M000_IG06:                ;; offset=0x008B
       xor      ecx, ecx
       test     eax, eax
       jle      SHORT G_M000_IG09

G_M000_IG07:                ;; offset=0x0091
       cmp      eax, r8d
       jg       SHORT G_M000_IG11
       align    [10 bytes for IG08]

G_M000_IG08:                ;; offset=0x00A0
       mov      r8d, ecx
       movzx    rdx, byte  ptr [rsi+r8]
       xor      byte  ptr [r11+r8], dl
       inc      ecx
       cmp      ecx, eax
       jl       SHORT G_M000_IG08

G_M000_IG09:                ;; offset=0x00B2
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret

G_M000_IG10:                ;; offset=0x00BB
       mov      edi, ebx
       shl      rdi, 4
       lea      rbp, bword ptr [rcx+rdi]
       vmovups  xmm0, xmmword ptr [rbp]
       cmp      ebx, r9d
       jae      SHORT G_M000_IG14
       vpxor    xmm0, xmm0, xmmword ptr [rdx+rdi]
       vmovups  xmmword ptr [rbp], xmm0
       inc      ebx
       cmp      ebx, r10d
       jl       SHORT G_M000_IG10
       jmp      SHORT G_M000_IG06

G_M000_IG11:                ;; offset=0x00E2
       mov      edx, ecx
       mov      r10d, ecx
       movzx    r10, byte  ptr [r11+r10]
       cmp      ecx, r8d
       jae      SHORT G_M000_IG14
       mov      r9d, ecx
       movzx    r9, byte  ptr [rsi+r9]
       xor      r10d, r9d
       mov      byte  ptr [r11+rdx], r10b
       inc      ecx
       cmp      ecx, eax
       jl       SHORT G_M000_IG11
       jmp      SHORT G_M000_IG09

G_M000_IG12:                ;; offset=0x0108
       mov      rcx, 0x7FFB2735C080
       call     CORINFO_HELP_NEWSFAST
       mov      rbx, rax
       mov      rcx, rbx
       call     [System.ArgumentException:.ctor():this]
       mov      rcx, rbx
       call     CORINFO_HELP_THROW
       int3

G_M000_IG13:                ;; offset=0x012C
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3

G_M000_IG14:                ;; offset=0x0133
       call     CORINFO_HELP_RNGCHKFAIL
       int3

; Total bytes of code 313

@tannergooding
Copy link
Member

_spanVector = MemoryMarshal.Cast<T, Vector128>(span);

This is dangerous for many reasons and is generally not something you'd want to do.

I'd likely argue it is more unsafe than the standard Vector128.LoadUnsafe(ref address, nuint index) and Vector128.StoreUnsafe(ref address, nuint index) pattern all while generating more complex and likely less optimal code.

The general point of this issue is that we have some simple and clear patterns for unsafely loading/storing the data today. But ideally users should just be able to pass in a Span<T> to get a Vector128<T>, slice it, and iterate along it until the end; that way it is safe and performant.

@tfenise
Copy link
Contributor

tfenise commented Apr 30, 2025

If you don't like _spanVector = MemoryMarshal.Cast<T, Vector128<T>>(span);, then

[StructLayout(LayoutKind.Sequential, Size = 16)]
struct Vector128Unaligned { }

readonly Span<Vector128Unaligned> _spanVector;

_spanVector = MemoryMarshal.Cast<T, Vector128Unaligned>(span);

@tannergooding
Copy link
Member

tannergooding commented Apr 30, 2025

If you don't like _spanVector = MemoryMarshal.Cast<T, Vector128>(span);, then

That is also unsafe/dangerous. It is strictly "worse" and more error prone than the existing unsafe Vector128.LoadUnsafe(ref address, nuint index) pattern.

If users are using unsafe, they should just using the existing idiomatic patterns and APIs which do the right thing already. Such APIs are simpler, more readable, less likely to hit various bugs/edge cases, will produce far better codegen, and are safer than things like MemoryMarshal.Cast, etc. -- Not all unsafe code is equal and some unsafe patterns are inherently more dangerous than other unsafe patterns

This issue however is about being able to use safe idiomatic patterns to achieve the same effect so that unsafe code isn't required at all.

@tfenise
Copy link
Contributor

tfenise commented Apr 30, 2025

Why do you think it's more unsafe? There is no more alignment issue. MemoryMarshal.Cast checks IsReferenceOrContainsReferences and computes the length correctly. Unsafe.ReadUnaligned<Vector128<T>>(ref Unsafe.As<Vector128Unaligned, byte>(ref _spanVector[index])) is only an unaligned read, index is checked by the indexer of Span<>, so no buffer overflow.

@tannergooding
Copy link
Member

Alignment issues still exist (nothing in the code you gave guarantees alignment)

The backing data can still be relocated by the GC changing any opportunistically checked alignment from under you

You can more easily lose track of "remaining data" -- MemoryMarshal.Cast<float, Vector128<float>>(new float[6]).Length returns 1 and the remaining 2 trailing elements are silently lost, it would also fail for MemoryMarshal.Cast<float, Vector128<float>>(new float[2]).Length, when the normal algorithm should process it

Platforms aren't expecting this pattern and so aren't going to always handle or optimize it well. It also does a number of additional operations leading to subpar codegen compared to the more idiomatic unsafe pattern. -- The explicit Load APIs are designed with them having the semantics necessary to do the right thing efficiently, so they are explicitly tuned and handled to ensure the right codegen occurs in most scenarios.

The list could go on for quite a while around the general problems this type of pattern represents and how it can confuse or cause problems for the code.

The existing unsafe patterns are fairly simple and intuitive. They are also producing the "best" codegen and are most easily extensible to other scenarios. They are the right thing to use if you're going to use any unsafe code.

But the point of this issue is to enable the safe pattern so that users don't have to write any unsafe code. They just write the safe code instead and get the same "best" codegen with the added benefit of it being safe and so not prone to memory safety errors.

@tfenise
Copy link
Contributor

tfenise commented Apr 30, 2025

Alignment issues still exist (nothing in the code you gave guarantees alignment)

The backing data can still be relocated by the GC changing any opportunistically checked alignment from under you

You can more easily lose track of "remaining data" -- MemoryMarshal.Cast<float, Vector128>(new float[6]).Length returns 1 and the remaining 2 trailing elements are silently lost, it would also fail for MemoryMarshal.Cast<float, Vector128>(new float[2]).Length, when the normal algorithm should process it

Please read my code more carefully.

Code with Vector128Unaligned

//Unsafeness encapsulated
readonly ref struct Vector128Span<T> where T : struct
{
    [StructLayout(LayoutKind.Sequential, Size = 16)]
    struct Vector128Unaligned { }

    readonly Span<Vector128Unaligned> _spanVector;
    internal readonly Span<T> _spanRemainder;

    internal Vector128Span(Span<T> span)
    {
        if (!Vector128<T>.IsSupported) throw new NotSupportedException();

        _spanVector = MemoryMarshal.Cast<T, Vector128Unaligned>(span);
        _spanRemainder = span.Slice(_spanVector.Length * Vector128<T>.Count);
    }

    internal int VectorLength => _spanVector.Length;
    internal Vector128<T> ReadVector(int index) => Unsafe.ReadUnaligned<Vector128<T>>(ref Unsafe.As<Vector128Unaligned, byte>(ref _spanVector[index]));
    internal void WriteVector(int index, Vector128<T> v) => Unsafe.WriteUnaligned<Vector128<T>>(ref Unsafe.As<Vector128Unaligned, byte>(ref _spanVector[index]), v);
}

//Safe code
static void Xor(Span<byte> bits1, Span<byte> bits2)
{
    Vector128Span<byte> vs1 = new(bits1), vs2 = new(bits2);

    if (vs1.VectorLength != vs2.VectorLength || vs1._spanRemainder.Length != vs2._spanRemainder.Length)
        throw new ArgumentException();

    for (int i = 0; i < vs1.VectorLength; i++)
    {
        vs1.WriteVector(i, vs1.ReadVector(i) ^ vs2.ReadVector(i));
    }
    for (int i = 0; i < vs1._spanRemainder.Length; i++)
    {
        vs1._spanRemainder[i] = (byte)(vs1._spanRemainder[i] ^ vs2._spanRemainder[i]);
    }
}

Platforms aren't expecting this pattern and so aren't going to always handle or optimize it well. It also does a number of additional operations leading to subpar codegen compared to the more idiomatic unsafe pattern.

At least for (int i = 0; i < someLengthOfSpan; i++) is a well-known pattern and generates no extra range check.

The existing unsafe patterns are fairly simple and intuitive.

Apparently some of those who worked on https://github.com/dotnet/runtime/blob/dbb3f759798208ca7463059e0c87c0f45704b62f/src/libraries/System.Collections/src/System/Collections/BitArray.cs before me didn't feel so. They have unnecessarily used uint in the for and written the loop condition in a variety of ways (#33749 3b0ba24), and one has to reason about the possiblity of overflow in each of them.

But the point of this issue is to enable the safe pattern so that users don't have to write any unsafe code.

My point is to encapsulate the unsafeness somewhere, and elsewhere users can write safe code.

@tannergooding
Copy link
Member

tannergooding commented Apr 30, 2025

Please read my code more carefully.

I did and I know your particular sample handles it. However, I'm referring to the usage of MemoryMarshal.Cast and this being one of the major pitfalls with it in general. Developers are much more likely to lose track of data accidentally with such patterns.

At least for (int i = 0; i < someLengthOfSpan; i++) is a well-known pattern and generates no extra range check.

This isn't the problematic part. It's the general reinterpreting of T to Vector128<T> that is problematic, combined with ReadUnaligned and other APIs which have subtly different semantics on some platforms/architectures and which will not always produce the intended IR that allows the correct codegen.

My point is to encapsulate the unsafeness somewhere, and elsewhere users can write safe code.

This doesn't really "encapsulate" it, it just moves some of it and leaves other pitfills still widely open. It is not a pattern that I would recommend anyone use for their SIMD code, particularly as it will tend to have poor performance on various platforms, runtimes, or scenarios.

Apparently some of those who worked on https://github.com/dotnet/runtime/blob/dbb3f759798208ca7463059e0c87c0f45704b62f/src/libraries/System.Collections/src/System/Collections/BitArray.cs before me didn't feel so. They have unnecessarily used uint in the for and written the loop condition in a variety of ways (#33749 3b0ba24), and one has to reason about the possiblity of overflow in each of them.

There is quite a bit of code that exists historically that is not following best practices and which is being improved. Most of this will eventually be used to one of the patterns I called out above, where it slices the span as it iterates and avoids any unsafe code. -- In general we are trying to reduce the amount of unsafe/dangerous code, not increase it, not replace it with alternative unsafe code, or other suboptimal patterns.

In the interim, the looping logic is likely to be isolated (where feasible) using a pattern similar to how TensorPrimitives is doing it (using internal generic methods where the generics are constrained to functional interfaces). Such patterns allow easy sharing of the core looping logic, do not require unsafe or remainder handling duplicated in each algorithm, and can use the optimal APIs to ensure good codegen.

@tfenise
Copy link
Contributor

tfenise commented Apr 30, 2025

It's the general reinterpreting of T to Vector128 that is problematic, combined with ReadUnaligned and other APIs which have subtly different semantics on some platforms/architectures and which will not always produce the intended IR that allows the correct codegen.

If ReadUnaligned is not a good idea, we may use

Vector128<T> ReadVector(int index) => Vector128.LoadUnsafe(ref Unsafe.As<Vector128Unaligned, T>(ref _spanVector[index]));

This doesn't really "encapsulate" it, it just moves some of it and leaves other pitfills still widely open.

At least one is very unlikely to write codes with memory safety problems with this encapsulation, as long as they don't use unsafe code themself. It doesn't solve all problems but at least the developer can write simple vectorized codes without worrying about memory safety issues.

Most of this will eventually be used to one of the patterns I called out above, where it slices the span as it iterates and avoids any unsafe code.

Yes, eventually, but I see the milestone of this issue is set to Future. Maybe the codegen gets enough improvements this year, maybe next year, maybe 2027, maybe 2028... On the other hand, it is possible to use the technique that I outlined to make simple vectorized code safer now.

In the interim, the looping logic is likely to be isolated (where feasible) using a pattern similar to how TensorPrimitives is doing it (using internal generic methods where the generics are constrained to functional interfaces).

Patterns like TensorPrimitives probably help, but they don't seem very flexible. For example, it is not obvious to me how to use the existing patterns in TensorPrimitives to implement BitArray.CopyTo(bool[], ...) where basically every int produces a Vector256<byte>, and I don't want to invent a new pattern just for BitArray.CopyTo(bool[], ...). On the other hand, the technique that I outlined would just work.

@tannergooding
Copy link
Member

If ReadUnaligned is not a good idea, we may use

This is still generally dangerous/problematic. There are lots of special rules that don't always work the way you expect for structs that don't have clearly defined fields. On many platforms it will "mostly" work how you expect, but on others or for various edge cases it won't. In general it's not something users should be doing, especially when there are existing clear/safe patterns that work instead.

At least one is very unlikely to write codes with memory safety problems with this encapsulation, as long as they don't use unsafe code themself. It doesn't solve all problems but at least the developer can write simple vectorized codes without worrying about memory safety issues.

You've not removed the memory safety issues, they still exist and in many cases are just different or more subtle. The code is "more dangerous" with the pattern you've defined and there are many subtle ways it could break.

A general abstraction that preserved the Span<T> and allowed iterating as Vector128<T>, such as via IEnumerable<T> without doing MemoryMarshal.Cast would somewhat solve the issue. There's still a lot non-ideal about it, but it wouldn't have the same safety problems. The same would be true for something that simply provided an indexed wrapper and which validates the bounds in a safer way up front to help minimize risk of accessing data it shouldn't.

On the other hand, it is possible to use the technique that I outlined to make simple vectorized code safer now.

It's not safer and it isn't something I would recommend anyone do. There are variations that would be acceptable and could encapsulate things in a safer way but none of them involve MemoryMarshal.Cast or defining sized structs without backing fields.

For example, it is not obvious to me how to use the existing patterns in TensorPrimitives to implement BitArray.CopyTo(bool[], ...) where basically every int produces a Vector256

There are multiple ways to do this.

The simplest is to take in a scalar at a time, broadcast it to the vector, then do a variable shift and mask. Finally return or store the result. You'd generally use the same helper as you would for widening byte->ulong (that is a helper that needs 8 times as many bits for the output as compared to the input).

@tfenise
Copy link
Contributor

tfenise commented May 1, 2025

This is still generally dangerous/problematic. There are lots of special rules that don't always work the way you expect for structs that don't have clearly defined fields.

If you don't like structs without fields, it can just be a usual struct with 16 fields of byte, or fixed-size buffer, or InlineArray.

You have been saying how MemoryMarshal.Cast is so dangerous or "more dangerous" or whatever, but any vectorization involving Span<char>(common when processing string) or Span<bool> must first use MemoryMarshal.Cast to reinterpret them to Span<short> or Span<byte>, or do something equivalent. If you think MemoryMarshal.Cast is something even more dangerous than Vector128.LoadUnsafe(ref T address, nuint index), then there is no point ever trying to make any vectorization involving Span<char> or Span<bool> safe because they must always contain something even more dangerous than Vector128.LoadUnsafe(ref T address, nuint index).

@tannergooding
Copy link
Member

If you don't like structs without fields, it can just be a usual struct with 16 fields of byte, or fixed-size buffer, or InlineArray.

It's not about liking them or not. It's about these types of reinterpretations being generally unsafe and potentially not having well-defined behavior.

You have been saying how MemoryMarshal.Cast is so dangerous or "more dangerous" or whatever

The way it's being used in your sample is dangerous. Not all usages of it are dangerous.

Reinterpretation of char to ushort or vice-versa is safe. Reinterpretation from bool to byte is generally safe, but the inverse (byte to bool) is potentially dangerous. Reinterpretation of T to a U that is not layout compatible can be problematic.

However, such safe cases are also scenarios where it'd be likely to get an explicit API such as Vector128<ushort> Vector128.LoadUnsafe(ref char address, nuint index) or some Vector128<ushort> Vector128.Create(ReadOnlySpan<char> span) to make it convenient to go from char to ushort for operation.

Some API dealing with Vector128<byte> to Span<bool> would explicitly ensure normalization to 0 or 1 as part of its storage process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI reduce-unsafe
Projects
None yet
Development

No branches or pull requests

4 participants