-
Notifications
You must be signed in to change notification settings - Fork 5k
Safe vectorization is a lot slower than unsafe #111309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
@EgorBot -amd -arm using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
public class Bench
{
private static int[] Testdata = new int[10000];
[Benchmark]
public int Sum_safe()
{
ReadOnlySpan<int> span = Testdata;
Vector128<int> vSum = default;
int i;
for (i = 0; i < span.Length - Vector128<int>.Count; i += Vector128<int>.Count)
vSum += Vector128.Create(span[i..]);
int sum = Vector128.Sum(vSum);
for (; i < span.Length; i++)
sum += span[i];
return sum;
}
[Benchmark(Baseline = true)]
public int Sum_unsafe()
{
ReadOnlySpan<int> span = Testdata;
Vector128<int> vSum = default;
ref int r = ref MemoryMarshal.GetReference(span);
int i;
for (i = 0; i < span.Length - Vector128<int>.Count; i += Vector128<int>.Count)
vSum += Vector128.LoadUnsafe(ref r, (nuint)i);
int sum = Vector128.Sum(vSum);
for (; i < span.Length; i++)
sum += span[i];
return sum;
}
} |
There could also be a few helper methods provided by CoreLib as a middle ground for simpler vectorized loop implementations. For example, Rust exposes https://doc.rust-lang.org/std/primitive.slice.html#method.as_simd_mut which returns a tuple of scalar prefix, aligned SIMD-typed middle section and scalar suffix. It's not ideal if you can use masking/overlapping or other alternatives for handling short inputs and/or remainders, but it's quite nice when you know you just want to make the "long input" case go fast. |
cc @AndyAyersMS this issue I was talking about |
I'm not sure I'd call the pattern above necessarily representative of what the BCL would write, so I don't think it's a pattern that would allow us to move off of the unsafe logic. Rather, it's more the baseline pattern that we'd want to support for allowing some code to move off. Which is generally that this shape of loop (given that for (int i = 0; i < (span.Length - constant); i += constant) Should allow bounds checks to be elided for if (span.Length < constant)
{
ArgumentOutOfRangeException.Throw(...);
} More generally the patterns that need to be supported are a bit more expansive and also include things like manual unrolling or tracking a remainder: int remainder = span.Length;
while (remainder >= (Vector128<T>.Count * 4))
{
vector1 = Vector128.Create(span);
vector2 = Vector128.Create(span[Vector128<T>.Count..]);
vector3 = Vector128.Create(span[(Vector128<T>.Count * 2)..]);
vector4 = Vector128.Create(span[(Vector128<T>.Count * 3)..]);
remainder -= Vector128<T>.Count * 4;
} scenarios like adjusting the baseline to handle remaining elements in a single vector operation: if (span.Length > constant)
{
int i;
for (i = 0; i < (span.Length - constant); i += constant)
{
M(Vector128.Create(span[i..]));
}
i = span.Length - constant;
M(Vector128.Create(span[i..]));
} A more complex example of the previous pattern that is done alongside unrolling is: nuint endIndex = remainder;
remainder = (remainder + (uint)(Vector128<T>.Count - 1)) & (nuint)(-Vector128<T>.Count);
switch (remainder / (uint)Vector128<T>.Count)
{
case 4:
{
Vector128<T> vector = M(Vector128.Create(x[remainder - (uint)(Vector128<T>.Count * 4..]));
vector.CopyTo(d[remainder - (uint)(Vector128<T>.Count * 4)..]);
goto case 3;
}
case 3: // ...
} and so on; with the general goal likely just getting a better understanding of the valid ranges given a higher condition that was met, thus allowing bounds check elimination for safe arithmetic. |
Currently it is possible to do the following: //Unsafeness encapsulated
readonly ref struct Vector128Span<T> where T : struct
{
readonly Span<Vector128<T>> _spanVector;//Unaligned
internal readonly Span<T> _spanRemainder;
internal Vector128Span(Span<T> span)
{
if (!Vector128<T>.IsSupported) throw new NotSupportedException();
_spanVector = MemoryMarshal.Cast<T, Vector128<T>>(span);
_spanRemainder = span.Slice(_spanVector.Length * Vector128<T>.Count);
}
internal int VectorLength => _spanVector.Length;
internal Vector128<T> ReadVector(int index) => Unsafe.ReadUnaligned<Vector128<T>>(ref Unsafe.As<Vector128<T>, byte>(ref _spanVector[index]));
internal void WriteVector(int index, Vector128<T> v) => Unsafe.WriteUnaligned<Vector128<T>>(ref Unsafe.As<Vector128<T>, byte>(ref _spanVector[index]), v);
}
//Safe code
static void Xor(Span<byte> bits1, Span<byte> bits2)
{
Vector128Span<byte> vs1 = new(bits1), vs2 = new(bits2);
if (vs1.VectorLength != vs2.VectorLength || vs1._spanRemainder.Length != vs2._spanRemainder.Length)
throw new ArgumentException();
for (int i = 0; i < vs1.VectorLength; i++)
{
vs1.WriteVector(i, vs1.ReadVector(i) ^ vs2.ReadVector(i));
}
for (int i = 0; i < vs1._spanRemainder.Length; i++)
{
vs1._spanRemainder[i] = (byte)(vs1._spanRemainder[i] ^ vs2._spanRemainder[i]);
}
} The codegen is decent without extra range checks in the loop. Disasm
; Assembly listing for method ConsoleApp1.SafeVec:Xor(System.Span`1[ubyte],System.Span`1[ubyte]) (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; fully interruptible
; with Synthesized PGO: fgCalledCount is 71
; 4 inlinees with PGO data; 10 single block inlinees; 2 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
push rsi
push rbx
sub rsp, 40
G_M000_IG02: ;; offset=0x0006
mov rax, bword ptr [rcx]
mov ecx, dword ptr [rcx+0x08]
mov r8d, ecx
shr r8d, 4
mov r10d, r8d
shl r10d, 4
cmp r10d, ecx
ja SHORT G_M000_IG08
mov r9d, r10d
add r9, rax
sub ecx, r10d
mov r10, bword ptr [rdx]
mov edx, dword ptr [rdx+0x08]
mov r11d, edx
shr r11d, 4
mov ebx, r11d
shl ebx, 4
cmp ebx, edx
ja SHORT G_M000_IG08
mov esi, ebx
add rsi, r10
sub edx, ebx
cmp r8d, r11d
jne SHORT G_M000_IG09
cmp ecx, edx
jne SHORT G_M000_IG09
test r8d, r8d
jle SHORT G_M000_IG05
G_M000_IG03: ;; offset=0x0058
xor edx, edx
align [6 bytes for IG04]
G_M000_IG04: ;; offset=0x0060
vmovups xmm0, xmmword ptr [rax+rdx]
vpxor xmm0, xmm0, xmmword ptr [r10+rdx]
vmovups xmmword ptr [rax+rdx], xmm0
add rdx, 16
dec r8d
jne SHORT G_M000_IG04
G_M000_IG05: ;; offset=0x0079
xor eax, eax
test ecx, ecx
jle SHORT G_M000_IG07
align [1 bytes for IG06]
G_M000_IG06: ;; offset=0x0080
mov edx, eax
movzx r8, byte ptr [rsi+rdx]
xor byte ptr [r9+rdx], r8b
inc eax
cmp eax, ecx
jl SHORT G_M000_IG06
G_M000_IG07: ;; offset=0x0091
add rsp, 40
pop rbx
pop rsi
ret
G_M000_IG08: ;; offset=0x0098
call [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
int3
G_M000_IG09: ;; offset=0x009F
mov rcx, 0x7FFB2733C080
call CORINFO_HELP_NEWSFAST
mov rbx, rax
mov rcx, rbx
call [System.ArgumentException:.ctor():this]
mov rcx, rbx
call CORINFO_HELP_THROW
int3
; Total bytes of code 195 However, it only works with relatively simple operations, and it is a bit hacky because internally it keeps an unaligned It could be practised in places like Side notes
I would expect the following to produce better codegen, as internal static void Xor(Span<byte> bits1, Span<byte> bits2)
{
if (bits1.Length != bits2.Length)
throw new ArgumentException();
Vector128Span<byte> vs1 = new(bits1), vs2 = new(bits2);
//if (vs1.VectorLength != vs2.VectorLength || vs1._spanRemainder.Length != vs2._spanRemainder.Length)
// throw new ArgumentException();
for (int i = 0; i < vs1.VectorLength; i++)
{
vs1.WriteVector(i, vs1.ReadVector(i) ^ vs2.ReadVector(i));
}
for (int i = 0; i < vs1._spanRemainder.Length; i++)
{
vs1._spanRemainder[i] = (byte)(vs1._spanRemainder[i] ^ vs2._spanRemainder[i]);
}
} Apparently not. It's worse: ; Assembly listing for method ConsoleApp1.SafeVec:Xor(System.Span`1[ubyte],System.Span`1[ubyte]) (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; fully interruptible
; with Synthesized PGO: fgCalledCount is 70
; 4 inlinees with PGO data; 8 single block inlinees; 2 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
push rdi
push rsi
push rbp
push rbx
sub rsp, 40
G_M000_IG02: ;; offset=0x0008
mov eax, dword ptr [rcx+0x08]
mov r8d, dword ptr [rdx+0x08]
cmp eax, r8d
jne G_M000_IG12
mov rcx, bword ptr [rcx]
mov r10d, eax
shr r10d, 4
mov r9d, r10d
shl r9d, 4
cmp r9d, eax
ja G_M000_IG13
mov r11d, r9d
add r11, rcx
sub eax, r9d
mov rdx, bword ptr [rdx]
mov r9d, r8d
shr r9d, 4
mov ebx, r9d
shl ebx, 4
cmp ebx, r8d
ja G_M000_IG13
mov esi, ebx
add rsi, rdx
sub r8d, ebx
xor ebx, ebx
test r10d, r10d
jle SHORT G_M000_IG06
G_M000_IG03: ;; offset=0x0063
cmp r10d, r9d
jg SHORT G_M000_IG10
G_M000_IG04: ;; offset=0x0068
xor r9d, r9d
align [5 bytes for IG05]
G_M000_IG05: ;; offset=0x0070
vmovups xmm0, xmmword ptr [rcx+r9]
vpxor xmm0, xmm0, xmmword ptr [rdx+r9]
vmovups xmmword ptr [rcx+r9], xmm0
add r9, 16
dec r10d
jne SHORT G_M000_IG05
G_M000_IG06: ;; offset=0x008B
xor ecx, ecx
test eax, eax
jle SHORT G_M000_IG09
G_M000_IG07: ;; offset=0x0091
cmp eax, r8d
jg SHORT G_M000_IG11
align [10 bytes for IG08]
G_M000_IG08: ;; offset=0x00A0
mov r8d, ecx
movzx rdx, byte ptr [rsi+r8]
xor byte ptr [r11+r8], dl
inc ecx
cmp ecx, eax
jl SHORT G_M000_IG08
G_M000_IG09: ;; offset=0x00B2
add rsp, 40
pop rbx
pop rbp
pop rsi
pop rdi
ret
G_M000_IG10: ;; offset=0x00BB
mov edi, ebx
shl rdi, 4
lea rbp, bword ptr [rcx+rdi]
vmovups xmm0, xmmword ptr [rbp]
cmp ebx, r9d
jae SHORT G_M000_IG14
vpxor xmm0, xmm0, xmmword ptr [rdx+rdi]
vmovups xmmword ptr [rbp], xmm0
inc ebx
cmp ebx, r10d
jl SHORT G_M000_IG10
jmp SHORT G_M000_IG06
G_M000_IG11: ;; offset=0x00E2
mov edx, ecx
mov r10d, ecx
movzx r10, byte ptr [r11+r10]
cmp ecx, r8d
jae SHORT G_M000_IG14
mov r9d, ecx
movzx r9, byte ptr [rsi+r9]
xor r10d, r9d
mov byte ptr [r11+rdx], r10b
inc ecx
cmp ecx, eax
jl SHORT G_M000_IG11
jmp SHORT G_M000_IG09
G_M000_IG12: ;; offset=0x0108
mov rcx, 0x7FFB2735C080
call CORINFO_HELP_NEWSFAST
mov rbx, rax
mov rcx, rbx
call [System.ArgumentException:.ctor():this]
mov rcx, rbx
call CORINFO_HELP_THROW
int3
G_M000_IG13: ;; offset=0x012C
call [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
int3
G_M000_IG14: ;; offset=0x0133
call CORINFO_HELP_RNGCHKFAIL
int3
; Total bytes of code 313 |
This is dangerous for many reasons and is generally not something you'd want to do. I'd likely argue it is more unsafe than the standard The general point of this issue is that we have some simple and clear patterns for unsafely loading/storing the data today. But ideally users should just be able to pass in a |
If you don't like [StructLayout(LayoutKind.Sequential, Size = 16)]
struct Vector128Unaligned { }
readonly Span<Vector128Unaligned> _spanVector;
_spanVector = MemoryMarshal.Cast<T, Vector128Unaligned>(span); |
That is also unsafe/dangerous. It is strictly "worse" and more error prone than the existing unsafe If users are using unsafe, they should just using the existing idiomatic patterns and APIs which do the right thing already. Such APIs are simpler, more readable, less likely to hit various bugs/edge cases, will produce far better codegen, and are safer than things like This issue however is about being able to use safe idiomatic patterns to achieve the same effect so that unsafe code isn't required at all. |
Why do you think it's more unsafe? There is no more alignment issue. |
Alignment issues still exist (nothing in the code you gave guarantees alignment) The backing data can still be relocated by the GC changing any opportunistically checked alignment from under you You can more easily lose track of "remaining data" -- Platforms aren't expecting this pattern and so aren't going to always handle or optimize it well. It also does a number of additional operations leading to subpar codegen compared to the more idiomatic unsafe pattern. -- The explicit The list could go on for quite a while around the general problems this type of pattern represents and how it can confuse or cause problems for the code. The existing unsafe patterns are fairly simple and intuitive. They are also producing the "best" codegen and are most easily extensible to other scenarios. They are the right thing to use if you're going to use any unsafe code. But the point of this issue is to enable the safe pattern so that users don't have to write any unsafe code. They just write the safe code instead and get the same "best" codegen with the added benefit of it being safe and so not prone to memory safety errors. |
Please read my code more carefully. Code with Vector128Unaligned
//Unsafeness encapsulated
readonly ref struct Vector128Span<T> where T : struct
{
[StructLayout(LayoutKind.Sequential, Size = 16)]
struct Vector128Unaligned { }
readonly Span<Vector128Unaligned> _spanVector;
internal readonly Span<T> _spanRemainder;
internal Vector128Span(Span<T> span)
{
if (!Vector128<T>.IsSupported) throw new NotSupportedException();
_spanVector = MemoryMarshal.Cast<T, Vector128Unaligned>(span);
_spanRemainder = span.Slice(_spanVector.Length * Vector128<T>.Count);
}
internal int VectorLength => _spanVector.Length;
internal Vector128<T> ReadVector(int index) => Unsafe.ReadUnaligned<Vector128<T>>(ref Unsafe.As<Vector128Unaligned, byte>(ref _spanVector[index]));
internal void WriteVector(int index, Vector128<T> v) => Unsafe.WriteUnaligned<Vector128<T>>(ref Unsafe.As<Vector128Unaligned, byte>(ref _spanVector[index]), v);
}
//Safe code
static void Xor(Span<byte> bits1, Span<byte> bits2)
{
Vector128Span<byte> vs1 = new(bits1), vs2 = new(bits2);
if (vs1.VectorLength != vs2.VectorLength || vs1._spanRemainder.Length != vs2._spanRemainder.Length)
throw new ArgumentException();
for (int i = 0; i < vs1.VectorLength; i++)
{
vs1.WriteVector(i, vs1.ReadVector(i) ^ vs2.ReadVector(i));
}
for (int i = 0; i < vs1._spanRemainder.Length; i++)
{
vs1._spanRemainder[i] = (byte)(vs1._spanRemainder[i] ^ vs2._spanRemainder[i]);
}
}
At least
Apparently some of those who worked on https://github.com/dotnet/runtime/blob/dbb3f759798208ca7463059e0c87c0f45704b62f/src/libraries/System.Collections/src/System/Collections/BitArray.cs before me didn't feel so. They have unnecessarily used
My point is to encapsulate the unsafeness somewhere, and elsewhere users can write safe code. |
I did and I know your particular sample handles it. However, I'm referring to the usage of
This isn't the problematic part. It's the general reinterpreting of
This doesn't really "encapsulate" it, it just moves some of it and leaves other pitfills still widely open. It is not a pattern that I would recommend anyone use for their SIMD code, particularly as it will tend to have poor performance on various platforms, runtimes, or scenarios.
There is quite a bit of code that exists historically that is not following best practices and which is being improved. Most of this will eventually be used to one of the patterns I called out above, where it slices the span as it iterates and avoids any unsafe code. -- In general we are trying to reduce the amount of unsafe/dangerous code, not increase it, not replace it with alternative unsafe code, or other suboptimal patterns. In the interim, the looping logic is likely to be isolated (where feasible) using a pattern similar to how |
If Vector128<T> ReadVector(int index) => Vector128.LoadUnsafe(ref Unsafe.As<Vector128Unaligned, T>(ref _spanVector[index]));
At least one is very unlikely to write codes with memory safety problems with this encapsulation, as long as they don't use unsafe code themself. It doesn't solve all problems but at least the developer can write simple vectorized codes without worrying about memory safety issues.
Yes, eventually, but I see the milestone of this issue is set to Future. Maybe the codegen gets enough improvements this year, maybe next year, maybe 2027, maybe 2028... On the other hand, it is possible to use the technique that I outlined to make simple vectorized code safer now.
Patterns like |
This is still generally dangerous/problematic. There are lots of special rules that don't always work the way you expect for structs that don't have clearly defined fields. On many platforms it will "mostly" work how you expect, but on others or for various edge cases it won't. In general it's not something users should be doing, especially when there are existing clear/safe patterns that work instead.
You've not removed the memory safety issues, they still exist and in many cases are just different or more subtle. The code is "more dangerous" with the pattern you've defined and there are many subtle ways it could break. A general abstraction that preserved the
It's not safer and it isn't something I would recommend anyone do. There are variations that would be acceptable and could encapsulate things in a safer way but none of them involve
There are multiple ways to do this. The simplest is to take in a scalar at a time, broadcast it to the vector, then do a variable shift and mask. Finally return or store the result. You'd generally use the same helper as you would for widening |
If you don't like structs without fields, it can just be a usual struct with 16 fields of You have been saying how |
It's not about liking them or not. It's about these types of reinterpretations being generally unsafe and potentially not having well-defined behavior.
The way it's being used in your sample is dangerous. Not all usages of it are dangerous. Reinterpretation of However, such safe cases are also scenarios where it'd be likely to get an explicit API such as Some API dealing with |
Currently, when we do code vectorization in C#, we switch to raw pointers (either managed or unmanaged) and unsafe APIs resulting in lack of safety/bounds checks. A good example is a function that calculates a sum using SIMD, today we would write it like this:
It would be nice to move away from this practice towards fully safe APIs:
Unfortunately, JIT is not able to eliminate safety checks produced by
Slice
despite those being redundant in this case. This makes the safer version up to 2x slower, see EgorBot/runtime-utils#226Full C# impl we want to have 0 redundant bounds check in it:
Current codegen
There are other loop shapes typically used with SIMD.
The text was updated successfully, but these errors were encountered: