-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Add vectorization to improve CRC32 performance #83321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This significantly improves performance for System.IO.Hashing.Crc32 for cases where the source span is 64 bytes or larger on Intel x86/x64 architectures. The change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7. BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1641/21H2) Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=8.0.100-preview.1.23115.2 [Host] : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2 Job-PBKTIR : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-TVEBLV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1 | Method | Job | BufferSize | Mean | Error | StdDev | Median | Min | Max | Ratio | |------- |----------- |----------- |------------:|----------:|----------:|------------:|------------:|------------:|------:| | Append | Current | 128 | 228.20 ns | 2.366 ns | 2.213 ns | 228.07 ns | 225.54 ns | 232.75 ns | 1.00 | | Append | Intrinsics | 128 | 17.62 ns | 0.096 ns | 0.075 ns | 17.59 ns | 17.56 ns | 17.80 ns | 0.08 | | | | | | | | | | | | | Append | Current | 1024 | 1,988.07 ns | 47.120 ns | 54.264 ns | 1,990.18 ns | 1,892.83 ns | 2,089.15 ns | 1.00 | | Append | Intrinsics | 1024 | 64.71 ns | 0.794 ns | 0.704 ns | 64.67 ns | 63.13 ns | 65.96 ns | 0.03 |
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsThis significantly improves performance for System.IO.Hashing.Crc32 for cases where the source span is 64 bytes or larger on Intel x86/x64 architectures. The change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7. This is a C# implementation of the algorithm put forth in the Intel paper "Fast CRC Computation for Generic Polynomials Using BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1641/21H2) Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=8.0.100-preview.1.23115.2 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.X86.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.X86.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.X86.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.X86.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements. Thanks.
{ | ||
public partial class Crc32 | ||
{ | ||
private const int X86BlockSize = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm generally a fan of putting values like this into named consts, but in this particular case I think it actually muddies the water. There are a bunch of other related const values throughout the code, e.g. 16, 32, 48, that don't have or need such a name, but then when 64 is used there is a name, which to me at least makes it harder to understand the relationship and code. I'd just inline this number into where it's used, and put a comment on the very first use in the up-front guard check that explains where the number comes from.
You could also avoid the numbers and named consts and use things like Vector128<byte>.Count
and Vector128<byte>.Count * 4
throughout the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After considering it, I agree. The constant was a holdover from a previous iteration where I was checking the length before calling the method, which made it harder to intuit the value. Since you requested that the length check be moved to the Update
method, I've left the constant for that purpose only and renamed it appropriately. All the other sites use Vector128<byte>.Count
.
Let me know if you still think the Update
method should just use Vector128<byte>.Count * 4
. I just thought it made things clearer when the logic is split between two files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I have this all resolved. Thanks.
|
||
// Processes the bytes in source in X86BlockSize chunks using x86 intrinsics, followed by processing 16 | ||
// byte chunks, and then processing remaining bytes individually. Requires support for Sse2 and Pclmulqdq intrinsics. | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in ~800 bytes of asm. We don't want to inline it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.X86.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.X86.cs
Outdated
Show resolved
Hide resolved
private const byte CarrylessMultiplyLeftLowerRightUpper = 0x10; | ||
|
||
// Processes the bytes in source in X86BlockSize chunks using x86 intrinsics, followed by processing 16 | ||
// byte chunks, and then processing remaining bytes individually. Requires support for Sse2 and Pclmulqdq intrinsics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to include the name of the paper this is based on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones Issue DetailsThis significantly improves performance for System.IO.Hashing.Crc32 for cases where the source span is 64 bytes or larger on Intel x86/x64 architectures. The change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7. This is a C# implementation of the algorithm put forth in the Intel paper "Fast CRC Computation for Generic Polynomials Using BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1641/21H2) Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=8.0.100-preview.1.23115.2 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
Given this references other work do we need to add an entry in the third party notices file? |
x5 = Pclmulqdq.CarrylessMultiply(x1, x0, CarrylessMultiplyLower); | ||
Vector128<ulong> x6 = Pclmulqdq.CarrylessMultiply(x2, x0, CarrylessMultiplyLower); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you abstract this out into a helper, you can also support it on Arm64 by using the "polynomial multiply" instructions. Believe you specifically want PMULL
and PMULL2
which correspond to AdvSimd.PolynomialMultiplyWideningLower/Upper
Polynomial Multiply Long. This instruction multiplies corresponding elements in the lower or upper half of the
vectors of the two source SIMD&FP registers, places the results in a vector, and writes the vector to the destination
SIMD&FP register. The destination vector elements are twice as long as the elements that are multiplied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually considered using PMULL and PMULL2 on ARM when I was writing this, but I didn't do so because:
- Some of this implementation feels like it might be affected by byte order. Probably addressable, but it was a concern.
- I couldn't find an equivalent of
Sse2.ShiftRightLogical128BitLane
on ARM, so we'd need to create a less performant equivalent (correct me if I'm just blind on this one) - ARM has a built-in CRC32 intrinsic that uses the same polynomial, so I assumed (potentially incorrectly) that it would be a better choice. So I was thinking I'd come back next and add an ARM-specific implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a bit more research on this, and it appears that the PMULL approach is more performant on modern ARM that supports it than the CRC32 intrinsic on larger buffers because it can operate on wider data sets. I see evidence of commits in the Linux kernel and Java based on this. It seems like they use the CRC32 intrinsic as a fallback when PMULL is not available.
Based on this, I'm considering refactoring this to support ARM PMULL as well. However, I'm a bit stumped on how to test it since my dev laptop is Intel. Are there any tricks documented to test using QEMU or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The easiest way is likely just to let our CI cover it if you don't have your own box.
I wouldn't expect any significant differences and you can filter on BitConverter.IsLittleEndian
to ensure it works on BigEndian platforms if that's a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Good question, I was hoping someone would tell me. I'm not really an expert on licensing. This work was based on the ImageSharp implementation of the algorithm, but it's a pretty major overhaul of the C# so I'd assume that any required attribution would just be to the Intel paper that it was originally based on. The Intel paper doesn't really have clear licensing, at least to my untrained eye. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf |
If it's a derivative of the ImageSharp code, the relevant information should be added to the third party notices file. If it's instead just based on the algorithm in the paper, I think including the paper title in the source is sufficient. But @richlander would have the final say here. |
For reference here is the ImageSharp code: https://github.com/SixLabors/ImageSharp/blob/f4f689ce67ecbcc35cebddba5aacb603e6d1068a/src/ImageSharp/Formats/Png/Zlib/Crc32.cs#L80 |
Overhaul or not overhaul isn't the bar nor is the bar high. If you started with the ImageSharp code, then attribution should be given. We're not trying to limit attributions but given them where warranted. From what I read, it is. Please correct me if I've got that wrong. ImageSharp is going through a license change, but this file says Apache 2, so we're fine. /cc @JimBobSquarePants. We should double check with our legal staff on the Intel paper. I suspect it is fine, however the licensing aspects at the end of the doc are confusing. I'll ask. |
All good by me. Would be great if someone could backport the ARM instrinics to ImageSharp though. |
Will this change be suitable for you to use @JimBobSquarePants, or do you mean so that you have an improved implementation generally (since this change is .NET 8+)? |
Assuming I get the ARM intrinsics working, you should be able to take System.IO.Hashing 8.0.0 (once released) as a dependency for net7.0 and forward targets to get the intrinsics, without the need to port it. If you want net6.0 support it would require a port, though. This version is currently using some APIs added in .NET 7 to Vector128. |
I've just realised we've already added ARM support since that commit so no need for changes there. The only thing we're missing is the endianess check. We'd definitely delete our code once ImageSharp targets .NET 8 and use this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question / suggestion, otherwise LGTM.
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.Vectorized.cs
Outdated
Show resolved
Hide resolved
I heard back. I was told that this logic is good. If you believe you could have made this implementation with just the .asm file, then we can go forward with making a 3PN entry for just it. We should still add one for ImageSharp if you also based your implementation on that code (even if you ended up with something quite different). |
Assigned myself to give a final review pass and merge if everything looks good. Please let me know once you've resolved or responded to all feedback and I'll get started (there are a number of open, but likely already handled comments still above). |
To my knowledge, all feedback is resolved above. I just wasn't sure what the procedure is, if I'm supposed to mark it resolved or let the reviewer mark it as resolved. I'm happy to do so if that's the procedure. |
It varies from repo to repo and reviewer to reviewer, unfortunately. The "safest" thing to do is to at least leave a little comment indicating "Fixed" or "Resolved" if its been explicitly addressed. You can optionally leave a link or comment elaborating if appropriate. That at least lets other reviewers see that something isn't still pending or simply missed. |
Okay, I resolved some from gfoidl since he approved the PR and replied on the other conversations. Thanks. |
I'm almost done with vectorizing the CRC64 implementation as well. I wanted to check and see, from a workflow perspective, what would be best for you. Would it make sense to get this merged first and then do a separate PR for CRC64? There is some dependency between them, so I don't want to do them both as parallel PRs. But separate commits to |
Thanks. Let's get this one merged and then do the other. |
// Compute in 8 byte chunks | ||
if (source.Length >= sizeof(ulong)) | ||
{ | ||
ReadOnlySpan<ulong> longSource = MemoryMarshal.Cast<byte, ulong>(source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really use ReadUnaligned
instead of casting since there is no guarantee that things are "properly aligned" otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten to use ref byte and Unsafe.ReadUnaligned
Debug.Fail("This path should be unreachable."); | ||
return default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a new System.Diagnostics.UnreachableException
which is probably better.
You'd then want:
ThrowHelper.ThrowUnreachableException();
return default;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Changes overall LGTM. Should probably have a secondary review/sign-off before merging. |
This significantly improves performance for System.IO.Hashing.Crc32 for cases where the source span is 64 bytes or larger on Intel x86/x64 and modern ARM architectures. It also improves the performance on ARM in cases where vectorization is not an option, such as systems without the necessary intrinsic or for short source spans.
The vectorization change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7. The scalar processing ARM changes also apply to .NET 6 and later.
The vectorization algorithm is a C# implementation of the algorithm put forth in the Intel paper "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" in December 2009. It is a modernization of the implementation found in ImageSharp offered here: #40244 (comment).
BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1413)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2
Job-UHMIUW : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-IZYDKJ : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), Arm64 RyuJIT AdvSIMD
Job-LINWAX : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-SOJHQU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1