Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

brantburnett
Copy link
Contributor

@brantburnett brantburnett commented Mar 13, 2023

This significantly improves performance for System.IO.Hashing.Crc32 for cases where the source span is 64 bytes or larger on Intel x86/x64 and modern ARM architectures. It also improves the performance on ARM in cases where vectorization is not an option, such as systems without the necessary intrinsic or for short source spans.

The vectorization change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7. The scalar processing ARM changes also apply to .NET 6 and later.

The vectorization algorithm is a C# implementation of the algorithm put forth in the Intel paper "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" in December 2009. It is a modernization of the implementation found in ImageSharp offered here: #40244 (comment).

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1413)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2
Job-UHMIUW : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-IZYDKJ : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio RatioSD
Append Current 16 31.41 ns 0.637 ns 0.734 ns 31.35 ns 30.23 ns 32.36 ns 1.00 0.00
Append Intrinsics 16 30.97 ns 0.778 ns 0.896 ns 31.14 ns 29.59 ns 32.12 ns 0.99 0.04
Append Current 128 250.96 ns 4.743 ns 5.075 ns 251.44 ns 242.05 ns 257.93 ns 1.00 0.00
Append Intrinsics 128 19.05 ns 0.297 ns 0.263 ns 18.99 ns 18.55 ns 19.57 ns 0.08 0.00
Append Current 1024 1,990.18 ns 31.113 ns 29.104 ns 1,994.39 ns 1,948.98 ns 2,030.06 ns 1.00 0.00
Append Intrinsics 1024 58.31 ns 1.452 ns 1.672 ns 58.49 ns 55.71 ns 60.18 ns 0.03 0.00

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), Arm64 RyuJIT AdvSIMD
Job-LINWAX : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-SOJHQU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Current 16 44.797 ns 0.0125 ns 0.0117 ns 44.796 ns 44.777 ns 44.815 ns 1.00
Append Intrinsics 16 8.521 ns 0.0502 ns 0.0445 ns 8.525 ns 8.444 ns 8.590 ns 0.19
Append Current 128 363.017 ns 0.0252 ns 0.0223 ns 363.021 ns 362.977 ns 363.048 ns 1.00
Append Intrinsics 128 29.491 ns 0.0412 ns 0.0385 ns 29.498 ns 29.414 ns 29.543 ns 0.08
Append Current 1024 2,887.236 ns 0.3149 ns 0.2946 ns 2,887.090 ns 2,886.898 ns 2,887.818 ns 1.00
Append Intrinsics 1024 92.073 ns 0.4069 ns 0.3807 ns 92.078 ns 91.529 ns 92.833 ns 0.03

This significantly improves performance for System.IO.Hashing.Crc32 for
cases where the source span is 64 bytes or larger on Intel x86/x64
architectures.

The change only applies to .NET 7 and later targets of System.IO.Hashing
because it uses some Vector128 APIs added in .NET 7.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1641/21H2)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.1.23115.2
  [Host]     : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2
  Job-PBKTIR : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-TVEBLV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

| Method | Job        | BufferSize |        Mean |     Error |    StdDev |      Median |         Min |         Max | Ratio |
|------- |----------- |----------- |------------:|----------:|----------:|------------:|------------:|------------:|------:|
| Append | Current    |        128 |   228.20 ns |  2.366 ns |  2.213 ns |   228.07 ns |   225.54 ns |   232.75 ns |  1.00 |
| Append | Intrinsics |        128 |    17.62 ns |  0.096 ns |  0.075 ns |    17.59 ns |    17.56 ns |    17.80 ns |  0.08 |
|        |            |            |             |           |           |             |             |             |       |
| Append | Current    |       1024 | 1,988.07 ns | 47.120 ns | 54.264 ns | 1,990.18 ns | 1,892.83 ns | 2,089.15 ns |  1.00 |
| Append | Intrinsics |       1024 |    64.71 ns |  0.794 ns |  0.704 ns |    64.67 ns |    63.13 ns |    65.96 ns |  0.03 |
@ghost ghost added area-System.IO community-contribution Indicates that the PR has been added by a community member labels Mar 13, 2023
@ghost
Copy link

ghost commented Mar 13, 2023

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

This significantly improves performance for System.IO.Hashing.Crc32 for cases where the source span is 64 bytes or larger on Intel x86/x64 architectures.

The change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7.

This is a C# implementation of the algorithm put forth in the Intel paper "Fast CRC Computation for Generic Polynomials Using
PCLMULQDQ Instruction" in December 2009. It is a modernization of the implementation found in ImageSharp offered here: #40244 (comment).

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1641/21H2) Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2
Job-PBKTIR : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-TVEBLV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Current 128 228.20 ns 2.366 ns 2.213 ns 228.07 ns 225.54 ns 232.75 ns 1.00
Append Intrinsics 128 17.62 ns 0.096 ns 0.075 ns 17.59 ns 17.56 ns 17.80 ns 0.08
Append Current 1024 1,988.07 ns 47.120 ns 54.264 ns 1,990.18 ns 1,892.83 ns 2,089.15 ns 1.00
Append Intrinsics 1024 64.71 ns 0.794 ns 0.704 ns 64.67 ns 63.13 ns 65.96 ns 0.03
Author: brantburnett
Assignees: -
Labels:

area-System.IO

Milestone: -

@brantburnett brantburnett marked this pull request as ready for review March 14, 2023 12:37
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements. Thanks.

{
public partial class Crc32
{
private const int X86BlockSize = 64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm generally a fan of putting values like this into named consts, but in this particular case I think it actually muddies the water. There are a bunch of other related const values throughout the code, e.g. 16, 32, 48, that don't have or need such a name, but then when 64 is used there is a name, which to me at least makes it harder to understand the relationship and code. I'd just inline this number into where it's used, and put a comment on the very first use in the up-front guard check that explains where the number comes from.

You could also avoid the numbers and named consts and use things like Vector128<byte>.Count and Vector128<byte>.Count * 4 throughout the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After considering it, I agree. The constant was a holdover from a previous iteration where I was checking the length before calling the method, which made it harder to intuit the value. Since you requested that the length check be moved to the Update method, I've left the constant for that purpose only and renamed it appropriately. All the other sites use Vector128<byte>.Count.

Let me know if you still think the Update method should just use Vector128<byte>.Count * 4. I just thought it made things clearer when the logic is split between two files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I have this all resolved. Thanks.


// Processes the bytes in source in X86BlockSize chunks using x86 intrinsics, followed by processing 16
// byte chunks, and then processing remaining bytes individually. Requires support for Sse2 and Pclmulqdq intrinsics.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in ~800 bytes of asm. We don't want to inline it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private const byte CarrylessMultiplyLeftLowerRightUpper = 0x10;

// Processes the bytes in source in X86BlockSize chunks using x86 intrinsics, followed by processing 16
// byte chunks, and then processing remaining bytes individually. Requires support for Sse2 and Pclmulqdq intrinsics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to include the name of the paper this is based on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ghost
Copy link

ghost commented Mar 15, 2023

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

This significantly improves performance for System.IO.Hashing.Crc32 for cases where the source span is 64 bytes or larger on Intel x86/x64 architectures.

The change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7.

This is a C# implementation of the algorithm put forth in the Intel paper "Fast CRC Computation for Generic Polynomials Using
PCLMULQDQ Instruction" in December 2009. It is a modernization of the implementation found in ImageSharp offered here: #40244 (comment).

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1641/21H2) Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), X64 RyuJIT AVX2
Job-PBKTIR : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-TVEBLV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Current 128 228.20 ns 2.366 ns 2.213 ns 228.07 ns 225.54 ns 232.75 ns 1.00
Append Intrinsics 128 17.62 ns 0.096 ns 0.075 ns 17.59 ns 17.56 ns 17.80 ns 0.08
Append Current 1024 1,988.07 ns 47.120 ns 54.264 ns 1,990.18 ns 1,892.83 ns 2,089.15 ns 1.00
Append Intrinsics 1024 64.71 ns 0.794 ns 0.704 ns 64.67 ns 63.13 ns 65.96 ns 0.03
Author: brantburnett
Assignees: -
Labels:

area-System.Security, community-contribution

Milestone: -

@danmoseley
Copy link
Member

Given this references other work do we need to add an entry in the third party notices file?

Comment on lines 49 to 50
x5 = Pclmulqdq.CarrylessMultiply(x1, x0, CarrylessMultiplyLower);
Vector128<ulong> x6 = Pclmulqdq.CarrylessMultiply(x2, x0, CarrylessMultiplyLower);
Copy link
Member

@tannergooding tannergooding Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you abstract this out into a helper, you can also support it on Arm64 by using the "polynomial multiply" instructions. Believe you specifically want PMULL and PMULL2 which correspond to AdvSimd.PolynomialMultiplyWideningLower/Upper

Polynomial Multiply Long. This instruction multiplies corresponding elements in the lower or upper half of the
vectors of the two source SIMD&FP registers, places the results in a vector, and writes the vector to the destination
SIMD&FP register. The destination vector elements are twice as long as the elements that are multiplied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually considered using PMULL and PMULL2 on ARM when I was writing this, but I didn't do so because:

  • Some of this implementation feels like it might be affected by byte order. Probably addressable, but it was a concern.
  • I couldn't find an equivalent of Sse2.ShiftRightLogical128BitLane on ARM, so we'd need to create a less performant equivalent (correct me if I'm just blind on this one)
  • ARM has a built-in CRC32 intrinsic that uses the same polynomial, so I assumed (potentially incorrectly) that it would be a better choice. So I was thinking I'd come back next and add an ARM-specific implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a bit more research on this, and it appears that the PMULL approach is more performant on modern ARM that supports it than the CRC32 intrinsic on larger buffers because it can operate on wider data sets. I see evidence of commits in the Linux kernel and Java based on this. It seems like they use the CRC32 intrinsic as a fallback when PMULL is not available.

Based on this, I'm considering refactoring this to support ARM PMULL as well. However, I'm a bit stumped on how to test it since my dev laptop is Intel. Are there any tricks documented to test using QEMU or something similar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easiest way is likely just to let our CI cover it if you don't have your own box.

I wouldn't expect any significant differences and you can filter on BitConverter.IsLittleEndian to ensure it works on BigEndian platforms if that's a concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@brantburnett
Copy link
Contributor Author

Given this references other work do we need to add an entry in the third party notices file?

Good question, I was hoping someone would tell me. I'm not really an expert on licensing. This work was based on the ImageSharp implementation of the algorithm, but it's a pretty major overhaul of the C# so I'd assume that any required attribution would just be to the Intel paper that it was originally based on. The Intel paper doesn't really have clear licensing, at least to my untrained eye. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

@stephentoub
Copy link
Member

If it's a derivative of the ImageSharp code, the relevant information should be added to the third party notices file. If it's instead just based on the algorithm in the paper, I think including the paper title in the source is sufficient. But @richlander would have the final say here.

@brantburnett
Copy link
Contributor Author

If it's a derivative of the ImageSharp code, the relevant information should be added to the third party notices file. If it's instead just based on the algorithm in the paper, I think including the paper title in the source is sufficient. But @richlander would have the final say here.

For reference here is the ImageSharp code: https://github.com/SixLabors/ImageSharp/blob/f4f689ce67ecbcc35cebddba5aacb603e6d1068a/src/ImageSharp/Formats/Png/Zlib/Crc32.cs#L80

@richlander
Copy link
Member

Overhaul or not overhaul isn't the bar nor is the bar high. If you started with the ImageSharp code, then attribution should be given. We're not trying to limit attributions but given them where warranted. From what I read, it is. Please correct me if I've got that wrong.

ImageSharp is going through a license change, but this file says Apache 2, so we're fine. /cc @JimBobSquarePants.

We should double check with our legal staff on the Intel paper. I suspect it is fine, however the licensing aspects at the end of the doc are confusing. I'll ask.

@JimBobSquarePants
Copy link

All good by me. Would be great if someone could backport the ARM instrinics to ImageSharp though.

@richlander
Copy link
Member

Will this change be suitable for you to use @JimBobSquarePants, or do you mean so that you have an improved implementation generally (since this change is .NET 8+)?

@brantburnett
Copy link
Contributor Author

All good by me. Would be great if someone could backport the ARM instrinics to ImageSharp though.

Assuming I get the ARM intrinsics working, you should be able to take System.IO.Hashing 8.0.0 (once released) as a dependency for net7.0 and forward targets to get the intrinsics, without the need to port it. If you want net6.0 support it would require a port, though. This version is currently using some APIs added in .NET 7 to Vector128.

@JimBobSquarePants
Copy link

JimBobSquarePants commented Mar 18, 2023

@richlander @brantburnett

I've just realised we've already added ARM support since that commit so no need for changes there.

The only thing we're missing is the endianess check.
https://github.com/dotnet/runtime/pull/83321/files#diff-8c4ea1d8b9624f9e2f25b7ffc2f776d6ea77492d9758f116bf1058c6882c2481R176-R180

We'd definitely delete our code once ImageSharp targets .NET 8 and use this.

@brantburnett brantburnett requested review from stephentoub and removed request for gfoidl March 22, 2023 12:13
Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question / suggestion, otherwise LGTM.

@brantburnett brantburnett requested review from gfoidl and removed request for stephentoub March 22, 2023 23:00
@richlander
Copy link
Member

I'm assuming that, as the developer of this implementation, that you find that this .asm file you linked to has the same/sufficient information as the paper such that you would have been able to move forward if you'd only had this file initially

I heard back. I was told that this logic is good. If you believe you could have made this implementation with just the .asm file, then we can go forward with making a 3PN entry for just it. We should still add one for ImageSharp if you also based your implementation on that code (even if you ended up with something quite different).

@tannergooding tannergooding self-assigned this Mar 23, 2023
@tannergooding
Copy link
Member

Assigned myself to give a final review pass and merge if everything looks good.

Please let me know once you've resolved or responded to all feedback and I'll get started (there are a number of open, but likely already handled comments still above).

@brantburnett
Copy link
Contributor Author

Assigned myself to give a final review pass and merge if everything looks good.

Please let me know once you've resolved or responded to all feedback and I'll get started (there are a number of open, but likely already handled comments still above).

To my knowledge, all feedback is resolved above. I just wasn't sure what the procedure is, if I'm supposed to mark it resolved or let the reviewer mark it as resolved. I'm happy to do so if that's the procedure.

@tannergooding
Copy link
Member

It varies from repo to repo and reviewer to reviewer, unfortunately.

The "safest" thing to do is to at least leave a little comment indicating "Fixed" or "Resolved" if its been explicitly addressed. You can optionally leave a link or comment elaborating if appropriate.

That at least lets other reviewers see that something isn't still pending or simply missed.

@brantburnett
Copy link
Contributor Author

It varies from repo to repo and reviewer to reviewer, unfortunately.

The "safest" thing to do is to at least leave a little comment indicating "Fixed" or "Resolved" if its been explicitly addressed. You can optionally leave a link or comment elaborating if appropriate.

That at least lets other reviewers see that something isn't still pending or simply missed.

Okay, I resolved some from gfoidl since he approved the PR and replied on the other conversations. Thanks.

@brantburnett
Copy link
Contributor Author

@tannergooding

I'm almost done with vectorizing the CRC64 implementation as well. I wanted to check and see, from a workflow perspective, what would be best for you. Would it make sense to get this merged first and then do a separate PR for CRC64? There is some dependency between them, so I don't want to do them both as parallel PRs. But separate commits to main may give a clearer history. Or would it save you time to deal with them both at once in a single PR?

@stephentoub
Copy link
Member

Thanks. Let's get this one merged and then do the other.

// Compute in 8 byte chunks
if (source.Length >= sizeof(ulong))
{
ReadOnlySpan<ulong> longSource = MemoryMarshal.Cast<byte, ulong>(source);
Copy link
Member

@tannergooding tannergooding Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really use ReadUnaligned instead of casting since there is no guarantee that things are "properly aligned" otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten to use ref byte and Unsafe.ReadUnaligned

Comment on lines 30 to 31
Debug.Fail("This path should be unreachable.");
return default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a new System.Diagnostics.UnreachableException which is probably better.

You'd then want:

ThrowHelper.ThrowUnreachableException();
return default;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@tannergooding
Copy link
Member

Changes overall LGTM. Should probably have a secondary review/sign-off before merging.

@stephentoub stephentoub merged commit d0ca558 into dotnet:main Apr 22, 2023
@brantburnett brantburnett deleted the crc32-x86 branch April 23, 2023 01:19
@adamsitnik adamsitnik added the tenet-performance Performance related issue label May 17, 2023
@adamsitnik adamsitnik added this to the 8.0.0 milestone May 17, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Security community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants