-
-
Notifications
You must be signed in to change notification settings - Fork 881
Improve JPEG Block8x8F Intrinsics for Vector128 paths. #2918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances JPEG decoding by migrating intrinsic implementations in Block8x8F from legacy SSE/AVX paths to more versatile Vector128 and Vector256 methods, which should boost performance on mobile platforms and improve consistency.
- Renamed methods and field references (e.g. TransposeInplace → TransposeInPlace) for improved naming clarity.
- Introduced separate Vector128 and Vector256 intrinsic implementations and removed legacy intrinsic and generated files.
- Updated SIMD helper classes to use new alias naming (e.g. Vector128_ instead of Vector128Utilities).
Reviewed Changes
Copilot reviewed 29 out of 30 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs | Renamed transpose method and updated intrinsic vector field references. |
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs | Updated SIMD intrinsic checks and operations; added new normalization and load methods; removed legacy warnings. |
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs | Added a new Vector256-based implementation of Block8x8F operations. |
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector128.cs | Added a new Vector128-based implementation of Block8x8F operations. |
src/ImageSharp/Common/Helpers/*Utilities.cs | Updated utility classes to use the new alias naming conventions (e.g. Vector128_). |
src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs | Adjusted references to intrinsics helpers in accordance with alias renaming. |
Files not reviewed (1)
- src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Generated.tt: Language not supported
Comments suppressed due to low confidence (1)
src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs:23
- The method 'TransposeInPlace' has been renamed from 'TransposeInplace' for consistency. Confirm that all call sites and inline comments are updated to reflect this naming convention.
block.TransposeInPlace();
@JimBobSquarePants , when I build the nuget ( ![]() I assume that's what the above says, and we want to be using .NET 9 SDK. Should I be forcing it to use .NET 9 with a global.json or some other setting? |
ImageSharp actually only targets a single LTS version. We fudge the target for CI and tests so we can track potential JIT issues when building against previews. |
Ah, gotya. All good. I have my head in MAUI world a lot, they follow latest release instead of LTS so I am not used to seeing .NET 8 pop up 😅 |
Doesn't seem like numbers moved too much. Some higher, some lower (or within margin of error). Keep in mind these tests are not using BenchmarkDotNet yet so it isn't doing the warmup and other things it does. Just loops over 10 times and I am jotting down the average. Debug (3.1.8)
Debug (Modernize JPEG Color Converters)
Debug (this updated PR)
Release (3.1.8)
Release (Modernize JPEG Color Converters)
Release (this updated PR)
|
I think we need to find a way to properly benchmark because I cannot see how the numbers could be worse in this PR than the last one. Edit. It appears we could just run BenchmarkDotNet… https://benchmarkdotnet.org/articles/samples/IntroXamarin.html |
New project I started is using that with MAUI. I'm having some issues getting it working with Mac Catalyst (macOS desktop) variant. But if the current issues are Android I can just do a net9.0-android repo with Android only app to put the tests in to focus on that while I deal with full MAUI later |
Yeah, as I recall iOS was very good, let’s focus on the numbers for Android |
How is the project being compiled for Android, is it using Mono LLVM or standard Mono AOT? |
public static Vector128<T> Clamp<T>(Vector128<T> value, Vector128<T> min, Vector128<T> max) | ||
=> Vector128.Min(Vector128.Max(value, min), max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In .NET 9+ this can just use Vector128.Clamp
. Alternatively it can use Vector128.ClampNative
if you don't need need to care about -0
vs +0
or NaN
handling for float
/double
if (Avx.IsSupported) | ||
{ | ||
Vector256<float> lower = Avx.RoundToNearestInteger(vector.GetLower()); | ||
Vector256<float> upper = Avx.RoundToNearestInteger(vector.GetUpper()); | ||
return Vector512.Create(lower, upper); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the AVX path for with Vector512?
Vector512.IsHardwareAccelerated
will only report true if Avx512F+BW+CD+DQ+VL
is supported, so this path should generally be "dead".
[MethodImpl(InliningOptions.ShortMethod)] | ||
public void NormalizeColorsAndRoundInPlaceVector128(float maximum) | ||
{ | ||
Vector128<float> off = Vector128.Create(MathF.Ceiling(maximum * 0.5F)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would actually be more efficient as Vector128.Ceiling(Vector128.Create(maximum) * 0.5f)
While the codegen (see below) is the same size and looks nearly identical, the change to be vectorized instead of scalar avoids a very minor penalty that that exists as scalar operations mutate element 0 and preserve elements 1, 2, and 3 as is.
In general it's better to convert to vector up front and do operations as vectorized where possible.
Here's what you're getting now
; XMM
vmulss xmm0, xmm1, [reloc @RWD00]
vroundss xmm0, xmm0, xmm0, 0xa
vbroadcastss xmm0, xmm0
Here's what you would be getting with the suggested change
; XMM
vbroadcastss xmm0, xmm1
vmulps xmm0, xmm0, [reloc @RWD00]
vroundps xmm0, xmm0, 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notably it also allows the Vector128.Create(maximum)
used for initializing max
to be reused, rather than a distinct instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! I shouldn't have missed these!
dRef = Avx.ConvertToVector256Single(top); | ||
Unsafe.Add(ref dRef, 1) = Avx.ConvertToVector256Single(bottom); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one becomes stylistic preference, but you can freely mix the xplat APIs and the platform specific intrinsics.
That is, while you're wanting to use V256 Avx2.ConvertToVector256Int32(V128)
instead of V256.WidenLower/WidenUpper
for efficiency, you can just use V256.ConvertToSingle()
still instead of Avx.ConvertToVector256Single
since it is a 1-to-1 mapping.
-- As a note to myself, it would likely be beneficial to have V256.Widen(V128)
APIs or similar; or to pattern match V256.WidenLower(V256)
followed by V256.WidenUpper(V256)
; so devs don't need to use platform specific intrinsics in such cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't bother porting the existing Avx code as it worked but I might still do it.
Vector256<int> row0 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 0), Unsafe.Add(ref bBase, i + 0))); | ||
Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can use x * y
instead of Avx.Multiply(x, y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use Vector256.ConvertToInt32
instead of Avx.ConvertToVector256Int32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually seeing a difference in output if I switch from Avx.ConvertToVector256Int32
to Vector256.ConvertToInt32
do they use the same rounding?
Avx.ConvertToVector256Int32
uses the equivalent of MidpointRounding.ToEven
but the Vector256
equivalent is undocumented.
Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1))); | ||
|
||
Vector256<short> row = Avx2.PackSignedSaturate(row0, row1); | ||
row = Avx2.PermuteVar8x32(row.AsInt32(), multiplyIntoInt16ShuffleMask).AsInt16(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This should do the right thing in .NET 8 if you have Vector256.Shuffle(row.AsInt32(), Vector256.Create(0, 1, 4, 5, 2, 3, 6, 7))
In general, declaring the indices directly into the call like this will do the right thing provided all indices are constant. We improved the handling in .NET 9 and even more so in .NET 10 to handle more patterns so that devs that are manually hoisting the indices will still get good codegen if the JIT can detect them as constant during compilation (so in .NET 10, you can have the code as you do right now, rather than directly declaring V256.Create(...)
inside the Vector256.Shuffle
call as is needed for .NET 8).
Vector256<float> r0 = Avx.InsertVector128( | ||
this.V256_0, | ||
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L), | ||
1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can use this.V256_0.WithUpper(Unsafe.As<Vector4, Vector128<float>>(ref this.V4L))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be the following can it not?
Vector256<float> r0 = this.V256_0.WithUpper(this.V4L.AsVector128());
@@ -421,16 +488,17 @@ public void LoadFromInt16ExtendedAvx2(ref Block8x8 source) | |||
/// <param name="value">Value to compare to.</param> | |||
public bool EqualsToScalar(int value) | |||
{ | |||
// TODO: Can we provide a Vector128 implementation for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's blocking a V128 path from being added? At a glance it looks like it should be almost a copy/paste of the V256 path...
Vector256<int> areEqual = Avx2.CompareEqual(Avx.ConvertToVector256Int32WithTruncation(Unsafe.Add(ref this.V256_0, i)), targetVector); | ||
if (Avx2.MoveMask(areEqual.AsByte()) != equalityMask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified to if (!V256.EqualsAll(V256.ConvertToInt32(Unsafe.Add(ref this.V256_0, i)), targetVector))
That avoids a dependency on MoveMask
and maps better to V128/V512.
-- Notably on .NET 9+ you may want to use V256.ConvertToInt32Native
instead, since ConvertToInt32
will saturate for out of bounds values, rather than saturating on some platforms and returning a "sentinel" value on x86/x64.
Vector256<float> tmp0 = Avx.Add(block.V256_0, block.V256_7); | ||
Vector256<float> tmp7 = Avx.Subtract(block.V256_0, block.V256_7); | ||
Vector256<float> tmp1 = Avx.Add(block.V256_1, block.V256_6); | ||
Vector256<float> tmp6 = Avx.Subtract(block.V256_1, block.V256_6); | ||
Vector256<float> tmp2 = Avx.Add(block.V256_2, block.V256_5); | ||
Vector256<float> tmp5 = Avx.Subtract(block.V256_2, block.V256_5); | ||
Vector256<float> tmp3 = Avx.Add(block.V256_3, block.V256_4); | ||
Vector256<float> tmp4 = Avx.Subtract(block.V256_3, block.V256_4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These could use x + y
and x - y
. Similar for other arithmetic operations in the method
if (Vector128.IsHardwareAccelerated) | ||
{ | ||
Vector128<int> targetVector = Vector128.Create(value); | ||
ref Vector4 blockStride = ref this.V0L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is blockStride
intentionally unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Badly named (copy/paste) but yeah, I'm pointing to the Vector4
field at offset 0. I don't have explicit Vector128 fields but am considering adding them to avoid some of the To/From Vector128
code.
@beeradmoore Those numbers are wild and yes, Skia is cheating there. Here's my desktop decoding the same image. I benchmarked against System.Drawing because the JPEG decoder there is incredibly fast (I don't know what the underlying implementation is but it's blazing) Appreciating the fact that the CPU on the Android (could you post the details btw) is less powerful than my laptop I'm surprised that the
@tannergooding I'm suspicious of the scalar timing on desktop and those Android numbers lining up so closely. Could just be coincidence though... |
My test Android device is a Pixel 2 XL. 8 years old and still chugging along. From the output of one of the previous runs (test_pr.txt) I also see I checked to see if I could use |
I think the easiest way to confirm this is to add the following, which should tell the Mono AOT compiler to skip intrinsic usage... <ItemGroup>
<MonoAOTCompilerDefaultProcessArguments Include="-O=-intrins" />
</ItemGroup> |
With that added Before,
After,
|
If that setting is correct then intrinsics are not being used at all in all scenarios. Are you able to stick a Is this relevant? |
I'll make a details page and make it output some general device info. Any other properties you'd care about? |
I’d say any chipset info, runtime info and intrinsics tests. E.g. AdvSimd.IsSupported |
|
I added everything I could find. Very likely too much. Info from MAUIs DeviceInfo, I couldn't get a OpenGL Surface to initialise so I couldn't fetch GPU specific information. Here is a giant dump of my Android device in release mode.
(side note, I am not sure why Android.OS.Build.Type is I did a build with this, and then with the above and the only change was
But that is may misunderstanding of what that property means. |
I did another test of debug build. Keeping in mind the csproj is
The only property that changed (aside from
Swapping back to
|
So it should be getting generally accelerated in most places, except for where |
That’s the odd thing. There’s actually very few places left I use AdvSimd directly (png, web encoders). Almost everything I do have uses Xplat as a fallback also. |
{ | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
get => Ssse3.IsSupported || AdvSimd.Arm64.IsSupported; | ||
get => Ssse3.IsSupported || AdvSimd.Arm64.IsSupported || PackedSimd.IsSupported; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Arm64 and WASM you should just be able to use (for byte
) Vector128.Shuffle
due to how VectorTableLookup
and Swizzle
work.
You really just want at least Ssse3
for x86/x64, since they cannot be done otherwise.
So perhaps you want:
get
{
if (Vector128.IsHardwareAccelerated)
{
if (RuntimeInformation.ProcessArchitecture is Architecture.X86 or Architecture.X64)
{
return Ssse3.IsSupported;
}
// You could optionally do:
// return ProcessArchitecture is Architecture.Arm64 or Architecture.Wasm;
// if you wanted to restrict it to platforms you know should be safe
return true;
}
return false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I’ll review my other helpers.
Vector128<short> u0 = Vector128_.PackSignedSaturate(w0, w1); | ||
Vector128<short> u1 = Vector128_.PackSignedSaturate(w2, w3); | ||
|
||
Unsafe.Add(ref destinationBase, i) = Vector128Utilities.PackUnsignedSaturate(u0, u1); | ||
Unsafe.Add(ref destinationBase, i) = Vector128_.PackUnsignedSaturate(u0, u1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be made cheaper with a direct int-> byte
helper.
For x86/x64 it should do roughly the same as it is right now (float->int->short->byte
), but for the V128.IsHardwareAccelerated
fallback it can clamp to byte, narrow to short, narrow to byte
; rather than clamp to short, narrow to short, clamp to byte, narrow to byte
like its doing currently.
Notably for AVX512 there's even some other optimizations you could do since instructions exist to go from V512<uint> -> V128<byte>
. You can likewise fixup the V512<float>
as part of the conversion to uint using ConvertToVector512UInt32(Max(float.AsInt32(), Zero).AsSingle())
(since out of range values become uint.MaxValue
, you just have to care about negatives becoming 0).
There's also some Arm64/WASM specific behaviors you can take advantage of because float->integer
already saturates there, so instead of doing float->int->short->byte
, you can do float->uint
then clamp to byte.MaxValue
and then do the narrowing. On Arm64 you might be able to do a VectorTableLookup with 2 inputs instead of 4 narrowing instructions or do some zip instructions instead, which might be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- The other optimizations aren't as important, but for Android since AdvSimd.IsSupported
reports false
, the suggestion on improving V128.IsHardwareAccelerated
path will likely benefit there since it cuts out almost half the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've lost me a bit here with the Vector128
path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was this part
This could be made cheaper with a direct int-> byte helper.
For x86/x64 it should do roughly the same as it is right now (float->int->short->byte), but for the V128.IsHardwareAccelerated fallback it can clamp to byte, narrow to short, narrow to byte; rather than clamp to short, narrow to short, clamp to byte, narrow to byte like its doing currently.
In particular right now you're doing:
- Clamp the int32 to int16
- Narrow the int32 to int16 now that it's in range
- Clamp the int16 to uint8
- Narrow the int16 to uint8 now that it's in range
You could instead just do:
- Clamp the int32 to int8
- Narrow the int32 to uint16 now that it's in range
- Narrow the uint16 to uint8, it's already in range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I see now! Thanks!
(Vector256.IsHardwareAccelerated && Vector256Utilities.SupportsShuffleByte) || | ||
(Vector128.IsHardwareAccelerated && Vector128Utilities.SupportsShuffleByte)) | ||
if ((Vector512.IsHardwareAccelerated && Vector512_.SupportsShuffleNativeByte) || | ||
(Vector256.IsHardwareAccelerated && Vector256_.SupportsShuffleByte) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed this middle one is SupportsShuffleByte
while the other two are SupportsShuffleNativeByte
, guessing that was a mistake?
if ((Vector512.IsHardwareAccelerated && Vector512Utilities.SupportsShuffleFloat) || | ||
(Vector256.IsHardwareAccelerated && Vector256Utilities.SupportsShuffleFloat) || | ||
(Vector128.IsHardwareAccelerated && Vector128Utilities.SupportsShuffleFloat)) | ||
if ((Vector512.IsHardwareAccelerated && Vector512_.SupportsShuffleNativeFloat) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these could be supported on other platforms if you passed in the indexes rather than a byte control
.
There's some other functions like that throughout the SimdUtils.HwIntrinsics.cs
file as well which could be impactful to Arm64, Wasm, and Android; but I've not looked at them in depth and I'm guessing many are larger refactorings to clean up.
Given these are marked as AggressiveInlining
and byte control
is expected to be a constant, the JIT "might" (just double check the codegen) be able to do the right thing for something like Vector128.Shuffle(vector, Vector128.Create((control & 3), ((control >> 2) & 3), ((control >> 4) & 3), ((control >> 6) & 3)))
since that should itself break down to 4 constants which the JIT can recognize
I got the above (device details page, not benchmarks) to run on my iPhone 15 Pro Max in release mode. Trying to build with benchmarks enabled takes forever while it's doing AOT compilation. I'll leave it overnight to see if it actually does finish. The only property it had different over the Android build in the same config was.
Full details
|
@beeradmoore If you get the chance could you please run another benchmark. I want to see if some of the additional shuffle changes have made a difference. Thanks! |
New is this updated PR, old is re-running code from about 4 days ago.
|
I'm going to merge this AS-IS as it's blocking important work. I'm hoping the new narrowing code will speed things up a bit since the last benchmark, but I can revisit later as I look for further optimization opportunities. |
Prerequisites
Description
This PR adds
Vector128
intrinsic implementations to several methods inBlock8x8F
and reimplementsZigZag
to migrate intrinsics fromSse
to generalVector<128>
methods which should provide a good speedup on mobile.Performance improvements are measurable.
Benchmarks
Main
This PR
CC
@tannergooding - I think I got everything right performance-wise though I have commented with
TODO
where there may be more low hanging fruit.@beeradmoore - I'm hoping this makes a real difference with the MAUI benchmarks. There were several places where we were falling back to scalar implementations for ARM and WASM.