-
-
Notifications
You must be signed in to change notification settings - Fork 872
Improve JPEG Block8x8F Intrinsics for Vector128 paths. #2918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances JPEG decoding by migrating intrinsic implementations in Block8x8F from legacy SSE/AVX paths to more versatile Vector128 and Vector256 methods, which should boost performance on mobile platforms and improve consistency.
- Renamed methods and field references (e.g. TransposeInplace → TransposeInPlace) for improved naming clarity.
- Introduced separate Vector128 and Vector256 intrinsic implementations and removed legacy intrinsic and generated files.
- Updated SIMD helper classes to use new alias naming (e.g. Vector128_ instead of Vector128Utilities).
Reviewed Changes
Copilot reviewed 29 out of 30 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs | Renamed transpose method and updated intrinsic vector field references. |
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs | Updated SIMD intrinsic checks and operations; added new normalization and load methods; removed legacy warnings. |
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs | Added a new Vector256-based implementation of Block8x8F operations. |
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector128.cs | Added a new Vector128-based implementation of Block8x8F operations. |
src/ImageSharp/Common/Helpers/*Utilities.cs | Updated utility classes to use the new alias naming conventions (e.g. Vector128_). |
src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs | Adjusted references to intrinsics helpers in accordance with alias renaming. |
Files not reviewed (1)
- src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Generated.tt: Language not supported
Comments suppressed due to low confidence (1)
src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs:23
- The method 'TransposeInPlace' has been renamed from 'TransposeInplace' for consistency. Confirm that all call sites and inline comments are updated to reflect this naming convention.
block.TransposeInPlace();
@JimBobSquarePants , when I build the nuget ( ![]() I assume that's what the above says, and we want to be using .NET 9 SDK. Should I be forcing it to use .NET 9 with a global.json or some other setting? |
ImageSharp actually only targets a single LTS version. We fudge the target for CI and tests so we can track potential JIT issues when building against previews. |
Ah, gotya. All good. I have my head in MAUI world a lot, they follow latest release instead of LTS so I am not used to seeing .NET 8 pop up 😅 |
Doesn't seem like numbers moved too much. Some higher, some lower (or within margin of error). Keep in mind these tests are not using BenchmarkDotNet yet so it isn't doing the warmup and other things it does. Just loops over 10 times and I am jotting down the average. Debug (3.1.8)
Debug (Modernize JPEG Color Converters)
Debug (this updated PR)
Release (3.1.8)
Release (Modernize JPEG Color Converters)
Release (this updated PR)
|
I think we need to find a way to properly benchmark because I cannot see how the numbers could be worse in this PR than the last one. Edit. It appears we could just run BenchmarkDotNet… https://benchmarkdotnet.org/articles/samples/IntroXamarin.html |
New project I started is using that with MAUI. I'm having some issues getting it working with Mac Catalyst (macOS desktop) variant. But if the current issues are Android I can just do a net9.0-android repo with Android only app to put the tests in to focus on that while I deal with full MAUI later |
Yeah, as I recall iOS was very good, let’s focus on the numbers for Android |
How is the project being compiled for Android, is it using Mono LLVM or standard Mono AOT? |
public static Vector128<T> Clamp<T>(Vector128<T> value, Vector128<T> min, Vector128<T> max) | ||
=> Vector128.Min(Vector128.Max(value, min), max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In .NET 9+ this can just use Vector128.Clamp
. Alternatively it can use Vector128.ClampNative
if you don't need need to care about -0
vs +0
or NaN
handling for float
/double
if (Avx.IsSupported) | ||
{ | ||
Vector256<float> lower = Avx.RoundToNearestInteger(vector.GetLower()); | ||
Vector256<float> upper = Avx.RoundToNearestInteger(vector.GetUpper()); | ||
return Vector512.Create(lower, upper); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the AVX path for with Vector512?
Vector512.IsHardwareAccelerated
will only report true if Avx512F+BW+CD+DQ+VL
is supported, so this path should generally be "dead".
[MethodImpl(InliningOptions.ShortMethod)] | ||
public void NormalizeColorsAndRoundInPlaceVector128(float maximum) | ||
{ | ||
Vector128<float> off = Vector128.Create(MathF.Ceiling(maximum * 0.5F)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would actually be more efficient as Vector128.Ceiling(Vector128.Create(maximum) * 0.5f)
While the codegen (see below) is the same size and looks nearly identical, the change to be vectorized instead of scalar avoids a very minor penalty that that exists as scalar operations mutate element 0 and preserve elements 1, 2, and 3 as is.
In general it's better to convert to vector up front and do operations as vectorized where possible.
Here's what you're getting now
; XMM
vmulss xmm0, xmm1, [reloc @RWD00]
vroundss xmm0, xmm0, xmm0, 0xa
vbroadcastss xmm0, xmm0
Here's what you would be getting with the suggested change
; XMM
vbroadcastss xmm0, xmm1
vmulps xmm0, xmm0, [reloc @RWD00]
vroundps xmm0, xmm0, 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notably it also allows the Vector128.Create(maximum)
used for initializing max
to be reused, rather than a distinct instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! I shouldn't have missed these!
dRef = Avx.ConvertToVector256Single(top); | ||
Unsafe.Add(ref dRef, 1) = Avx.ConvertToVector256Single(bottom); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one becomes stylistic preference, but you can freely mix the xplat APIs and the platform specific intrinsics.
That is, while you're wanting to use V256 Avx2.ConvertToVector256Int32(V128)
instead of V256.WidenLower/WidenUpper
for efficiency, you can just use V256.ConvertToSingle()
still instead of Avx.ConvertToVector256Single
since it is a 1-to-1 mapping.
-- As a note to myself, it would likely be beneficial to have V256.Widen(V128)
APIs or similar; or to pattern match V256.WidenLower(V256)
followed by V256.WidenUpper(V256)
; so devs don't need to use platform specific intrinsics in such cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't bother porting the existing Avx code as it worked but I might still do it.
Vector256<int> row0 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 0), Unsafe.Add(ref bBase, i + 0))); | ||
Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can use x * y
instead of Avx.Multiply(x, y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use Vector256.ConvertToInt32
instead of Avx.ConvertToVector256Int32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually seeing a difference in output if I switch from Avx.ConvertToVector256Int32
to Vector256.ConvertToInt32
do they use the same rounding?
Avx.ConvertToVector256Int32
uses the equivalent of MidpointRounding.ToEven
but the Vector256
equivalent is undocumented.
Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1))); | ||
|
||
Vector256<short> row = Avx2.PackSignedSaturate(row0, row1); | ||
row = Avx2.PermuteVar8x32(row.AsInt32(), multiplyIntoInt16ShuffleMask).AsInt16(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This should do the right thing in .NET 8 if you have Vector256.Shuffle(row.AsInt32(), Vector256.Create(0, 1, 4, 5, 2, 3, 6, 7))
In general, declaring the indices directly into the call like this will do the right thing provided all indices are constant. We improved the handling in .NET 9 and even more so in .NET 10 to handle more patterns so that devs that are manually hoisting the indices will still get good codegen if the JIT can detect them as constant during compilation (so in .NET 10, you can have the code as you do right now, rather than directly declaring V256.Create(...)
inside the Vector256.Shuffle
call as is needed for .NET 8).
Vector256<float> r0 = Avx.InsertVector128( | ||
this.V256_0, | ||
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L), | ||
1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can use this.V256_0.WithUpper(Unsafe.As<Vector4, Vector128<float>>(ref this.V4L))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be the following can it not?
Vector256<float> r0 = this.V256_0.WithUpper(this.V4L.AsVector128());
@@ -421,16 +488,17 @@ public void LoadFromInt16ExtendedAvx2(ref Block8x8 source) | |||
/// <param name="value">Value to compare to.</param> | |||
public bool EqualsToScalar(int value) | |||
{ | |||
// TODO: Can we provide a Vector128 implementation for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's blocking a V128 path from being added? At a glance it looks like it should be almost a copy/paste of the V256 path...
Vector256<int> areEqual = Avx2.CompareEqual(Avx.ConvertToVector256Int32WithTruncation(Unsafe.Add(ref this.V256_0, i)), targetVector); | ||
if (Avx2.MoveMask(areEqual.AsByte()) != equalityMask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified to if (!V256.EqualsAll(V256.ConvertToInt32(Unsafe.Add(ref this.V256_0, i)), targetVector))
That avoids a dependency on MoveMask
and maps better to V128/V512.
-- Notably on .NET 9+ you may want to use V256.ConvertToInt32Native
instead, since ConvertToInt32
will saturate for out of bounds values, rather than saturating on some platforms and returning a "sentinel" value on x86/x64.
Vector256<float> tmp0 = Avx.Add(block.V256_0, block.V256_7); | ||
Vector256<float> tmp7 = Avx.Subtract(block.V256_0, block.V256_7); | ||
Vector256<float> tmp1 = Avx.Add(block.V256_1, block.V256_6); | ||
Vector256<float> tmp6 = Avx.Subtract(block.V256_1, block.V256_6); | ||
Vector256<float> tmp2 = Avx.Add(block.V256_2, block.V256_5); | ||
Vector256<float> tmp5 = Avx.Subtract(block.V256_2, block.V256_5); | ||
Vector256<float> tmp3 = Avx.Add(block.V256_3, block.V256_4); | ||
Vector256<float> tmp4 = Avx.Subtract(block.V256_3, block.V256_4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These could use x + y
and x - y
. Similar for other arithmetic operations in the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. Left some suggestions for potential additional cleanup or minor improvements
@tannergooding , the Android test app the numbers are coming from can have its guts found here. AOT+LLVM (not using profiled AOT) for release mode. For compiling/deploying/running it I used this script. So something like,
As for the nuget itself I put in the tests I am just using |
Just for a test, could you try it without LLVM as well? That is, with |
Release (this PR), Android device
EDIT: These numbers with EnableLLVM=true are the best yet (better than what I recorded previously). The only difference I did here is I deleted the nuget cache Its very possible all my tests (manually built from PRs) are wrong 😑 |
New project I am using for benchmarking is called MAUIImageBenchmarks. Uses BenchmarkDotNet. Test script deletes local nuget for ImageSharp to make sure local PR testing is accurate. Currently it is only building for Android. Only benchmarks are load png and load jpg. I added Android native and SkiaSharp into the mix. Also less friction getting the results, I don't have to read the logs I get the output and can hit share and use LocalSend to sent the text results directly to my Mac from my Android device. 3.1.8 (jpg only)
This PR (jpg only)
I set With these tests You can peek the ImageSharp code here, let me know if there is more optimised way to do these tests. Another thing to note, I am using async code where possible, as that is generally the use case I would be using them as. If you want me to test async vs not async I can also do a run of those to see if there is any difference. The good news is it does not look like a regression from 3.1.8 to this PR which is great. No idea why SkiaSharp is stupidly fast. It doesn't appear to be erroring. I guess it is only loading the metadata of the file and not the data itself, considering it is also so much faster than native. If that's true, Resize+Save should show a different story. |
if (Vector128.IsHardwareAccelerated) | ||
{ | ||
Vector128<int> targetVector = Vector128.Create(value); | ||
ref Vector4 blockStride = ref this.V0L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is blockStride
intentionally unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Badly named (copy/paste) but yeah, I'm pointing to the Vector4
field at offset 0. I don't have explicit Vector128 fields but am considering adding them to avoid some of the To/From Vector128
code.
@beeradmoore Those numbers are wild and yes, Skia is cheating there. Here's my desktop decoding the same image. I benchmarked against System.Drawing because the JPEG decoder there is incredibly fast (I don't know what the underlying implementation is but it's blazing) Appreciating the fact that the CPU on the Android (could you post the details btw) is less powerful than my laptop I'm surprised that the
@tannergooding I'm suspicious of the scalar timing on desktop and those Android numbers lining up so closely. Could just be coincidence though... |
My test Android device is a Pixel 2 XL. 8 years old and still chugging along. From the output of one of the previous runs (test_pr.txt) I also see I checked to see if I could use |
I think the easiest way to confirm this is to add the following, which should tell the Mono AOT compiler to skip intrinsic usage... <ItemGroup>
<MonoAOTCompilerDefaultProcessArguments Include="-O=-intrins" />
</ItemGroup> |
With that added Before,
After,
|
Prerequisites
Description
This PR adds
Vector128
intrinsic implementations to several methods inBlock8x8F
and reimplementsZigZag
to migrate intrinsics fromSse
to generalVector<128>
methods which should provide a good speedup on mobile.Performance improvements are measurable.
Benchmarks
Main
This PR
CC
@tannergooding - I think I got everything right performance-wise though I have commented with
TODO
where there may be more low hanging fruit.@beeradmoore - I'm hoping this makes a real difference with the MAUI benchmarks. There were several places where we were falling back to scalar implementations for ARM and WASM.