Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improve JPEG Block8x8F Intrinsics for Vector128 paths. #2918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented May 7, 2025

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR adds Vector128 intrinsic implementations to several methods in Block8x8F and reimplements ZigZag to migrate intrinsics from Sse to general Vector<128> methods which should provide a good speedup on mobile.

Performance improvements are measurable.

Benchmarks

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3915)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.300-preview.0.25177.5
  [Host]             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  1. No HwIntrinsics : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT
  2. SSE             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT SSE4.2
  3. AVX             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Runtime=.NET 8.0

Main

| Method                              | Job                | EnvironmentVariables                            | Mean       | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------------------------------------ |------------------- |------------------------------------------------ |-----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| 'Baseline 4:4:4 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 107.930 ms | 1.5990 ms | 1.4957 ms |  1.00 |    0.02 |      - |  47.46 KB |        1.00 |
| 'Baseline 4:4:4 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  24.525 ms | 0.1969 ms | 0.1842 ms |  0.23 |    0.00 |      - |   47.1 KB |        0.99 |
| 'Baseline 4:4:4 Interleaved'        | 3. AVX             | Empty                                           |   8.838 ms | 0.0784 ms | 0.0733 ms |  0.08 |    0.00 |      - |  47.06 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:2:0 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  40.974 ms | 0.1971 ms | 0.1844 ms |  1.00 |    0.01 |      - |  15.22 KB |        1.00 |
| 'Baseline 4:2:0 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  11.841 ms | 0.0494 ms | 0.0438 ms |  0.29 |    0.00 |      - |  15.14 KB |        0.99 |
| 'Baseline 4:2:0 Interleaved'        | 3. AVX             | Empty                                           |   7.467 ms | 0.0922 ms | 0.0863 ms |  0.18 |    0.00 |      - |  15.13 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:0:0 (grayscale)'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |   8.920 ms | 0.0859 ms | 0.0718 ms |  1.00 |    0.01 |      - |  12.73 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 2. SSE             | DOTNET_EnableAVX=0                              |   2.713 ms | 0.0152 ms | 0.0142 ms |  0.30 |    0.00 |      - |  12.72 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 3. AVX             | Empty                                           |   1.204 ms | 0.0078 ms | 0.0065 ms |  0.13 |    0.00 | 1.9531 |  12.71 KB |        1.00 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Progressive 4:2:0 Non-Interleaved' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  74.589 ms | 0.4589 ms | 0.4068 ms |  1.00 |    0.01 |      - |  39.54 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 2. SSE             | DOTNET_EnableAVX=0                              |  20.615 ms | 0.1037 ms | 0.0919 ms |  0.28 |    0.00 |      - |  39.38 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 3. AVX             | Empty                                           |  11.544 ms | 0.0490 ms | 0.0458 ms |  0.15 |    0.00 |      - |  39.35 KB |        1.00 |

This PR

| Method                              | Job                | EnvironmentVariables                            | Mean       | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------------------------------------ |------------------- |------------------------------------------------ |-----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| 'Baseline 4:4:4 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 108.958 ms | 1.3012 ms | 1.2172 ms |  1.00 |    0.02 |      - |  47.46 KB |        1.00 |
| 'Baseline 4:4:4 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  13.185 ms | 0.1547 ms | 0.1447 ms |  0.12 |    0.00 |      - |  47.06 KB |        0.99 |
| 'Baseline 4:4:4 Interleaved'        | 3. AVX             | Empty                                           |   8.754 ms | 0.0501 ms | 0.0468 ms |  0.08 |    0.00 |      - |  47.06 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:2:0 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  41.072 ms | 0.2252 ms | 0.1996 ms |  1.00 |    0.01 |      - |  15.22 KB |        1.00 |
| 'Baseline 4:2:0 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |   8.928 ms | 0.0815 ms | 0.0722 ms |  0.22 |    0.00 |      - |  15.14 KB |        0.99 |
| 'Baseline 4:2:0 Interleaved'        | 3. AVX             | Empty                                           |   7.399 ms | 0.0449 ms | 0.0398 ms |  0.18 |    0.00 |      - |  15.13 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:0:0 (grayscale)'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |   8.967 ms | 0.0404 ms | 0.0358 ms |  1.00 |    0.01 |      - |  12.73 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 2. SSE             | DOTNET_EnableAVX=0                              |   1.723 ms | 0.0079 ms | 0.0070 ms |  0.19 |    0.00 | 1.9531 |  12.71 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 3. AVX             | Empty                                           |   1.215 ms | 0.0051 ms | 0.0048 ms |  0.14 |    0.00 | 1.9531 |  12.73 KB |        1.00 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Progressive 4:2:0 Non-Interleaved' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  74.980 ms | 0.3103 ms | 0.2751 ms |  1.00 |    0.01 |      - |  39.58 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 2. SSE             | DOTNET_EnableAVX=0                              |  14.541 ms | 0.1421 ms | 0.1329 ms |  0.19 |    0.00 |      - |  39.35 KB |        0.99 |
| 'Progressive 4:2:0 Non-Interleaved' | 3. AVX             | Empty                                           |  12.342 ms | 0.2376 ms | 0.2440 ms |  0.16 |    0.00 |      - |  39.35 KB |        0.99 |

CC
@tannergooding - I think I got everything right performance-wise though I have commented with TODO where there may be more low hanging fruit.
@beeradmoore - I'm hoping this makes a real difference with the MAUI benchmarks. There were several places where we were falling back to scalar implementations for ARM and WASM.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances JPEG decoding by migrating intrinsic implementations in Block8x8F from legacy SSE/AVX paths to more versatile Vector128 and Vector256 methods, which should boost performance on mobile platforms and improve consistency.

  • Renamed methods and field references (e.g. TransposeInplace → TransposeInPlace) for improved naming clarity.
  • Introduced separate Vector128 and Vector256 intrinsic implementations and removed legacy intrinsic and generated files.
  • Updated SIMD helper classes to use new alias naming (e.g. Vector128_ instead of Vector128Utilities).

Reviewed Changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs Renamed transpose method and updated intrinsic vector field references.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs Updated SIMD intrinsic checks and operations; added new normalization and load methods; removed legacy warnings.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs Added a new Vector256-based implementation of Block8x8F operations.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector128.cs Added a new Vector128-based implementation of Block8x8F operations.
src/ImageSharp/Common/Helpers/*Utilities.cs Updated utility classes to use the new alias naming conventions (e.g. Vector128_).
src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs Adjusted references to intrinsics helpers in accordance with alias renaming.
Files not reviewed (1)
  • src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Generated.tt: Language not supported
Comments suppressed due to low confidence (1)

src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs:23

  • The method 'TransposeInPlace' has been renamed from 'TransposeInplace' for consistency. Confirm that all call sites and inline comments are updated to reflect this naming convention.
block.TransposeInPlace();

@beeradmoore
Copy link

@JimBobSquarePants , when I build the nuget (dotnet pack -c Release) to test with it looks like it compiles with .NET 8 SDK.

Screenshot 2025-05-07 at 5 57 53 pm

I assume that's what the above says, and we want to be using .NET 9 SDK. Should I be forcing it to use .NET 9 with a global.json or some other setting?

@JimBobSquarePants
Copy link
Member Author

ImageSharp actually only targets a single LTS version. We fudge the target for CI and tests so we can track potential JIT issues when building against previews.

@beeradmoore
Copy link

Ah, gotya. All good.

I have my head in MAUI world a lot, they follow latest release instead of LTS so I am not used to seeing .NET 8 pop up 😅

@beeradmoore
Copy link

Doesn't seem like numbers moved too much. Some higher, some lower (or within margin of error).

Keep in mind these tests are not using BenchmarkDotNet yet so it isn't doing the warmup and other things it does. Just loops over 10 times and I am jotting down the average.

Debug (3.1.8)

Device JpgLoad JpgResize PngLoad PngResize
Android 1084.1 1312.4 37.1 44.4
Android Emulator 189.5 245.1 13.8 14.2

Debug (Modernize JPEG Color Converters)

Device JpgLoad JpgResize PngLoad PngResize
Android 1344.77 1586.2 37.5 48.1
Android Emulator 233.3 285.8 13.8 15.6

Debug (this updated PR)

Device JpgLoad JpgResize PngLoad PngResize
Android 1366.6 1605.1 37.1 48.4
Android Emulator 245.5 287.8 15.5 15.0

Release (3.1.8)

Device JpgLoad JpgResize PngLoad PngResize
Android 285.5 392.9 19.3 26.1
Android Emulator 83.7 96.2 9.4 9.9

Release (Modernize JPEG Color Converters)

Device JpgLoad JpgResize PngLoad PngResize
Android 341.0 469.4 20.1 25.8
Android Emulator 99.3 121.1 9.2 10.4

Release (this updated PR)

Device JpgLoad JpgResize PngLoad PngResize
Android 360.3 482.0 19.5 27.5
Android Emulator 106.1 119.9 10.5 12.6

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented May 7, 2025

I think we need to find a way to properly benchmark because I cannot see how the numbers could be worse in this PR than the last one.

Edit.

It appears we could just run BenchmarkDotNet…

https://benchmarkdotnet.org/articles/samples/IntroXamarin.html

@beeradmoore
Copy link

New project I started is using that with MAUI. I'm having some issues getting it working with Mac Catalyst (macOS desktop) variant.

But if the current issues are Android I can just do a net9.0-android repo with Android only app to put the tests in to focus on that while I deal with full MAUI later

@JimBobSquarePants
Copy link
Member Author

Yeah, as I recall iOS was very good, let’s focus on the numbers for Android

@tannergooding
Copy link
Contributor

How is the project being compiled for Android, is it using Mono LLVM or standard Mono AOT?

Comment on lines +328 to +329
public static Vector128<T> Clamp<T>(Vector128<T> value, Vector128<T> min, Vector128<T> max)
=> Vector128.Min(Vector128.Max(value, min), max);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In .NET 9+ this can just use Vector128.Clamp. Alternatively it can use Vector128.ClampNative if you don't need need to care about -0 vs +0 or NaN handling for float/double

Comment on lines +131 to +136
if (Avx.IsSupported)
{
Vector256<float> lower = Avx.RoundToNearestInteger(vector.GetLower());
Vector256<float> upper = Avx.RoundToNearestInteger(vector.GetUpper());
return Vector512.Create(lower, upper);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the AVX path for with Vector512?

Vector512.IsHardwareAccelerated will only report true if Avx512F+BW+CD+DQ+VL is supported, so this path should generally be "dead".

[MethodImpl(InliningOptions.ShortMethod)]
public void NormalizeColorsAndRoundInPlaceVector128(float maximum)
{
Vector128<float> off = Vector128.Create(MathF.Ceiling(maximum * 0.5F));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually be more efficient as Vector128.Ceiling(Vector128.Create(maximum) * 0.5f)

While the codegen (see below) is the same size and looks nearly identical, the change to be vectorized instead of scalar avoids a very minor penalty that that exists as scalar operations mutate element 0 and preserve elements 1, 2, and 3 as is.

In general it's better to convert to vector up front and do operations as vectorized where possible.

Here's what you're getting now

; XMM
vmulss xmm0, xmm1, [reloc @RWD00]
vroundss xmm0, xmm0, xmm0, 0xa
vbroadcastss xmm0, xmm0

Here's what you would be getting with the suggested change

; XMM
vbroadcastss xmm0, xmm1
vmulps xmm0, xmm0, [reloc @RWD00]
vroundps xmm0, xmm0, 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably it also allows the Vector128.Create(maximum) used for initializing max to be reused, rather than a distinct instruction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I shouldn't have missed these!

Comment on lines 79 to 80
dRef = Avx.ConvertToVector256Single(top);
Unsafe.Add(ref dRef, 1) = Avx.ConvertToVector256Single(bottom);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one becomes stylistic preference, but you can freely mix the xplat APIs and the platform specific intrinsics.

That is, while you're wanting to use V256 Avx2.ConvertToVector256Int32(V128) instead of V256.WidenLower/WidenUpper for efficiency, you can just use V256.ConvertToSingle() still instead of Avx.ConvertToVector256Single since it is a 1-to-1 mapping.

-- As a note to myself, it would likely be beneficial to have V256.Widen(V128) APIs or similar; or to pattern match V256.WidenLower(V256) followed by V256.WidenUpper(V256); so devs don't need to use platform specific intrinsics in such cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't bother porting the existing Avx code as it worked but I might still do it.

Comment on lines 114 to 115
Vector256<int> row0 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 0), Unsafe.Add(ref bBase, i + 0)));
Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can use x * y instead of Avx.Multiply(x, y)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use Vector256.ConvertToInt32 instead of Avx.ConvertToVector256Int32

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually seeing a difference in output if I switch from Avx.ConvertToVector256Int32 to Vector256.ConvertToInt32 do they use the same rounding?

Avx.ConvertToVector256Int32 uses the equivalent of MidpointRounding.ToEven but the Vector256 equivalent is undocumented.

Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1)));

Vector256<short> row = Avx2.PackSignedSaturate(row0, row1);
row = Avx2.PermuteVar8x32(row.AsInt32(), multiplyIntoInt16ShuffleMask).AsInt16();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should do the right thing in .NET 8 if you have Vector256.Shuffle(row.AsInt32(), Vector256.Create(0, 1, 4, 5, 2, 3, 6, 7))

In general, declaring the indices directly into the call like this will do the right thing provided all indices are constant. We improved the handling in .NET 9 and even more so in .NET 10 to handle more patterns so that devs that are manually hoisting the indices will still get good codegen if the JIT can detect them as constant during compilation (so in .NET 10, you can have the code as you do right now, rather than directly declaring V256.Create(...) inside the Vector256.Shuffle call as is needed for .NET 8).

Comment on lines 127 to 130
Vector256<float> r0 = Avx.InsertVector128(
this.V256_0,
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L),
1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can use this.V256_0.WithUpper(Unsafe.As<Vector4, Vector128<float>>(ref this.V4L))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be the following can it not?
Vector256<float> r0 = this.V256_0.WithUpper(this.V4L.AsVector128());

@@ -421,16 +488,17 @@ public void LoadFromInt16ExtendedAvx2(ref Block8x8 source)
/// <param name="value">Value to compare to.</param>
public bool EqualsToScalar(int value)
{
// TODO: Can we provide a Vector128 implementation for this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's blocking a V128 path from being added? At a glance it looks like it should be almost a copy/paste of the V256 path...

Comment on lines 501 to 502
Vector256<int> areEqual = Avx2.CompareEqual(Avx.ConvertToVector256Int32WithTruncation(Unsafe.Add(ref this.V256_0, i)), targetVector);
if (Avx2.MoveMask(areEqual.AsByte()) != equalityMask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified to if (!V256.EqualsAll(V256.ConvertToInt32(Unsafe.Add(ref this.V256_0, i)), targetVector))

That avoids a dependency on MoveMask and maps better to V128/V512.

-- Notably on .NET 9+ you may want to use V256.ConvertToInt32Native instead, since ConvertToInt32 will saturate for out of bounds values, rather than saturating on some platforms and returning a "sentinel" value on x86/x64.

Comment on lines 29 to 36
Vector256<float> tmp0 = Avx.Add(block.V256_0, block.V256_7);
Vector256<float> tmp7 = Avx.Subtract(block.V256_0, block.V256_7);
Vector256<float> tmp1 = Avx.Add(block.V256_1, block.V256_6);
Vector256<float> tmp6 = Avx.Subtract(block.V256_1, block.V256_6);
Vector256<float> tmp2 = Avx.Add(block.V256_2, block.V256_5);
Vector256<float> tmp5 = Avx.Subtract(block.V256_2, block.V256_5);
Vector256<float> tmp3 = Avx.Add(block.V256_3, block.V256_4);
Vector256<float> tmp4 = Avx.Subtract(block.V256_3, block.V256_4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These could use x + y and x - y. Similar for other arithmetic operations in the method

Copy link
Contributor

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Left some suggestions for potential additional cleanup or minor improvements

@beeradmoore
Copy link

@tannergooding , the Android test app the numbers are coming from can have its guts found here. AOT+LLVM (not using profiled AOT) for release mode.

For compiling/deploying/running it I used this script.

So something like,

dotnet build \
    /t:Run \
    --framework net9.0-android \
    --configuration Release\
    -p:RuntimeIdentifier= android-arm64\
    ImageSharpMAUITest/ImageSharpMAUITest.csproj

As for the nuget itself I put in the tests I am just using dotnet pack -c Release

@tannergooding
Copy link
Contributor

Just for a test, could you try it without LLVM as well? That is, with <EnableLLVM>false</EnableLLVM>?

@tannergooding
Copy link
Contributor

For reference, these are the accelerated APIs in Mono on .NET 8...

https://github.com/dotnet/runtime/blob/v8.0.0/src/mono/mono/mini/simd-intrinsics.c#L2381-L2399 for APIs defined on Vector128<T>

https://github.com/dotnet/runtime/blob/v8.0.0/src/mono/mono/mini/simd-intrinsics.c#L1149-L1219 for APIs defined on Vector128 (non-generic static class)

You can see that for the first group (https://github.com/dotnet/runtime/blob/v8.0.0/src/mono/mono/mini/simd-intrinsics.c#L2403C1-L2588) it has paths for V128 to work with or without LLVM (it is entered from https://github.com/dotnet/runtime/blob/v8.0.0/src/mono/mono/mini/simd-intrinsics.c#L5821-L5824)

The second group has something similar: https://github.com/dotnet/runtime/blob/v8.0.0/src/mono/mono/mini/simd-intrinsics.c#L1369-L2379 -- It is entered from https://github.com/dotnet/runtime/blob/v8.0.0/src/mono/mono/mini/simd-intrinsics.c#L5816-L5819

Since it should work with or without LLVM, I'd at least like to determine if one of them appears to be working as intended. That can potentially help root cause the issue and what needs to be looked at next.

@beeradmoore
Copy link

beeradmoore commented May 7, 2025

Release (this PR), Android device

EnableLLVM JpgLoad JpgResize PngLoad PngResize
true 292.3 428.9 21.0 26.0
false 685.5 960.3 33.1 49.2

EDIT: These numbers with EnableLLVM=true are the best yet (better than what I recorded previously).

The only difference I did here is I deleted the nuget cache ~/.nuget/packages/sixlabors.imagesharp/0.0.1

Its very possible all my tests (manually built from PRs) are wrong 😑
Adding removing of that cache file into my test scripts now

@beeradmoore
Copy link

New project I am using for benchmarking is called MAUIImageBenchmarks. Uses BenchmarkDotNet. Test script deletes local nuget for ImageSharp to make sure local PR testing is accurate.

Currently it is only building for Android. Only benchmarks are load png and load jpg. I added Android native and SkiaSharp into the mix. Also less friction getting the results, I don't have to read the logs I get the output and can hit share and use LocalSend to sent the text results directly to my Mac from my Android device.

3.1.8 (jpg only)

Method Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ImageSharp_FromResource 292.552 ms 3.2035 ms 2.9966 ms 3.53 0.06 96416 B 39.81
ImageSharp_FromFile 289.824 ms 4.5948 ms 4.2980 ms 3.50 0.07 73856 B 30.49
SkiaSharp_FromResource 5.706 ms 0.0591 ms 0.0553 ms 0.07 0.00 1584 B 0.65
SkiaSharp_FromFile 1.683 ms 0.0157 ms 0.0147 ms 0.02 0.00 766 B 0.32
Native_Android_FromResource 82.838 ms 1.2323 ms 1.1527 ms 1.00 0.02 2422 B 1.00
Native_Android_FromFile 83.926 ms 1.3806 ms 1.1529 ms 1.01 0.02 1469 B 0.61

This PR (jpg only)

Method Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ImageSharp_FromResource 287.057 ms 1.7632 ms 1.4724 ms 3.52 0.08 97176 B 40.81
ImageSharp_FromFile 281.226 ms 2.3356 ms 2.0704 ms 3.45 0.08 74848 B 31.44
SkiaSharp_FromResource 5.594 ms 0.0369 ms 0.0308 ms 0.07 0.00 1584 B 0.67
SkiaSharp_FromFile 1.673 ms 0.0067 ms 0.0059 ms 0.02 0.00 766 B 0.32
Native_Android_FromResource 81.557 ms 1.5679 ms 1.8665 ms 1.00 0.03 2381 B 1.00
Native_Android_FromFile 81.679 ms 1.5681 ms 1.7430 ms 1.00 0.03 1397 B 0.59

I set Native_Android_FromResource as the baseline.

With these tests _FromResource is because the files are an embedded resource. _FromFile I use the benchmark setup to save the file from embedded resource out to physical file on disk and load it.

You can peek the ImageSharp code here, let me know if there is more optimised way to do these tests.

Another thing to note, I am using async code where possible, as that is generally the use case I would be using them as. If you want me to test async vs not async I can also do a run of those to see if there is any difference.

The good news is it does not look like a regression from 3.1.8 to this PR which is great.
We have actual numbers, 3x slower than native which tbh says a lot about Android than it does this library. (Unsure if my Android code can be optimised more).

No idea why SkiaSharp is stupidly fast. It doesn't appear to be erroring. I guess it is only loading the metadata of the file and not the data itself, considering it is also so much faster than native. If that's true, Resize+Save should show a different story.

if (Vector128.IsHardwareAccelerated)
{
Vector128<int> targetVector = Vector128.Create(value);
ref Vector4 blockStride = ref this.V0L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is blockStride intentionally unused?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Badly named (copy/paste) but yeah, I'm pointing to the Vector4 field at offset 0. I don't have explicit Vector128 fields but am considering adding them to avoid some of the To/From Vector128 code.

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented May 8, 2025

@beeradmoore Those numbers are wild and yes, Skia is cheating there.

Here's my desktop decoding the same image. I benchmarked against System.Drawing because the JPEG decoder there is incredibly fast (I don't know what the underlying implementation is but it's blazing)

Appreciating the fact that the CPU on the Android (could you post the details btw) is less powerful than my laptop I'm surprised that the Vector128 performance is still so bad.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3915)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.300-preview.0.25177.5
  [Host]             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  1. No HwIntrinsics : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT
  2. SSE             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT SSE4.2
  3. AVX             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Runtime=.NET 8.0

| Method                     | Job                | EnvironmentVariables                            | Mean      | Error    | StdDev   | Ratio | RatioSD | Gen0     | Gen1     | Gen2     | Allocated | Alloc Ratio |
|--------------------------- |------------------- |------------------------------------------------ |----------:|---------:|---------:|------:|--------:|---------:|---------:|---------:|----------:|------------:|
| 'Maui Test'                | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 259.10 ms | 4.091 ms | 3.626 ms |  1.00 |    0.02 | 500.0000 | 500.0000 | 500.0000 |   20668 B |        1.00 |
| 'Maui Test'                | 2. SSE             | DOTNET_EnableAVX=0                              |  47.15 ms | 0.806 ms | 0.754 ms |  0.18 |    0.00 |        - |        - |        - |   19403 B |        0.94 |
| 'Maui Test'                | 3. AVX             | Empty                                           |  37.05 ms | 0.507 ms | 0.449 ms |  0.14 |    0.00 |        - |        - |        - |   19366 B |        0.94 |
|                            |                    |                                                 |           |          |          |       |         |          |          |          |           |             |
| 'Maui Test System Drawing' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  31.10 ms | 0.237 ms | 0.210 ms |  1.00 |    0.01 |        - |        - |        - |     129 B |        1.00 |
| 'Maui Test System Drawing' | 2. SSE             | DOTNET_EnableAVX=0                              |  30.77 ms | 0.232 ms | 0.217 ms |  0.99 |    0.01 |        - |        - |        - |     116 B |        0.90 |
| 'Maui Test System Drawing' | 3. AVX             | Empty                                           |  30.65 ms | 0.189 ms | 0.177 ms |  0.99 |    0.01 |        - |        - |        - |     116 B |        0.90 |

@tannergooding I'm suspicious of the scalar timing on desktop and those Android numbers lining up so closely. Could just be coincidence though...

@beeradmoore
Copy link

My test Android device is a Pixel 2 XL. 8 years old and still chugging along.

From the output of one of the previous runs (test_pr.txt) I also see // HardwareIntrinsics=ArmBase VectorSize=128. I assume that is what is expected?

I checked to see if I could use System.Drawing to compare but I think that is a Windows only API.

@tannergooding
Copy link
Contributor

I think the easiest way to confirm this is to add the following, which should tell the Mono AOT compiler to skip intrinsic usage...

<ItemGroup>
  <MonoAOTCompilerDefaultProcessArguments Include="-O=-intrins" />
</ItemGroup>

@beeradmoore
Copy link

With that added HardwareIntrinsics is still ArmBase VectorSize=128. Times were all the same as well.

Before,

Method Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ImageSharp_FromResource 289.242 ms 1.6725 ms 1.4826 ms 3.49 0.05 97448 B 40.93
ImageSharp_FromFile 284.213 ms 4.5314 ms 4.2387 ms 3.42 0.07 74848 B 31.44
SkiaSharp_FromResource 5.612 ms 0.0430 ms 0.0402 ms 0.07 0.00 1584 B 0.67
SkiaSharp_FromFile 1.737 ms 0.0100 ms 0.0094 ms 0.02 0.00 766 B 0.32
Native_Android_FromResource 83.006 ms 1.3020 ms 1.2179 ms 1.00 0.02 2381 B 1.00
Native_Android_FromFile 85.044 ms 1.6291 ms 1.8107 ms 1.02 0.03 1397 B 0.59

After,

Method Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ImageSharp_FromResource 292.062 ms 2.8498 ms 2.6657 ms 3.44 0.09 228536 B 93.17
ImageSharp_FromFile 285.281 ms 3.6495 ms 3.4137 ms 3.36 0.09 341120 B 139.06
SkiaSharp_FromResource 5.635 ms 0.0288 ms 0.0255 ms 0.07 0.00 1584 B 0.65
SkiaSharp_FromFile 1.663 ms 0.0119 ms 0.0111 ms 0.02 0.00 766 B 0.31
Native_Android_FromResource 85.023 ms 1.6433 ms 2.0181 ms 1.00 0.03 2453 B 1.00
Native_Android_FromFile 85.023 ms 1.0300 ms 0.8601 ms 1.00 0.03 1469 B 0.60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants