Improve JPEG Block8x8F Intrinsics for Vector128 paths. #2918

JimBobSquarePants · 2025-05-07T06:59:07Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This PR adds Vector128 intrinsic implementations to several methods in Block8x8F and reimplements ZigZag to migrate intrinsics from Sse to general Vector<128> methods which should provide a good speedup on mobile.

Performance improvements are measurable.

Benchmarks

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3915)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.300-preview.0.25177.5
  [Host]             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  1. No HwIntrinsics : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT
  2. SSE             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT SSE4.2
  3. AVX             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Runtime=.NET 8.0

Main

| Method                              | Job                | EnvironmentVariables                            | Mean       | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------------------------------------ |------------------- |------------------------------------------------ |-----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| 'Baseline 4:4:4 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 107.930 ms | 1.5990 ms | 1.4957 ms |  1.00 |    0.02 |      - |  47.46 KB |        1.00 |
| 'Baseline 4:4:4 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  24.525 ms | 0.1969 ms | 0.1842 ms |  0.23 |    0.00 |      - |   47.1 KB |        0.99 |
| 'Baseline 4:4:4 Interleaved'        | 3. AVX             | Empty                                           |   8.838 ms | 0.0784 ms | 0.0733 ms |  0.08 |    0.00 |      - |  47.06 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:2:0 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  40.974 ms | 0.1971 ms | 0.1844 ms |  1.00 |    0.01 |      - |  15.22 KB |        1.00 |
| 'Baseline 4:2:0 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  11.841 ms | 0.0494 ms | 0.0438 ms |  0.29 |    0.00 |      - |  15.14 KB |        0.99 |
| 'Baseline 4:2:0 Interleaved'        | 3. AVX             | Empty                                           |   7.467 ms | 0.0922 ms | 0.0863 ms |  0.18 |    0.00 |      - |  15.13 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:0:0 (grayscale)'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |   8.920 ms | 0.0859 ms | 0.0718 ms |  1.00 |    0.01 |      - |  12.73 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 2. SSE             | DOTNET_EnableAVX=0                              |   2.713 ms | 0.0152 ms | 0.0142 ms |  0.30 |    0.00 |      - |  12.72 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 3. AVX             | Empty                                           |   1.204 ms | 0.0078 ms | 0.0065 ms |  0.13 |    0.00 | 1.9531 |  12.71 KB |        1.00 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Progressive 4:2:0 Non-Interleaved' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  74.589 ms | 0.4589 ms | 0.4068 ms |  1.00 |    0.01 |      - |  39.54 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 2. SSE             | DOTNET_EnableAVX=0                              |  20.615 ms | 0.1037 ms | 0.0919 ms |  0.28 |    0.00 |      - |  39.38 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 3. AVX             | Empty                                           |  11.544 ms | 0.0490 ms | 0.0458 ms |  0.15 |    0.00 |      - |  39.35 KB |        1.00 |

This PR

| Method                              | Job                | EnvironmentVariables                            | Mean       | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------------------------------------ |------------------- |------------------------------------------------ |-----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| 'Baseline 4:4:4 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 108.958 ms | 1.3012 ms | 1.2172 ms |  1.00 |    0.02 |      - |  47.46 KB |        1.00 |
| 'Baseline 4:4:4 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  13.185 ms | 0.1547 ms | 0.1447 ms |  0.12 |    0.00 |      - |  47.06 KB |        0.99 |
| 'Baseline 4:4:4 Interleaved'        | 3. AVX             | Empty                                           |   8.754 ms | 0.0501 ms | 0.0468 ms |  0.08 |    0.00 |      - |  47.06 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:2:0 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  41.072 ms | 0.2252 ms | 0.1996 ms |  1.00 |    0.01 |      - |  15.22 KB |        1.00 |
| 'Baseline 4:2:0 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |   8.928 ms | 0.0815 ms | 0.0722 ms |  0.22 |    0.00 |      - |  15.14 KB |        0.99 |
| 'Baseline 4:2:0 Interleaved'        | 3. AVX             | Empty                                           |   7.399 ms | 0.0449 ms | 0.0398 ms |  0.18 |    0.00 |      - |  15.13 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:0:0 (grayscale)'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |   8.967 ms | 0.0404 ms | 0.0358 ms |  1.00 |    0.01 |      - |  12.73 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 2. SSE             | DOTNET_EnableAVX=0                              |   1.723 ms | 0.0079 ms | 0.0070 ms |  0.19 |    0.00 | 1.9531 |  12.71 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 3. AVX             | Empty                                           |   1.215 ms | 0.0051 ms | 0.0048 ms |  0.14 |    0.00 | 1.9531 |  12.73 KB |        1.00 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Progressive 4:2:0 Non-Interleaved' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  74.980 ms | 0.3103 ms | 0.2751 ms |  1.00 |    0.01 |      - |  39.58 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 2. SSE             | DOTNET_EnableAVX=0                              |  14.541 ms | 0.1421 ms | 0.1329 ms |  0.19 |    0.00 |      - |  39.35 KB |        0.99 |
| 'Progressive 4:2:0 Non-Interleaved' | 3. AVX             | Empty                                           |  12.342 ms | 0.2376 ms | 0.2440 ms |  0.16 |    0.00 |      - |  39.35 KB |        0.99 |

CC
@tannergooding - I think I got everything right performance-wise though I have commented with TODO where there may be more low hanging fruit.
@beeradmoore - I'm hoping this makes a real difference with the MAUI benchmarks. There were several places where we were falling back to scalar implementations for ARM and WASM.

Copilot

Pull Request Overview

This PR enhances JPEG decoding by migrating intrinsic implementations in Block8x8F from legacy SSE/AVX paths to more versatile Vector128 and Vector256 methods, which should boost performance on mobile platforms and improve consistency.

Renamed methods and field references (e.g. TransposeInplace → TransposeInPlace) for improved naming clarity.
Introduced separate Vector128 and Vector256 intrinsic implementations and removed legacy intrinsic and generated files.
Updated SIMD helper classes to use new alias naming (e.g. Vector128_ instead of Vector128Utilities).

Reviewed Changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs	Renamed transpose method and updated intrinsic vector field references.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs	Updated SIMD intrinsic checks and operations; added new normalization and load methods; removed legacy warnings.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs	Added a new Vector256-based implementation of Block8x8F operations.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector128.cs	Added a new Vector128-based implementation of Block8x8F operations.
src/ImageSharp/Common/Helpers/*Utilities.cs	Updated utility classes to use the new alias naming conventions (e.g. Vector128_).
src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs	Adjusted references to intrinsics helpers in accordance with alias renaming.

Files not reviewed (1)

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Generated.tt: Language not supported

Comments suppressed due to low confidence (1)

src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs:23

The method 'TransposeInPlace' has been renamed from 'TransposeInplace' for consistency. Confirm that all call sites and inline comments are updated to reflect this naming convention.

block.TransposeInPlace();

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

beeradmoore · 2025-05-07T08:00:52Z

@JimBobSquarePants , when I build the nuget (dotnet pack -c Release) to test with it looks like it compiles with .NET 8 SDK.

I assume that's what the above says, and we want to be using .NET 9 SDK. Should I be forcing it to use .NET 9 with a global.json or some other setting?

JimBobSquarePants · 2025-05-07T08:07:40Z

ImageSharp actually only targets a single LTS version. We fudge the target for CI and tests so we can track potential JIT issues when building against previews.

beeradmoore · 2025-05-07T08:14:30Z

Ah, gotya. All good.

I have my head in MAUI world a lot, they follow latest release instead of LTS so I am not used to seeing .NET 8 pop up 😅

beeradmoore · 2025-05-07T09:28:54Z

Doesn't seem like numbers moved too much. Some higher, some lower (or within margin of error).

Keep in mind these tests are not using BenchmarkDotNet yet so it isn't doing the warmup and other things it does. Just loops over 10 times and I am jotting down the average.

Debug (3.1.8)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	1084.1	1312.4	37.1	44.4
Android Emulator	189.5	245.1	13.8	14.2

Debug (Modernize JPEG Color Converters)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	1344.77	1586.2	37.5	48.1
Android Emulator	233.3	285.8	13.8	15.6

Debug (this updated PR)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	1366.6	1605.1	37.1	48.4
Android Emulator	245.5	287.8	15.5	15.0

Release (3.1.8)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	285.5	392.9	19.3	26.1
Android Emulator	83.7	96.2	9.4	9.9

Release (Modernize JPEG Color Converters)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	341.0	469.4	20.1	25.8
Android Emulator	99.3	121.1	9.2	10.4

Release (this updated PR)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	360.3	482.0	19.5	27.5
Android Emulator	106.1	119.9	10.5	12.6

JimBobSquarePants · 2025-05-07T09:31:42Z

I think we need to find a way to properly benchmark because I cannot see how the numbers could be worse in this PR than the last one.

Edit.

It appears we could just run BenchmarkDotNet…

https://benchmarkdotnet.org/articles/samples/IntroXamarin.html

beeradmoore · 2025-05-07T11:38:06Z

New project I started is using that with MAUI. I'm having some issues getting it working with Mac Catalyst (macOS desktop) variant.

But if the current issues are Android I can just do a net9.0-android repo with Android only app to put the tests in to focus on that while I deal with full MAUI later

JimBobSquarePants · 2025-05-07T11:53:22Z

Yeah, as I recall iOS was very good, let’s focus on the numbers for Android

tannergooding · 2025-05-07T15:41:45Z

How is the project being compiled for Android, is it using Mono LLVM or standard Mono AOT?

tannergooding · 2025-05-07T15:50:46Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+    public static Vector128<T> Clamp<T>(Vector128<T> value, Vector128<T> min, Vector128<T> max)
+        => Vector128.Min(Vector128.Max(value, min), max);


In .NET 9+ this can just use Vector128.Clamp. Alternatively it can use Vector128.ClampNative if you don't need need to care about -0 vs +0 or NaN handling for float/double

tannergooding · 2025-05-07T15:52:12Z

src/ImageSharp/Common/Helpers/Vector512Utilities.cs

+        if (Avx.IsSupported)
+        {
+            Vector256<float> lower = Avx.RoundToNearestInteger(vector.GetLower());
+            Vector256<float> upper = Avx.RoundToNearestInteger(vector.GetUpper());
+            return Vector512.Create(lower, upper);
+        }


What's the AVX path for with Vector512?

Vector512.IsHardwareAccelerated will only report true if Avx512F+BW+CD+DQ+VL is supported, so this path should generally be "dead".

tannergooding · 2025-05-07T16:07:07Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector128.cs

+    [MethodImpl(InliningOptions.ShortMethod)]
+    public void NormalizeColorsAndRoundInPlaceVector128(float maximum)
+    {
+        Vector128<float> off = Vector128.Create(MathF.Ceiling(maximum * 0.5F));


This would actually be more efficient as Vector128.Ceiling(Vector128.Create(maximum) * 0.5f)

While the codegen (see below) is the same size and looks nearly identical, the change to be vectorized instead of scalar avoids a very minor penalty that that exists as scalar operations mutate element 0 and preserve elements 1, 2, and 3 as is.

In general it's better to convert to vector up front and do operations as vectorized where possible.

Here's what you're getting now

; XMM vmulss xmm0, xmm1, [reloc @RWD00] vroundss xmm0, xmm0, xmm0, 0xa vbroadcastss xmm0, xmm0

Here's what you would be getting with the suggested change

; XMM vbroadcastss xmm0, xmm1 vmulps xmm0, xmm0, [reloc @RWD00] vroundps xmm0, xmm0, 2

Notably it also allows the Vector128.Create(maximum) used for initializing max to be reused, rather than a distinct instruction.

Ha! I shouldn't have missed these!

tannergooding · 2025-05-07T16:20:22Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs

+        dRef = Avx.ConvertToVector256Single(top);
+        Unsafe.Add(ref dRef, 1) = Avx.ConvertToVector256Single(bottom);


This one becomes stylistic preference, but you can freely mix the xplat APIs and the platform specific intrinsics.

That is, while you're wanting to use V256 Avx2.ConvertToVector256Int32(V128) instead of V256.WidenLower/WidenUpper for efficiency, you can just use V256.ConvertToSingle() still instead of Avx.ConvertToVector256Single since it is a 1-to-1 mapping.

-- As a note to myself, it would likely be beneficial to have V256.Widen(V128) APIs or similar; or to pattern match V256.WidenLower(V256) followed by V256.WidenUpper(V256); so devs don't need to use platform specific intrinsics in such cases

I didn't bother porting the existing Avx code as it worked but I might still do it.

tannergooding · 2025-05-07T16:21:32Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs

+            Vector256<int> row0 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 0), Unsafe.Add(ref bBase, i + 0)));
+            Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1)));


nit: You can use x * y instead of Avx.Multiply(x, y)

You can also use Vector256.ConvertToInt32 instead of Avx.ConvertToVector256Int32

I'm actually seeing a difference in output if I switch from Avx.ConvertToVector256Int32 to Vector256.ConvertToInt32 do they use the same rounding?

Avx.ConvertToVector256Int32 uses the equivalent of MidpointRounding.ToEven but the Vector256 equivalent is undocumented.

tannergooding · 2025-05-07T16:25:23Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs

+            Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1)));
+
+            Vector256<short> row = Avx2.PackSignedSaturate(row0, row1);
+            row = Avx2.PermuteVar8x32(row.AsInt32(), multiplyIntoInt16ShuffleMask).AsInt16();


nit: This should do the right thing in .NET 8 if you have Vector256.Shuffle(row.AsInt32(), Vector256.Create(0, 1, 4, 5, 2, 3, 6, 7))

In general, declaring the indices directly into the call like this will do the right thing provided all indices are constant. We improved the handling in .NET 9 and even more so in .NET 10 to handle more patterns so that devs that are manually hoisting the indices will still get good codegen if the JIT can detect them as constant during compilation (so in .NET 10, you can have the code as you do right now, rather than directly declaring V256.Create(...) inside the Vector256.Shuffle call as is needed for .NET 8).

tannergooding · 2025-05-07T16:26:26Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs

+        Vector256<float> r0 = Avx.InsertVector128(
+            this.V256_0,
+            Unsafe.As<Vector4, Vector128<float>>(ref this.V4L),
+            1);


nit: You can use this.V256_0.WithUpper(Unsafe.As<Vector4, Vector128<float>>(ref this.V4L))

It can be the following can it not?
Vector256<float> r0 = this.V256_0.WithUpper(this.V4L.AsVector128());

tannergooding · 2025-05-07T16:32:54Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs

@@ -421,16 +488,17 @@ public void LoadFromInt16ExtendedAvx2(ref Block8x8 source)
    /// <param name="value">Value to compare to.</param>
    public bool EqualsToScalar(int value)
    {
+        // TODO: Can we provide a Vector128 implementation for this?


What's blocking a V128 path from being added? At a glance it looks like it should be almost a copy/paste of the V256 path...

tannergooding · 2025-05-07T16:36:35Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs

+                Vector256<int> areEqual = Avx2.CompareEqual(Avx.ConvertToVector256Int32WithTruncation(Unsafe.Add(ref this.V256_0, i)), targetVector);
                if (Avx2.MoveMask(areEqual.AsByte()) != equalityMask)


This could be simplified to if (!V256.EqualsAll(V256.ConvertToInt32(Unsafe.Add(ref this.V256_0, i)), targetVector))

That avoids a dependency on MoveMask and maps better to V128/V512.

-- Notably on .NET 9+ you may want to use V256.ConvertToInt32Native instead, since ConvertToInt32 will saturate for out of bounds values, rather than saturating on some platforms and returning a "sentinel" value on x86/x64.

tannergooding · 2025-05-07T16:38:17Z

src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs

+            Vector256<float> tmp0 = Avx.Add(block.V256_0, block.V256_7);
+            Vector256<float> tmp7 = Avx.Subtract(block.V256_0, block.V256_7);
+            Vector256<float> tmp1 = Avx.Add(block.V256_1, block.V256_6);
+            Vector256<float> tmp6 = Avx.Subtract(block.V256_1, block.V256_6);
+            Vector256<float> tmp2 = Avx.Add(block.V256_2, block.V256_5);
+            Vector256<float> tmp5 = Avx.Subtract(block.V256_2, block.V256_5);
+            Vector256<float> tmp3 = Avx.Add(block.V256_3, block.V256_4);
+            Vector256<float> tmp4 = Avx.Subtract(block.V256_3, block.V256_4);


nit: These could use x + y and x - y. Similar for other arithmetic operations in the method

tannergooding · 2025-05-08T05:56:05Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs

+        if (Vector128.IsHardwareAccelerated)
+        {
+            Vector128<int> targetVector = Vector128.Create(value);
+            ref Vector4 blockStride = ref this.V0L;


Is blockStride intentionally unused?

Badly named (copy/paste) but yeah, I'm pointing to the Vector4 field at offset 0. I don't have explicit Vector128 fields but am considering adding them to avoid some of the To/From Vector128 code.

JimBobSquarePants · 2025-05-08T06:30:29Z

@beeradmoore Those numbers are wild and yes, Skia is cheating there.

Here's my desktop decoding the same image. I benchmarked against System.Drawing because the JPEG decoder there is incredibly fast (I don't know what the underlying implementation is but it's blazing)

Appreciating the fact that the CPU on the Android (could you post the details btw) is less powerful than my laptop I'm surprised that the Vector128 performance is still so bad.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3915)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.300-preview.0.25177.5
  [Host]             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  1. No HwIntrinsics : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT
  2. SSE             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT SSE4.2
  3. AVX             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Runtime=.NET 8.0

| Method                     | Job                | EnvironmentVariables                            | Mean      | Error    | StdDev   | Ratio | RatioSD | Gen0     | Gen1     | Gen2     | Allocated | Alloc Ratio |
|--------------------------- |------------------- |------------------------------------------------ |----------:|---------:|---------:|------:|--------:|---------:|---------:|---------:|----------:|------------:|
| 'Maui Test'                | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 259.10 ms | 4.091 ms | 3.626 ms |  1.00 |    0.02 | 500.0000 | 500.0000 | 500.0000 |   20668 B |        1.00 |
| 'Maui Test'                | 2. SSE             | DOTNET_EnableAVX=0                              |  47.15 ms | 0.806 ms | 0.754 ms |  0.18 |    0.00 |        - |        - |        - |   19403 B |        0.94 |
| 'Maui Test'                | 3. AVX             | Empty                                           |  37.05 ms | 0.507 ms | 0.449 ms |  0.14 |    0.00 |        - |        - |        - |   19366 B |        0.94 |
|                            |                    |                                                 |           |          |          |       |         |          |          |          |           |             |
| 'Maui Test System Drawing' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  31.10 ms | 0.237 ms | 0.210 ms |  1.00 |    0.01 |        - |        - |        - |     129 B |        1.00 |
| 'Maui Test System Drawing' | 2. SSE             | DOTNET_EnableAVX=0                              |  30.77 ms | 0.232 ms | 0.217 ms |  0.99 |    0.01 |        - |        - |        - |     116 B |        0.90 |
| 'Maui Test System Drawing' | 3. AVX             | Empty                                           |  30.65 ms | 0.189 ms | 0.177 ms |  0.99 |    0.01 |        - |        - |        - |     116 B |        0.90 |

@tannergooding I'm suspicious of the scalar timing on desktop and those Android numbers lining up so closely. Could just be coincidence though...

beeradmoore · 2025-05-08T06:50:32Z

My test Android device is a Pixel 2 XL. 8 years old and still chugging along.

From the output of one of the previous runs (test_pr.txt) I also see // HardwareIntrinsics=ArmBase VectorSize=128. I assume that is what is expected?

I checked to see if I could use System.Drawing to compare but I think that is a Windows only API.

tannergooding · 2025-05-08T15:34:37Z

I think the easiest way to confirm this is to add the following, which should tell the Mono AOT compiler to skip intrinsic usage...

<ItemGroup>
  <MonoAOTCompilerDefaultProcessArguments Include="-O=-intrins" />
</ItemGroup>

beeradmoore · 2025-05-09T00:52:17Z

With that added HardwareIntrinsics is still ArmBase VectorSize=128. Times were all the same as well.

Before,

Method	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
ImageSharp_FromResource	289.242 ms	1.6725 ms	1.4826 ms	3.49	0.05	97448 B	40.93
ImageSharp_FromFile	284.213 ms	4.5314 ms	4.2387 ms	3.42	0.07	74848 B	31.44
SkiaSharp_FromResource	5.612 ms	0.0430 ms	0.0402 ms	0.07	0.00	1584 B	0.67
SkiaSharp_FromFile	1.737 ms	0.0100 ms	0.0094 ms	0.02	0.00	766 B	0.32
Native_Android_FromResource	83.006 ms	1.3020 ms	1.2179 ms	1.00	0.02	2381 B	1.00
Native_Android_FromFile	85.044 ms	1.6291 ms	1.8107 ms	1.02	0.03	1397 B	0.59

After,

Method	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
ImageSharp_FromResource	292.062 ms	2.8498 ms	2.6657 ms	3.44	0.09	228536 B	93.17
ImageSharp_FromFile	285.281 ms	3.6495 ms	3.4137 ms	3.36	0.09	341120 B	139.06
SkiaSharp_FromResource	5.635 ms	0.0288 ms	0.0255 ms	0.07	0.00	1584 B	0.65
SkiaSharp_FromFile	1.663 ms	0.0119 ms	0.0111 ms	0.02	0.00	766 B	0.31
Native_Android_FromResource	85.023 ms	1.6433 ms	2.0181 ms	1.00	0.03	2453 B	1.00
Native_Android_FromFile	85.023 ms	1.0300 ms	0.8601 ms	1.00	0.03	1469 B	0.60

JimBobSquarePants · 2025-05-10T01:20:55Z

If that setting is correct then intrinsics are not being used at all in all scenarios. Are you able to stick a Vector128.IsHardwareAccelerated anywhere?

Is this relevant?

dotnet/runtime#60792

beeradmoore · 2025-05-10T01:43:50Z

I'll make a details page and make it output some general device info. Any other properties you'd care about?

JimBobSquarePants · 2025-05-10T01:45:58Z

I'll make a details page and make it output some general device info. Any other properties you'd care about?

I’d say any chipset info, runtime info and intrinsics tests. E.g. AdvSimd.IsSupported

tannergooding · 2025-05-10T02:31:00Z

Vector128.IsHardwareAccelerated and AdvSimd.IsSupported are likely the two most important as far as vectorization is concerned. You might also print Vector64.IsHardwareAccelerated, AdvSimd.Arm64.IsSupported, and System.Numerics.Vector.IsHardwareAccelerated.

beeradmoore · 2025-05-10T04:59:40Z

I added everything I could find. Very likely too much.

Info from MAUIs DeviceInfo, System.Environment, System.Runtime.InteropServices.RuntimeInformation, and then a bunch of Intrinsics info including the Arm64 and X64 sub properties. After that there is Android specific info such as reading CPU info, Android runtime, and Android build information.

I couldn't get a OpenGL Surface to initialise so I couldn't fetch GPU specific information.

Here is a giant dump of my Android device in release mode.

DeviceInfo
DeviceType: Physical
Idiom: Phone
Manufacturer: Google
Model: Pixel 2 XL
Name: Pixel 2 XL
Platform: Android
VersionString: 15

Environment
Is64BitOperatingSystem: True
Is64BitProcess: True
IsPrivilegedProcess: False
OSVersion: Unix 35.0.0.0
ProcessorCount: 8
Version: 9.0.4

RuntimeInformation
FrameworkDescription: .NET 9.0.4
OSArchitecture: Arm64
OSDescription: Linux 4.4.302-g113bd1cfa6f5 #1 SMP PREEMPT Tue May 6 00:14:01 UTC 2025
ProcessArchitecture: Arm64
RuntimeIdentifier: android-arm64

Intrinsics
System.Numerics.Vector.IsHardwareAccelerated: True
Vector64.IsHardwareAccelerated: True
Vector128.IsHardwareAccelerated: True
Vector256.IsHardwareAccelerated: False
Vector512.IsHardwareAccelerated: False

Intrinsics.Arm
AdvSimd.IsSupported: False
AdvSimd.Arm64.IsSupported: False
Aes.IsSupported: False
Ae.Arm64s.IsSupported: False
ArmBase.IsSupported: True
ArmBase.Arm64.IsSupported: True
Crc32.IsSupported: False
Crc32.Arm64.IsSupported: False
Dp.IsSupported: False
Dp.Arm64.IsSupported: False
Rdm.IsSupported: False
Rdm.Arm64.IsSupported: False
Sha1.IsSupported: False
Sha1.Arm64.IsSupported: False
Sha256.IsSupported: False
Sha256.Arm64.IsSupported: False

Intrinsics.X86
Aes.IsSupported: False
Aes.X64.IsSupported: False
Avx.IsSupported: False
Avx.X64.IsSupported: False
Avx2.IsSupported: False
Avx2.X64.IsSupported: False
Avx10v1.IsSupported: False
Avx10v1.X64.IsSupported: False
Avx512BW.IsSupported: False
Avx512BW.X64.IsSupported: False
Avx512CD.IsSupported: False
Avx512CD.X64.IsSupported: False
Avx512DQ.IsSupported: False
Avx512DQ.X64.IsSupported: False
Avx512F.IsSupported: False
Avx512F.X64.IsSupported: False
Avx512Vbmi.IsSupported: False
Avx512Vbmi.X64.IsSupported: False
AvxVnni.IsSupported: False
AvxVnni.X64.IsSupported: False
Bmi1.IsSupported: False
Bmi1.X64.IsSupported: False
Bmi2.IsSupported: False
Bmi2.X64.IsSupported: False
Fma.IsSupported: False
Fma.X64.IsSupported: False
Lzcnt.IsSupported: False
Lzcnt.X64.IsSupported: False
Pclmulqdq.IsSupported: False
Pclmulqdq.X64.IsSupported: False
Popcnt.IsSupported: False
Popcnt.X64.IsSupported: False
Sse.IsSupported: False
Sse.X64.IsSupported: False
Sse2.IsSupported: False
Sse2.X64.IsSupported: False
Sse3.IsSupported: False
Sse3.X64.IsSupported: False
Sse41.IsSupported: False
Sse41.X64.IsSupported: False
Sse42.IsSupported: False
Sse42.X64.IsSupported: False
Ssse3.IsSupported: False
Ssse3.X64.IsSupported: False
X86Base.IsSupported: False
X86Base.X64.IsSupported: False
X86Serialize.IsSupported: False
X86Serialize.X64.IsSupported: False

Intrinsics.Wasm
PackedSimd.IsSupported: False

Android
CPU Info: /proc/cpuinfo
Processor	: AArch64 Processor rev 1 (aarch64)
processor	: 0
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x801
CPU revision	: 4

processor	: 1
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x801
CPU revision	: 4

processor	: 2
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x801
CPU revision	: 4

processor	: 3
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x801
CPU revision	: 4

processor	: 4
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x800
CPU revision	: 1

processor	: 5
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x800
CPU revision	: 1

processor	: 6
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x800
CPU revision	: 1

processor	: 7
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x800
CPU revision	: 1

Hardware	: Qualcomm Technologies, Inc MSM8998

CPU Info: /proc/gpuinfo
Error: Unable to fetch GPU info (/proc/gpuinfo)
Runtime.AvailableProcessors: 8
Runtime.TotalMemory: 28781376
Build.Board: taimen
Build.Bootloader: TMZ30m
Build.Brand: google
Build.Device: taimen
Build.Display: lineage_taimen-userdebug 15 BP1A.250405.007 95382b1173
Build.Hardware: taimen
Build.Host: 295e78adf884
Build.Id: BP1A.250405.007
Build.Manufacturer: Google
Build.Model: Pixel 2 XL
Build.OdmSku: unknown
Build.Product: lineage_taimen
Build.Sku: G011C
Build.SocManufacturer: Qualcomm
Build.SocModel: MSM8998
Build.SupportedAbis: arm64-v8a, armeabi-v7a, armeabi
Build.Tags: release-keys
Build.Time: 1746489034000
Build.Type: userdebug

(side note, I am not sure why Android.OS.Build.Type is userdebug. That is the OS build, not app build. Surely LinageOS (custom Android ROM) isn't a debug build 👀)

I did a build with this, and then with the above and the only change was

- Runtime.TotalMemory: 28781376
+ Runtime.TotalMemory: 8388608

But that is may misunderstanding of what that property means.

beeradmoore · 2025-05-10T05:12:22Z

I did another test of debug build. Keeping in mind the csproj is


<PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Debug'">
    <!-- performance improvements for debug mode, will break hot reload. -->
    <UseInterpreter>false</UseInterpreter>
</PropertyGroup>

<PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Release'">
    <EnableLLVM>true</EnableLLVM>
    <RunAOTCompilation>true</RunAOTCompilation>
    <AndroidEnableProfiledAot>false</AndroidEnableProfiledAot>
</PropertyGroup>

The only property that changed (aside from Runtime.TotalMemory) was

Vector64.IsHardwareAccelerated: False

Swapping back to UseInterpreter=true in debug mode the only properties that changed (from standard release) were

System.Numerics.Vector.IsHardwareAccelerated: False
Vector64.IsHardwareAccelerated: False
ArmBase.IsSupported: False
ArmBase.Arm64.IsSupported: False

tannergooding · 2025-05-10T05:14:09Z

Intrinsics
System.Numerics.Vector.IsHardwareAccelerated: True
Vector64.IsHardwareAccelerated: True
Vector128.IsHardwareAccelerated: True
Vector256.IsHardwareAccelerated: False
Vector512.IsHardwareAccelerated: False

Intrinsics.Arm
AdvSimd.IsSupported: False
AdvSimd.Arm64.IsSupported: False

So it should be getting generally accelerated in most places, except for where AdvSimd is being used directly. Those paths would need to use the xplat intrinsics, potentially just as a fallback, instead. Many of those should be fairly easy to switch over, but feel free to tag me on any of them if you have questions @JimBobSquarePants and I can give more direct guidance

JimBobSquarePants · 2025-05-10T05:18:53Z

That’s the odd thing. There’s actually very few places left I use AdvSimd directly (png, web encoders). Almost everything I do have uses Xplat as a fallback also.

tannergooding · 2025-05-10T05:22:35Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
-        get => Ssse3.IsSupported || AdvSimd.Arm64.IsSupported;
+        get => Ssse3.IsSupported || AdvSimd.Arm64.IsSupported || PackedSimd.IsSupported;


For Arm64 and WASM you should just be able to use (for byte) Vector128.Shuffle due to how VectorTableLookup and Swizzle work.

You really just want at least Ssse3 for x86/x64, since they cannot be done otherwise.

So perhaps you want:

get { if (Vector128.IsHardwareAccelerated) { if (RuntimeInformation.ProcessArchitecture is Architecture.X86 or Architecture.X64) { return Ssse3.IsSupported; } // You could optionally do: // return ProcessArchitecture is Architecture.Arm64 or Architecture.Wasm; // if you wanted to restrict it to platforms you know should be safe return true; } return false; }

Good catch. I’ll review my other helpers.

tannergooding · 2025-05-10T06:13:49Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+                    Vector128<short> u0 = Vector128_.PackSignedSaturate(w0, w1);
+                    Vector128<short> u1 = Vector128_.PackSignedSaturate(w2, w3);

-                    Unsafe.Add(ref destinationBase, i) = Vector128Utilities.PackUnsignedSaturate(u0, u1);
+                    Unsafe.Add(ref destinationBase, i) = Vector128_.PackUnsignedSaturate(u0, u1);


This could be made cheaper with a direct int-> byte helper.

For x86/x64 it should do roughly the same as it is right now (float->int->short->byte), but for the V128.IsHardwareAccelerated fallback it can clamp to byte, narrow to short, narrow to byte; rather than clamp to short, narrow to short, clamp to byte, narrow to byte like its doing currently.

Notably for AVX512 there's even some other optimizations you could do since instructions exist to go from V512<uint> -> V128<byte>. You can likewise fixup the V512<float> as part of the conversion to uint using ConvertToVector512UInt32(Max(float.AsInt32(), Zero).AsSingle()) (since out of range values become uint.MaxValue, you just have to care about negatives becoming 0).

There's also some Arm64/WASM specific behaviors you can take advantage of because float->integer already saturates there, so instead of doing float->int->short->byte, you can do float->uint then clamp to byte.MaxValue and then do the narrowing. On Arm64 you might be able to do a VectorTableLookup with 2 inputs instead of 4 narrowing instructions or do some zip instructions instead, which might be faster.

-- The other optimizations aren't as important, but for Android since AdvSimd.IsSupported reports false, the suggestion on improving V128.IsHardwareAccelerated path will likely benefit there since it cuts out almost half the work.

You've lost me a bit here with the Vector128 path.

That was this part

This could be made cheaper with a direct int-> byte helper.

For x86/x64 it should do roughly the same as it is right now (float->int->short->byte), but for the V128.IsHardwareAccelerated fallback it can clamp to byte, narrow to short, narrow to byte; rather than clamp to short, narrow to short, clamp to byte, narrow to byte like its doing currently.

In particular right now you're doing:

Clamp the int32 to int16

Narrow the int32 to int16 now that it's in range

Clamp the int16 to uint8

Narrow the int16 to uint8 now that it's in range

You could instead just do:

Clamp the int32 to int8

Narrow the int32 to uint16 now that it's in range

Narrow the uint16 to uint8, it's already in range

Ah yes I see now! Thanks!

tannergooding · 2025-05-10T06:17:24Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

-                (Vector256.IsHardwareAccelerated && Vector256Utilities.SupportsShuffleByte) ||
-                (Vector128.IsHardwareAccelerated && Vector128Utilities.SupportsShuffleByte))
+            if ((Vector512.IsHardwareAccelerated && Vector512_.SupportsShuffleNativeByte) ||
+                (Vector256.IsHardwareAccelerated && Vector256_.SupportsShuffleByte) ||


Noticed this middle one is SupportsShuffleByte while the other two are SupportsShuffleNativeByte, guessing that was a mistake?

tannergooding · 2025-05-10T06:23:10Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

-            if ((Vector512.IsHardwareAccelerated && Vector512Utilities.SupportsShuffleFloat) ||
-                (Vector256.IsHardwareAccelerated && Vector256Utilities.SupportsShuffleFloat) ||
-                (Vector128.IsHardwareAccelerated && Vector128Utilities.SupportsShuffleFloat))
+            if ((Vector512.IsHardwareAccelerated && Vector512_.SupportsShuffleNativeFloat) ||


Some of these could be supported on other platforms if you passed in the indexes rather than a byte control.

There's some other functions like that throughout the SimdUtils.HwIntrinsics.cs file as well which could be impactful to Arm64, Wasm, and Android; but I've not looked at them in depth and I'm guessing many are larger refactorings to clean up.

Given these are marked as AggressiveInlining and byte control is expected to be a constant, the JIT "might" (just double check the codegen) be able to do the right thing for something like Vector128.Shuffle(vector, Vector128.Create((control & 3), ((control >> 2) & 3), ((control >> 4) & 3), ((control >> 6) & 3))) since that should itself break down to 4 constants which the JIT can recognize

beeradmoore · 2025-05-10T06:44:43Z

I got the above (device details page, not benchmarks) to run on my iPhone 15 Pro Max in release mode. Trying to build with benchmarks enabled takes forever while it's doing AOT compilation. I'll leave it overnight to see if it actually does finish.

The only property it had different over the Android build in the same config was.

AdvSimd.IsSupported: True
AdvSimd.Arm64.IsSupported: True

Full details

DeviceInfo
DeviceType: Physical
Idiom: Phone
Manufacturer: Apple
Model: iPhone16,2
Name: iPhone
Platform: iOS
VersionString: 18.4.1

Environment
Is64BitOperatingSystem: True
Is64BitProcess: True
IsPrivilegedProcess: False
OSVersion: Unix 18.4.1
ProcessorCount: 6
Version: 9.0.4

RuntimeInformation
FrameworkDescription: .NET 9.0.4
OSArchitecture: Arm64
OSDescription: Darwin 24.4.0 Darwin Kernel Version 24.4.0: Sat Mar 15 18:28:20 PDT 2025; root:xnu-11417.102.9~20/RELEASE_ARM64_T8122
ProcessArchitecture: Arm64
RuntimeIdentifier: ios-arm64

Intrinsics
System.Numerics.Vector.IsHardwareAccelerated: True
Vector64.IsHardwareAccelerated: True
Vector128.IsHardwareAccelerated: True
Vector256.IsHardwareAccelerated: False
Vector512.IsHardwareAccelerated: False

Intrinsics.Arm
AdvSimd.IsSupported: True
AdvSimd.Arm64.IsSupported: True
Aes.IsSupported: False
Aes.Arm64.IsSupported: False
ArmBase.IsSupported: True
ArmBase.Arm64.IsSupported: True
Crc32.IsSupported: False
Crc32.Arm64.IsSupported: False
Dp.IsSupported: False
Dp.Arm64.IsSupported: False
Rdm.IsSupported: False
Rdm.Arm64.IsSupported: False
Sha1.IsSupported: False
Sha1.Arm64.IsSupported: False
Sha256.IsSupported: False
Sha256.Arm64.IsSupported: False

Intrinsics.X86
Aes.IsSupported: False
Aes.X64.IsSupported: False
Avx.IsSupported: False
Avx.X64.IsSupported: False
Avx2.IsSupported: False
Avx2.X64.IsSupported: False
Avx10v1.IsSupported: False
Avx10v1.X64.IsSupported: False
Avx512BW.IsSupported: False
Avx512BW.X64.IsSupported: False
Avx512CD.IsSupported: False
Avx512CD.X64.IsSupported: False
Avx512DQ.IsSupported: False
Avx512DQ.X64.IsSupported: False
Avx512F.IsSupported: False
Avx512F.X64.IsSupported: False
Avx512Vbmi.IsSupported: False
Avx512Vbmi.X64.IsSupported: False
AvxVnni.IsSupported: False
AvxVnni.X64.IsSupported: False
Bmi1.IsSupported: False
Bmi1.X64.IsSupported: False
Bmi2.IsSupported: False
Bmi2.X64.IsSupported: False
Fma.IsSupported: False
Fma.X64.IsSupported: False
Lzcnt.IsSupported: False
Lzcnt.X64.IsSupported: False
Pclmulqdq.IsSupported: False
Pclmulqdq.X64.IsSupported: False
Popcnt.IsSupported: False
Popcnt.X64.IsSupported: False
Sse.IsSupported: False
Sse.X64.IsSupported: False
Sse2.IsSupported: False
Sse2.X64.IsSupported: False
Sse3.IsSupported: False
Sse3.X64.IsSupported: False
Sse41.IsSupported: False
Sse41.X64.IsSupported: False
Sse42.IsSupported: False
Sse42.X64.IsSupported: False
Ssse3.IsSupported: False
Ssse3.X64.IsSupported: False
X86Base.IsSupported: False
X86Base.X64.IsSupported: False
X86Serialize.IsSupported: False
X86Serialize.X64.IsSupported: False

Intrinsics.Wasm
PackedSimd.IsSupported: False

JimBobSquarePants · 2025-05-13T02:20:53Z

@beeradmoore If you get the chance could you please run another benchmark. I want to see if some of the additional shuffle changes have made a difference. Thanks!

beeradmoore · 2025-05-13T04:01:09Z

New is this updated PR, old is re-running code from about 4 days ago.

Method	Mean	Error	StdDev	Ratio	RatioSD	Allocated	Alloc Ratio
New ImageSharp_FromResource	293.021 ms	1.7252 ms	1.6138 ms	3.48	0.09	97448 B	39.73
New ImageSharp_FromFile	286.549 ms	2.9329 ms	2.7434 ms	3.40	0.09	74848 B	30.51
Old ImageSharp_FromResource	290.849 ms	2.7201 ms	2.5444 ms	3.41	0.10	97448 B	39.73
Old ImageSharp_FromFile	284.574 ms	2.0648 ms	1.8304 ms	3.34	0.09	74752 B	30.47

JimBobSquarePants · 2025-05-16T13:26:47Z

@beeradmoore @tannergooding

I'm going to merge this AS-IS as it's blocking important work.

I'm hoping the new narrowing code will speed things up a bit since the last benchmark, but I can revisit later as I look for further optimization opportunities.

JimBobSquarePants added 5 commits May 6, 2025 20:21

Add Vector128 rounding

69caa49

Clean up and prep for Vector512 multiply

29a5635

Rename utils, organize BlockF8x8

5125a04

Migrate from Sse to general Vector128 for ZigZag

30bdc29

All Vector128 Load

041e59d

JimBobSquarePants added this to the v4.0.0 milestone May 7, 2025

JimBobSquarePants requested review from antonfirsov and Copilot May 7, 2025 06:59

JimBobSquarePants added area:performance formats:jpeg arch:arm64 labels May 7, 2025

Copilot AI reviewed May 7, 2025

View reviewed changes

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs Show resolved Hide resolved

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs Outdated Show resolved Hide resolved

src/ImageSharp/Common/Helpers/Vector128Utilities.cs Show resolved Hide resolved

tannergooding reviewed May 7, 2025

View reviewed changes

tannergooding reviewed May 8, 2025

View reviewed changes

JimBobSquarePants added 2 commits May 8, 2025 20:56

Port more V256 code

8a23d42

Modernize additional V256 code from review

6238f00

tannergooding reviewed May 10, 2025

View reviewed changes

JimBobSquarePants added 2 commits May 12, 2025 09:46

Update ShuffleNative (byte)

505ecce

Expand v128 native shuffle (float) support

55a8c73

JimBobSquarePants added 3 commits May 14, 2025 20:55

More optimizations based on feedback

a59c900

Merge branch 'main' into js/block8x8-simd

28025e7

Fix v128 narrowing.

4c1ecfa

JimBobSquarePants merged commit d8b464b into main May 16, 2025
10 checks passed

JimBobSquarePants deleted the js/block8x8-simd branch May 16, 2025 13:26

		public static Vector128<T> Clamp<T>(Vector128<T> value, Vector128<T> min, Vector128<T> max)
		=> Vector128.Min(Vector128.Max(value, min), max);

		dRef = Avx.ConvertToVector256Single(top);
		Unsafe.Add(ref dRef, 1) = Avx.ConvertToVector256Single(bottom);

		Vector256<int> row0 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 0), Unsafe.Add(ref bBase, i + 0)));
		Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1)));

		Vector256<int> areEqual = Avx2.CompareEqual(Avx.ConvertToVector256Int32WithTruncation(Unsafe.Add(ref this.V256_0, i)), targetVector);
		if (Avx2.MoveMask(areEqual.AsByte()) != equalityMask)

Uh oh!

Improve JPEG Block8x8F Intrinsics for Vector128 paths. #2918

Improve JPEG Block8x8F Intrinsics for Vector128 paths. #2918

Uh oh!

Conversation

JimBobSquarePants commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prerequisites

Description

Benchmarks

Main

This PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beeradmoore commented May 7, 2025

Uh oh!

JimBobSquarePants commented May 7, 2025

Uh oh!

beeradmoore commented May 7, 2025

Uh oh!

beeradmoore commented May 7, 2025

Debug (3.1.8)

Debug (Modernize JPEG Color Converters)

Debug (this updated PR)

Release (3.1.8)

Release (Modernize JPEG Color Converters)

Release (this updated PR)

Uh oh!

JimBobSquarePants commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beeradmoore commented May 7, 2025

Uh oh!

JimBobSquarePants commented May 7, 2025

Uh oh!

tannergooding commented May 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Here's what you're getting now

Here's what you would be getting with the suggested change

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JimBobSquarePants commented May 7, 2025 •

edited

Loading

JimBobSquarePants commented May 7, 2025 •

edited

Loading

JimBobSquarePants commented May 8, 2025 •

edited

Loading

JimBobSquarePants commented May 10, 2025 •

edited

Loading