Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improve JPEG Block8x8F Intrinsics for Vector128 paths. #2918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 16, 2025

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented May 7, 2025

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR adds Vector128 intrinsic implementations to several methods in Block8x8F and reimplements ZigZag to migrate intrinsics from Sse to general Vector<128> methods which should provide a good speedup on mobile.

Performance improvements are measurable.

Benchmarks

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3915)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.300-preview.0.25177.5
  [Host]             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  1. No HwIntrinsics : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT
  2. SSE             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT SSE4.2
  3. AVX             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Runtime=.NET 8.0

Main

| Method                              | Job                | EnvironmentVariables                            | Mean       | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------------------------------------ |------------------- |------------------------------------------------ |-----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| 'Baseline 4:4:4 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 107.930 ms | 1.5990 ms | 1.4957 ms |  1.00 |    0.02 |      - |  47.46 KB |        1.00 |
| 'Baseline 4:4:4 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  24.525 ms | 0.1969 ms | 0.1842 ms |  0.23 |    0.00 |      - |   47.1 KB |        0.99 |
| 'Baseline 4:4:4 Interleaved'        | 3. AVX             | Empty                                           |   8.838 ms | 0.0784 ms | 0.0733 ms |  0.08 |    0.00 |      - |  47.06 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:2:0 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  40.974 ms | 0.1971 ms | 0.1844 ms |  1.00 |    0.01 |      - |  15.22 KB |        1.00 |
| 'Baseline 4:2:0 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  11.841 ms | 0.0494 ms | 0.0438 ms |  0.29 |    0.00 |      - |  15.14 KB |        0.99 |
| 'Baseline 4:2:0 Interleaved'        | 3. AVX             | Empty                                           |   7.467 ms | 0.0922 ms | 0.0863 ms |  0.18 |    0.00 |      - |  15.13 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:0:0 (grayscale)'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |   8.920 ms | 0.0859 ms | 0.0718 ms |  1.00 |    0.01 |      - |  12.73 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 2. SSE             | DOTNET_EnableAVX=0                              |   2.713 ms | 0.0152 ms | 0.0142 ms |  0.30 |    0.00 |      - |  12.72 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 3. AVX             | Empty                                           |   1.204 ms | 0.0078 ms | 0.0065 ms |  0.13 |    0.00 | 1.9531 |  12.71 KB |        1.00 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Progressive 4:2:0 Non-Interleaved' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  74.589 ms | 0.4589 ms | 0.4068 ms |  1.00 |    0.01 |      - |  39.54 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 2. SSE             | DOTNET_EnableAVX=0                              |  20.615 ms | 0.1037 ms | 0.0919 ms |  0.28 |    0.00 |      - |  39.38 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 3. AVX             | Empty                                           |  11.544 ms | 0.0490 ms | 0.0458 ms |  0.15 |    0.00 |      - |  39.35 KB |        1.00 |

This PR

| Method                              | Job                | EnvironmentVariables                            | Mean       | Error     | StdDev    | Ratio | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------------------------------------ |------------------- |------------------------------------------------ |-----------:|----------:|----------:|------:|--------:|-------:|----------:|------------:|
| 'Baseline 4:4:4 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 108.958 ms | 1.3012 ms | 1.2172 ms |  1.00 |    0.02 |      - |  47.46 KB |        1.00 |
| 'Baseline 4:4:4 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |  13.185 ms | 0.1547 ms | 0.1447 ms |  0.12 |    0.00 |      - |  47.06 KB |        0.99 |
| 'Baseline 4:4:4 Interleaved'        | 3. AVX             | Empty                                           |   8.754 ms | 0.0501 ms | 0.0468 ms |  0.08 |    0.00 |      - |  47.06 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:2:0 Interleaved'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  41.072 ms | 0.2252 ms | 0.1996 ms |  1.00 |    0.01 |      - |  15.22 KB |        1.00 |
| 'Baseline 4:2:0 Interleaved'        | 2. SSE             | DOTNET_EnableAVX=0                              |   8.928 ms | 0.0815 ms | 0.0722 ms |  0.22 |    0.00 |      - |  15.14 KB |        0.99 |
| 'Baseline 4:2:0 Interleaved'        | 3. AVX             | Empty                                           |   7.399 ms | 0.0449 ms | 0.0398 ms |  0.18 |    0.00 |      - |  15.13 KB |        0.99 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Baseline 4:0:0 (grayscale)'        | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |   8.967 ms | 0.0404 ms | 0.0358 ms |  1.00 |    0.01 |      - |  12.73 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 2. SSE             | DOTNET_EnableAVX=0                              |   1.723 ms | 0.0079 ms | 0.0070 ms |  0.19 |    0.00 | 1.9531 |  12.71 KB |        1.00 |
| 'Baseline 4:0:0 (grayscale)'        | 3. AVX             | Empty                                           |   1.215 ms | 0.0051 ms | 0.0048 ms |  0.14 |    0.00 | 1.9531 |  12.73 KB |        1.00 |
|                                     |                    |                                                 |            |           |           |       |         |        |           |             |
| 'Progressive 4:2:0 Non-Interleaved' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  74.980 ms | 0.3103 ms | 0.2751 ms |  1.00 |    0.01 |      - |  39.58 KB |        1.00 |
| 'Progressive 4:2:0 Non-Interleaved' | 2. SSE             | DOTNET_EnableAVX=0                              |  14.541 ms | 0.1421 ms | 0.1329 ms |  0.19 |    0.00 |      - |  39.35 KB |        0.99 |
| 'Progressive 4:2:0 Non-Interleaved' | 3. AVX             | Empty                                           |  12.342 ms | 0.2376 ms | 0.2440 ms |  0.16 |    0.00 |      - |  39.35 KB |        0.99 |

CC
@tannergooding - I think I got everything right performance-wise though I have commented with TODO where there may be more low hanging fruit.
@beeradmoore - I'm hoping this makes a real difference with the MAUI benchmarks. There were several places where we were falling back to scalar implementations for ARM and WASM.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances JPEG decoding by migrating intrinsic implementations in Block8x8F from legacy SSE/AVX paths to more versatile Vector128 and Vector256 methods, which should boost performance on mobile platforms and improve consistency.

  • Renamed methods and field references (e.g. TransposeInplace → TransposeInPlace) for improved naming clarity.
  • Introduced separate Vector128 and Vector256 intrinsic implementations and removed legacy intrinsic and generated files.
  • Updated SIMD helper classes to use new alias naming (e.g. Vector128_ instead of Vector128Utilities).

Reviewed Changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs Renamed transpose method and updated intrinsic vector field references.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs Updated SIMD intrinsic checks and operations; added new normalization and load methods; removed legacy warnings.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector256.cs Added a new Vector256-based implementation of Block8x8F operations.
src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Vector128.cs Added a new Vector128-based implementation of Block8x8F operations.
src/ImageSharp/Common/Helpers/*Utilities.cs Updated utility classes to use the new alias naming conventions (e.g. Vector128_).
src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs Adjusted references to intrinsics helpers in accordance with alias renaming.
Files not reviewed (1)
  • src/ImageSharp/Formats/Jpeg/Components/Block8x8F.Generated.tt: Language not supported
Comments suppressed due to low confidence (1)

src/ImageSharp/Formats/Jpeg/Components/FloatingPointDCT.Intrinsic.cs:23

  • The method 'TransposeInPlace' has been renamed from 'TransposeInplace' for consistency. Confirm that all call sites and inline comments are updated to reflect this naming convention.
block.TransposeInPlace();

@beeradmoore
Copy link

@JimBobSquarePants , when I build the nuget (dotnet pack -c Release) to test with it looks like it compiles with .NET 8 SDK.

Screenshot 2025-05-07 at 5 57 53 pm

I assume that's what the above says, and we want to be using .NET 9 SDK. Should I be forcing it to use .NET 9 with a global.json or some other setting?

@JimBobSquarePants
Copy link
Member Author

ImageSharp actually only targets a single LTS version. We fudge the target for CI and tests so we can track potential JIT issues when building against previews.

@beeradmoore
Copy link

Ah, gotya. All good.

I have my head in MAUI world a lot, they follow latest release instead of LTS so I am not used to seeing .NET 8 pop up 😅

@beeradmoore
Copy link

Doesn't seem like numbers moved too much. Some higher, some lower (or within margin of error).

Keep in mind these tests are not using BenchmarkDotNet yet so it isn't doing the warmup and other things it does. Just loops over 10 times and I am jotting down the average.

Debug (3.1.8)

Device JpgLoad JpgResize PngLoad PngResize
Android 1084.1 1312.4 37.1 44.4
Android Emulator 189.5 245.1 13.8 14.2

Debug (Modernize JPEG Color Converters)

Device JpgLoad JpgResize PngLoad PngResize
Android 1344.77 1586.2 37.5 48.1
Android Emulator 233.3 285.8 13.8 15.6

Debug (this updated PR)

Device JpgLoad JpgResize PngLoad PngResize
Android 1366.6 1605.1 37.1 48.4
Android Emulator 245.5 287.8 15.5 15.0

Release (3.1.8)

Device JpgLoad JpgResize PngLoad PngResize
Android 285.5 392.9 19.3 26.1
Android Emulator 83.7 96.2 9.4 9.9

Release (Modernize JPEG Color Converters)

Device JpgLoad JpgResize PngLoad PngResize
Android 341.0 469.4 20.1 25.8
Android Emulator 99.3 121.1 9.2 10.4

Release (this updated PR)

Device JpgLoad JpgResize PngLoad PngResize
Android 360.3 482.0 19.5 27.5
Android Emulator 106.1 119.9 10.5 12.6

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented May 7, 2025

I think we need to find a way to properly benchmark because I cannot see how the numbers could be worse in this PR than the last one.

Edit.

It appears we could just run BenchmarkDotNet…

https://benchmarkdotnet.org/articles/samples/IntroXamarin.html

@beeradmoore
Copy link

New project I started is using that with MAUI. I'm having some issues getting it working with Mac Catalyst (macOS desktop) variant.

But if the current issues are Android I can just do a net9.0-android repo with Android only app to put the tests in to focus on that while I deal with full MAUI later

@JimBobSquarePants
Copy link
Member Author

Yeah, as I recall iOS was very good, let’s focus on the numbers for Android

@tannergooding
Copy link
Contributor

How is the project being compiled for Android, is it using Mono LLVM or standard Mono AOT?

Comment on lines +328 to +329
public static Vector128<T> Clamp<T>(Vector128<T> value, Vector128<T> min, Vector128<T> max)
=> Vector128.Min(Vector128.Max(value, min), max);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In .NET 9+ this can just use Vector128.Clamp. Alternatively it can use Vector128.ClampNative if you don't need need to care about -0 vs +0 or NaN handling for float/double

Comment on lines +131 to +136
if (Avx.IsSupported)
{
Vector256<float> lower = Avx.RoundToNearestInteger(vector.GetLower());
Vector256<float> upper = Avx.RoundToNearestInteger(vector.GetUpper());
return Vector512.Create(lower, upper);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the AVX path for with Vector512?

Vector512.IsHardwareAccelerated will only report true if Avx512F+BW+CD+DQ+VL is supported, so this path should generally be "dead".

[MethodImpl(InliningOptions.ShortMethod)]
public void NormalizeColorsAndRoundInPlaceVector128(float maximum)
{
Vector128<float> off = Vector128.Create(MathF.Ceiling(maximum * 0.5F));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually be more efficient as Vector128.Ceiling(Vector128.Create(maximum) * 0.5f)

While the codegen (see below) is the same size and looks nearly identical, the change to be vectorized instead of scalar avoids a very minor penalty that that exists as scalar operations mutate element 0 and preserve elements 1, 2, and 3 as is.

In general it's better to convert to vector up front and do operations as vectorized where possible.

Here's what you're getting now

; XMM
vmulss xmm0, xmm1, [reloc @RWD00]
vroundss xmm0, xmm0, xmm0, 0xa
vbroadcastss xmm0, xmm0

Here's what you would be getting with the suggested change

; XMM
vbroadcastss xmm0, xmm1
vmulps xmm0, xmm0, [reloc @RWD00]
vroundps xmm0, xmm0, 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably it also allows the Vector128.Create(maximum) used for initializing max to be reused, rather than a distinct instruction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I shouldn't have missed these!

Comment on lines 79 to 80
dRef = Avx.ConvertToVector256Single(top);
Unsafe.Add(ref dRef, 1) = Avx.ConvertToVector256Single(bottom);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one becomes stylistic preference, but you can freely mix the xplat APIs and the platform specific intrinsics.

That is, while you're wanting to use V256 Avx2.ConvertToVector256Int32(V128) instead of V256.WidenLower/WidenUpper for efficiency, you can just use V256.ConvertToSingle() still instead of Avx.ConvertToVector256Single since it is a 1-to-1 mapping.

-- As a note to myself, it would likely be beneficial to have V256.Widen(V128) APIs or similar; or to pattern match V256.WidenLower(V256) followed by V256.WidenUpper(V256); so devs don't need to use platform specific intrinsics in such cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't bother porting the existing Avx code as it worked but I might still do it.

Comment on lines 114 to 115
Vector256<int> row0 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 0), Unsafe.Add(ref bBase, i + 0)));
Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can use x * y instead of Avx.Multiply(x, y)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use Vector256.ConvertToInt32 instead of Avx.ConvertToVector256Int32

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually seeing a difference in output if I switch from Avx.ConvertToVector256Int32 to Vector256.ConvertToInt32 do they use the same rounding?

Avx.ConvertToVector256Int32 uses the equivalent of MidpointRounding.ToEven but the Vector256 equivalent is undocumented.

Vector256<int> row1 = Avx.ConvertToVector256Int32(Avx.Multiply(Unsafe.Add(ref aBase, i + 1), Unsafe.Add(ref bBase, i + 1)));

Vector256<short> row = Avx2.PackSignedSaturate(row0, row1);
row = Avx2.PermuteVar8x32(row.AsInt32(), multiplyIntoInt16ShuffleMask).AsInt16();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should do the right thing in .NET 8 if you have Vector256.Shuffle(row.AsInt32(), Vector256.Create(0, 1, 4, 5, 2, 3, 6, 7))

In general, declaring the indices directly into the call like this will do the right thing provided all indices are constant. We improved the handling in .NET 9 and even more so in .NET 10 to handle more patterns so that devs that are manually hoisting the indices will still get good codegen if the JIT can detect them as constant during compilation (so in .NET 10, you can have the code as you do right now, rather than directly declaring V256.Create(...) inside the Vector256.Shuffle call as is needed for .NET 8).

Comment on lines 127 to 130
Vector256<float> r0 = Avx.InsertVector128(
this.V256_0,
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L),
1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can use this.V256_0.WithUpper(Unsafe.As<Vector4, Vector128<float>>(ref this.V4L))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be the following can it not?
Vector256<float> r0 = this.V256_0.WithUpper(this.V4L.AsVector128());

@@ -421,16 +488,17 @@ public void LoadFromInt16ExtendedAvx2(ref Block8x8 source)
/// <param name="value">Value to compare to.</param>
public bool EqualsToScalar(int value)
{
// TODO: Can we provide a Vector128 implementation for this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's blocking a V128 path from being added? At a glance it looks like it should be almost a copy/paste of the V256 path...

Comment on lines 501 to 502
Vector256<int> areEqual = Avx2.CompareEqual(Avx.ConvertToVector256Int32WithTruncation(Unsafe.Add(ref this.V256_0, i)), targetVector);
if (Avx2.MoveMask(areEqual.AsByte()) != equalityMask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified to if (!V256.EqualsAll(V256.ConvertToInt32(Unsafe.Add(ref this.V256_0, i)), targetVector))

That avoids a dependency on MoveMask and maps better to V128/V512.

-- Notably on .NET 9+ you may want to use V256.ConvertToInt32Native instead, since ConvertToInt32 will saturate for out of bounds values, rather than saturating on some platforms and returning a "sentinel" value on x86/x64.

Comment on lines 29 to 36
Vector256<float> tmp0 = Avx.Add(block.V256_0, block.V256_7);
Vector256<float> tmp7 = Avx.Subtract(block.V256_0, block.V256_7);
Vector256<float> tmp1 = Avx.Add(block.V256_1, block.V256_6);
Vector256<float> tmp6 = Avx.Subtract(block.V256_1, block.V256_6);
Vector256<float> tmp2 = Avx.Add(block.V256_2, block.V256_5);
Vector256<float> tmp5 = Avx.Subtract(block.V256_2, block.V256_5);
Vector256<float> tmp3 = Avx.Add(block.V256_3, block.V256_4);
Vector256<float> tmp4 = Avx.Subtract(block.V256_3, block.V256_4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These could use x + y and x - y. Similar for other arithmetic operations in the method

if (Vector128.IsHardwareAccelerated)
{
Vector128<int> targetVector = Vector128.Create(value);
ref Vector4 blockStride = ref this.V0L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is blockStride intentionally unused?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Badly named (copy/paste) but yeah, I'm pointing to the Vector4 field at offset 0. I don't have explicit Vector128 fields but am considering adding them to avoid some of the To/From Vector128 code.

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented May 8, 2025

@beeradmoore Those numbers are wild and yes, Skia is cheating there.

Here's my desktop decoding the same image. I benchmarked against System.Drawing because the JPEG decoder there is incredibly fast (I don't know what the underlying implementation is but it's blazing)

Appreciating the fact that the CPU on the Android (could you post the details btw) is less powerful than my laptop I'm surprised that the Vector128 performance is still so bad.

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3915)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.300-preview.0.25177.5
  [Host]             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  1. No HwIntrinsics : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT
  2. SSE             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT SSE4.2
  3. AVX             : .NET 8.0.15 (8.0.1525.16413), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Runtime=.NET 8.0

| Method                     | Job                | EnvironmentVariables                            | Mean      | Error    | StdDev   | Ratio | RatioSD | Gen0     | Gen1     | Gen2     | Allocated | Alloc Ratio |
|--------------------------- |------------------- |------------------------------------------------ |----------:|---------:|---------:|------:|--------:|---------:|---------:|---------:|----------:|------------:|
| 'Maui Test'                | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 | 259.10 ms | 4.091 ms | 3.626 ms |  1.00 |    0.02 | 500.0000 | 500.0000 | 500.0000 |   20668 B |        1.00 |
| 'Maui Test'                | 2. SSE             | DOTNET_EnableAVX=0                              |  47.15 ms | 0.806 ms | 0.754 ms |  0.18 |    0.00 |        - |        - |        - |   19403 B |        0.94 |
| 'Maui Test'                | 3. AVX             | Empty                                           |  37.05 ms | 0.507 ms | 0.449 ms |  0.14 |    0.00 |        - |        - |        - |   19366 B |        0.94 |
|                            |                    |                                                 |           |          |          |       |         |          |          |          |           |             |
| 'Maui Test System Drawing' | 1. No HwIntrinsics | DOTNET_EnableHWIntrinsic=0,DOTNET_FeatureSIMD=0 |  31.10 ms | 0.237 ms | 0.210 ms |  1.00 |    0.01 |        - |        - |        - |     129 B |        1.00 |
| 'Maui Test System Drawing' | 2. SSE             | DOTNET_EnableAVX=0                              |  30.77 ms | 0.232 ms | 0.217 ms |  0.99 |    0.01 |        - |        - |        - |     116 B |        0.90 |
| 'Maui Test System Drawing' | 3. AVX             | Empty                                           |  30.65 ms | 0.189 ms | 0.177 ms |  0.99 |    0.01 |        - |        - |        - |     116 B |        0.90 |

@tannergooding I'm suspicious of the scalar timing on desktop and those Android numbers lining up so closely. Could just be coincidence though...

@beeradmoore
Copy link

My test Android device is a Pixel 2 XL. 8 years old and still chugging along.

From the output of one of the previous runs (test_pr.txt) I also see // HardwareIntrinsics=ArmBase VectorSize=128. I assume that is what is expected?

I checked to see if I could use System.Drawing to compare but I think that is a Windows only API.

@tannergooding
Copy link
Contributor

I think the easiest way to confirm this is to add the following, which should tell the Mono AOT compiler to skip intrinsic usage...

<ItemGroup>
  <MonoAOTCompilerDefaultProcessArguments Include="-O=-intrins" />
</ItemGroup>

@beeradmoore
Copy link

With that added HardwareIntrinsics is still ArmBase VectorSize=128. Times were all the same as well.

Before,

Method Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ImageSharp_FromResource 289.242 ms 1.6725 ms 1.4826 ms 3.49 0.05 97448 B 40.93
ImageSharp_FromFile 284.213 ms 4.5314 ms 4.2387 ms 3.42 0.07 74848 B 31.44
SkiaSharp_FromResource 5.612 ms 0.0430 ms 0.0402 ms 0.07 0.00 1584 B 0.67
SkiaSharp_FromFile 1.737 ms 0.0100 ms 0.0094 ms 0.02 0.00 766 B 0.32
Native_Android_FromResource 83.006 ms 1.3020 ms 1.2179 ms 1.00 0.02 2381 B 1.00
Native_Android_FromFile 85.044 ms 1.6291 ms 1.8107 ms 1.02 0.03 1397 B 0.59

After,

Method Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ImageSharp_FromResource 292.062 ms 2.8498 ms 2.6657 ms 3.44 0.09 228536 B 93.17
ImageSharp_FromFile 285.281 ms 3.6495 ms 3.4137 ms 3.36 0.09 341120 B 139.06
SkiaSharp_FromResource 5.635 ms 0.0288 ms 0.0255 ms 0.07 0.00 1584 B 0.65
SkiaSharp_FromFile 1.663 ms 0.0119 ms 0.0111 ms 0.02 0.00 766 B 0.31
Native_Android_FromResource 85.023 ms 1.6433 ms 2.0181 ms 1.00 0.03 2453 B 1.00
Native_Android_FromFile 85.023 ms 1.0300 ms 0.8601 ms 1.00 0.03 1469 B 0.60

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented May 10, 2025

If that setting is correct then intrinsics are not being used at all in all scenarios. Are you able to stick a Vector128.IsHardwareAccelerated anywhere?

Is this relevant?

dotnet/runtime#60792

@beeradmoore
Copy link

I'll make a details page and make it output some general device info. Any other properties you'd care about?

@JimBobSquarePants
Copy link
Member Author

I'll make a details page and make it output some general device info. Any other properties you'd care about?

I’d say any chipset info, runtime info and intrinsics tests. E.g. AdvSimd.IsSupported

@tannergooding
Copy link
Contributor

Vector128.IsHardwareAccelerated and AdvSimd.IsSupported are likely the two most important as far as vectorization is concerned. You might also print Vector64.IsHardwareAccelerated, AdvSimd.Arm64.IsSupported, and System.Numerics.Vector.IsHardwareAccelerated.

@beeradmoore
Copy link

I added everything I could find. Very likely too much.

Info from MAUIs DeviceInfo, System.Environment, System.Runtime.InteropServices.RuntimeInformation, and then a bunch of Intrinsics info including the Arm64 and X64 sub properties. After that there is Android specific info such as reading CPU info, Android runtime, and Android build information.

I couldn't get a OpenGL Surface to initialise so I couldn't fetch GPU specific information.

Here is a giant dump of my Android device in release mode.

DeviceInfo
DeviceType: Physical
Idiom: Phone
Manufacturer: Google
Model: Pixel 2 XL
Name: Pixel 2 XL
Platform: Android
VersionString: 15

Environment
Is64BitOperatingSystem: True
Is64BitProcess: True
IsPrivilegedProcess: False
OSVersion: Unix 35.0.0.0
ProcessorCount: 8
Version: 9.0.4

RuntimeInformation
FrameworkDescription: .NET 9.0.4
OSArchitecture: Arm64
OSDescription: Linux 4.4.302-g113bd1cfa6f5 #1 SMP PREEMPT Tue May 6 00:14:01 UTC 2025
ProcessArchitecture: Arm64
RuntimeIdentifier: android-arm64

Intrinsics
System.Numerics.Vector.IsHardwareAccelerated: True
Vector64.IsHardwareAccelerated: True
Vector128.IsHardwareAccelerated: True
Vector256.IsHardwareAccelerated: False
Vector512.IsHardwareAccelerated: False

Intrinsics.Arm
AdvSimd.IsSupported: False
AdvSimd.Arm64.IsSupported: False
Aes.IsSupported: False
Ae.Arm64s.IsSupported: False
ArmBase.IsSupported: True
ArmBase.Arm64.IsSupported: True
Crc32.IsSupported: False
Crc32.Arm64.IsSupported: False
Dp.IsSupported: False
Dp.Arm64.IsSupported: False
Rdm.IsSupported: False
Rdm.Arm64.IsSupported: False
Sha1.IsSupported: False
Sha1.Arm64.IsSupported: False
Sha256.IsSupported: False
Sha256.Arm64.IsSupported: False

Intrinsics.X86
Aes.IsSupported: False
Aes.X64.IsSupported: False
Avx.IsSupported: False
Avx.X64.IsSupported: False
Avx2.IsSupported: False
Avx2.X64.IsSupported: False
Avx10v1.IsSupported: False
Avx10v1.X64.IsSupported: False
Avx512BW.IsSupported: False
Avx512BW.X64.IsSupported: False
Avx512CD.IsSupported: False
Avx512CD.X64.IsSupported: False
Avx512DQ.IsSupported: False
Avx512DQ.X64.IsSupported: False
Avx512F.IsSupported: False
Avx512F.X64.IsSupported: False
Avx512Vbmi.IsSupported: False
Avx512Vbmi.X64.IsSupported: False
AvxVnni.IsSupported: False
AvxVnni.X64.IsSupported: False
Bmi1.IsSupported: False
Bmi1.X64.IsSupported: False
Bmi2.IsSupported: False
Bmi2.X64.IsSupported: False
Fma.IsSupported: False
Fma.X64.IsSupported: False
Lzcnt.IsSupported: False
Lzcnt.X64.IsSupported: False
Pclmulqdq.IsSupported: False
Pclmulqdq.X64.IsSupported: False
Popcnt.IsSupported: False
Popcnt.X64.IsSupported: False
Sse.IsSupported: False
Sse.X64.IsSupported: False
Sse2.IsSupported: False
Sse2.X64.IsSupported: False
Sse3.IsSupported: False
Sse3.X64.IsSupported: False
Sse41.IsSupported: False
Sse41.X64.IsSupported: False
Sse42.IsSupported: False
Sse42.X64.IsSupported: False
Ssse3.IsSupported: False
Ssse3.X64.IsSupported: False
X86Base.IsSupported: False
X86Base.X64.IsSupported: False
X86Serialize.IsSupported: False
X86Serialize.X64.IsSupported: False

Intrinsics.Wasm
PackedSimd.IsSupported: False

Android
CPU Info: /proc/cpuinfo
Processor	: AArch64 Processor rev 1 (aarch64)
processor	: 0
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x801
CPU revision	: 4

processor	: 1
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x801
CPU revision	: 4

processor	: 2
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x801
CPU revision	: 4

processor	: 3
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x801
CPU revision	: 4

processor	: 4
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x800
CPU revision	: 1

processor	: 5
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x800
CPU revision	: 1

processor	: 6
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x800
CPU revision	: 1

processor	: 7
BogoMIPS	: 38.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x51
CPU architecture: 8
CPU variant	: 0xa
CPU part	: 0x800
CPU revision	: 1

Hardware	: Qualcomm Technologies, Inc MSM8998

CPU Info: /proc/gpuinfo
Error: Unable to fetch GPU info (/proc/gpuinfo)
Runtime.AvailableProcessors: 8
Runtime.TotalMemory: 28781376
Build.Board: taimen
Build.Bootloader: TMZ30m
Build.Brand: google
Build.Device: taimen
Build.Display: lineage_taimen-userdebug 15 BP1A.250405.007 95382b1173
Build.Hardware: taimen
Build.Host: 295e78adf884
Build.Id: BP1A.250405.007
Build.Manufacturer: Google
Build.Model: Pixel 2 XL
Build.OdmSku: unknown
Build.Product: lineage_taimen
Build.Sku: G011C
Build.SocManufacturer: Qualcomm
Build.SocModel: MSM8998
Build.SupportedAbis: arm64-v8a, armeabi-v7a, armeabi
Build.Tags: release-keys
Build.Time: 1746489034000
Build.Type: userdebug

(side note, I am not sure why Android.OS.Build.Type is userdebug. That is the OS build, not app build. Surely LinageOS (custom Android ROM) isn't a debug build 👀)

I did a build with this, and then with the above and the only change was

- Runtime.TotalMemory: 28781376
+ Runtime.TotalMemory: 8388608

But that is may misunderstanding of what that property means.

@beeradmoore
Copy link

I did another test of debug build. Keeping in mind the csproj is


<PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Debug'">
    <!-- performance improvements for debug mode, will break hot reload. -->
    <UseInterpreter>false</UseInterpreter>
</PropertyGroup>

<PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Release'">
    <EnableLLVM>true</EnableLLVM>
    <RunAOTCompilation>true</RunAOTCompilation>
    <AndroidEnableProfiledAot>false</AndroidEnableProfiledAot>
</PropertyGroup>

The only property that changed (aside from Runtime.TotalMemory) was

Vector64.IsHardwareAccelerated: False

Swapping back to UseInterpreter=true in debug mode the only properties that changed (from standard release) were

System.Numerics.Vector.IsHardwareAccelerated: False
Vector64.IsHardwareAccelerated: False
ArmBase.IsSupported: False
ArmBase.Arm64.IsSupported: False

@tannergooding
Copy link
Contributor

Intrinsics
System.Numerics.Vector.IsHardwareAccelerated: True
Vector64.IsHardwareAccelerated: True
Vector128.IsHardwareAccelerated: True
Vector256.IsHardwareAccelerated: False
Vector512.IsHardwareAccelerated: False

Intrinsics.Arm
AdvSimd.IsSupported: False
AdvSimd.Arm64.IsSupported: False

So it should be getting generally accelerated in most places, except for where AdvSimd is being used directly. Those paths would need to use the xplat intrinsics, potentially just as a fallback, instead. Many of those should be fairly easy to switch over, but feel free to tag me on any of them if you have questions @JimBobSquarePants and I can give more direct guidance

@JimBobSquarePants
Copy link
Member Author

That’s the odd thing. There’s actually very few places left I use AdvSimd directly (png, web encoders). Almost everything I do have uses Xplat as a fallback also.

{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => Ssse3.IsSupported || AdvSimd.Arm64.IsSupported;
get => Ssse3.IsSupported || AdvSimd.Arm64.IsSupported || PackedSimd.IsSupported;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Arm64 and WASM you should just be able to use (for byte) Vector128.Shuffle due to how VectorTableLookup and Swizzle work.

You really just want at least Ssse3 for x86/x64, since they cannot be done otherwise.

So perhaps you want:

get
{
    if (Vector128.IsHardwareAccelerated)
    {
        if (RuntimeInformation.ProcessArchitecture is Architecture.X86 or Architecture.X64)
        {
            return Ssse3.IsSupported;
        }

        // You could optionally do:
        //    return ProcessArchitecture is Architecture.Arm64 or Architecture.Wasm;
        // if you wanted to restrict it to platforms you know should be safe
        return true;
    }

    return false;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I’ll review my other helpers.

Comment on lines 1018 to 1021
Vector128<short> u0 = Vector128_.PackSignedSaturate(w0, w1);
Vector128<short> u1 = Vector128_.PackSignedSaturate(w2, w3);

Unsafe.Add(ref destinationBase, i) = Vector128Utilities.PackUnsignedSaturate(u0, u1);
Unsafe.Add(ref destinationBase, i) = Vector128_.PackUnsignedSaturate(u0, u1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made cheaper with a direct int-> byte helper.

For x86/x64 it should do roughly the same as it is right now (float->int->short->byte), but for the V128.IsHardwareAccelerated fallback it can clamp to byte, narrow to short, narrow to byte; rather than clamp to short, narrow to short, clamp to byte, narrow to byte like its doing currently.

Notably for AVX512 there's even some other optimizations you could do since instructions exist to go from V512<uint> -> V128<byte>. You can likewise fixup the V512<float> as part of the conversion to uint using ConvertToVector512UInt32(Max(float.AsInt32(), Zero).AsSingle()) (since out of range values become uint.MaxValue, you just have to care about negatives becoming 0).

There's also some Arm64/WASM specific behaviors you can take advantage of because float->integer already saturates there, so instead of doing float->int->short->byte, you can do float->uint then clamp to byte.MaxValue and then do the narrowing. On Arm64 you might be able to do a VectorTableLookup with 2 inputs instead of 4 narrowing instructions or do some zip instructions instead, which might be faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- The other optimizations aren't as important, but for Android since AdvSimd.IsSupported reports false, the suggestion on improving V128.IsHardwareAccelerated path will likely benefit there since it cuts out almost half the work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've lost me a bit here with the Vector128 path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was this part

This could be made cheaper with a direct int-> byte helper.

For x86/x64 it should do roughly the same as it is right now (float->int->short->byte), but for the V128.IsHardwareAccelerated fallback it can clamp to byte, narrow to short, narrow to byte; rather than clamp to short, narrow to short, clamp to byte, narrow to byte like its doing currently.

In particular right now you're doing:

  1. Clamp the int32 to int16
  2. Narrow the int32 to int16 now that it's in range
  3. Clamp the int16 to uint8
  4. Narrow the int16 to uint8 now that it's in range

You could instead just do:

  1. Clamp the int32 to int8
  2. Narrow the int32 to uint16 now that it's in range
  3. Narrow the uint16 to uint8, it's already in range

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I see now! Thanks!

(Vector256.IsHardwareAccelerated && Vector256Utilities.SupportsShuffleByte) ||
(Vector128.IsHardwareAccelerated && Vector128Utilities.SupportsShuffleByte))
if ((Vector512.IsHardwareAccelerated && Vector512_.SupportsShuffleNativeByte) ||
(Vector256.IsHardwareAccelerated && Vector256_.SupportsShuffleByte) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed this middle one is SupportsShuffleByte while the other two are SupportsShuffleNativeByte, guessing that was a mistake?

if ((Vector512.IsHardwareAccelerated && Vector512Utilities.SupportsShuffleFloat) ||
(Vector256.IsHardwareAccelerated && Vector256Utilities.SupportsShuffleFloat) ||
(Vector128.IsHardwareAccelerated && Vector128Utilities.SupportsShuffleFloat))
if ((Vector512.IsHardwareAccelerated && Vector512_.SupportsShuffleNativeFloat) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these could be supported on other platforms if you passed in the indexes rather than a byte control.

There's some other functions like that throughout the SimdUtils.HwIntrinsics.cs file as well which could be impactful to Arm64, Wasm, and Android; but I've not looked at them in depth and I'm guessing many are larger refactorings to clean up.


Given these are marked as AggressiveInlining and byte control is expected to be a constant, the JIT "might" (just double check the codegen) be able to do the right thing for something like Vector128.Shuffle(vector, Vector128.Create((control & 3), ((control >> 2) & 3), ((control >> 4) & 3), ((control >> 6) & 3))) since that should itself break down to 4 constants which the JIT can recognize

@beeradmoore
Copy link

I got the above (device details page, not benchmarks) to run on my iPhone 15 Pro Max in release mode. Trying to build with benchmarks enabled takes forever while it's doing AOT compilation. I'll leave it overnight to see if it actually does finish.

The only property it had different over the Android build in the same config was.

AdvSimd.IsSupported: True
AdvSimd.Arm64.IsSupported: True

Full details

DeviceInfo
DeviceType: Physical
Idiom: Phone
Manufacturer: Apple
Model: iPhone16,2
Name: iPhone
Platform: iOS
VersionString: 18.4.1

Environment
Is64BitOperatingSystem: True
Is64BitProcess: True
IsPrivilegedProcess: False
OSVersion: Unix 18.4.1
ProcessorCount: 6
Version: 9.0.4

RuntimeInformation
FrameworkDescription: .NET 9.0.4
OSArchitecture: Arm64
OSDescription: Darwin 24.4.0 Darwin Kernel Version 24.4.0: Sat Mar 15 18:28:20 PDT 2025; root:xnu-11417.102.9~20/RELEASE_ARM64_T8122
ProcessArchitecture: Arm64
RuntimeIdentifier: ios-arm64

Intrinsics
System.Numerics.Vector.IsHardwareAccelerated: True
Vector64.IsHardwareAccelerated: True
Vector128.IsHardwareAccelerated: True
Vector256.IsHardwareAccelerated: False
Vector512.IsHardwareAccelerated: False

Intrinsics.Arm
AdvSimd.IsSupported: True
AdvSimd.Arm64.IsSupported: True
Aes.IsSupported: False
Aes.Arm64.IsSupported: False
ArmBase.IsSupported: True
ArmBase.Arm64.IsSupported: True
Crc32.IsSupported: False
Crc32.Arm64.IsSupported: False
Dp.IsSupported: False
Dp.Arm64.IsSupported: False
Rdm.IsSupported: False
Rdm.Arm64.IsSupported: False
Sha1.IsSupported: False
Sha1.Arm64.IsSupported: False
Sha256.IsSupported: False
Sha256.Arm64.IsSupported: False

Intrinsics.X86
Aes.IsSupported: False
Aes.X64.IsSupported: False
Avx.IsSupported: False
Avx.X64.IsSupported: False
Avx2.IsSupported: False
Avx2.X64.IsSupported: False
Avx10v1.IsSupported: False
Avx10v1.X64.IsSupported: False
Avx512BW.IsSupported: False
Avx512BW.X64.IsSupported: False
Avx512CD.IsSupported: False
Avx512CD.X64.IsSupported: False
Avx512DQ.IsSupported: False
Avx512DQ.X64.IsSupported: False
Avx512F.IsSupported: False
Avx512F.X64.IsSupported: False
Avx512Vbmi.IsSupported: False
Avx512Vbmi.X64.IsSupported: False
AvxVnni.IsSupported: False
AvxVnni.X64.IsSupported: False
Bmi1.IsSupported: False
Bmi1.X64.IsSupported: False
Bmi2.IsSupported: False
Bmi2.X64.IsSupported: False
Fma.IsSupported: False
Fma.X64.IsSupported: False
Lzcnt.IsSupported: False
Lzcnt.X64.IsSupported: False
Pclmulqdq.IsSupported: False
Pclmulqdq.X64.IsSupported: False
Popcnt.IsSupported: False
Popcnt.X64.IsSupported: False
Sse.IsSupported: False
Sse.X64.IsSupported: False
Sse2.IsSupported: False
Sse2.X64.IsSupported: False
Sse3.IsSupported: False
Sse3.X64.IsSupported: False
Sse41.IsSupported: False
Sse41.X64.IsSupported: False
Sse42.IsSupported: False
Sse42.X64.IsSupported: False
Ssse3.IsSupported: False
Ssse3.X64.IsSupported: False
X86Base.IsSupported: False
X86Base.X64.IsSupported: False
X86Serialize.IsSupported: False
X86Serialize.X64.IsSupported: False

Intrinsics.Wasm
PackedSimd.IsSupported: False

@JimBobSquarePants
Copy link
Member Author

@beeradmoore If you get the chance could you please run another benchmark. I want to see if some of the additional shuffle changes have made a difference. Thanks!

@beeradmoore
Copy link

New is this updated PR, old is re-running code from about 4 days ago.

Method Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
New ImageSharp_FromResource 293.021 ms 1.7252 ms 1.6138 ms 3.48 0.09 97448 B 39.73
New ImageSharp_FromFile 286.549 ms 2.9329 ms 2.7434 ms 3.40 0.09 74848 B 30.51
Old ImageSharp_FromResource 290.849 ms 2.7201 ms 2.5444 ms 3.41 0.10 97448 B 39.73
Old ImageSharp_FromFile 284.574 ms 2.0648 ms 1.8304 ms 3.34 0.09 74752 B 30.47

@JimBobSquarePants
Copy link
Member Author

@beeradmoore @tannergooding

I'm going to merge this AS-IS as it's blocking important work.

I'm hoping the new narrowing code will speed things up a bit since the last benchmark, but I can revisit later as I look for further optimization opportunities.

@JimBobSquarePants JimBobSquarePants merged commit d8b464b into main May 16, 2025
10 checks passed
@JimBobSquarePants JimBobSquarePants deleted the js/block8x8-simd branch May 16, 2025 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants