Implement GGML_CPU_ALL_VARIANTS for ARM #14080

ckastner · 2025-06-09T07:23:59Z

This supersedes #14049 which also has more context.

There are two notable design decisions, better explained in the respective commit messages:

The use of GGML_INTERNAL_ within cmake, and the GGML_USE_ this activates in code
The odd backend naming as a consequence of features being optional

I tested this on a 4-vcpu Graviton4 which is armv9.0-a. Test command was simply llama-bench -m ggml-model-q4_0.gguf, as I just needed something simple to show that loading worked correctly and that no regressions were introduced.

First, the results with GGML_NATIVE=ON:

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       4 |           pp512 |         38.57 ± 0.01 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       4 |           tg128 |          6.77 ± 0.00 |

Then, the results of GGML_NATIVE=OFF GGML_BACKEND_DL=ON GGML_CPU_ALL_VARIANTS=ON (some debug messages omitted):

ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv9.2_2.so score: 0
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.2_3.so score: 15
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.2_1.so score: 3
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.2_2.so score: 7
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.6_2.so score: 63
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.0_1.so score: 1
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv9.2_1.so score: 0
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.6_1.so score: 31
load_backend: loaded CPU backend from /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.6_2.so

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       4 |           pp512 |         38.58 ± 0.01 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       4 |           tg128 |          6.77 ± 0.00 |

The scoring for each backend was calculated correctly.

The armv9.2 backends are for SME which the Graviton4 doesn't have, hence they scored 0. So the armv8.6_2 backend (+dotprod+fp16+sve+i8mm) got picked.

Incidentally, this showcases one problem that I left for future work. When choosing the MCPU to target, I chose the first version that supported a particular instruction, which eg: for i8mm was armv8.6-a. However, the test above was run on armv9.0-a, so a build with same features but targeting armv9.0-a might have performed even better. The solution would be to include the runtime arch in the scoring, but the above implements the necessary base case and I'll look to this improvement later.

ckastner · 2025-06-09T11:43:20Z

Regarding the test failures:

I don't have a macOS environment so I can't easily say what's going on with the macOS-latest-swift and ios-xcode-build failures. The error message suggests that some header is missing. I added a few in another branch, and this did not resolve the issue.
The android build failure seems related to armv7 and NEON. No idea yet why this pops up here.

slaren · 2025-06-09T12:28:43Z

We are going to need a different set of variants for each platform. For example I would expect the variants used for Apple to be each of the M1-M4 chips (or at least the ones that have different CPU features). The Android variants are also likely to be different than the Linux variants. Windows at the moment probably only needs one variant. It's not necessary to support every platform from the first moment, but this list of variants probably should only be used for Linux.

Generally I think it is easier to build the list of variants if we know exactly the list of chips we are targeting.

ckastner · 2025-06-09T12:49:50Z

We are going to need a different set of variants for each platform. [...] It's not necessary to support every platform from the first moment, but this list of variants probably should only be used for Linux.

I agree, I'll add that later today.

Generally I think it is easier to build the list of variants if we know exactly the list of chips we are targeting.

Same, though I saw this as a topic for future iterations once the basic mechanism is in place. Might also be an opportunity for the respective manufacturers to chip in (pun intended).

chaxu01 · 2025-06-09T13:12:32Z

I got build error on my m4pro with -DGGML_METAL=OFF -DGGML_BLAS=OFF -DBUILD_SHARED_LIBS=ON -DGGML_OPENMP=OFF -DGGML_CPU_ALL_VARIANTS=ON -DGGML_BACKEND_DL=ON -DGGML_NATIVE=OFF:

Building CXX object ggml/src/CMakeFiles/ggml-cpu-armv9.2_2.dir/ggml-cpu/binary-ops.cpp.o
fatal error: error in backend: Cannot select: 0x159f960e0: nxv4i32 = AArch64ISD::SUNPKLO 0x1591f17e0
  0x1591f17e0: v2i64,ch = CopyFromReg 0x159268eb0:1, Register:v2i64 %498
    0x15920f940: v2i64 = Register %498
In function: ggml_compute_forward_mul
c++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.5.0

So it needs a different set of variants for each platform.

ckastner · 2025-06-09T15:45:36Z

I limited GGML_CPU_ALL_VARIANTS to Linux now.

I also figured out why some tests were failing. This was a mistake on my part, I put some of the variant-building part in an else-branch, rather than an elseif-branch. The previous behavior has been restored.

This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_<FEAT> which need to be set in cmake, instead of GGML_<FEAT> that users would set for x86. This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags.

Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>. Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used.

The other platforms will need their own specific variants. This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.

ckastner · 2025-06-09T16:27:49Z

Rebased onto current master.

chaxu01 · 2025-06-10T12:04:30Z

Benchmarked on Graviton3:

W/O GGML_CPU_ALL_VARIANTS

$TOOLCHAIN_SYSROOT/lib/ld-linux-aarch64.so.1 --library-path $TOOLCHAIN_SYSROOT/lib64:$TOOLCHAIN_SYSROOT/usr/lib64:/tmp/chaxu01 ./llama-bench -m llama-2-7b-chat.Q4_0.gguf -ngl 0 -t 1,2,4,8,16
load_backend: loaded CPU backend from /tmp/chaxu01/libggml-cpu-armv8.6_1.so
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |           pp512 |         15.17 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |           tg128 |          5.73 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |           pp512 |         30.26 ± 0.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |           tg128 |         10.46 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           pp512 |         60.23 ± 0.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           tg128 |         18.25 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       8 |           pp512 |        101.29 ± 0.05 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       8 |           tg128 |         30.75 ± 0.03 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |      16 |           pp512 |        178.37 ± 0.46 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |      16 |           tg128 |         40.93 ± 0.21 |

build: f5a8b9ad (5619)

W/ GGML_CPU_ALL_VARIANTS
$TOOLCHAIN_SYSROOT/lib/ld-linux-aarch64.so.1 --library-path $TOOLCHAIN_SYSROOT/lib64:$TOOLCHAIN_SYSROOT/usr/lib64:/tmp/chaxu01 ./llama-bench -m llama-2-7b-chat.Q4_0.gguf -ngl 0 -t 1,2,4,8,16
load_backend: loaded CPU backend from /tmp/chaxu01/libggml-cpu-armv8.6_1.so
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |           pp512 |         15.15 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |           tg128 |          5.72 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |           pp512 |         30.17 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |           tg128 |         10.44 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           pp512 |         60.24 ± 0.03 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           tg128 |         18.18 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       8 |           pp512 |        101.18 ± 0.09 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       8 |           tg128 |         30.50 ± 0.09 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |      16 |           pp512 |       133.59 ± 34.84 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |      16 |           tg128 |        35.80 ± 10.28 |

build: f5a8b9ad (5619)

Not sure why the number for threads 16 were off for W/ GGML_CPU_ALL_VARIANTS, otherwise it looks good. Looking forward adding the support for Apple and Android.

ckastner · 2025-06-10T16:27:31Z

Thanks for the check!

Are you sure the first example is with GGML_CPU_ALL_VARIANTS=OFF? It loaded libggml-cpu-armv8.6_1.so, a name which should only appear with =ON. The =OFF version should just be called libggml-cpu.so.

I tested it with GGML_BACKEND_DL=ON and then GGML_NATIVE=ON GGML_ALL_CPU_VARIANTS=OFF, and vice versa.

slaren

I don't have a linux arm machine to test this, but the code looks good to me. The only thing I don't like very much is how the base arm arch is set automatically depending on which features are added, it is not obvious to me that this will always be correct. But it can changed if necessary when adding support for other platforms.

ckastner · 2025-06-10T21:17:43Z

The only thing I don't like very much is how the base arm arch is set automatically depending on which features are added

Yeah this definitely has room for future improvement. Regarding correctness, I assumed that if some instruction was only added in arch version X, then it is safe to bump the base arch to X and possibly get other improvements. @chaxu01, your thoughts on that?

chaxu01 · 2025-06-11T05:55:03Z

In my case, I check for compiler support for arch X for each feature and add the supported variant.

        set(DOTPROD_SUPPORTED -1)
        set(I8MM_SUPPORTED -1)
        set(SVE_SUPPORTED -1)
        set(SME_SUPPORTED -1)

        check_compiler_support(dotprod "#include <arm_neon.h>\nint main() { int8x16_t _a, _b; volatile int32x4_t _s = vdotq_s32(_s, _a, _b); return 0; }" DOTPROD_SUPPORTED)
        check_compiler_support(i8mm    "#include <arm_neon.h>\nint main() { int8x16_t _a, _b; volatile int32x4_t _s = vmmlaq_s32(_s, _a, _b); return 0; }" I8MM_SUPPORTED)
        check_compiler_support(sve     "#include <arm_sve.h>\nint main()  { svfloat32_t _a, _b; volatile svfloat32_t _c = svadd_f32_z(svptrue_b8(), _a, _b); return 0; }" SVE_SUPPORTED)
        check_compiler_support(sme     "#include <arm_sme.h>\n__arm_locally_streaming int main() { __asm__ volatile(\"smstart; smstop;\"); return 0; }" SME_SUPPORTED)

        function(add_isa_support ISA_NAME SUPPORTED)
            if (DEFINED ${SUPPORTED} AND ${SUPPORTED} EQUAL 1)
                if (${ISA_NAME} STREQUAL DOTPROD)
                    ggml_add_cpu_backend_variant(dotprod DOTPROD)
                elseif (${ISA_NAME} STREQUAL I8MM)
                    ggml_add_cpu_backend_variant(i8mm DOTPROD I8MM)
                elseif (${ISA_NAME} STREQUAL SVE)
                    ggml_add_cpu_backend_variant(sve DOTPROD I8MM SVE)
                elseif (${ISA_NAME} STREQUAL SME)
                    if (APPLE)
                        ggml_add_cpu_backend_variant(sme DOTPROD I8MM SME)
                    else()
                        ggml_add_cpu_backend_variant(sme DOTPROD I8MM SVE SME)
                    endif()
                endif()
            else()
                message(STATUS "Skipping ${ISA_NAME} variant — not supported by compiler")
            endif()
        endfunction()

        add_isa_support(DOTPROD DOTPROD_SUPPORTED)
        add_isa_support(I8MM    I8MM_SUPPORTED)
        if (NOT APPLE)
            add_isa_support(SVE SVE_SUPPORTED)
        endif()
        add_isa_support(SME SME_SUPPORTED)

Otherwise one could get into the build failed issue in case the compiler doesn't support the specific feature.

chaxu01 · 2025-06-11T06:50:47Z

Are you sure the first example is with GGML_CPU_ALL_VARIANTS=OFF? It loaded libggml-cpu-armv8.6_1.so, a name which should only appear with =ON. The =OFF version should just be called libggml-cpu.so.

Ah, the leftover library from the previous build got picked up. Rerun after cleaning up:
$TOOLCHAIN_SYSROOT/lib/ld-linux-aarch64.so.1 --library-path $TOOLCHAIN_SYSROOT/lib64:$TOOLCHAIN_SYSROOT/usr/lib64:/tmp/chaxu01 ./llama-bench -m llama-2-7b-chat.Q4_0.gguf -ngl 0 -t 1,2,4,8,16

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	1	pp512	15.12 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	1	tg128	5.73 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	2	pp512	30.17 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	2	tg128	10.62 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	pp512	60.17 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	tg128	18.30 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	100.84 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	31.97 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	pp512	178.34 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg128	41.88 ± 0.09

ckastner · 2025-06-11T11:45:45Z

In my case, I check for compiler support for arch X for each feature and add the supported variant.

A fair point. It's even a bit more complicated I think, for example the support check is the same that is done for GGML_NATIVE=ON so that could be factored out. Like FindSIMD for x86 does.

Then there's a choice where to check and skip the build if not supported: before or in the _variant function). Before is cleaner I think, but it also feels a bit like a layer violation given the current setup.

@slaren this new question seems nuanced enough to warrant its own PR, is that ok or would you like that addressed here?

slaren · 2025-06-11T19:07:38Z

Should be ok to merge this now, other improvements can be added in a later PR. Once everything is ironed out we can enable it in the docker arm releases, and re-enable the linux arm releases.

ckastner · 2025-06-11T19:33:52Z

@chaxu01, I'll step back now from the ARM part so you can add your queued work without us crossing wires.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 9, 2025

ckastner mentioned this pull request Jun 9, 2025

ARM: Fixes and additions to CPU feature detection #14049

Closed

ckastner force-pushed the arm64-all-variants branch from 82b2f82 to 0a9ae27 Compare June 9, 2025 07:58

ckastner added 4 commits June 9, 2025 18:09

ggml-cpu: Factor out feature detection build from x86

a8f0eb8

ckastner force-pushed the arm64-all-variants branch from d79bfe9 to f5a8b9a Compare June 9, 2025 16:25

slaren approved these changes Jun 10, 2025

View reviewed changes

slaren merged commit 532802f into ggml-org:master Jun 11, 2025
47 checks passed

Implement GGML_CPU_ALL_VARIANTS for ARM #14080

Implement GGML_CPU_ALL_VARIANTS for ARM #14080

Uh oh!

Conversation

ckastner commented Jun 9, 2025

Uh oh!

ckastner commented Jun 9, 2025

Uh oh!

slaren commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ckastner commented Jun 9, 2025

Uh oh!

chaxu01 commented Jun 9, 2025

Uh oh!

ckastner commented Jun 9, 2025

Uh oh!

ckastner commented Jun 9, 2025

Uh oh!

chaxu01 commented Jun 10, 2025

Uh oh!

ckastner commented Jun 10, 2025

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

ckastner commented Jun 10, 2025

Uh oh!

chaxu01 commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaxu01 commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ckastner commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jun 11, 2025

Uh oh!

Uh oh!

ckastner commented Jun 11, 2025

Uh oh!

Uh oh!

slaren commented Jun 9, 2025 •

edited

Loading

chaxu01 commented Jun 11, 2025 •

edited

Loading

chaxu01 commented Jun 11, 2025 •

edited

Loading

ckastner commented Jun 11, 2025 •

edited

Loading