Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Implement GGML_CPU_ALL_VARIANTS for ARM #14080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 11, 2025

Conversation

ckastner
Copy link
Collaborator

@ckastner ckastner commented Jun 9, 2025

This supersedes #14049 which also has more context.

There are two notable design decisions, better explained in the respective commit messages:

  1. The use of GGML_INTERNAL_ within cmake, and the GGML_USE_ this activates in code
  2. The odd backend naming as a consequence of features being optional

I tested this on a 4-vcpu Graviton4 which is armv9.0-a. Test command was simply llama-bench -m ggml-model-q4_0.gguf, as I just needed something simple to show that loading worked correctly and that no regressions were introduced.

First, the results with GGML_NATIVE=ON:

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       4 |           pp512 |         38.57 ± 0.01 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       4 |           tg128 |          6.77 ± 0.00 |

Then, the results of GGML_NATIVE=OFF GGML_BACKEND_DL=ON GGML_CPU_ALL_VARIANTS=ON (some debug messages omitted):

ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv9.2_2.so score: 0
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.2_3.so score: 15
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.2_1.so score: 3
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.2_2.so score: 7
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.6_2.so score: 63
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.0_1.so score: 1
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv9.2_1.so score: 0
ggml_backend_load_best: /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.6_1.so score: 31
load_backend: loaded CPU backend from /home/christian/GitHub/llama.cpp/build/bin/libggml-cpu-armv8.6_2.so

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       4 |           pp512 |         38.58 ± 0.01 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       4 |           tg128 |          6.77 ± 0.00 |

The scoring for each backend was calculated correctly.

The armv9.2 backends are for SME which the Graviton4 doesn't have, hence they scored 0. So the armv8.6_2 backend (+dotprod+fp16+sve+i8mm) got picked.

Incidentally, this showcases one problem that I left for future work. When choosing the MCPU to target, I chose the first version that supported a particular instruction, which eg: for i8mm was armv8.6-a. However, the test above was run on armv9.0-a, so a build with same features but targeting armv9.0-a might have performed even better. The solution would be to include the runtime arch in the scoring, but the above implements the necessary base case and I'll look to this improvement later.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 9, 2025
@ckastner ckastner force-pushed the arm64-all-variants branch from 82b2f82 to 0a9ae27 Compare June 9, 2025 07:58
@ckastner
Copy link
Collaborator Author

ckastner commented Jun 9, 2025

Regarding the test failures:

  • I don't have a macOS environment so I can't easily say what's going on with the macOS-latest-swift and ios-xcode-build failures. The error message suggests that some header is missing. I added a few in another branch, and this did not resolve the issue.
  • The android build failure seems related to armv7 and NEON. No idea yet why this pops up here.

@slaren
Copy link
Member

slaren commented Jun 9, 2025

We are going to need a different set of variants for each platform. For example I would expect the variants used for Apple to be each of the M1-M4 chips (or at least the ones that have different CPU features). The Android variants are also likely to be different than the Linux variants. Windows at the moment probably only needs one variant. It's not necessary to support every platform from the first moment, but this list of variants probably should only be used for Linux.

Generally I think it is easier to build the list of variants if we know exactly the list of chips we are targeting.

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 9, 2025

We are going to need a different set of variants for each platform. [...] It's not necessary to support every platform from the first moment, but this list of variants probably should only be used for Linux.

I agree, I'll add that later today.

Generally I think it is easier to build the list of variants if we know exactly the list of chips we are targeting.

Same, though I saw this as a topic for future iterations once the basic mechanism is in place. Might also be an opportunity for the respective manufacturers to chip in (pun intended).

@chaxu01
Copy link
Collaborator

chaxu01 commented Jun 9, 2025

I got build error on my m4pro with -DGGML_METAL=OFF -DGGML_BLAS=OFF -DBUILD_SHARED_LIBS=ON -DGGML_OPENMP=OFF -DGGML_CPU_ALL_VARIANTS=ON -DGGML_BACKEND_DL=ON -DGGML_NATIVE=OFF:

Building CXX object ggml/src/CMakeFiles/ggml-cpu-armv9.2_2.dir/ggml-cpu/binary-ops.cpp.o
fatal error: error in backend: Cannot select: 0x159f960e0: nxv4i32 = AArch64ISD::SUNPKLO 0x1591f17e0
  0x1591f17e0: v2i64,ch = CopyFromReg 0x159268eb0:1, Register:v2i64 %498
    0x15920f940: v2i64 = Register %498
In function: ggml_compute_forward_mul
c++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.5.0

So it needs a different set of variants for each platform.

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 9, 2025

I limited GGML_CPU_ALL_VARIANTS to Linux now.

I also figured out why some tests were failing. This was a mistake on my part, I put some of the variant-building part in an else-branch, rather than an elseif-branch. The previous behavior has been restored.

ckastner added 4 commits June 9, 2025 18:09
This is analogous to cpu-feats-x86.cpp. However, to detect compile-time
activation of features, we rely on GGML_USE_<FEAT> which need to be set
in cmake, instead of GGML_<FEAT> that users would set for x86.

This is because on ARM, users specify features with GGML_CPU_ARM_ARCH,
rather than with individual flags.
Like x86, however to pass around arch flags within cmake, we use
GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>.

Some features are optional, so we may need to build multiple backends
per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring
function sort out which one can be used.
The other platforms will need their own specific variants.

This also fixes the bug that the the variant-building branch was always
being executed as the else-branch of GGML_NATIVE=OFF. The branch is
moved to an elseif-branch which restores the previous behavior.
@ckastner ckastner force-pushed the arm64-all-variants branch from d79bfe9 to f5a8b9a Compare June 9, 2025 16:25
@ckastner
Copy link
Collaborator Author

ckastner commented Jun 9, 2025

Rebased onto current master.

@chaxu01
Copy link
Collaborator

chaxu01 commented Jun 10, 2025

Benchmarked on Graviton3:

W/O GGML_CPU_ALL_VARIANTS

$TOOLCHAIN_SYSROOT/lib/ld-linux-aarch64.so.1 --library-path $TOOLCHAIN_SYSROOT/lib64:$TOOLCHAIN_SYSROOT/usr/lib64:/tmp/chaxu01 ./llama-bench -m llama-2-7b-chat.Q4_0.gguf -ngl 0 -t 1,2,4,8,16
load_backend: loaded CPU backend from /tmp/chaxu01/libggml-cpu-armv8.6_1.so
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |           pp512 |         15.17 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |           tg128 |          5.73 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |           pp512 |         30.26 ± 0.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |           tg128 |         10.46 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           pp512 |         60.23 ± 0.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           tg128 |         18.25 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       8 |           pp512 |        101.29 ± 0.05 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       8 |           tg128 |         30.75 ± 0.03 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |      16 |           pp512 |        178.37 ± 0.46 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |      16 |           tg128 |         40.93 ± 0.21 |

build: f5a8b9ad (5619)
W/ GGML_CPU_ALL_VARIANTS
$TOOLCHAIN_SYSROOT/lib/ld-linux-aarch64.so.1 --library-path $TOOLCHAIN_SYSROOT/lib64:$TOOLCHAIN_SYSROOT/usr/lib64:/tmp/chaxu01 ./llama-bench -m llama-2-7b-chat.Q4_0.gguf -ngl 0 -t 1,2,4,8,16
load_backend: loaded CPU backend from /tmp/chaxu01/libggml-cpu-armv8.6_1.so
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |           pp512 |         15.15 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |           tg128 |          5.72 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |           pp512 |         30.17 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |           tg128 |         10.44 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           pp512 |         60.24 ± 0.03 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           tg128 |         18.18 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       8 |           pp512 |        101.18 ± 0.09 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       8 |           tg128 |         30.50 ± 0.09 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |      16 |           pp512 |       133.59 ± 34.84 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |      16 |           tg128 |        35.80 ± 10.28 |

build: f5a8b9ad (5619)

Not sure why the number for threads 16 were off for W/ GGML_CPU_ALL_VARIANTS, otherwise it looks good. Looking forward adding the support for Apple and Android.

@ckastner
Copy link
Collaborator Author

Thanks for the check!

Are you sure the first example is with GGML_CPU_ALL_VARIANTS=OFF? It loaded libggml-cpu-armv8.6_1.so, a name which should only appear with =ON. The =OFF version should just be called libggml-cpu.so.

I tested it with GGML_BACKEND_DL=ON and then GGML_NATIVE=ON GGML_ALL_CPU_VARIANTS=OFF, and vice versa.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a linux arm machine to test this, but the code looks good to me. The only thing I don't like very much is how the base arm arch is set automatically depending on which features are added, it is not obvious to me that this will always be correct. But it can changed if necessary when adding support for other platforms.

@ckastner
Copy link
Collaborator Author

The only thing I don't like very much is how the base arm arch is set automatically depending on which features are added

Yeah this definitely has room for future improvement. Regarding correctness, I assumed that if some instruction was only added in arch version X, then it is safe to bump the base arch to X and possibly get other improvements. @chaxu01, your thoughts on that?

@chaxu01
Copy link
Collaborator

chaxu01 commented Jun 11, 2025

In my case, I check for compiler support for arch X for each feature and add the supported variant.

        set(DOTPROD_SUPPORTED -1)
        set(I8MM_SUPPORTED -1)
        set(SVE_SUPPORTED -1)
        set(SME_SUPPORTED -1)

        check_compiler_support(dotprod "#include <arm_neon.h>\nint main() { int8x16_t _a, _b; volatile int32x4_t _s = vdotq_s32(_s, _a, _b); return 0; }" DOTPROD_SUPPORTED)
        check_compiler_support(i8mm    "#include <arm_neon.h>\nint main() { int8x16_t _a, _b; volatile int32x4_t _s = vmmlaq_s32(_s, _a, _b); return 0; }" I8MM_SUPPORTED)
        check_compiler_support(sve     "#include <arm_sve.h>\nint main()  { svfloat32_t _a, _b; volatile svfloat32_t _c = svadd_f32_z(svptrue_b8(), _a, _b); return 0; }" SVE_SUPPORTED)
        check_compiler_support(sme     "#include <arm_sme.h>\n__arm_locally_streaming int main() { __asm__ volatile(\"smstart; smstop;\"); return 0; }" SME_SUPPORTED)

        function(add_isa_support ISA_NAME SUPPORTED)
            if (DEFINED ${SUPPORTED} AND ${SUPPORTED} EQUAL 1)
                if (${ISA_NAME} STREQUAL DOTPROD)
                    ggml_add_cpu_backend_variant(dotprod DOTPROD)
                elseif (${ISA_NAME} STREQUAL I8MM)
                    ggml_add_cpu_backend_variant(i8mm DOTPROD I8MM)
                elseif (${ISA_NAME} STREQUAL SVE)
                    ggml_add_cpu_backend_variant(sve DOTPROD I8MM SVE)
                elseif (${ISA_NAME} STREQUAL SME)
                    if (APPLE)
                        ggml_add_cpu_backend_variant(sme DOTPROD I8MM SME)
                    else()
                        ggml_add_cpu_backend_variant(sme DOTPROD I8MM SVE SME)
                    endif()
                endif()
            else()
                message(STATUS "Skipping ${ISA_NAME} variant — not supported by compiler")
            endif()
        endfunction()

        add_isa_support(DOTPROD DOTPROD_SUPPORTED)
        add_isa_support(I8MM    I8MM_SUPPORTED)
        if (NOT APPLE)
            add_isa_support(SVE SVE_SUPPORTED)
        endif()
        add_isa_support(SME SME_SUPPORTED) 

Otherwise one could get into the build failed issue in case the compiler doesn't support the specific feature.

@chaxu01
Copy link
Collaborator

chaxu01 commented Jun 11, 2025

Are you sure the first example is with GGML_CPU_ALL_VARIANTS=OFF? It loaded libggml-cpu-armv8.6_1.so, a name which should only appear with =ON. The =OFF version should just be called libggml-cpu.so.

Ah, the leftover library from the previous build got picked up. Rerun after cleaning up:
$TOOLCHAIN_SYSROOT/lib/ld-linux-aarch64.so.1 --library-path $TOOLCHAIN_SYSROOT/lib64:$TOOLCHAIN_SYSROOT/usr/lib64:/tmp/chaxu01 ./llama-bench -m llama-2-7b-chat.Q4_0.gguf -ngl 0 -t 1,2,4,8,16

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 1 pp512 15.12 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CPU 1 tg128 5.73 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CPU 2 pp512 30.17 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B CPU 2 tg128 10.62 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 pp512 60.17 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 tg128 18.30 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CPU 8 pp512 100.84 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B CPU 8 tg128 31.97 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 pp512 178.34 ± 0.22
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 tg128 41.88 ± 0.09

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 11, 2025

In my case, I check for compiler support for arch X for each feature and add the supported variant.

A fair point. It's even a bit more complicated I think, for example the support check is the same that is done for GGML_NATIVE=ON so that could be factored out. Like FindSIMD for x86 does.

Then there's a choice where to check and skip the build if not supported: before or in the _variant function). Before is cleaner I think, but it also feels a bit like a layer violation given the current setup.

@slaren this new question seems nuanced enough to warrant its own PR, is that ok or would you like that addressed here?

@slaren
Copy link
Member

slaren commented Jun 11, 2025

Should be ok to merge this now, other improvements can be added in a later PR. Once everything is ironed out we can enable it in the docker arm releases, and re-enable the linux arm releases.

@slaren slaren merged commit 532802f into ggml-org:master Jun 11, 2025
47 checks passed
@ckastner
Copy link
Collaborator Author

@chaxu01, I'll step back now from the ARM part so you can add your queued work without us crossing wires.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants