Sync master with upstream release b5798 #142

jan-service-account · 2025-07-02T06:38:35Z

Updates dev branch with latest release (b5798) from ggml-org/llama.cpp

Co-authored-by: dinhhuy <[email protected]>

* Add Reorder to Q6_K mmvq implementation * Address PR comments: clean up comments * Remove unused parameter after refactoring q4_k * Adding inline to function and removing unnecessary reference to int --------- Signed-off-by: nscipione <[email protected]>

ggml-ci

* webui: fix sidebar being covered by main content Signed-off-by: Xiaodong Ye <[email protected]> * webui: update index.html.gz Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

* Simplify the environment variable setting to specify the memory pool type. * Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options. * update * fix CI * update * delete whitespace * fix according to review * update CANN.md * update CANN.md

ggml-ci

* move ggml-cpu-aarch64 to repack * split quantize_row_q8_0/1 * split helper functions * split ggml_vec_dot_q4_0_q8_0 * split ggml_vec_dot_q4_1_q8_1 * split ggml_vec_dot_q5_0_q8_0 * split ggml_vec_dot_q5_1_q8_1 * split ggml_vec_dot_q8_0_q8_0 * split ggml_vec_dot_tq1_0_q8_K * split ggml_vec_dot_tq2_0_q8_K * split ggml_vec_dot_q2_K_q8_K * split ggml_vec_dot_q3_K_q8_K * split ggml_vec_dot_q4_K_q8_K * split ggml_vec_dot_q5_K_q8_K * split ggml_vec_dot_q6_K_q8_K * split ggml_vec_dot_iq2_xxs_q8_K * split ggml_vec_dot_iq2_xs_q8_K * split ggml_vec_dot_iq2_s_q8_K * split ggml_vec_dot_iq3_xxs_q8_K * split ggml_vec_dot_iq3_s_q8_K * split ggml_vec_dot_iq1_s_q8_K * split ggml_vec_dot_iq1_m_q8_K * split ggml_vec_dot_iq4_nl_q8_0 * split ggml_vec_dot_iq4_xs_q8_K * fix typos * fix missing prototypes * rename ggml-cpu-quants.c * rename ggml-cpu-traits * rename arm folder * move cpu-feats-x86.cpp * rename ggml-cpu-hbm * update arm detection macro in quants.c * move iq quant tables * split ggml_quantize_mat_q8_0/K * split ggml_gemv_* * split ggml_gemm_* * rename namespace aarch64 to repack * use weak aliases to replace test macros * rename GGML_CPU_AARCH64 to GGML_CPU_REPACK * rename more aarch64 to repack * clean up rebase leftover * fix compilation errors * remove trailing spaces * try to fix clang compilation errors * try to fix clang compilation errors again * try to fix clang compilation errors, 3rd attempt * try to fix clang compilation errors, 4th attempt * try to fix clang compilation errors, 5th attempt * try to fix clang compilation errors, 6th attempt * try to fix clang compilation errors, 7th attempt * try to fix clang compilation errors, 8th attempt * try to fix clang compilation errors, 9th attempt * more cleanup * fix compilation errors * fix apple targets * fix a typo in arm version of ggml_vec_dot_q4_K_q8_K Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-org#13980) * llama : allow building all tests on windows when not using shared libraries * add static windows build to ci * tests : enable debug logs for test-chat --------- Co-authored-by: Georgi Gerganov <[email protected]>

* kv-cache : fix shift ggml-ci * cont : reset shift[i] ggml-ci * cont : fix defrag erasing cells that didn't move ggml-ci

* metal : use less stack memory in FA kernel ggml-ci * cont : fix BF16 variant

Enable uniform linking with subproject and with find_package.

ggml-ci

… device is available, to allow fallback to CPU backend (ggml-org#14099)

ggml-ci

…gml-org#14104)

)

…org#13834) * kv-cache : avoid modifying recurrent cells when setting inputs * kv-cache : remove inp_s_mask It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy. * kv-cache : fix non-consecutive token pos warning for recurrent models The problem was apparently caused by how the tail cells were swapped. * graph : simplify logic for recurrent state copies * kv-cache : use cell without src refs for rs_z in recurrent cache * llama-graph : fix recurrent state copy The `state_copy` shuffle assumes everything is moved at once, which is not true when `states_extra` is copied back to the cache before copying the range of states between `head` and `head + n_seqs`. This is only a problem if any of the cells in [`head`, `head + n_seqs`) have an `src` in [`head + n_seqs`, `head + n_kv`), which does happen when `n_ubatch > 1` in the `llama-parallel` example. Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes. * llama-graph : rename n_state to state_size in build_recurrent_state This naming should reduce confusion between the state size and the number of states.

Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.

)

ggml-ci

…org#14062) * webui: Wrap long numbers instead of infinite horizontal scroll * Use tailwind class * update index.html.gz

This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.

) * CUDA: add bf16 and f32 support to cublas_mul_mat_batched * Review: add type traits and make function more generic * Review: make check more explicit, add back comments, and fix formatting * Review: fix formatting, remove useless type conversion, fix naming for bools

* vulkan: Add fusion support for RMS_NORM+MUL - Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test. * extract some common fusion logic * fix -Winconsistent-missing-override * move ggml_can_fuse to a common function * build fix * C and C++ versions of can_fuse * move use count to the graph to avoid data races and double increments when used in multiple threads * use hash table lookup to find node index * change use_counts to be indexed by hash table slot * minimize hash lookups style fixes * last node doesn't need single use. fix type. handle mul operands being swapped. * remove redundant parameter --------- Co-authored-by: slaren <[email protected]>

* implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (ggml-org#14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <[email protected]> Co-authored-by: Akarshan <[email protected]> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (ggml-org#14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: 0cc4m <[email protected]> Co-authored-by: Akarshan <[email protected]> Co-authored-by: Jeff Bolz <[email protected]>

* SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit ed0aab1. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel

…ml-org#14322)

…eature), from command line and from client (ggml-org#13196) * initial commit for handling extra template kwargs * enable_thinking and assistant prefill cannot be enabled at the same time * can set chat_template_kwargs in command line * added doc * fixed formatting * add support for extra context in generic template init * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <[email protected]> * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Apply suggestions from code review coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <[email protected]> * fix merge conflict * chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context) * normalize environment variable name * simplify code * prefill cannot be used with thinking models * compatibility with the new reasoning-budget parameter * fix prefill for non thinking models --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Olivier Chafik <[email protected]>

* Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * Remove redundant include path in CMakeLists.txt The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths. * Enable scheduled Docker image builds Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

* metal : disable fast-math for some cpy kernels ggml-ci * cont : disable for q4_1 ggml-ci * cont : disable for iq4_nl ggml-ci

ggml-ci

* Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16

This commit renames the variable `best_mad` to `best_error` in the `make_qkx2_quants` function. The motivation for this is that the name `best_mad` can be somewhat confusing if mean absolute deviation (MAD) is not in use.

* add "align corners" mode for bilinear upscale, and allow downscaling * add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag * test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners

ggml-ci

…4411) * [CANN]update to aclnnGroupedMatmulV2 Signed-off-by: noemotiovon <[email protected]> * Support MUL_MAT_ID on 310p Signed-off-by: noemotiovon <[email protected]> * fix editorconfig Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

Right now it's not easy to find those.

* ci : disable fast-math for Metal GHA CI ggml-ci * cont : remove -g flag ggml-ci

* Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed. * Return previous callback to allow callback chaining * style fixes --------- Co-authored-by: Diego Devesa <[email protected]>

shink and others added 30 commits July 2, 2025 12:24

CANN: Enable labeler for Ascend NPU (ggml-org#13914)

c8e213d

add geglu activation function (ggml-org#14074)

bbd51ae

Co-authored-by: dinhhuy <[email protected]>

server : fix LRU check (ggml-org#14079)

0363bd9

ggml-ci

graph : fix geglu (ggml-org#14077)

652d610

ggml-ci

cuda : fix device sync on buffer clear (ggml-org#14033)

4eebf7c

kv-cache : fix shift and defrag logic (ggml-org#14081)

6341356

* kv-cache : fix shift ggml-ci * cont : reset shift[i] ggml-ci * cont : fix defrag erasing cells that didn't move ggml-ci

metal : use less stack memory in FA kernel (ggml-org#14088)

9b90308

* metal : use less stack memory in FA kernel ggml-ci * cont : fix BF16 variant

Add in-build ggml::ggml ALIAS library (ggml/1260)

5d9c218

Enable uniform linking with subproject and with find_package.

sync : ggml

f6eca5c

ggml-ci

rpc : nicer error messages for RPC server crash (ggml-org#14076)

e227eef

Vulkan: Don't default to CPU device (like llvmpipe), even if no other…

2375744

… device is available, to allow fallback to CPU backend (ggml-org#14099)

ggml : fix weak alias win32 (whisper/0)

b4d1bcb

ggml-ci

sync : ggml

c13372c

ggml-ci

Fixed spec timings to: accepted/tested instead of accepted/drafted (g…

50162e6

…gml-org#14104)

vulkan: force device 0 in CI (ggml-org#14106)

11d3265

llama : support GEGLU for jina-bert-v2 (ggml-org#14090)

627669e

convert : fix duplicate key DeepSeek-R1 conversion error (ggml-org#14103

34641af

)

opencl: add mul_mv_id_q4_0_f32_8x_flat (ggml-org#14003)

39d4a90

kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (ggml-org#14121

038e0ef

)

server : pass default --keep argument (ggml-org#14120)

5043576

kv-cache : relax SWA masking condition (ggml-org#14119)

5fa57da

ggml-ci

webui: Wrap long numbers instead of infinite horizontal scroll (ggml-…

3539498

…org#14062) * webui: Wrap long numbers instead of infinite horizontal scroll * Use tailwind class * update index.html.gz

am17an and others added 25 commits July 2, 2025 12:28

ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (ggml-org#14443)

0f5b1fd

SYCL: disable faulty fp16 exp kernel (ggml-org#14395)

54caf5d

* SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit ed0aab1. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel

server : fix appearance of the chats list context menu for Safari (gg…

71c0d60

…ml-org#14322)

scripts : make the shell scripts cross-platform (ggml-org#14341)

0d0ef3e

test-backend-ops : disable llama test (ggml-org#14461)

52d0667

ggml-cpu: sycl: Re-enable exp f16 (ggml-org#14462)

195134c

metal : disable fast-math for some cpy kernels (ggml-org#14460)

89e5342

* metal : disable fast-math for some cpy kernels ggml-ci * cont : disable for q4_1 ggml-ci * cont : disable for iq4_nl ggml-ci

memory : correctly handle failure in apply() (ggml-org#14438)

3e39a42

ggml-ci

Add Conv2d for CPU (ggml-org#14388)

a6b9824

* Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16

opencl : add GEGLU, REGLU, SWIGLU (ggml-org#14456)

780ba6d

ggml-quants : rename best_mad to best_error (ggml/1283)

af33c35

This commit renames the variable `best_mad` to `best_error` in the `make_qkx2_quants` function. The motivation for this is that the name `best_mad` can be somewhat confusing if mean absolute deviation (MAD) is not in use.

sync : ggml

f7ca5cc

ggml-ci

ggml : remove trailing whitespace (#0)

afe880b

add GELU_ERF (ggml-org#14455)

edd05a2

vulkan: Split large mul_mat_id to fit in shared memory (ggml-org#14451)

80b5906

Add Vulkan images to docker.md (ggml-org#14472)

530c9a9

Right now it's not easy to find those.

ci : disable fast-math for Metal GHA CI (ggml-org#14478)

244305f

* ci : disable fast-math for Metal GHA CI ggml-ci * cont : remove -g flag ggml-ci

qnixsynapse force-pushed the update-dev-from-master-2025-07-02-06-38 branch from 68b3cd6 to 6399ac4 Compare July 2, 2025 06:58

Minh141120 self-requested a review July 2, 2025 07:46

Minh141120 approved these changes Jul 2, 2025

View reviewed changes

Minh141120 merged commit 0e28dd9 into dev Jul 2, 2025
9 checks passed

Minh141120 deleted the update-dev-from-master-2025-07-02-06-38 branch July 2, 2025 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b5798 #142

Sync master with upstream release b5798 #142

Uh oh!

jan-service-account commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Sync master with upstream release b5798 #142

Sync master with upstream release b5798 #142

Uh oh!

Conversation

jan-service-account commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!