Sync master with upstream release b6880 #308

jan-service-account · 2025-10-30T00:36:15Z

Updates dev branch with latest release (b6880) from ggml-org/llama.cpp

* sycl: add RMS_NORM_BACK operation support * sycl: rms_norm_back: add dual reduction paths (FP64 and FP32) and savepoint before further changes * sycl: add RMS_NORM_BACK support Implement RMS_NORM_BACK for the SYCL backend using FP32 compensated parallel reduction. Minimal docs updates (ops.md / SYCL.csv). * revert: restore .gitignore and tools/run/CMakeLists.txt to upstream * revert: restore tests/CMakeLists.txt to upstream * sycl: optimize rms_norm_back * fix: restore SYCL.csv to correct state with RMS_NORM_BACK support * Update ggml/src/ggml-sycl/norm.cpp Co-authored-by: Neo Zhang Jianyu <[email protected]> * fix: remove trailing whitespace and add missing newline (EditorConfig) --------- Co-authored-by: Neo Zhang Jianyu <[email protected]>

* CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef

…rg#16793) This lets the copy to the destination device use the host-visible vidmem optimization.

* sync minja.hpp Adds Call/EndCall support, used in MiniCPM3 and MiniCPM4-MCP. * remove spurious semicolon * sync from ochafik/minja

* CUDA: use fastdiv in set-rows * add assert about value fitting in u32

* hexagon: remove dspqueue callbacks and do all read processing inplace * hexagon: there is no need to ref/deref the buffers at this point We're not going to release the buffers without flushing the session queue. So there is no need to inc/dec the refcounts for every request. We also don't need to include those bufs in the response. * hexagon: bump the thread count in the adb wrapper scripts We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention). Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs. * hexagon: add lhez as the second code owner

* vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads

…#16656) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on ggml-org#16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in ggml-org#16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <[email protected]>

* llama: store mrope data in KV cell * correct x,y ordering * address review comments * add consistency checks * Update src/llama-kv-cache.cpp Co-authored-by: Georgi Gerganov <[email protected]> * add TODO * fix asan error * kv-cells : improve ext handling * cont : fix headers --------- Co-authored-by: Georgi Gerganov <[email protected]>

This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.

YaelLogic and others added 11 commits October 29, 2025 14:14

vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (ggml-o…

f549b00

…rg#16793) This lets the copy to the destination device use the host-visible vidmem optimization.

vendor : sync minja (ggml-org#16500)

144a4ce

* sync minja.hpp Adds Call/EndCall support, used in MiniCPM3 and MiniCPM4-MCP. * remove spurious semicolon * sync from ochafik/minja

CUDA: use fastdiv in set-rows (ggml-org#16834)

e41bcce

* CUDA: use fastdiv in set-rows * add assert about value fitting in u32

llama: fix ASAN error with M-RoPE (ggml-org#16848)

3464bda

jan-service-account merged commit 27ef034 into dev Oct 30, 2025
1 check passed

jan-service-account deleted the update-dev-from-master-2025-10-30-00-36 branch October 30, 2025 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b6880 #308

Sync master with upstream release b6880 #308

Uh oh!

jan-service-account commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Sync master with upstream release b6880 #308

Sync master with upstream release b6880 #308

Uh oh!

Conversation

jan-service-account commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants