Sync master with upstream release b6891 #309

jan-service-account · 2025-10-31T00:34:46Z

Updates dev branch with latest release (b6891) from ggml-org/llama.cpp

This is realised by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive here, the warp scheduler has enough threads available to effectively hide the cost of additionally loading those two floats.

* Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.

* support qwen3vl series. Co-authored-by: Thireus ☠ <[email protected]> Co-authored-by: yairpatch <[email protected]> Co-authored-by: LETS-BEE <[email protected]> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit f321b9f. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <[email protected]> Co-authored-by: yairpatch <[email protected]> Co-authored-by: LETS-BEE <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

…ing on ARM64 (ggml-org#16833) Very similar implementation to the flash-attention chunking, with similar benefits.

* server : remove n_past * server : replace slot.n_prompt_tokens() with slot.task->n_tokens() * server : fixes + clean-up * cont : fix context shift * server : add server_tokens::pos_next() Co-authored-by: Xuan-Son Nguyen <[email protected]> * server : fix pos_next() usage Co-authored-by: Xuan-Son Nguyen <[email protected]> --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

ORippler and others added 11 commits October 30, 2025 11:34

vulkan: Handle argsort with a large number of rows (ggml-org#16851)

052df28

llama : use std::abs instead of abs (ggml-org#16853)

d739511

cuda : fix argsort with 64k+ rows (ggml-org#16849)

229bf68

cpu: introduce chunking for flash attention (ggml-org#16829)

dcca0d3

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.

common: fix typo in cli help text (ggml-org#16864)

835e918

cpu: introduce chunking for repack matmuls and enable matmul-id chunk…

517b717

…ing on ARM64 (ggml-org#16833) Very similar implementation to the flash-attention chunking, with similar benefits.

server : bump request URI max length to 32768 (ggml-org#16862)

16724b5

jan-service-account merged commit 8e9fc48 into dev Oct 31, 2025
3 checks passed

jan-service-account deleted the update-dev-from-master-2025-10-31-00-34 branch October 31, 2025 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b6891 #309

Sync master with upstream release b6891 #309

Uh oh!

jan-service-account commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Sync master with upstream release b6891 #309

Sync master with upstream release b6891 #309

Uh oh!

Conversation

jan-service-account commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants