Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@jan-service-account
Copy link

Updates dev branch with latest release (b6891) from ggml-org/llama.cpp

ORippler and others added 11 commits October 30, 2025 11:34
This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop
on top that handles the chunks.
* support qwen3vl series.

Co-authored-by: Thireus ☠ <[email protected]>
Co-authored-by: yairpatch <[email protected]>
Co-authored-by: LETS-BEE <[email protected]>

* bugfix: fix the arch check for qwen3vl-moe.

* use build_ffn

* optimize deepstack structure

* optimize deepstack feature saving

* Revert "optimize deepstack feature saving" for temporal fix

This reverts commit f321b9f.

* code clean

* use fused qkv in clip

* clean up / rm is_deepstack_layers for simplification

* add test model

* move test model to "big" section

* fix imrope check

* remove trailing whitespace

* fix rope fail

* metal : add imrope support

* add imrope support for sycl

* vulkan: add imrope w/o check

* fix vulkan

* webgpu: add imrope w/o check

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* fix tensor mapping

---------

Co-authored-by: Thireus ☠ <[email protected]>
Co-authored-by: yairpatch <[email protected]>
Co-authored-by: LETS-BEE <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
…ing on ARM64 (ggml-org#16833)

Very similar implementation to the flash-attention chunking, with similar benefits.
* server : remove n_past

* server : replace slot.n_prompt_tokens() with slot.task->n_tokens()

* server : fixes + clean-up

* cont : fix context shift

* server : add server_tokens::pos_next()

Co-authored-by: Xuan-Son Nguyen <[email protected]>

* server : fix pos_next() usage

Co-authored-by: Xuan-Son Nguyen <[email protected]>

---------

Co-authored-by: Xuan-Son Nguyen <[email protected]>
@jan-service-account jan-service-account merged commit 8e9fc48 into dev Oct 31, 2025
3 checks passed
@jan-service-account jan-service-account deleted the update-dev-from-master-2025-10-31-00-34 branch October 31, 2025 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.