[pull] master from ggml-org:master #475

pull · 2025-06-04T09:22:07Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

ggml-ci

* kv-cache : refactor update mechanism ggml-ci * memory : improve status handling * defrag : reset head + add comments ggml-ci * cont : minor fixes ggml-ci

* * ggml-vulkan: adds op CONV_TRANSPOSE_1D * test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D * Missing barrier added to shader. Number of additional tests reduced to 108. * * Fixes typo in variable name. * Removes extra whitespaces. * Adds int64->int32 casts to prevent possible warnings. * Problem size reduced in tests to pass tests with llvmpipe. * supports_op condition moved from unintended position

ggml-ci

…N_VER to llama.cpp sources (#14013)

…4006) * memory : merge llama_kv_cache into llama_memory + new `llama_memory` API ggml-ci * context : fix casts ggml-ci

Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection as 'native' fails on autodl cloud environments. Co-authored-by: pockers21 <[email protected]>

…#14001) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check

* add add_classifier_output_labels * use add_classifier_output_labels

* llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci

* batch : rework llama_batch_allocr ggml-ci * cont : move validation inside class ggml-ci * cont : move output counting to class ggml-ci * cont : minor ggml-ci * batch : add TODOs ggml-ci

* Update multimodal.md * Update multimodal.md

* batch : add LLAMA_BATCH_DEBUG environment variable ggml-ci * cont : improve seq_id display

* vocab : prevent integer overflow during load * Add static cast and GGML_ABORT --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

* compare llama-bench: add option to plot * Address review comments: convert case + add type hints * Add matplotlib to requirements * fix tests * Improve comment and fix assert condition for test * Add back default test_name, add --plot_log_scale * use log_scale regardless of x_values

Currently when a model generates output which looks like a tool call, but is invalid an exception is thrown and not handled, causing the cli or llama-server to bail. Instead, handle the chat parser exception and simply return the generated text in such cases. Signed-off-by: Piotr Stankiewicz <[email protected]>

* batch : verify multi-sequence input batches ggml-ci * cont : auto-gen positions + verify multi-seq input ggml-ci * cont : first print debug info, then perform validation ggml-ci * cont : fix position auto-gen + add comments ggml-ci

ggml-ci

Adds: * Dots1Model to convert_hf_to_gguf.py * Computation graph code to llama-model.cpp * Chat template to llama-chat.cpp to detect this model's template. --- The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture. The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here: * https://huggingface.co/rednote-hilab/dots.llm1.inst * https://huggingface.co/rednote-hilab/dots.llm1.base The model architecture is a combination of Qwen and Deepseek parts, as seen here: https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

ggml-ci

…T_SIZE__ (#14183)

…nd port (#14180) Instead show something like this: main: server is listening on file.sock - starting the main loop Signed-off-by: Eric Curtin <[email protected]>

* Add Arcee AFM support * Add draft update code * Fix linter and update URL, may still not be final * Update src/llama-model.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Remote accidental blank line --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

* ggml-cpu : rework weak alias on apple targets * fix powerpc detection * fix ppc detection * fix powerpc detection on darwin

This fixes the remaining crash in test-thread-safety on my system.

Co-authored-by: dinhhuy <[email protected]>

* llama : rework embeddings logic ggml-ci * cont : fix rerank ggml-ci * cont : engrish [no ci] * cont : fix rerank ggml-ci * server : support both embeddings and completions with single model ggml-ci * cont : avoid embeddings_org ggml-ci

* convert neobert model to gguf * add inference graph * fix flake8 lint * followed reviewer suggestions Co-authored-by: Georgi Gerganov <[email protected]> * follow reviewers suggestions Co-authored-by: Georgi Gerganov <[email protected]> * override NeoBERT feed-forward length --------- Co-authored-by: dinhhuy <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* Remove install step for vulkan-shaders-gen * Add install step to normalize msvc with make * Regenerate modified shaders at build-time

* llama : add thread safety test * llamafile : remove global state * llama : better LLAMA_SPLIT_MODE_NONE logic when main_gpu < 0 GPU devices are not used --------- Co-authored-by: Georgi Gerganov <[email protected]>

* server : fix incorrect usage of llama_get_embeddings() ggml-ci * cont : fix the fix ggml-ci

ggerganov and others added 3 commits June 4, 2025 09:50

kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)

e0e806f

ggml-ci

CUDA: fix FTZ in FA for Gemma 3 (#13991)

0b4be4c

llama-graph : use ggml_repeat_4d (#13998)

3ac6753

pull bot added the ⤵️ pull label Jun 4, 2025

github-actions bot added Nvidia GPU ggml labels Jun 4, 2025

releases : use dl backend for linux release, remove arm64 linux relea…

4825487

…se (#13996)

github-actions bot added the devops label Jun 4, 2025

slaren and others added 3 commits June 4, 2025 15:37

ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997)

2589ad3

kv-cache : refactor the update/defrag mechanism (#13988)

3e63a58

* kv-cache : refactor update mechanism ggml-ci * memory : improve status handling * defrag : reset head + add comments ggml-ci * cont : minor fixes ggml-ci

github-actions bot added testing Vulkan labels Jun 4, 2025

jeffbolznv and others added 5 commits June 5, 2025 07:17

vulkan: automatically deduce size of push constants (#13936)

5a8ae30

context : fix pos_min initialization upon error decode (#14008)

9e31bec

ggml-ci

vocab : warn about missing mask token (#14022)

9f47fa5

readme : add badge (#13938)

d01d112

llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WI…

3a07714

…N_VER to llama.cpp sources (#14013)

github-actions bot added the build label Jun 5, 2025

ggerganov and others added 4 commits June 5, 2025 15:29

memory : migrate from llama_kv_cache to more generic llama_memory (#1…

7f37b6c

…4006) * memory : merge llama_kv_cache into llama_memory + new `llama_memory` API ggml-ci * context : fix casts ggml-ci

ci: fix CUDA build failure on autodl cloud machines (#14005)

146b88e

Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection as 'native' fails on autodl cloud environments. Co-authored-by: pockers21 <[email protected]>

vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (…

669c13e

…#14001) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check

gguf-py : add add_classifier_output_labels method to writer (#14031)

1caae7f

* add add_classifier_output_labels * use add_classifier_output_labels

github-actions bot added the python label Jun 5, 2025

llama : support multiple classifier outputs and labels (#13940)

d17a809

github-actions bot added the examples label Jun 6, 2025

ggerganov added 2 commits June 6, 2025 13:29

context : fix SWA-related warning for multiple sequences (#14045)

487a5e0

llama : deprecate llama_kv_self_ API (#14030)

745aa53

* llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci

github-actions bot added server android labels Jun 6, 2025

ggerganov and others added 30 commits June 13, 2025 13:47

batch : rework llama_batch_allocr (#14153)

60c6663

* batch : rework llama_batch_allocr ggml-ci * cont : move validation inside class ggml-ci * cont : move output counting to class ggml-ci * cont : minor ggml-ci * batch : add TODOs ggml-ci

docs : Update multimodal.md (#14122)

26ff368

* Update multimodal.md * Update multimodal.md

batch : add LLAMA_BATCH_DEBUG environment variable (#14172)

80709b7

* batch : add LLAMA_BATCH_DEBUG environment variable ggml-ci * cont : improve seq_id display

Merge commit from fork

3cfbbdb

* vocab : prevent integer overflow during load * Add static cast and GGML_ABORT --------- Co-authored-by: Georgi Gerganov <[email protected]>

sycl: fix docker image (#14144)

40643ed

vocab : fix build (#14175)

fb85a28

ggml-ci

docs : remove WIP since PR has been merged (#13912)

00ba772

cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)

c311ac6

ggml-ci

kv-cache : fix use-after-move of defrag info (#14189)

5fce5f9

ggml-ci

HIP: Replace usage of depricated preprocessor macro __AMDGCN_WAVEFRON…

2c2caa4

…T_SIZE__ (#14183)

CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (#14196)

e54b394

quantize : change int to unsigned int for KV overrides (#14197)

30e5b01

server : When listening on a unix domain socket don't print http:// a…

cd355ed

…nd port (#14180) Instead show something like this: main: server is listening on file.sock - starting the main loop Signed-off-by: Eric Curtin <[email protected]>

ggml-cpu : rework weak alias on apple targets (#14146)

3555b30

* ggml-cpu : rework weak alias on apple targets * fix powerpc detection * fix ppc detection * fix powerpc detection on darwin

vulkan: mutex around vkQueueSubmit (#14127)

c89c2d1

This fixes the remaining crash in test-thread-safety on my system.

gguf-py : allow key override when adding value to GGUFWriter (#14194)

4ad2436

Co-authored-by: dinhhuy <[email protected]>

convert : remove arcee change in convert_hf_to_gguf_update.py (#14207)

0bf49eb

ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206)

3ba0d84

llama : rework embeddings logic (#14208)

d3e64b9

* llama : rework embeddings logic ggml-ci * cont : fix rerank ggml-ci * cont : engrish [no ci] * cont : fix rerank ggml-ci * server : support both embeddings and completions with single model ggml-ci * cont : avoid embeddings_org ggml-ci

HIP: disable rocwmma on gfx12 by default until rocm 7.0 (#14202)

7d6d91b

cmake: clean up external project logic for vulkan-shaders-gen (#14179)

0dbcabd

* Remove install step for vulkan-shaders-gen * Add install step to normalize msvc with make * Regenerate modified shaders at build-time

llama : add thread safety test (#14035)

6adc3c3

* llama : add thread safety test * llamafile : remove global state * llama : better LLAMA_SPLIT_MODE_NONE logic when main_gpu < 0 GPU devices are not used --------- Co-authored-by: Georgi Gerganov <[email protected]>

server : fix incorrect usage of llama_get_embeddings() (#14225)

89fea80

* server : fix incorrect usage of llama_get_embeddings() ggml-ci * cont : fix the fix ggml-ci

common : suggest --jinja when autodetection fails (#14222)

e434e69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from ggml-org:master #475

[pull] master from ggml-org:master #475

pull bot commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

[pull] master from ggml-org:master #475

Are you sure you want to change the base?

[pull] master from ggml-org:master #475

Conversation

pull bot commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pull bot commented Jun 4, 2025 •

edited

Loading