[pull] master from ggml-org:master #407

pull · 2025-06-04T13:49:20Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

* kv-cache : refactor update mechanism ggml-ci * memory : improve status handling * defrag : reset head + add comments ggml-ci * cont : minor fixes ggml-ci

* * ggml-vulkan: adds op CONV_TRANSPOSE_1D * test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D * Missing barrier added to shader. Number of additional tests reduced to 108. * * Fixes typo in variable name. * Removes extra whitespaces. * Adds int64->int32 casts to prevent possible warnings. * Problem size reduced in tests to pass tests with llvmpipe. * supports_op condition moved from unintended position

ggml-ci

…N_VER to llama.cpp sources (#14013)

…4006) * memory : merge llama_kv_cache into llama_memory + new `llama_memory` API ggml-ci * context : fix casts ggml-ci

Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection as 'native' fails on autodl cloud environments. Co-authored-by: pockers21 <[email protected]>

…#14001) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check

* add add_classifier_output_labels * use add_classifier_output_labels

* llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci

… suffix) (#14050)

* SYCL: Implement few same quantized type copy kernels * Use memcpy for copying contiguous tensors ggml-ci * feat(sycl): add contiguous tensor copy support and device checks Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance. * refactor: replace specific block copy functions with template The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed. * Exclude BF16 support for COPY tensors for now ggml-ci * perf: adjust SYCL copy kernel block sizes for efficiency Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.

* batch : remove logits_all flag ggml-ci * context : simplify output counting logic during decode ggml-ci * cont : fix comments

ggml-ci

* cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir

Update oneMath commit to merged PR uxlfoundation/oneMath#669 which adds SYCL-Graph support for recording CUDA BLAS commands. With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph enabled. Prior to this change, an error would be thrown. ``` $ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2 UR CUDA ERROR: Value: 700 Name: CUDA_ERROR_ILLEGAL_ADDRESS Description: an illegal memory access was encountered Function: operator() Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154 Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN) Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator() SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code! in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598 $HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. ```

ggml-ci

Co-authored-by: dinhhuy <[email protected]>

* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT * cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*

* batch : rework llama_batch_allocr ggml-ci * cont : move validation inside class ggml-ci * cont : move output counting to class ggml-ci * cont : minor ggml-ci * batch : add TODOs ggml-ci

* Update multimodal.md * Update multimodal.md

* batch : add LLAMA_BATCH_DEBUG environment variable ggml-ci * cont : improve seq_id display

* vocab : prevent integer overflow during load * Add static cast and GGML_ABORT --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

* compare llama-bench: add option to plot * Address review comments: convert case + add type hints * Add matplotlib to requirements * fix tests * Improve comment and fix assert condition for test * Add back default test_name, add --plot_log_scale * use log_scale regardless of x_values

Currently when a model generates output which looks like a tool call, but is invalid an exception is thrown and not handled, causing the cli or llama-server to bail. Instead, handle the chat parser exception and simply return the generated text in such cases. Signed-off-by: Piotr Stankiewicz <[email protected]>

* batch : verify multi-sequence input batches ggml-ci * cont : auto-gen positions + verify multi-seq input ggml-ci * cont : first print debug info, then perform validation ggml-ci * cont : fix position auto-gen + add comments ggml-ci

ggml-ci

Adds: * Dots1Model to convert_hf_to_gguf.py * Computation graph code to llama-model.cpp * Chat template to llama-chat.cpp to detect this model's template. --- The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture. The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here: * https://huggingface.co/rednote-hilab/dots.llm1.inst * https://huggingface.co/rednote-hilab/dots.llm1.base The model architecture is a combination of Qwen and Deepseek parts, as seen here: https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

ggml-ci

…T_SIZE__ (#14183)

…nd port (#14180) Instead show something like this: main: server is listening on file.sock - starting the main loop Signed-off-by: Eric Curtin <[email protected]>

* Add Arcee AFM support * Add draft update code * Fix linter and update URL, may still not be final * Update src/llama-model.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Remote accidental blank line --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

ngxson and others added 3 commits June 4, 2025 10:11

llama-graph : use ggml_repeat_4d (#13998)

3ac6753

releases : use dl backend for linux release, remove arm64 linux relea…

4825487

…se (#13996)

ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997)

2589ad3

pull bot added the ⤵️ pull label Jun 4, 2025

github-actions bot added ggml devops labels Jun 4, 2025

ggerganov and others added 2 commits June 4, 2025 18:58

kv-cache : refactor the update/defrag mechanism (#13988)

3e63a58

* kv-cache : refactor update mechanism ggml-ci * memory : improve status handling * defrag : reset head + add comments ggml-ci * cont : minor fixes ggml-ci

github-actions bot added testing Vulkan labels Jun 4, 2025

jeffbolznv and others added 5 commits June 5, 2025 07:17

vulkan: automatically deduce size of push constants (#13936)

5a8ae30

context : fix pos_min initialization upon error decode (#14008)

9e31bec

ggml-ci

vocab : warn about missing mask token (#14022)

9f47fa5

readme : add badge (#13938)

d01d112

llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WI…

3a07714

…N_VER to llama.cpp sources (#14013)

github-actions bot added the build label Jun 5, 2025

ggerganov and others added 4 commits June 5, 2025 15:29

memory : migrate from llama_kv_cache to more generic llama_memory (#1…

7f37b6c

…4006) * memory : merge llama_kv_cache into llama_memory + new `llama_memory` API ggml-ci * context : fix casts ggml-ci

ci: fix CUDA build failure on autodl cloud machines (#14005)

146b88e

Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection as 'native' fails on autodl cloud environments. Co-authored-by: pockers21 <[email protected]>

vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (…

669c13e

…#14001) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check

gguf-py : add add_classifier_output_labels method to writer (#14031)

1caae7f

* add add_classifier_output_labels * use add_classifier_output_labels

github-actions bot added the python label Jun 5, 2025

llama : support multiple classifier outputs and labels (#13940)

d17a809

github-actions bot added the examples label Jun 6, 2025

ggerganov added 2 commits June 6, 2025 13:29

context : fix SWA-related warning for multiple sequences (#14045)

487a5e0

llama : deprecate llama_kv_self_ API (#14030)

745aa53

* llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci

github-actions bot added server android labels Jun 6, 2025

CISC and others added 2 commits June 7, 2025 14:13

llama : fix llama_model_chat_template with template name (LLM_KV with…

0974ad7

… suffix) (#14050)

github-actions bot added the SYCL label Jun 7, 2025

ggerganov and others added 30 commits June 12, 2025 11:50

context : simplify output counting logic during decode (#14142)

f6e1a7a

* batch : remove logits_all flag ggml-ci * context : simplify output counting logic during decode ggml-ci * cont : fix comments

server : re-enable SWA speculative decoding (#14131)

7d51644

ggml-ci

readme : remove project status link (#14149)

a681b4b

sycl: Remove not needed copy f16->f32 for dnnl mul mat (#14125)

ed52f36

vocab : prevent heap overflow when vocab is too small (#14145)

c33fe8b

ggml-ci

cmake : Improve build-info.cpp generation (#14156)

09cf2c7

* cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir

sycl: Adding additional cpy dbg print output (#14034)

0889eba

server : fix SWA condition for full context reprocess (#14163)

ffad043

ggml-ci

pooling : make cls_b and cls_out_b optional (#14165)

d714dad

Co-authored-by: dinhhuy <[email protected]>

cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)

cc8d081

* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT * cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*

readme : remove survey link (#14168)

b7cc774

batch : rework llama_batch_allocr (#14153)

60c6663

* batch : rework llama_batch_allocr ggml-ci * cont : move validation inside class ggml-ci * cont : move output counting to class ggml-ci * cont : minor ggml-ci * batch : add TODOs ggml-ci

docs : Update multimodal.md (#14122)

26ff368

* Update multimodal.md * Update multimodal.md

batch : add LLAMA_BATCH_DEBUG environment variable (#14172)

80709b7

* batch : add LLAMA_BATCH_DEBUG environment variable ggml-ci * cont : improve seq_id display

Merge commit from fork

3cfbbdb

* vocab : prevent integer overflow during load * Add static cast and GGML_ABORT --------- Co-authored-by: Georgi Gerganov <[email protected]>

sycl: fix docker image (#14144)

40643ed

vocab : fix build (#14175)

fb85a28

ggml-ci

docs : remove WIP since PR has been merged (#13912)

00ba772

cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)

c311ac6

ggml-ci

kv-cache : fix use-after-move of defrag info (#14189)

5fce5f9

ggml-ci

HIP: Replace usage of depricated preprocessor macro __AMDGCN_WAVEFRON…

2c2caa4

…T_SIZE__ (#14183)

CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (#14196)

e54b394

quantize : change int to unsigned int for KV overrides (#14197)

30e5b01

server : When listening on a unix domain socket don't print http:// a…

cd355ed

…nd port (#14180) Instead show something like this: main: server is listening on file.sock - starting the main loop Signed-off-by: Eric Curtin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from ggml-org:master #407

[pull] master from ggml-org:master #407

Uh oh!

pull bot commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

[pull] master from ggml-org:master #407

Are you sure you want to change the base?

[pull] master from ggml-org:master #407

Uh oh!

Conversation

pull bot commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pull bot commented Jun 4, 2025 •

edited

Loading