Cherry pick 20250224 #7

arthw · 2025-02-26T14:21:08Z

ggml-org#11492 - ggml-org#11950

More RAII mainly Signed-off-by: Eric Curtin <[email protected]>

There is no need to use map, just store the base pointer in the buffer context.

* Copy minja from google/minja@58f0ca6 * Add --jinja and --chat-template-file flags * Add missing <optional> include * Avoid print in get_hf_chat_template.py * No designated initializers yet * Try and work around msvc++ non-macro max resolution quirk * Update test_chat_completion.py * Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template * Refactor test-chat-template * Test templates w/ minja * Fix deprecation * Add --jinja to llama-run * Update common_chat_format_example to use minja template wrapper * Test chat_template in e2e test * Update utils.py * Update test_chat_completion.py * Update run.cpp * Update arg.cpp * Refactor common_chat_* functions to accept minja template + use_jinja option * Attempt to fix linkage of LLAMA_CHATML_TEMPLATE * Revert LLAMA_CHATML_TEMPLATE refactor * Normalize newlines in test-chat-templates for windows tests * Forward decl minja::chat_template to avoid eager json dep * Flush stdout in chat template before potential crash * Fix copy elision warning * Rm unused optional include * Add missing optional include to server.cpp * Disable jinja test that has a cryptic windows failure * minja: fix vigogne (google/minja#22) * Apply suggestions from code review Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> * Finish suggested renamings * Move chat_templates inside server_context + remove mutex * Update --chat-template-file w/ recent change to --chat-template * Refactor chat template validation * Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr) * Warn against missing eos / bos tokens when jinja template references them * rename: common_chat_template[s] * reinstate assert on chat_templates.template_default * Update minja to google/minja@b8437df * Update minja to google/minja#25 * Update minja from google/minja#27 * rm unused optional header --------- Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* init * add readme * update readme * no use make * update readme * update fix code * fix editorconfig-checker * no change convert py * use clip_image_u8_free

ggml-org#11342) * Factor string_join, string_split, string_repeat into common * json: refactor to surface a versatile builder * Update common.cpp

Signed-off-by: Jiri Podivin <[email protected]>

* main : update README documentation for batch size * fix formatting * minor

With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.

Fixes ggml-org#11306.

There should be a copy-and-paste error here. *mmq_wg_denoms should be used together with *warptile_mmq, instead of wg_denoms.

ollama uses hf.co/ to specify huggingface prefix, like RamaLama uses hf:// Treat them similarly. Signed-off-by: Eric Curtin <[email protected]>

* server : add more clean up when cancel_tasks is called * fix recv_with_timeout * std::remove_if * fix std::remove_if

Most other llama.cpp cli tools accept -ngl with a single dash. Signed-off-by: Eric Curtin <[email protected]>

To show -n, -ngl, --ngl is acceptable. Signed-off-by: Eric Curtin <[email protected]>

Now that we have batched mat-vec mul Vulkan shaders for up to n==8, these tests weren't actually exercising the mat-mat mul path. Test n==9 as well. Also, change to use all_types.

…nt (ggml-org#11364) * webui : put DeepSeek R1 CoT in a collapsible <details> element * webui: refactor split * webui: don't use regex to split cot and response * webui: format+qol * webui: no loading icon if the model isn't generating * ui fix, add configs * add jsdoc types * only filter </think> for assistant msg * build * update build --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

For consistency Signed-off-by: Eric Curtin <[email protected]>

…rg#11366) See https://reproducible-builds.org/ for why this is good and https://reproducible-builds.org/specs/source-date-epoch/ for the definition of this variable. Without this patch, compiling on different machines produced different binaries, which made verification of results difficult. Fixes: ggml-org#11317 This patch was done while working on reproducible builds for openSUSE.

…g#11368)

* release : pack /lib and /include in the packages * cmake : put libs in /bin * TMP : push artifacts * Revert "TMP : push artifacts" This reverts commit 4decf2c. * ci : fix HIP cmake compiler options to be on first line * ci : restore the original HIP commands * ci : change ubuntu build from latest to 20.04 * ci : try to fix macos build rpaths * ci : remove obsolete MacOS build * TMP : push artifacts * ci : change back to ubuntu latest * ci : macos set build rpath to "@loader_path" * ci : fix typo * ci : change ubuntu package to 22.04 * Revert "TMP : push artifacts" This reverts commit 537b09e.

* Add hipGraph support * Enable VMM on rocm

* CANN: Add Ascend CANN build ci * Update build.yml * Modify cann image version * Update build.yml * Change to run on x86 system * Update build.yml * Update build.yml * Modify format error * Update build.yml * Add 'Ascend NPU' label restrictions * Exclude non PR event Co-authored-by: Yuanhao Ji <[email protected]> * Update build.yml --------- Co-authored-by: Yuanhao Ji <[email protected]>

* ci : fix line breaks on windows builds * cont : another try * ci : fix powershell line breaks

* ci : fix arm upload artifacts * cont : fix archive name to use matrix

* llava: export function `clip_build_img_from_pixels` to build image from pixels decoded by other libraries instead of stb_image.h for better performance * Apply suggestions from code review --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

@ericcurtin

* ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <[email protected]> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <[email protected]> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <[email protected]> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <[email protected]> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <[email protected]> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <[email protected]> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <[email protected]> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <[email protected]> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <[email protected]> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <[email protected]> * ggml: remove test.py Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <[email protected]> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <[email protected]> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <[email protected]> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <[email protected]> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <[email protected]> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <[email protected]> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <[email protected]> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <[email protected]> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <[email protected]> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <[email protected]> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <[email protected]> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <[email protected]> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <[email protected]> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <[email protected]> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <[email protected]> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <[email protected]> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <[email protected]> * ggml: fix wrong char*x16_t naming Signed-off-by: Aaron Teo <[email protected]> * ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <[email protected]> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <[email protected]> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <[email protected]> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <[email protected]> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <[email protected]> * ggml : fix LoongArch compile error with 128-bit SIMD (ggml-org#11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <[email protected]> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]> Co-authored-by: Jinyang He <[email protected]> Co-authored-by: junchao-zhao <[email protected]>

Use consolidated open function call from File class. Change read_all to to_string(). Remove exclusive locking, the intent for that lock is to avoid multiple processes writing to the same file, it's not an issue for readers, although we may want to consider adding a shared lock. Remove passing nullptr as reference, references are never supposed to be null. clang-format the code for consistent styling. Signed-off-by: Eric Curtin <[email protected]>

…rg#12041) Signed-off-by: Florent Benoit <[email protected]>

…l-org#11349)

* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <[email protected]>

* Add super wip scripts for multimodal granite gguf Signed-off-by: Alex-Brooks <[email protected]> * Add example for converting mmgranite to gguf Signed-off-by: Alex-Brooks <[email protected]> * remove hardcoded path Signed-off-by: Alex-Brooks <[email protected]> * Add vision feature layer to gguf params Signed-off-by: Alex-Brooks <[email protected]> * Clean up llava surgery and remove name substitution hacks Signed-off-by: Alex-Brooks <[email protected]> * Add transformers llava next tensor name mapping Signed-off-by: Alex-Brooks <[email protected]> * Make siglip / openclip mutuall exclusive Signed-off-by: Alex-Brooks <[email protected]> * Fix projector linear substitution Signed-off-by: Alex-Brooks <[email protected]> * Fix linear 2 substitution index Signed-off-by: Alex-Brooks <[email protected]> * Increase max flattened gridpoints to 64 Signed-off-by: Alex-Brooks <[email protected]> * Fix hardcoded concat for multiple feature layers Signed-off-by: Alex-Brooks <[email protected]> * Pull vision feature layers out of gguf keys Signed-off-by: Alex-Brooks <[email protected]> * fix num gridpoints and use all layers Signed-off-by: Alex-Brooks <[email protected]> * Avoid dropping last image encoder layer in llava models Signed-off-by: Alex-Brooks <[email protected]> * Use 10 for max number of patches Signed-off-by: Alex-Brooks <[email protected]> * Standardize vision feature layers Signed-off-by: Alex-Brooks <[email protected]> * Cleanup logs Signed-off-by: Alex-Brooks <[email protected]> * Update comment for vision feature layer init Signed-off-by: Alex-Brooks <[email protected]> * Update notes for alternative to legacy llm conversion script Signed-off-by: Alex-Brooks <[email protected]> * Fix notes rendering Signed-off-by: Alex-Brooks <[email protected]> * Add v prefix to vision feature layer log Signed-off-by: Alex-Brooks <[email protected]> * Use current defaults for feature layer Signed-off-by: Alex-Brooks <[email protected]> * Use constant for max gridpoints / feat layers, style fixes Signed-off-by: Alex-Brooks <[email protected]> * clarify non-negative feature layers Signed-off-by: Alex-Brooks <[email protected]> * Remove CLIP_API from func signature Signed-off-by: Alex-Brooks <[email protected]> * USE MAX_IMAGE_FEATURE_LAYERS const in layer calc Signed-off-by: Alex-Brooks <[email protected]> * Clarify feature layers are non negative ints and not uint Signed-off-by: Alex-Brooks <[email protected]> * Fix condition for reading feature layers Signed-off-by: Alex-Brooks <[email protected]> * pop last llava layer when feature layers are unset Signed-off-by: Alex-Brooks <[email protected]> * Fix unset vision layer 0 Signed-off-by: Alex-Brooks <[email protected]> * Update examples/llava/clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Reenable assertion for out of bounds get_rows Signed-off-by: Alex-Brooks <[email protected]> * Use std vector for gridpoints and feature layers Signed-off-by: Alex-Brooks <[email protected]> * Caculate max feature layer at load time Signed-off-by: Alex-Brooks <[email protected]> * Include base patch for granite vision allocation Signed-off-by: Alex-Brooks <[email protected]> * Fix trailing whitespace Signed-off-by: Alex-Brooks <[email protected]> * Add max num patches = 10 back for minicpmv Signed-off-by: Alex-Brooks <[email protected]> * Use unordered set to store feature layers Co-authored-by: Xuan-Son Nguyen <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> * Use max feature layer for postnorm Signed-off-by: Alex-Brooks <[email protected]> * Apply suggestions from code review --------- Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>

* opencl: fix small shape gemv, remove unused extensions * opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size * opencl: fix for token length < 4 * opencl: use wave size of 64 for all Adreno GPUs --------- Co-authored-by: Shawn Gu <[email protected]> Co-authored-by: Skyler Szot <[email protected]>

ericcurtin and others added 30 commits February 25, 2025 18:42

linenoise.cpp refactoring (ggml-org#11301)

f6553b6

More RAII mainly Signed-off-by: Eric Curtin <[email protected]>

rpc : better caching of the base buffer pointer (ggml-org#11331)

e168c69

There is no need to use map, just store the base pointer in the buffer context.

export-lora : fix tok_embd tensor (ggml-org#11330)

72d643c

llava : support Minicpm-omni (ggml-org#11289)

d86ffd0

* init * add readme * update readme * no use make * update readme * update fix code * fix editorconfig-checker * no change convert py * use clip_image_u8_free

common: utils to split / join / repeat strings (from json converter) (

c01b442

ggml-org#11342) * Factor string_join, string_split, string_repeat into common * json: refactor to surface a versatile builder * Update common.cpp

Adding logprobs to /v1/completions (ggml-org#11344)

8d944ed

Signed-off-by: Jiri Podivin <[email protected]>

minja: sync at google/minja@0f5f7f2 (ggml-org#11352)

f65ef3e

server : fix draft context not being released (ggml-org#11354)

49cf58f

readme : add plugin links (ggml-org#11355)

108c0fc

main : update README documentation for batch size (ggml-org#11353)

d3d8ae3

* main : update README documentation for batch size * fix formatting * minor

vulkan: sort shaders for more deterministic binary (ggml-org#11315)

779d3ef

Fixes ggml-org#11306.

Vulkan-run-test: fix mmq_wg_denoms (ggml-org#11343)

d83ccfe

There should be a copy-and-paste error here. *mmq_wg_denoms should be used together with *warptile_mmq, instead of wg_denoms.

Treat hf.co/ prefix the same as hf:// (ggml-org#11350)

60b0c92

ollama uses hf.co/ to specify huggingface prefix, like RamaLama uses hf:// Treat them similarly. Signed-off-by: Eric Curtin <[email protected]>

server : add more clean up when cancel_tasks is called (ggml-org#11340)

7e181b8

* server : add more clean up when cancel_tasks is called * fix recv_with_timeout * std::remove_if * fix std::remove_if

Add -ngl (ggml-org#11372)

e1bc50b

Most other llama.cpp cli tools accept -ngl with a single dash. Signed-off-by: Eric Curtin <[email protected]>

Update documentation (ggml-org#11373)

261849c

To show -n, -ngl, --ngl is acceptable. Signed-off-by: Eric Curtin <[email protected]>

tests: fix some mul_mat test gaps (ggml-org#11375)

aaa017e

Now that we have batched mat-vec mul Vulkan shaders for up to n==8, these tests weren't actually exercising the mat-mat mul path. Test n==9 as well. Also, change to use all_types.

Update llama-run README.md (ggml-org#11386)

49667f6

For consistency Signed-off-by: Eric Curtin <[email protected]>

CPU/CUDA: fix (GQA) mul mat back, add CUDA support (ggml-org#11380)

ff4f398

docs : Update readme to build targets for local docker build (ggml-or…

db9c9cc

…g#11368)

rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (ggml-org#11356)

b56f691

CUDA: fix FP16 cuBLAS GEMM (ggml-org#11396)

95413b2

hip : Add hipGraph and VMM support to ROCM (ggml-org#11362)

6a9aeb2

* Add hipGraph support * Enable VMM on rocm

ci : fix line breaks on windows builds (ggml-org#11409)

f528852

* ci : fix line breaks on windows builds * cont : another try * ci : fix powershell line breaks

Rohanjames1997 and others added 14 commits February 25, 2025 18:49

ci : Build on Github-hosted arm64 runners (ggml-org#12009)

0b16703

CUDA: optimize FA for GQA + large batches (ggml-org#12014)

cded583

ci : fix arm upload artifacts (ggml-org#12024)

732014b

* ci : fix arm upload artifacts * cont : fix archive name to use matrix

CUDA: app option to compile without FlashAttention (ggml-org#12025)

4369168

run: allow to customize prompt by env var LLAMA_PROMPT_PREFIX (ggml-o…

2c6f90d

…rg#12041) Signed-off-by: Florent Benoit <[email protected]>

SYCL: Fix GGML_SYCL_DEBUG macro (ggml-org#11995)

7fecf7f

gguf_convert_endian.py: implement byteswapping for q4_k and q6_k (ggm…

89b48a8

…l-org#11349)

add new line at end of file

7ba151d

github-actions bot added documentation Improvements or additions to documentation SYCL ggml Apple Metal Nvidia GPU testing build examples devops python script android server nix Vulkan labels Feb 26, 2025

arthw merged commit c69f491 into master Feb 27, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cherry pick 20250224 #7

Cherry pick 20250224 #7

Uh oh!

arthw commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Cherry pick 20250224 #7

Cherry pick 20250224 #7

Uh oh!

Conversation

arthw commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!