Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tags: philipbroadway/llama.cpp

Tags

b6097

Toggle b6097's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ggml: WebGPU disable SET_ROWS for now (ggml-org#15078)

* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* Disable set_rows until it's implemented

* Fix potential issue around empty queue submission

* Try synchronous submission

* Try waiting on all futures explicitly

* Add debug

* Add more debug messages

* Work on getting ssh access for debugging

* Debug on failure

* Disable other tests

* Remove extra if

* Try more locking

* maybe passes?

* test

* Some cleanups

* Restore build file

* Remove extra testing branch ci

b6096

Toggle b6096's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
llama : add gpt-oss (ggml-org#15091)

* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (ggml-org#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (ggml-org#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (ggml-org#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <[email protected]>

change kvalues_mxfp4 table to match e2m1 (ggml-org#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (ggml-org#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: slaren <[email protected]>

b6095

Toggle b6095's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chat : only remove double bos/eos if added (ggml-org#15086)

* only remove double bos/eos if added

* fix tests

b6093

Toggle b6093's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
sycl: fix mul_mat selection (ggml-org#15092)

b6092

Toggle b6092's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix `glm4moe` bug (ggml-org#15088)

b6090

Toggle b6090's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
context : fix index overflow on huge outputs (ggml-org#15080)

* context : fix overflow when re-ordering huge outputs

* context : fix logits size overflow for huge batches

b6089

Toggle b6089's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
llama : add --n-cpu-moe option (ggml-org#15077)

* llama : add --n-cpu-moe option

Keeps the MoE weights of the first N layers in the CPU

b6088

Toggle b6088's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
imatrix : warn when GGUF imatrix is saved without .gguf suffix (ggml-…

…org#15076)

* imatrix : add warning when suffix is not .gguf for GGUF imatrix

* imatrix : only warn about suffix when output format is unspecified

b6087

Toggle b6087's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
cmake: Add GGML_BACKEND_DIR option (ggml-org#15074)

* cmake: Add GGML_BACKEND_DIR option

This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.

* Fix phrasing

b6085

Toggle b6085's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
model: support GLM 4.5 family of models (ggml-org#14939)

* model: Add GLM 4.5 (ggml-org#14921)

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Merge in PR suggestions

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* model: Add GLM 4.5 family of models (ggml-org#14921)

1. Updated tensor_mapping.py with NextN tensor mappings

- Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py
- Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm

2. Added num_nextn_predict_layers configuration

- Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp
- Added num_nextn_predict_layers field to llama_hparams struct
- Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter
- Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers
- Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method
- Updated conversion script to extract and write this parameter from HuggingFace config

3. Added FIM tokens for GLM4_MOE

- Added GLM-4.5's FIM tokens to llama-vocab.cpp:
  - <|code_prefix|> for FIM_PRE
  - <|code_suffix|> for FIM_SUF
  - <|code_middle|> for FIM_MID

4. Removed manual NextN tensor handling

- Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors
- NextN tensors are now handled automatically through the proper tensor mapping system

* glm 4.5 update tensors names

* model: glm 4.5 apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* model: glm 4.5 apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* model: glm 4.5 apply suggestions from code review

* Apply suggestions from code review

* patch broken chat template

* typings fix

* add TENSOR_SKIP flag


Co-authored-by: Diego Devesa <[email protected]>

* Update src/llama-model-loader.h

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Diego Devesa <[email protected]>