Granite Four #13550

gabe-l-hart · 2025-05-14T20:13:13Z

Description

This PR is the end-point for architecture support for Granite 4.0 (#13269 . It incorporates a number of changes from other in-flight branches that will need to be merged first:

Mamba2 model support: llama : initial Mamba-2 support #9126
Hybrid recurrent cache: Hybrid recurrent cache #13979

Additionally, this PR replaces some work done on other PRs / branches:

Initial Bamba support: Bamba architecture #10810
- Bamba is fully supported on this branch, so the old PR can be closed in favor of this PR
Refactored Bamba support: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactor
- I've used this branch as an A/B comparison along the way, but will abandon it now
Draft Granite 4.0 support: https://github.com/gabe-l-hart/llama.cpp/tree/GraniteFourDraft
- Build off of the previous Bamba work, this will also be abandoned in favor of this PR
Initial work on Jamba: llama : support Jamba hybrid Transformer-Mamba models #7531
- This work is quite out-of-date and would be a lot of work to overhaul to the refactors on master.
- I had planned to include Jamba support in this branch, but on further inspection, it looks like the Jamba architecture has some additional bells-and-whistles (eg sliding-window-attention) that would need further work, so my plan is to leave Jamba off for now and possibly tackle it later (hopefully it's much easier than the original branch!)

Outstanding Questions

Besides the upstream PRs, there are a few questions to answer before this PR is merge ready:

This PR contains several changes to llama-kv-cache beyond those in feat: Hybrid unified/recurrent cache #13276, but they depend on the addition of hparams.recurrent_layer_arr which is only populated correctly if there is a valid model architecture to check against. Should I move all of these changes to the hybrid cache PR or keep them here where the model architectures become real?
Is there a more efficient way to implement hparams.recurrent_layer_arr? Using a max-layer-size std::array doesn't feel quite right.
There are still some numerical differences between the attention outputs when running Bamba and granite-4.0-tiny-shared-preview on this branch vs the respective draft branches, so I need to determine if this is due to changes in the attention implementation (ie "working as expected") or a bug somewhere.
The use of dymamic_cast to get the right cache type could be expensive (though it's likely negligible relative to the tensor math). Should we do something more clever to handle different cache types in llama-graph?
The switch statement for determining the type of KV cache to allocate in llama-model.cpp seems redundant with llama_model_is_recurrent and llama_model_is_hybrid. Should we use those functions instead and eliminate the duplicate logic and additional place to tweak for new recurrent / hybrid models?

Testing

To test out this branch, I've been using the following models:

granite-4.0-tiny-preview: https://huggingface.co/ibm-granite/granite-4.0-tiny-preview
Bamba-9B-v1: https://huggingface.co/ibm-ai-platform/Bamba-9B-v1
- NOTE: v2 is out (here), but I already had v1 from previous branches and stuck with that for consistency
mamba2-370m-hf: https://huggingface.co/AntonV/mamba2-370m-hf

Details

This PR has a lot of changes in it, some of which are isolated in the prereq-PRs above. In addition to the general mamba2 and llama_kv_cache_hybrid changes, this PR does the following:

python side

Add conversion support for BambaForCausalLM and GraniteMoeHybridForCausalLM
- This includes one small tweak to gguf_writer.py that allows duplicate key/value pairs through add_key_value if (and only if) they match both value and type with the existing key. This is a convenience for hybrid models so that the converter doesn't need to rewrite the hparam conversion from multiple parents.
- This also adds the new HybridAttention section under Keys in constants.py to hold attention.layer_indices. OPEN QUESTION: Should this just go under Attention?

c++ side

Add a new public API function llama_model_is_hybrid akin to llama_model_is_recurrent
- I also split up both this function and llama_model_is_recurrent into llm_arch_is_* implemented in llama-arch.* and llama_model_is_* implemented in llama-model.*. This was done so that they could be used during model initialization before the model itself can be passed as the argument, specifically to determine how to populate hparams.recurrent_layer_arr (see below).
Add hparams.recurrent_layer_arr and support parsing it
- The current implementation pre-allocates it as a fixed-length array which doesn't feel quite right.
Add an optional layer id to hparams.n_embd_k_s / hparams.n_embd_v_s
- This is done because for hybrid models, the values may be different by layer.
- I plumbed through as many usages of these methods as I could find to properly pass the layer index, but there are some places where it's not available which default to layer 0. This should be fine since none of those places interact with the hybrid caching.
Add hparams.recurrent_layer(uint32_t) to check whether a given layer is recurrent
Model name/param/arch plumbing for bamba and granitemoeshared in llama-arch.* (the boring part!)
(possibly breaking) Add hparams as an additional argument to the llama_model.create_memory method
- This is done so the hparams can be given to the cache construction and used to determine which layers are recurrent for hybrid cache creation
In llama-graph, anywhere that a specific cache type needs to be fetched, it is grabbed using new methods get_recurrent_cache / get_unified_cache. These methods use dynamic_cast to handle both non-hybrid caches and hybrid caches.
Add support for instantiating the hybrid cache in llama-model.cpp
Add model support for bamba and granitemoehybrid in llama-model
- Most of this is "business as usual," but that breaks down when trying to avoid code duplication for the hybrid architecture
- To avoid code duplication, I hoisted build_mamba_layer / build_mamba2_layer from llm_build_mamba and build_attention_layer / build_layer_ffn from llm_build_granite into static methods on their respective classes. This makes for some gross function signatures where member data needs to be explicitly passed, but it allows the hybrid model architecture(s) to use these methods without complex inheritance.
- I tried an alternative route using diamond inheritance, but this would have required some kind of "don't actually initialize the graph" switch in the parent model builders' constructors to avoid trying to build the parent model graphs during initialization of the hybrid class.

* ggml : improve ggml_mul speed when masking recurrent states

* ggml : make the ggml_mul fast broadcast path more consistently formatted

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

gabe-l-hart · 2025-06-27T22:07:54Z

Ok, yeah, the logic in kv_self_update would definitely trigger this if the two caches had different statuses. I think it makes sense to contain this within the hybrid cache since the problem really lies with the fact that the two are out of sync and it's reasonable for the recurrent cache to assume that its apply will never be called if its status is NO_UPDATE based on the logic in kv_self_update.

I'll open a standalone PR to fix this on master.

There are conditions where the two child conditions can end up with different status values based on the logic in the init_update constructor for llama_kv_cache_unified_context which can conditionally set status to either LLAMA_MEMORY_STATUS_SUCCESS or LLAMA_MEMORY_STATUS_NO_UPDATE. See full discussion: ggml-org#13550 (comment) Branch: HybridCacheApplyLogic Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-06-27T22:16:45Z

Fix PR: #14428

ggml-ci

* origin/master: metal : disable fast-math for some cpy kernels (ggml-org#14460) ggml-cpu: sycl: Re-enable exp f16 (ggml-org#14462) test-backend-ops : disable llama test (ggml-org#14461) cmake : Remove redundant include path in CMakeLists.txt (ggml-org#14452) scripts : make the shell scripts cross-platform (ggml-org#14341) server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (ggml-org#13196) server : fix appearance of the chats list context menu for Safari (ggml-org#14322) SYCL: disable faulty fp16 exp kernel (ggml-org#14395) ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (ggml-org#14443) ggml : implement REGLU/GEGLU/SWIGLU ops (ggml-org#14158) vulkan: Add fusion support for RMS_NORM+MUL (ggml-org#14366) CUDA: add bf16 and f32 support to cublas_mul_mat_batched (ggml-org#14361) vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (ggml-org#14378) vulkan: lock accesses of pinned_memory vector (ggml-org#14333) model : add support for ERNIE 4.5 0.3B model (ggml-org#14408) fix async_mode bug (ggml-org#14432) ci : fix windows build and release (ggml-org#14431) vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (ggml-org#14427) graph : make llm_graph_context destructor virtual (ggml-org#14410)

* origin/gg/memory-is-fail: memory : correctly handle failure in apply()

* origin/master: memory : correctly handle failure in apply() (ggml-org#14438)

* origin/master: Add Vulkan images to docker.md (ggml-org#14472) CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (ggml-org#14411) vulkan: Split large mul_mat_id to fit in shared memory (ggml-org#14451) add GELU_ERF (ggml-org#14455) ggml : remove trailing whitespace (#0) sync : ggml ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285) ggml-quants : rename best_mad to best_error (ggml/1283) opencl : add GEGLU, REGLU, SWIGLU (ggml-org#14456) Add Conv2d for CPU (ggml-org#14388)

* origin/master: llama : initial Mamba-2 support (ggml-org#9126) sync : ggml ggml : add version function to get lib version (ggml/1286) Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (ggml-org#14309) CUDA: add softmax broadcast (ggml-org#14475) CUDA: broadcasting for FlashAttention mask (ggml-org#14500) vulkan: support softmax/FA batch and broadcast (ggml-org#14449) ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (ggml-org#14435) opencl : fix possible buffer overflow in dump_tensor (ggml-org#14490) simple-chat : fix context-exceeded condition (ggml-org#14494) opencl : skip empty nodes on cgraph compute (ggml-org#14491) opencl : update upscale to support align corners (ggml-org#14488) ci : add OpenCL to labeler workflow (ggml-org#14496) github : add OpenCL backend to issue templates (ggml-org#14492) ggml : Callback before abort (ggml-org#14481) ci : disable fast-math for Metal GHA CI (ggml-org#14478)

gabe-l-hart · 2025-07-02T17:48:00Z

@compilade, huge thanks for pushing #9126 over the line! With that merged, this is ready for full review (cc @ggerganov). This will be the first real instance of the hybrid recurrent cache implementation.

gabe-l-hart · 2025-07-02T17:48:38Z

convert_hf_to_gguf.py

@@ -4875,6 +4875,9 @@ def __init__(self, dir_model: Path, *args, **kwargs):
            with open(dir_model / "config.json", "r", encoding="utf-8") as f:
                hparams = json.load(f)
        super().__init__(dir_model, *args, hparams=hparams, **kwargs)
+        self.d_model = self.find_hparam(["hidden_size", "d_model", "dim"])


I pulled these into the class so that they can be set differently by derived conversion classes and then used int he common methods below

gabe-l-hart · 2025-07-02T17:52:03Z

gguf-py/gguf/constants.py

+    EXAONE             = auto()
+    GRANITE            = auto()
+    GRANITE_MOE        = auto()
+    GRANITE_MOE_HYBRID = auto()


This has been one of the most annoying changes keeping this branch up to date: The GRANITE_MOE_HYBRID name is two characters longer than the previous longest name, so to keep vertical alignment, it changes the indentation of all values (here and in llama-arch.cpp).

gguf-py/gguf/gguf_writer.py

gguf-py/gguf/tensor_mapping.py

src/llama-arch.cpp

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

src/llama-model.cpp

gabe-l-hart · 2025-07-02T18:03:44Z

src/llama-model.cpp

-                    Qcur = ggml_add(ctx0, Qcur, model.layers[il].bq);
-                    cb(Qcur, "Qcur", il);
-                }
+            cur = build_granite_attention_layer(


I had originally extracted these as standalone methods so that I could reuse them in the hybrid implementation. Ultimately, any inheritance / static method / mixin approach I tried felt too tangled, so tangled, so I went back to simply duplicating these methods in the hybrid model. I left these separated out here for the symmetry and encapsulation, but I could also revert this set of changes to llm_build_granite to keep the changeset smaller.

gabe-l-hart · 2025-07-02T18:05:40Z

src/llama-model.cpp

        cb(cur, "result_output", -1);
        res->t_logits = cur;

        ggml_build_forward_expand(gf, cur);
    }
+
+    ggml_tensor * build_mamba2_layer(


This is a copy-paste from llm_build_mamba. Per the other comment, it got too tangled to try to reliably reuse these across model builders. That said, I would still love to find a way to avoid this kind of duplication if there's appetite.

@gabe-l-hart I might have found a way to avoid this kind of duplication, see src/llama-model.cpp and src/llama-graph.cpp in #7531

llm_graph_context_mamba is a child class of llm_graph_context with Mamba-specific layer builders

llama.cpp/src/llama-model.cpp

Lines 9883 to 9894 in 908e655

struct llm_graph_context_mamba : public llm_graph_context {

llm_graph_context_mamba(const llm_graph_params & params) : llm_graph_context(params) {}

ggml_tensor * build_mamba_layer(

llm_graph_input_rs * inp,

ggml_cgraph * gf,

ggml_tensor * cur,

const llama_model & model,

const llama_ubatch & ubatch,

int il) {

const auto * mctx_cur = inp->mctx;

llm_graph_context_mamba is the parent class of llm_build_mamba and llm_build_jamba. Not sure if that would still be appropriate with multiple-inheritance, though that wasn't necessary for Jamba.

The methods could potentially be moved to llm_graph_context (in src/llama-graph.cpp), but I preferred to keep model-specific graph building methods in src/llama-model.cpp, at least for now.

llama.cpp/src/llama-model.cpp

Lines 10156 to 10157 in 908e655

struct llm_build_mamba : public llm_graph_context_mamba {

llm_build_mamba(const llama_model & model, const llm_graph_params & params, ggml_cgraph * gf) : llm_graph_context_mamba(params) {

Note that I've also removed build_inp_mem_hybrid and llm_graph_input_hybrid in favor of directly using the recurrent and self-attention input builders separately. This is relatively clean, I think.

build_inp_rs and build_inp_attn_kv_unified accept an optional mctx override argument.

llama.cpp/src/llama-model.cpp

Lines 10213 to 10259 in 908e655

struct llm_build_jamba : public llm_graph_context_mamba {

llm_build_jamba(const llama_model & model, const llm_graph_params & params, ggml_cgraph * gf) : llm_graph_context_mamba(params) {

const int64_t n_embd_head = hparams.n_embd_head_v;

ggml_tensor * cur;

ggml_tensor * inpL;

// {n_embd, n_tokens}

inpL = build_inp_embd(model.tok_embd);

const auto * mctx_hyb = static_cast<const llama_memory_hybrid_context *>(mctx);

auto * inp_rs = build_rs_inp(mctx_hyb->get_recr());

auto * inp_attn = build_attn_inp_kv_unified(mctx_hyb->get_attn());

ggml_tensor * inp_out_ids = build_inp_out_ids();

for (int il = 0; il < n_layer; ++il) {

const int64_t n_head_kv = hparams.n_head_kv(il);

cur = build_norm(inpL, model.layers[il].attn_norm, NULL, LLM_NORM_RMS, il);

cb(cur, "attn_norm", il);

if (n_head_kv == 0) {

cur = build_mamba_layer(inp_rs, gf, cur, model, ubatch, il);

} else {

// Attention

struct ggml_tensor * Qcur = build_lora_mm(model.layers[il].wq, cur);

struct ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur);

struct ggml_tensor * Vcur = build_lora_mm(model.layers[il].wv, cur);

cb(Qcur, "Qcur", il);

cb(Kcur, "Kcur", il);

cb(Vcur, "Vcur", il);

Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);

Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);

Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);

cb(Qcur, "Qcur", il);

cb(Kcur, "Kcur", il);

cb(Vcur, "Vcur", il);

// No RoPE :)

cur = build_attn(inp_attn, gf, model.layers[il].wo, NULL, Qcur, Kcur, Vcur, NULL, NULL, 1.0f/sqrtf(float(n_embd_head)), il);

This makes use of the fact that mctx is stored in inp_rs and inp_attn already, and so build_rs and build_attn were changed to use that instead of trying to cast llm_graph_context::mctx again.

…d_mamba Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

src/llama-model.cpp

Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

* origin/master: gguf-py : add support for chat template jinja files (ggml-org#14508)

compilade added 30 commits August 21, 2024 18:00

llama : initial Mamba-2 support

1f0fea7

ggml : SIMD ggml_ssm_scan for Mamba-2

dceff23

* ggml : improve ggml_mul speed when masking recurrent states

llama : support running Mamba-Codestral-7B-v0.1

2bfe9de

llama : fix Mamba-2 conv state saving

aff9692

* ggml : make the ggml_mul fast broadcast path more consistently formatted

llama : remove unused variable

e04910d

llama : add missing break

fa358e7

convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

38913dc

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

Merge branch 'master' into compilade/mamba2

0e601ca

llama : avoid redundant state copy for Mamba 1 and 2

273e7a4

Merge branch 'master' into compilade/mamba2

7d6cb36

metal : attempt to adapt SSM_SCAN for Mamba-2

2c77d79

metal : fix SSM_SCAN pipeline scope

87b97d0

metal : use log and exp instead of log1pf and expf in SSM_SCAN

03d0e6e

metal : remove unused arguments for SSM_SCAN

7a351ab

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

8b15bc6

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset

5b8ec2b

metal : fix wrong number of tokens per sequence in SSM_SCAN

62b09b3

Merge branch 'master' into compilade/mamba2

038d958

ggml : remove unused fast broadcast path in GGML_MUL

805512a

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

Merge branch 'master' into compilade/mamba2

7d16e1b

Merge branch 'master' into compilade/mamba2

8d8f065

convert : fix flake8 lint

b4e9c59

Merge branch 'master' into compilade/mamba2

1ee6c48

Merge branch 'master' into compilade/mamba2

c9ecf62

Merge branch 'master' into compilade/mamba2

35d06fa

metal : fix confusion between ; and ,

cf4f0a4

metal : add missing args for nb references in ssm_scan_f32_group

6def5cd

metal : single-user mamba2 inference works

791998b

kv-cache : remove const_cast when setting inputs for s_copy

94c3d53

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

gabe-l-hart mentioned this pull request Jun 27, 2025

fix(hybrid cache): Only call apply on child caches in the success state #14428

Closed

ggerganov and others added 3 commits June 29, 2025 10:15

memory : correctly handle failure in apply()

66a7a43

ggml-ci

Merge remote-tracking branch 'origin/gg/memory-is-fail' into GraniteFour

f13f5bc

* origin/gg/memory-is-fail: memory : correctly handle failure in apply()

gabe-l-hart force-pushed the GraniteFour branch from 7613fb2 to f13f5bc Compare June 30, 2025 14:29

gabe-l-hart mentioned this pull request Jun 30, 2025

memory : correctly handle failure in apply() #14438

Merged

gabe-l-hart added 3 commits June 30, 2025 09:04

Merge remote-tracking branch 'origin/master' into GraniteFour

6cac586

* origin/master: memory : correctly handle failure in apply() (ggml-org#14438)

gabe-l-hart marked this pull request as ready for review July 2, 2025 17:46