-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Granite Four #13550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Granite Four #13550
Conversation
* ggml : improve ggml_mul speed when masking recurrent states
* ggml : make the ggml_mul fast broadcast path more consistently formatted
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
The max index is 31, so trimming the arguments is necessary.
Whoops, this is needed for the offset in the concatenated output.
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
Ok, yeah, the logic in kv_self_update would definitely trigger this if the two caches had different statuses. I think it makes sense to contain this within the I'll open a standalone PR to fix this on |
There are conditions where the two child conditions can end up with different status values based on the logic in the init_update constructor for llama_kv_cache_unified_context which can conditionally set status to either LLAMA_MEMORY_STATUS_SUCCESS or LLAMA_MEMORY_STATUS_NO_UPDATE. See full discussion: ggml-org#13550 (comment) Branch: HybridCacheApplyLogic Signed-off-by: Gabe Goodhart <[email protected]>
Fix PR: #14428 |
* origin/master: metal : disable fast-math for some cpy kernels (ggml-org#14460) ggml-cpu: sycl: Re-enable exp f16 (ggml-org#14462) test-backend-ops : disable llama test (ggml-org#14461) cmake : Remove redundant include path in CMakeLists.txt (ggml-org#14452) scripts : make the shell scripts cross-platform (ggml-org#14341) server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (ggml-org#13196) server : fix appearance of the chats list context menu for Safari (ggml-org#14322) SYCL: disable faulty fp16 exp kernel (ggml-org#14395) ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (ggml-org#14443) ggml : implement REGLU/GEGLU/SWIGLU ops (ggml-org#14158) vulkan: Add fusion support for RMS_NORM+MUL (ggml-org#14366) CUDA: add bf16 and f32 support to cublas_mul_mat_batched (ggml-org#14361) vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (ggml-org#14378) vulkan: lock accesses of pinned_memory vector (ggml-org#14333) model : add support for ERNIE 4.5 0.3B model (ggml-org#14408) fix async_mode bug (ggml-org#14432) ci : fix windows build and release (ggml-org#14431) vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (ggml-org#14427) graph : make llm_graph_context destructor virtual (ggml-org#14410)
* origin/gg/memory-is-fail: memory : correctly handle failure in apply()
* origin/master: memory : correctly handle failure in apply() (ggml-org#14438)
* origin/master: Add Vulkan images to docker.md (ggml-org#14472) CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (ggml-org#14411) vulkan: Split large mul_mat_id to fit in shared memory (ggml-org#14451) add GELU_ERF (ggml-org#14455) ggml : remove trailing whitespace (#0) sync : ggml ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285) ggml-quants : rename best_mad to best_error (ggml/1283) opencl : add GEGLU, REGLU, SWIGLU (ggml-org#14456) Add Conv2d for CPU (ggml-org#14388)
* origin/master: llama : initial Mamba-2 support (ggml-org#9126) sync : ggml ggml : add version function to get lib version (ggml/1286) Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (ggml-org#14309) CUDA: add softmax broadcast (ggml-org#14475) CUDA: broadcasting for FlashAttention mask (ggml-org#14500) vulkan: support softmax/FA batch and broadcast (ggml-org#14449) ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (ggml-org#14435) opencl : fix possible buffer overflow in dump_tensor (ggml-org#14490) simple-chat : fix context-exceeded condition (ggml-org#14494) opencl : skip empty nodes on cgraph compute (ggml-org#14491) opencl : update upscale to support align corners (ggml-org#14488) ci : add OpenCL to labeler workflow (ggml-org#14496) github : add OpenCL backend to issue templates (ggml-org#14492) ggml : Callback before abort (ggml-org#14481) ci : disable fast-math for Metal GHA CI (ggml-org#14478)
@compilade, huge thanks for pushing #9126 over the line! With that merged, this is ready for full review (cc @ggerganov). This will be the first real instance of the hybrid recurrent cache implementation. |
@@ -4875,6 +4875,9 @@ def __init__(self, dir_model: Path, *args, **kwargs): | |||
with open(dir_model / "config.json", "r", encoding="utf-8") as f: | |||
hparams = json.load(f) | |||
super().__init__(dir_model, *args, hparams=hparams, **kwargs) | |||
self.d_model = self.find_hparam(["hidden_size", "d_model", "dim"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled these into the class so that they can be set differently by derived conversion classes and then used int he common methods below
EXAONE = auto() | ||
GRANITE = auto() | ||
GRANITE_MOE = auto() | ||
GRANITE_MOE_HYBRID = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been one of the most annoying changes keeping this branch up to date: The GRANITE_MOE_HYBRID
name is two characters longer than the previous longest name, so to keep vertical alignment, it changes the indentation of all values (here and in llama-arch.cpp
).
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Qcur = ggml_add(ctx0, Qcur, model.layers[il].bq); | ||
cb(Qcur, "Qcur", il); | ||
} | ||
cur = build_granite_attention_layer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had originally extracted these as standalone methods so that I could reuse them in the hybrid implementation. Ultimately, any inheritance / static method / mixin approach I tried felt too tangled, so tangled, so I went back to simply duplicating these methods in the hybrid model. I left these separated out here for the symmetry and encapsulation, but I could also revert this set of changes to llm_build_granite
to keep the changeset smaller.
cb(cur, "result_output", -1); | ||
res->t_logits = cur; | ||
|
||
ggml_build_forward_expand(gf, cur); | ||
} | ||
|
||
ggml_tensor * build_mamba2_layer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a copy-paste from llm_build_mamba
. Per the other comment, it got too tangled to try to reliably reuse these across model builders. That said, I would still love to find a way to avoid this kind of duplication if there's appetite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabe-l-hart I might have found a way to avoid this kind of duplication, see src/llama-model.cpp
and src/llama-graph.cpp
in #7531
llm_graph_context_mamba
is a child class of llm_graph_context
with Mamba-specific layer builders
Lines 9883 to 9894 in 908e655
struct llm_graph_context_mamba : public llm_graph_context { | |
llm_graph_context_mamba(const llm_graph_params & params) : llm_graph_context(params) {} | |
ggml_tensor * build_mamba_layer( | |
llm_graph_input_rs * inp, | |
ggml_cgraph * gf, | |
ggml_tensor * cur, | |
const llama_model & model, | |
const llama_ubatch & ubatch, | |
int il) { | |
const auto * mctx_cur = inp->mctx; |
llm_graph_context_mamba
is the parent class of llm_build_mamba
and llm_build_jamba
. Not sure if that would still be appropriate with multiple-inheritance, though that wasn't necessary for Jamba.
The methods could potentially be moved to llm_graph_context
(in src/llama-graph.cpp
), but I preferred to keep model-specific graph building methods in src/llama-model.cpp
, at least for now.
Lines 10156 to 10157 in 908e655
struct llm_build_mamba : public llm_graph_context_mamba { | |
llm_build_mamba(const llama_model & model, const llm_graph_params & params, ggml_cgraph * gf) : llm_graph_context_mamba(params) { |
Note that I've also removed build_inp_mem_hybrid
and llm_graph_input_hybrid
in favor of directly using the recurrent and self-attention input builders separately. This is relatively clean, I think.
build_inp_rs
and build_inp_attn_kv_unified
accept an optional mctx
override argument.
Lines 10213 to 10259 in 908e655
struct llm_build_jamba : public llm_graph_context_mamba { | |
llm_build_jamba(const llama_model & model, const llm_graph_params & params, ggml_cgraph * gf) : llm_graph_context_mamba(params) { | |
const int64_t n_embd_head = hparams.n_embd_head_v; | |
ggml_tensor * cur; | |
ggml_tensor * inpL; | |
// {n_embd, n_tokens} | |
inpL = build_inp_embd(model.tok_embd); | |
const auto * mctx_hyb = static_cast<const llama_memory_hybrid_context *>(mctx); | |
auto * inp_rs = build_rs_inp(mctx_hyb->get_recr()); | |
auto * inp_attn = build_attn_inp_kv_unified(mctx_hyb->get_attn()); | |
ggml_tensor * inp_out_ids = build_inp_out_ids(); | |
for (int il = 0; il < n_layer; ++il) { | |
const int64_t n_head_kv = hparams.n_head_kv(il); | |
cur = build_norm(inpL, model.layers[il].attn_norm, NULL, LLM_NORM_RMS, il); | |
cb(cur, "attn_norm", il); | |
if (n_head_kv == 0) { | |
cur = build_mamba_layer(inp_rs, gf, cur, model, ubatch, il); | |
} else { | |
// Attention | |
struct ggml_tensor * Qcur = build_lora_mm(model.layers[il].wq, cur); | |
struct ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur); | |
struct ggml_tensor * Vcur = build_lora_mm(model.layers[il].wv, cur); | |
cb(Qcur, "Qcur", il); | |
cb(Kcur, "Kcur", il); | |
cb(Vcur, "Vcur", il); | |
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens); | |
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens); | |
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens); | |
cb(Qcur, "Qcur", il); | |
cb(Kcur, "Kcur", il); | |
cb(Vcur, "Vcur", il); | |
// No RoPE :) | |
cur = build_attn(inp_attn, gf, model.layers[il].wo, NULL, Qcur, Kcur, Vcur, NULL, NULL, 1.0f/sqrtf(float(n_embd_head)), il); |
This makes use of the fact that mctx
is stored in inp_rs
and inp_attn
already, and so build_rs
and build_attn
were changed to use that instead of trying to cast llm_graph_context::mctx
again.
…d_mamba Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: gguf-py : add support for chat template jinja files (ggml-org#14508)
Description
This PR is the end-point for architecture support for Granite 4.0 (#13269 . It incorporates a number of changes from other in-flight branches that will need to be merged first:
Additionally, this PR replaces some work done on other PRs / branches:
Bamba
support: Bamba architecture #10810Bamba
support: https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitectureRefactorGranite 4.0
support: https://github.com/gabe-l-hart/llama.cpp/tree/GraniteFourDraftBamba
work, this will also be abandoned in favor of this PRJamba
: llama : support Jamba hybrid Transformer-Mamba models #7531master
.Jamba
support in this branch, but on further inspection, it looks like theJamba
architecture has some additional bells-and-whistles (eg sliding-window-attention) that would need further work, so my plan is to leaveJamba
off for now and possibly tackle it later (hopefully it's much easier than the original branch!)Outstanding Questions
Besides the upstream PRs, there are a few questions to answer before this PR is merge ready:
llama-kv-cache
beyond those in feat: Hybrid unified/recurrent cache #13276, but they depend on the addition ofhparams.recurrent_layer_arr
which is only populated correctly if there is a valid model architecture to check against. Should I move all of these changes to the hybrid cache PR or keep them here where the model architectures become real?hparams.recurrent_layer_arr
? Using a max-layer-sizestd::array
doesn't feel quite right.Bamba
andgranite-4.0-tiny-shared-preview
on this branch vs the respective draft branches, so I need to determine if this is due to changes in the attention implementation (ie "working as expected") or a bug somewhere.dymamic_cast
to get the right cache type could be expensive (though it's likely negligible relative to the tensor math). Should we do something more clever to handle different cache types inllama-graph
?switch
statement for determining the type of KV cache to allocate inllama-model.cpp
seems redundant withllama_model_is_recurrent
andllama_model_is_hybrid
. Should we use those functions instead and eliminate the duplicate logic and additional place to tweak for new recurrent / hybrid models?Testing
To test out this branch, I've been using the following models:
granite-4.0-tiny-preview
: https://huggingface.co/ibm-granite/granite-4.0-tiny-previewBamba-9B-v1
: https://huggingface.co/ibm-ai-platform/Bamba-9B-v1mamba2-370m-hf
: https://huggingface.co/AntonV/mamba2-370m-hfDetails
This PR has a lot of changes in it, some of which are isolated in the prereq-PRs above. In addition to the general
mamba2
andllama_kv_cache_hybrid
changes, this PR does the following:python side
BambaForCausalLM
andGraniteMoeHybridForCausalLM
gguf_writer.py
that allows duplicate key/value pairs throughadd_key_value
if (and only if) they match both value and type with the existing key. This is a convenience for hybrid models so that the converter doesn't need to rewrite the hparam conversion from multiple parents.HybridAttention
section underKeys
inconstants.py
to holdattention.layer_indices
. OPEN QUESTION: Should this just go underAttention
?c++ side
llama_model_is_hybrid
akin tollama_model_is_recurrent
llama_model_is_recurrent
intollm_arch_is_*
implemented inllama-arch.*
andllama_model_is_*
implemented inllama-model.*
. This was done so that they could be used during model initialization before the model itself can be passed as the argument, specifically to determine how to populatehparams.recurrent_layer_arr
(see below).hparams.recurrent_layer_arr
and support parsing ithparams.n_embd_k_s
/hparams.n_embd_v_s
0
. This should be fine since none of those places interact with the hybrid caching.hparams.recurrent_layer(uint32_t)
to check whether a given layer is recurrentbamba
andgranitemoeshared
inllama-arch.*
(the boring part!)hparams
as an additional argument to thellama_model.create_memory
methodllama-graph
, anywhere that a specific cache type needs to be fetched, it is grabbed using new methodsget_recurrent_cache
/get_unified_cache
. These methods usedynamic_cast
to handle both non-hybrid caches and hybrid caches.llama-model.cpp
bamba
andgranitemoehybrid
inllama-model
build_mamba_layer
/build_mamba2_layer
fromllm_build_mamba
andbuild_attention_layer
/build_layer_ffn
fromllm_build_granite
intostatic
methods on their respective classes. This makes for some gross function signatures where member data needs to be explicitly passed, but it allows the hybrid model architecture(s) to use these methods without complex inheritance.