llama: Attempt to add ModernBert #14014

huydt84 · 2025-06-04T15:24:53Z

I don't know whether my implementation is correct or not

huydt84 · 2025-06-04T15:40:09Z

hparams.set_swa_pattern can't work properly with ModernBert

huydt84 · 2025-06-04T16:15:15Z

The embedding result seems random and very low. There is something wrong with this

CISC

Delete the files you added in models, we don't need them, just make sure test-tokenizer-0 succeeds with the GGUF.

convert_hf_to_gguf.py

convert_hf_to_gguf_update.py

src/llama-hparams.h

convert_hf_to_gguf.py

ggerganov · 2025-06-05T05:34:32Z

src/llama-model.cpp

+        inpL = build_norm(inpL, model.tok_norm, nullptr, LLM_NORM, -1);
+        cb(inpL, "inp_norm", -1);
+
+        auto * inp_attn = build_attn_inp_kv_unified_iswa();


This should probably become:

Suggested change

auto * inp_attn = build_attn_inp_kv_unified_iswa();

auto * inp_attn = build_attn_inp_no_cache_iswa();

And add the corresponding mask logic in llama-graph. Special attention should be taken about how the SWA works for this model - i.e. is it symmetric or not:

# non-symmetric token i attends to [i - n_swa, i] # symmetric: token i attends to [i - n_swa/2, i + n_swa/2]

@huydt-bti Hey, is the issue that you forgot to make this function so that swa is actually never applied?

@CISC No, I made it here:

llama.cpp/src/llama-graph.cpp

Lines 349 to 352 in 454d7b7

if (hparams.use_alibi &&

(hparams.n_swa == 0 || (pos_diff >= -half_n_swa && pos_diff <= half_n_swa))) {

f = -std::abs(ubatch->pos[ti] - ubatch->pos[tj]);

} else {

I use build_attn_inp_no_cache()
#14014 (review)

Yes, but there's no kq_mask_swa, so is this even executed?

@CISC I have just implemented it. Please check again

ggerganov

You have to add the new arch here:

llama.cpp/src/llama-model.cpp

Lines 13195 to 13203 in 5a8ae30

    
           switch (arch) { 
        
               case LLM_ARCH_BERT: 
        
               case LLM_ARCH_JINA_BERT_V2: 
        
               case LLM_ARCH_NOMIC_BERT: 
        
               case LLM_ARCH_NOMIC_BERT_MOE: 
        
               case LLM_ARCH_WAVTOKENIZER_DEC: 
        
                   { 
        
                       res = nullptr; 
        
                   } break;

To avoid creating a memory module (a.k.a. KV cache) for these models.

convert_hf_to_gguf.py

CISC · 2025-06-05T20:48:46Z

So, since vocab is BPE you need to add modern-bert vocab handling a few places:

llama.cpp/src/llama-vocab.cpp

Line 1557 in 9f47fa5

tokenizer_pre == "roberta-bpe") {

Set correct attribute on [MASK] token, similarly to this:

llama.cpp/src/llama-vocab.cpp

Lines 2097 to 2105 in 9f47fa5

    
           if (false 
        
                   || _contains_any(tokenizer_pre, {"jina-v2-de", "jina-v2-es", "jina-v2-code"}) 
        
                   || _contains_any(general_arch, {"nomic-bert-moe"}) 
        
              ) { 
        
               if (token_to_id.count("<mask>") == 0) { 
        
                   LLAMA_LOG_WARN("%s: Mask token is missing in vocab, please reconvert model!\n", __func__); 
        
               } else { 
        
                   _set_token_attr("<mask>", LLAMA_TOKEN_ATTR_LSTRIP, true); 
        
               }

src/llama-model.cpp

CISC · 2025-06-05T21:51:55Z

The embedding result seems random and very low. There is something wrong with this

Yep, I also noticed the same with jina-reranker-v2, most likely the same issue, will investigate.

huydt84 · 2025-06-08T02:53:58Z

@CISC cc: @ggerganov

I tried to do the embedding with various models, but the output results are barely changed among those attempts. Maybe the params load or inference graph is getting problems somewhere. Can you check that part?
This is the model implementation in Huggingface: https://github.com/huggingface/transformers/blob/v4.52.3/src/transformers/models/modernbert/modeling_modernbert.py

CISC · 2025-06-08T14:27:32Z

So, I just noticed at least part of the problem:

llama.cpp/src/llama-graph.cpp

Lines 1567 to 1571 in 3ac6753

    
           if (cls != nullptr && cls_b != nullptr) { 
        
               // classification head 
        
               // https://github.com/huggingface/transformers/blob/5af7d41e49bbfc8319f462eb45253dcb3863dfb7/src/transformers/models/roberta/modeling_roberta.py#L1566 
        
               cur = ggml_add(ctx0, ggml_mul_mat(ctx0, cls, inp), cls_b); 
        
               cur = ggml_tanh(ctx0, cur);

We have cls, but not cls_b, so this has to be modified to handle that...

src/llama-model.cpp

huydt84 · 2025-06-09T04:37:13Z

We have cls, but not cls_b, so this has to be modified to handle that...

@CISC After fixing that, the result is much better now :) but it is still lower than my expectations about ModernBert. Maybe there is problem somewhere else...

src/llama-graph.cpp

CISC

Everything else LGTM, so pending finding the output issue.

src/llama-model.cpp

huydt84 · 2025-06-16T05:41:50Z

src/llama-kv-cache-unified.cpp

@@ -1328,6 +1328,12 @@ bool llama_kv_cache_unified::is_masked_swa(llama_pos p0, llama_pos p1) const {
                    return true;
                }
            } break;
+        case LLAMA_SWA_TYPE_SYMMETRIC:
+            {
+                if ( p1 - p0 <= (int32_t) n_swa / 2 || p0 - p1 >= (int32_t) n_swa / 2) {


@CISC I see part of the problem! I'm masking the token inside the window, which should be outside

No, the function isn't used because it belongs to llama_kv_cache_unified

…andle causal_attn

CISC · 2025-06-16T07:12:49Z

src/llama-graph.cpp

@@ -351,6 +351,69 @@ void llm_graph_input_attn_no_cache::set_input(const llama_ubatch * ubatch) {
            }
        }
    }
+
+    // Handle symmetric SWA mask separately if it exists
+    if (kq_mask_swa) {


No, this is unnecessary duplication, it should be handled like this once you add llm_graph_input/build_attn_inp_no_cache_iswa:

llama.cpp/src/llama-graph.cpp

Lines 356 to 370 in d7da8dc

void llm_graph_input_attn_kv_unified::set_input(const llama_ubatch * ubatch) {

if (self_kq_mask) {

kv_state->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);

}

}

void llm_graph_input_attn_kv_unified_iswa::set_input(const llama_ubatch * ubatch) {

if (self_kq_mask) {

kv_state->get_base()->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);

}

if (self_kq_mask_swa) {

kv_state->get_swa()->set_input_kq_mask(self_kq_mask_swa, ubatch, cparams.causal_attn);

}

}

It is not interleaved swa, so I still prefer using build_attn_inp_no_cache. I will try to refactor llm_graph_input_attn_no_cache::set_input, but the mechanisim is different from llm_graph_input_attn_kv_unified and llm_graph_input_attn_kv_unified_iswa, since they use kv_state

Sure, the point is just that you can handle it similarly.

I just pushed my code - but it's ugly to copy is_masked_swa and place it inside llm_graph_input_attn_no_cache::set_input. Do you have any suggestions?

I don't understand why you can't split the methods just like in unified, you can have a is_masked_swa implementation for no_cache and a much cleaner set_input.

Personally I don't think that would be cleaner, since the new build_attn_inp_no_cache_swa will almost be the same as the current build_attn_inp_no_cache, and we have a new additional build_attn_inp_no_cache. But I will try that

That doesn't matter, build_attn_inp_* are tiny, the important part is that you can reuse the same set_input_kq_mask.

huydt-bti added 2 commits June 5, 2025 00:21

llama: attempt to add modern-bert

045b1ac

Merge branch 'master' into huydt/mb

95f49d9

github-actions bot added the python python script changes label Jun 4, 2025

huydt84 marked this pull request as draft June 4, 2025 15:27

re-format and delete unused implementations

eab776e

huydt84 marked this pull request as ready for review June 4, 2025 15:36

huydt84 marked this pull request as draft June 4, 2025 15:40

overload set_swa_pattern for modern bert

7143840

modern-bert doesn't have bias

6aa1335

huydt84 marked this pull request as ready for review June 4, 2025 16:21

CISC requested changes Jun 4, 2025

View reviewed changes

convert_hf_to_gguf.py Show resolved Hide resolved

convert_hf_to_gguf_update.py Outdated Show resolved Hide resolved

CISC requested changes Jun 4, 2025

View reviewed changes

src/llama-hparams.h Outdated Show resolved Hide resolved

CISC requested changes Jun 4, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

delete unnecessary files

9e1179a

huydt84 requested a review from CISC June 4, 2025 22:55

ggerganov reviewed Jun 5, 2025

View reviewed changes

huydt-bti added 4 commits June 5, 2025 22:32

add build_attn_inp_no_cache_iswa with symmetric swa

fa23480

add modern-bert to llama_model::create_memory

a72cb3b

fix lint

adea1c9

access n_swa via hparams

cfebb6e

huydt84 requested a review from ggerganov June 5, 2025 13:55

CISC requested changes Jun 5, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

CISC requested changes Jun 5, 2025

View reviewed changes

src/llama-model.cpp Show resolved Hide resolved

huydt-bti added 2 commits June 6, 2025 11:37

revert changes in convert script

31e87e4

add set_vocab to modernbert convert class

1004327

fix modern-bert class register

6751e69

Merge branch 'master' into huydt/mb

8b794f9

CISC reviewed Jun 8, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

handle when no cls_b and cls_out_b

4d6b804

DangNhutNguyen approved these changes Jun 9, 2025

View reviewed changes

CISC reviewed Jun 9, 2025

View reviewed changes

huydt-bti added 4 commits June 9, 2025 23:17

fix unnecessary operations

5821d6c

revert incorrect change

820cee1

Merge branch 'master' into huydt/mb

4fc4bf6

use build_ffn with LLM_FFN_GEGLU

16b73d4

CISC reviewed Jun 9, 2025

View reviewed changes

src/llama-graph.cpp Outdated Show resolved Hide resolved

fix for use of models without n_swa

333eeed

CISC requested changes Jun 11, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

fix review comment

454d7b7

huydt84 mentioned this pull request Jun 13, 2025

Make cls_b and cls_out_b optional in ranking #14165

Merged

ngxson reviewed Jun 13, 2025

View reviewed changes

src/llama-model.cpp Show resolved Hide resolved

huydt84 mentioned this pull request Jun 13, 2025

Add geglu activation function #14074

Merged

huydt-bti added 2 commits June 16, 2025 09:56

Merge branch 'master' into huydt/mb

4d6bcdf

try to add kq_mask_swa to build_attn_inp_no_cache

0925ad0

huydt84 commented Jun 16, 2025

View reviewed changes

huydt-bti added 3 commits June 16, 2025 14:43

fix is_masked_swa

72b56c4

improve llm_graph_input_attn_no_cache::set_input:if(kq_mask_swa) to h…

d8a1c8a

…andle causal_attn

Fix lint

3387586

CISC reviewed Jun 16, 2025

View reviewed changes

huydt-bti added 2 commits June 16, 2025 19:40

refactor llm_graph_input_attn_no_cache::set_input

51eda92

Merge branch 'master' into huydt/mb

5835f58

CISC added the model Model specific label Jun 16, 2025

	auto * inp_attn = build_attn_inp_kv_unified_iswa();
	auto * inp_attn = build_attn_inp_no_cache_iswa();

	if (hparams.use_alibi &&
	(hparams.n_swa == 0 \|\| (pos_diff >= -half_n_swa && pos_diff <= half_n_swa))) {
	f = -std::abs(ubatch->pos[ti] - ubatch->pos[tj]);
	} else {

	switch (arch) {
	case LLM_ARCH_BERT:
	case LLM_ARCH_JINA_BERT_V2:
	case LLM_ARCH_NOMIC_BERT:
	case LLM_ARCH_NOMIC_BERT_MOE:
	case LLM_ARCH_WAVTOKENIZER_DEC:
	{
	res = nullptr;
	} break;

	void llm_graph_input_attn_kv_unified::set_input(const llama_ubatch * ubatch) {
	if (self_kq_mask) {
	kv_state->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);
	}
	}

	void llm_graph_input_attn_kv_unified_iswa::set_input(const llama_ubatch * ubatch) {
	if (self_kq_mask) {
	kv_state->get_base()->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);
	}

	if (self_kq_mask_swa) {
	kv_state->get_swa()->set_input_kq_mask(self_kq_mask_swa, ubatch, cparams.causal_attn);
	}
	}

llama: Attempt to add ModernBert #14014

Are you sure you want to change the base?

llama: Attempt to add ModernBert #14014

Uh oh!

Conversation

huydt84 commented Jun 4, 2025

Uh oh!

huydt84 commented Jun 4, 2025

Uh oh!

huydt84 commented Jun 4, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CISC commented Jun 5, 2025

Uh oh!

Uh oh!

CISC commented Jun 5, 2025

Uh oh!

huydt84 commented Jun 8, 2025

Uh oh!

CISC commented Jun 8, 2025

Uh oh!

Uh oh!

huydt84 commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

huydt84 Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov Jun 5, 2025 •

edited

Loading

huydt84 commented Jun 9, 2025 •

edited

Loading

huydt84 Jun 16, 2025 •

edited

Loading