model : add hunyuan moe #14425

ngxson · 2025-06-27T16:03:30Z

TODO:

convertible to GGUF
correct tokenizer / pretok
implement cgraph - everything is ok except for model : add hunyuan moe #14425 (comment)
fix context size larger than 4096 tokens done thanks to @kooshi

convert_hf_to_gguf.py

ngxson · 2025-06-27T17:10:26Z

convert_hf_to_gguf.py

+        for token, rank in mergeable_ranks.items():
+            vocab[QwenModel.token_bytes_to_string(token)] = rank
+            if len(token) == 1:
+                continue
+            merged = QwenModel.bpe(mergeable_ranks, token, max_rank=rank)
+            if len(merged) == 2:
+                merges.append(' '.join(map(QwenModel.token_bytes_to_string, merged)))


quite doubt if this is correct. if someone knows or has time to do tokenizer test, please feel free to leave a comment

ngxson · 2025-06-27T17:35:57Z

Ok, getting somewhere now. The model runs, but output gibberish

[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧]

ubergarm · 2025-06-27T20:56:35Z

Thanks for working on this!

I got the same looking output trying llama-server on ngxson/xsn/hunyuan-moe@51886a47a with the freshly converted bf16.

The only odd things I noticed were:

I had to pip install tiktoken to get it to convert
Conversion had an odd warning WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.
On startup llama-server printed this warning:

load: control-looking token: 127957 '<|endoftext|>' was not control-type; this is probably a bug in the model. its type will be overridden
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

Tested on an AMD 7965WX 24x Core 256GB DDR5@4800 + Dual RTX A6000 (96GB Total VRAM) rig.

👈 a few more commands and logs fwiw

convert

python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/ \
    /mnt/raid/models/tencent/Hunyuan-A13B-Instruct/

...

llama-server

model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf

./build/bin/llama-server \
  --model "$model" \
  -fa \
  -ctk f16 -ctv f16 \
  -c 8192 \
  -ts 48,48 \
  -ngl 10 \
  --threads 24 \
  --host 127.0.0.1 \
  --port 8080

...

client

>>> User:

Tell a funny joke in English.

>>> Assistant:

[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧]

arch-btw · 2025-06-27T22:36:10Z

I don't know as much about this as you guys but, could it be that the tokenizer is splitting characters like 新 ("new") into raw bytes?

So the UTF-8 sequence 0xe696b0 becomes 3 separate bytes (e6, 96, b0). And the other character 旧 ("old") splits into 3 bytes as well (e6, 97, a7).

And so the fragments get wrapped in [UNK_BYTE_] prefixes. The token stream becomes corrupt in the output and sort of traps the model in a "new --> old" loop, which then blocks normal text generation?

Because common Chinese characters always use 3 bytes in UTF-8:

新 converts to b'\xe6\x96\xb0' (3 bytes)
旧 converts to b'\xe6\x97\xa7' (3 bytes)

It matches the error: [UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧]

ngxson · 2025-06-27T23:10:15Z

The cgraph is still not correct. Testing with this tiny random weight: https://huggingface.co/ngxson/hunyuan-moe-tiny-random/tree/main

Seems like the problem is from the self-attention block

kooshi · 2025-06-28T02:10:11Z

I don't know if the improvements I am seeing are from your last wip commit, or from my edits to the convert script, but I currently get almost intelligible responses.

The changes I made were:

specify the BOS token explicitly, as it is incorrect in hunyuan's config.json self.gguf_writer.add_bos_token_id(127959)
use tokenizer.special_tokens.values() instead of tokenizer.get_added_vocab() to determine control tokens
skip lm_head.weight as the embedding weights are tied
changed the base model from LlamaModel to TextModel for a more generic foundation

my edits are here: https://github.com/kooshi/llama.cpp/tree/hunyuan
full disclaimer though, I have no idea what I'm doing. The BOS token was definitely broken though.

> hello
<think>[UNK_BYTE_0x0a>
]Okay,[UNK_BYTE_0x20 the]the[UNK_BYTE_0x20 user]user[UNK_BYTE_0x20 said]said[UNK_BYTE_0x20 "]"hello".[UNK_BYTE_0x20 I]I[UNK_BYTE_0x20 need]need[UNK_BYTE_0x20 to]to[UNK_BYTE_0x20 respond]respond[UNK_BYTE_0x20 appropriately]appropriately.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]First,[UNK_BYTE_0x20 hello]hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello[UNK_BYTE_0x20 there]there![UNK_BYTE_0x0a!

][UNK_BYTE_0x0a!

]Hi[UNK_BYTE_0x20 there]there.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hi.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hello.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hi.[UNK_BYTE_0x0a.

][UNK_BYTE_0x0a.

]Hey.[UNK_BYTE_0x0a.

(continues forever)

ngxson · 2025-06-28T08:42:30Z

The more looking at the upstream implementation, the more I wonder if it actually works.

My Mac M3 Ultra can't load the original model even though having 512GB of RAM.

Now, testing with the tiny weight. Switching between eager and sdpa, they give different output result, which indicates that one of the 2 attn impl is buggy.

Also, flash_attn does not work at all, they haven't even verified the code path before shipping (NameError: name 'flash_attn_func' is not defined)

And more importantly, attention_mask is None everywhere, even using the example code provided on HF.

If that is true, it means they messed up badly this time.

Downtown-Case · 2025-06-28T16:40:50Z

modeling_hunyuan.py is basically identical to the file for the old hunyuan-large, with 1 changed line:

https://www.diffchecker.com/P3e0hQM5/

https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Instruct/

And hunyuan.py (the actual model class here) is largly copied from modeling_hunyuan.py, including unused features like CLA:

https://www.diffchecker.com/P9FIR5OD/

In other words, its almost Hunyuan large? I'm not sure why the HF attention implementations would be bugged. But other reimplementations like vllm's seem to work, so maybe they can shed some light on this:

quinnrong94/vllm@5302fbf

Downtown-Case · 2025-06-28T16:44:51Z

I take that back, apparently vllm is only sometimes working with A13B, heh:

ikawrakow/ik_llama.cpp#561 (comment)

vllm-project/vllm#20183

vllm-project/vllm#20114

Noeda · 2025-06-28T16:56:20Z

I had the original model from Huggingface work coherently on pure CPU. It uses the HunYuanSdpaAttention codepath.

This is all tentative as I just got it running at all:

If I compare logits for a single-token prompt, I get a very similar logit distribution from both llama.cpp and the HF. More than one token and things look different. I'm purely going with numerical token IDs for llama.cpp as the tokenizer is messed up as observed (I tried 'a' the token 64 for single-token prompt and '12' prompt (16, 17) for two-token test, e.g. llama-eval-callback --no-escape --model hunyuan-q8.gguf -n 1 -c 512 -p '12').

This is with the code from combined @ngxson and @kooshi with the .gguf made with @kooshi 's code (I took latest efforts I saw here in the discussion to start off).

Below in the dropbox is the transformers test program that makes coherent text for me (up to 100 tokens because I was too impatient to try longer prompts). I think installing accelerate and asking it to use bfloat16 really helps with memory. I think that would make it run on the M3 512GB machine too, IIRC when I did this for dots.llm1 I really had to use bfloat16 to not run out of memory.

My machine has 256GB of memory, a Hetzner server with a modern AMD EPYC CPU. I do have a Mac Studio (M2, 192GB) as well but for CPU work this Hetzner is usually much faster.

(I don't know why asking it to use bfloat16 helps, maybe it doesn't make giant copies of tensors or something when you ask it to do that; it's just something I observed and never checked what's it doing behind the scenes).

test.py

This is a version of the example code from the Huggingface page that I modified a bit.

#!/usr/bin/env python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

def main():
    with torch.no_grad():
        model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'

        tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

        messages = [
            {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
        ]
        tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                          enable_thinking=True # Toggle thinking mode (default: True)
                                                      )

        outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=20)
        output_text = tokenizer.decode(outputs[0])
        print(outputs)
        print(output_text)


if __name__ == '__main__':
    main()

stdout of test.py

The output has output as token IDs and as text (two prints()) in there. To run this, you need to install accelerate into your Python environment for the device_map line thingy.

(hunyuan) shannon@soga ~/hunyuan_llama.cpp/hf> ./test.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.09it/s]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let

I'm on and off this weekend trying to also figure out where computation graph is off exactly. If I find out before someone else does, I'll let you all know.

(Runs surprisingly fast on transformers+CPU, I'm used to that combo being extraordinarily slow. It is still very slow, just not like "it will take 30 minutes to make 10 tokens" slow).

jacekpoplawski · 2025-06-28T17:06:07Z

Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me.

ubergarm · 2025-06-28T17:12:29Z

@jacekpoplawski

Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me.

Their official inference script for running the int4 quant on vllm is using --dtype bfloat16

(still didn't work for me though)

Noeda · 2025-06-28T18:02:33Z

To add to @ubergarm options, I did notice there are some quantized versions like https://huggingface.co/tencent/Hunyuan-A13B-Instruct-FP8 or https://huggingface.co/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4 (they look like they are designed to work with transformers at first glance. I've never in my entire life ran vLLM or sglang even once.).

The GPTQ-Int4 one has a single model.safetensors at 43.7GB which maybe works. One would hope 😉

Haven't tried any of them. For computation graph work feels better to get whatever is highest precision I am able to run conveniently.

ngxson · 2025-06-28T19:22:15Z

If someone can run it, could you please verify if attention_mask inside HunYuanDecoderLayer has a non-Nonevalue? Thanks.

Noeda · 2025-06-28T19:27:54Z

(hunyuan) shannon@soga ~/hunyuan_llama.cpp/hf> ./test2.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.91it/s]
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None

@ngxson is this the part you wanted to see if it's None or not? Argument to the forward()?

Edit: took a bigger screenshot to show more clearly where I put that. HunYuanDecoderLayer's forward(). The line numbers you see won't match with original because I have more print() debugging at the top of the file and other hacky stuff I added.

Stdout tail because that first paste is cut off, I see None throughout the entire run. Output looks coherent.

Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
Attention mask foobar:  None
tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let

Edit2: I'm going to let this thing generate a full response which might take a while. But I feel this might be a bit short as a test; it almost verbatim mentions the prompt in the <think> so maybe it's about to repeat itself or something. I'll paste as a new comment when it's done. Just want to get more confirmation the HF implementation itself works beyond very short generations.

Noeda · 2025-06-28T19:55:39Z

Full response example of the transformers version; I gave it 5000 token max:

stdout from test2.py (I cut off all the parts that said attention mask is None)

tensor([[127958,   8144,    264,   2875,  12399,    315,    279,   7720,    315,
           5912,  10368, 127962,  14023,    771,    397,  33413,     11,    358,
           1205,    311,   3350,    264,   2875,  12399,    922,    279,   7720,
            315,   5912,  10368,     13,   6914,    757,   1212,    555,  89746,
           1148,    358,   1440,     13,   5629,     11,   7106,   2890,   7720,
             25,  96931,    279,   4851,     11,  36050,  35855,     11,   8779,
            449,   4785,   6373,     13,   5112,  10723,   2890,     25,  26338,
           8631,     11,  18547,     11,  18710,     13,  10926,   1101,  67232,
           4907,   5990,     13,   8840,     11,    323,   1317,   9860,   6392,
           1093,  18189,   5326,    315,  21249,  19338,   2345,   8747,  16629,
             11,  63308,     11,   1063,  51423,     13,   8840,     11,    323,
          25702,   7720,     11,   1093,   2731,   5044,    477,   8271,    734,
             13,   6914,    757,  31335,   1521,   3585,    382,   3563,    449,
            459,  17219,    430,   5415,   5912,  10368,    706,  12387,   7720,
             13,   5112,   1464,   1523,   1139,   7106,     11,  10723,     11,
            323,   7344,   1023,  11306,     13,   1789,   7106,     25,   4851,
           2890,    320,   4620,    261,   4851,     11,   4827,   6680,   7410,
            705,   4785,   6373,    320,  22464,     82,  25247,     11,  22890,
          16124,    705,  22852,   1887,    320,  37860,    570,  38895,     25,
            842,  16751,   1354,     11,  26338,   8631,  56592,  16708,     11,
           3698,   1900,  18710,     13,  73235,     25,  57924,   5357,     11,
           5044,     11,   7344,  32174,  25702,  18174,     13,   7429,     11,
           3674,   7720,    422,   1912,  23783,     11,    719,   7344,    430,
            596,  10309,     13,  14998,    311,   2567,    433,  64694,     11,
            779,   7344,    220,     19,     12,     20,   1401,   3585,     13,
          35106,    503,  71921,     13,   7557,   2771,    433,  28555,     13,
           6914,    757,   1817,    422,    358,  13942,   4205,     13,   8840,
             11,   4907,   5990,   2345,  64562,    649,   5376,  61784,     11,
            539,   1120,   8395,  25247,     13,  22335,     11,    430,    596,
           3062,     13,   2100,  63179,    682,   1521,   1139,    264,  56887,
          14646,     13,   6914,    757,  10165,   1473,  31504,  10368,   6209,
            264,   7029,   2134,    315,   7720,    369,   8244,   1664,  33851,
             13,  13101,   2740,     11,    433,  96931,    279,   4851,     11,
          18899,  35855,    323,  46301,    279,   5326,    315,   4787,   1093,
          63308,    323,   4851,   8624,     11,   1418,  86387,    304,   4785,
           6373,   1555,  52703,  20252,    323,  16124,   4857,     13,  49693,
            750,     11,    433,  31854,    279,   4984,    315,    842,  16751,
           1354,     11,  18189,   8631,     11,  18547,     11,    323,  13803,
            315,  18710,     11,    323,  57924,  25702,    734,     11,  56028,
           5357,     11,   5044,     11,    323,  13893,  80430,   4325,  14228,
          10723,  18174,     13,  23212,     11,   5912,   5820,  12231,    988,
           4907,   5990,    555,  18899,  61784,    323,  11815,    264,  16643,
          22852,   1887,     11,  18189,  17563,   5326,     13,  32255,     11,
           1521,   6372,  17210,    311,    264,   5129,     11,  39345,     11,
            323,    810,  24770,   2324,    382,  14524,     11,    374,    430,
           2288,   1317,     30,  10926,  74481,     13,   6914,    757,   1518,
             13,    330,  31504,  10368,   5825,  62387,    582,  25489,   7720,
             13,  13101,   2740,     11,    433,  96931,    279,   4851,     11,
          73115,   6680,   7410,     11,  52797,   4785,   6373,     11,    323,
          67232,  40368,     13,  49693,    750,     11,    433,  19786,    842,
          16751,   1354,     11,  18189,   8631,     11,  18547,     11,    323,
          18710,     11,   1418,  47594,   5357,    323,   5044,     13,   1102,
           1101,  12992,   4907,   5990,    323,   1253,   7781,  25702,  18174,
             13,  28993,     11,    433,  39990,    264,   5129,     11,  39345,
           2324,   1210,   3011,    596,   2731,     13,   4497,  64694,     13,
           4343,    369,  32373,     13,  22335,     11,    430,   4375,     13,
           7557,   2771,    311,   6420,   1401,   5789,   2085,   3794,   2288,
          11944,     13,   3011,   1288,   3504,    433,    627,    524,  27963,
            397,     27,   9399,    397,  31504,  10368,  28421,  28254,   7720,
           4028,   7106,     11,  10723,     11,    323,  25702,  31576,     13,
          13101,   2740,     11,    433,  96931,    279,   4851,     11,  36050,
          35855,     11,    323,  73115,   6680,   7410,     11,  18189,    279,
           5326,    315,   4851,   8624,     11,  63308,     11,    323,  12943,
             13,   1102,  52797,   4785,   6373,    555,  20252,  25247,    323,
           4857,  16025,  16124,     11,   1418,   1101,  47594,  22852,    734,
             13,  49693,    750,     11,  10368,  31854,    842,  16751,    258,
           4984,     11,  46649,  23747,   8631,     11,  18547,     11,    323,
          13803,    315,  18710,     11,    323,  67232,   5357,     11,   5044,
             11,    323,  14604,  56062,     13,   1102,   4726,  12231,    988,
           4907,   5990,    555,  18899,  61784,    323,   1253,   7781,   4325,
          14228,  25702,  18174,     13,  21153,   3210,     11,   1521,   6372,
          12192,    264,   5129,     11,  39345,     11,    323,    810,  24770,
           2324,    627,    524,   9399,     29, 127960]])
<|startoftext|>Write a short summary of the benefits of regular exercise<|extra_0|><think>
Okay, I need to write a short summary about the benefits of regular exercise. Let me start by recalling what I know. First, physical health benefits: strengthens the heart, improves circulation, helps with weight management. Then mental health: reduces stress, anxiety, depression. Maybe also boosts energy levels. Oh, and long-term stuff like reducing risk of chronic diseases—diabetes, hypertension, some cancers. Oh, and cognitive benefits, like better memory or brain function. Let me organize these points.

Start with an introduction that states regular exercise has numerous benefits. Then break down into physical, mental, and maybe other categories. For physical: heart health (stronger heart, lower blood pressure), weight management (burns calories, builds muscle), immune system (maybe). Mental: endorphins, reduces stress/anxiety, combats depression. Cognitive: enhances focus, memory, maybe delays cognitive decline. Also, social benefits if group exercises, but maybe that's optional. Need to keep it concise, so maybe 4-5 key points. Avoid jargon. Make sure it flows. Let me check if I missed anything. Oh, energy levels—exercise can increase stamina, not just burn calories. Yeah, that's important. So summarize all these into a coherent paragraph. Let me draft:

Regular exercise offers a wide range of benefits for overall well-being. Physically, it strengthens the heart, improving circulation and lowering the risk of conditions like hypertension and heart disease, while aiding in weight management through calorie burning and muscle building. Mentally, it triggers the release of endorphins, reducing stress, anxiety, and symptoms of depression, and enhances cognitive function, boosting focus, memory, and potentially delaying age-related mental decline. Additionally, regular activity elevates energy levels by improving stamina and supports a stronger immune system, reducing illness risk. Together, these effects contribute to a longer, healthier, and more balanced life.

Wait, is that too long? Maybe shorten. Let me see. "Regular exercise provides multifaceted benefits. Physically, it strengthens the heart, lowers blood pressure, aids weight management, and boosts immunity. Mentally, it releases endorphins, reducing stress, anxiety, and depression, while enhancing focus and memory. It also increases energy levels and may delay cognitive decline. Overall, it promotes a longer, healthier life." That's better. More concise. Check for clarity. Yeah, that works. Make sure to mention key areas without getting too detailed. That should cover it.
</think>
<answer>
Regular exercise delivers profound benefits across physical, mental, and cognitive domains. Physically, it strengthens the heart, improves circulation, and lowers blood pressure, reducing the risk of heart disease, hypertension, and stroke. It aids weight management by burning calories and building lean muscle, while also enhancing immune function. Mentally, exercise triggers endorphin release, alleviating stress, anxiety, and symptoms of depression, and boosts focus, memory, and emotional resilience. It further elevates energy levels by improving stamina and may delay age-related cognitive decline. Collectively, these effects promote a longer, healthier, and more balanced life.
</answer><|eos|>

Code is almost same as before, pasting for reproducibility:

test2.py

#!/usr/bin/env python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re

def main():
    with torch.no_grad():
        model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'

        tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

        messages = [
            {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
        ]
        tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                          enable_thinking=True # Toggle thinking mode (default: True)
                                                      )

        outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=5000)
        output_text = tokenizer.decode(outputs[0])
        print(outputs)
        print(output_text)


if __name__ == '__main__':
    main()

The output looks normal to me and it answered the prompt. It does look like to me it works.

CPU-only, 256GB Hetzner server.

ngxson · 2025-06-30T15:42:27Z

The std::powf is a small hack so I don't need to re-convert the model while playing with it.

Should be resolved by the PR from @kooshi

convert_hf_to_gguf.py

CISC · 2025-06-30T16:48:21Z

src/llama-model.h

@@ -94,6 +94,7 @@ enum llm_type {
    LLM_TYPE_57B_A14B,
    LLM_TYPE_17B_16E, // llama4 Scout
    LLM_TYPE_17B_128E, // llama4 Maverick
+    LLM_TYPE_A13B,


nit, for consistency this should be 80B_A13B, but not important.

kkaarrss · 2025-06-30T20:18:12Z

The pow is fixed, but now I am concistentely getting repetitions with the current build after like 100 tokens. I will try some commits to see where it was introduced.

kooshi · 2025-06-30T21:13:02Z

The pow is fixed, but now I am concistentely getting repetitions with the current build after like 100 tokens. I will try some commits to see where it was introduced.

You'll need to recreate the gguf to embed the scaled RoPE base.

ubergarm · 2025-06-30T21:52:02Z

Has anyone tried making an imatrix? My testing on both mainline and ik seem to have rather higher than expected Perplexity values. Also has anyone tried running without -fa - i need to try with mainline to be sure but was much worse Perplexity.

I have an experimental ~5bpw quant running with folks on BeaverAI Club testing it out right now. Definitely some quirks but is able to have multi-turn chats.

RodriMora · 2025-07-01T14:45:57Z

seem to have rather higher than expected Perplexity values. Also has anyone tried running without -fa - i need to try with mainline to be sure but was much worse Perplexity.

I have an experimental ~5bpw quant running with folks on BeaverAI Club testing it out right now. Definitely some quirks but is able to have multi-turn chats.

there is something wrong with my ppl test too. I'm getting PPL = 501.5049 +/- 5.54021

ubergarm · 2025-07-01T15:26:40Z

there is something wrong with my ppl test too. I'm getting PPL = 501.5049 +/- 5.54021

Yeah that seems really high for a model this large, thanks for confirming.

In testing yesterday the model is able to have multi-turn chats that make some sense but still feels kind of "loose" like it loses train of though and feels a little too random even with reasonable temp ~0.5-0.7 range... Also it seemed like with an empty system prompt it would occasionally mumble things before the first <think> tag begins or answer in Chinese even when prompted in english - having a simple "You are an AI assistant." system prompt as appears in the official repo seems to help that a bit. Finally it seemed to drop the first < in the <answer> .... </answer> portion of the response (kind of strange it has answer tags too, but seems by design)...

Finally, hugging face bot auto-created some .safetensor files which probably are just the same data as in the original pytorch_model-*.bin? So probably not worth to re-download, re-convert, re-quantize yet: https://huggingface.co/tencent/Hunyuan-A13B-Instruct/discussions/29

EDIT Also seems like vllm support just landed and I managed to get the official Int4 quant running like so on 2x A6000s (96GB VRAM total):

NCCL_P2P_DISABLE=1 \
vllm serve \
    /mnt/raid/models/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4/ \
    --served-model-name "Hunyuan-A13B-Instruct-GPTQ-Int4" \
    --quantization gptq_marlin \
    --dtype bfloat16 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --host 127.0.0.1 \
    --port 8080

In very initial testing ☝️ seems more coherent fwiw... i gotta figure out how to vllm perplexity to know for sure though

jacekpoplawski · 2025-07-01T16:57:28Z

Finally, hugging face bot auto-created some .safetensor files which probably are just the same data as in the original pytorch_model-*.bin?

they just removed bins

ubergarm · 2025-07-01T21:36:28Z

I just tried to run perplexity using the latest vllm:main@7f280d69c with lm-eval on the official tencent/Hunyuan-A13B-Instruct-GPTQ-Int4 quant... Took a while to get it to actually run due to various version mismatches and trying to force disable flash attention, but here is the command and results:

$ export model=model="/mnt/raid/models/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4/"
$ export max_model_len=512
$ VLLM_USE_V1=0 \
  VLLM_ATTENTION_BACKEND=XFORMERS \
  NCCL_P2P_DISABLE=1 \
  lm_eval \
    --model vllm \
    --model_args pretrained="$model",add_bos_token=True,dtype=bfloat16,max_model_len="$max_model_len",enforce_eager=True,tensor_parallel_size=2,gpu_memory_utilization=0.95,quantization=gptq_marlin,disable_custom_all_reduce=True,trust_remote_code=True \
    --seed 1337 \
    --tasks wikitext

...

2025-07-01:17:09:05 INFO     [loggers.evaluation_tracker:280] Output path not provided, skipping saving results aggregated
vllm (pretrained=/mnt/raid/models/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4/,add_bos_token=True,dtype=bfloat16,max_model_len=512,enforce_eager=True,tensor_parallel_size=2,gpu_memory_utilization=0.95,quantization=gptq_marlin,disable_custom_all_reduce=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	Metric		Value		Stderr
wikitext	2	none	bits_per_byte	↓	1.7778	±	N/A
		none	byte_perplexity	↓	3.4290	±	N/A
		none	word_perplexity	↓	727.4207	±	N/A

So perplexity of 727.4 seems really high?! (assuming I ran it correctly...) For comparison I would expect qwen3-30b-a3b awq comes in around ~11... I re-ran the above commands without quantization=qptq_marlin on the full bf16 Qwen/Qwen3-30B-A3B safetensors and got the following output:

# Qwen3-30B-A3B
(VllmWorkerProcess pid=4044189) INFO 07-01 17:33:23 [multiproc_worker_utils.py:260] Worker exiting
2025-07-01:17:33:23 INFO     [loggers.evaluation_tracker:280] Output path not provided, skipping saving results aggregated
vllm (pretrained=/mnt/raid/models/Qwen/Qwen3-30B-A3B/,add_bos_token=True,dtype=bfloat16,max_model_len=512,enforce_eager=True,tensor_parallel_size=2,gpu_memory_utilization=0.95,disable_custom_all_reduce=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	Metric		Value		Stderr
wikitext	2	none	bits_per_byte	↓	0.7113	±	N/A
		none	byte_perplexity	↓	1.6372	±	N/A
		none	word_perplexity	↓	13.9625	±	N/A

Soo.... Is Hunyuan-80B-A13B really that high of perplexity? or is how it chooses experts or not messing with how perplexity is calculated? i'm just spitballing, out of my league here lol...

ngxson · 2025-07-01T21:43:46Z

fwiw, the model response poorly to non-formatted input (i.e. no chat template), so I assume it's very sensitive to chat template, hence the high PPL with wikitext

theo77186 · 2025-07-01T21:47:14Z

Post-training is known to increase perplexity, they also released the base model. Maybe with the base model, the perplexity will be lower though it's weird that a post-trained model causes such a perplexity increase, unless the base model has a high perplexity to start with.

Unfortunately I don't have enough VRAM to test the perplexity of these models.

jukofyork · 2025-07-01T22:33:26Z

Have you tried examining the output logits? To get a crazy high perplexity like that the model must be predicting a token with near certainty that didn't occur.

ubergarm · 2025-07-02T00:22:48Z

@RodriMora

there is something wrong with my ppl test too. I'm getting PPL = 501.5049 +/- 5.54021

Was that for the full sized bf16? I just re-pulled the new .safetensors, converted, did imatrix, and made a test quant mix weighing in at 34.088 GiB (3.642 BPW) giving me Final estimate: PPL = 522.7473 +/- 5.68072 (over on ik's fork) so seems to line up in the same ball-park. (i fixed my own -fa/nofa issue, my bad on that, it lines up close now though :gucci:)

Given vLLM scores high, and we're both getting pretty high values, that may just be normal for this model as mentioned. I haven't tried the base model nor looked at what tokens it is actually predicting e.g is it always trying to say <answer> or something silly hah...

In some more testing even this under 4bpw quant seems to be working okay, so I'd say ship it! I saw some updated bullerwins/Hunyuan-A13B-Instruct-GGUF mainline quants available already to go as well! If something comes out official later about it could be address in a subsequent PR. None of my biz I know but just my 2 cents hah...

Cheers and thanks!

kooshi · 2025-07-02T01:05:05Z

This model is definitely weird in my testing as well.

In conversations with short messages, it will often omit its closing </think> tag and opening <answer> tag, then either answer within its think block, or just keep thinking, then end its <think> with </answer>, then end the message.
It will also greet me for multiple consecutive messages if I start with a simple "Hello"

I can't run the vllm version because it doesn't support pipeline parallel (or it didn't earlier, maybe the official release does?) so I can't know if this is inherent to the model, but I do suspect that a lot of its weirdness is from how they scaled the context.

Stretching the RoPE base from the standard 10k to 11M means that it's much more difficult for it to differentiate between tokens that are close together, but it's shockingly effective at long contexts.

This also results in additional weirdness if you adjust the --rope-freq-base. If you reduce it to 2-5M, it actually becomes a little more coherent in my opinion, and it seems to prefer writing longer answers at this level. If you reduce it even further, it will start to do weird stuff like use )) instead of ) to close an open parenthesis in code, I suspect because as we compress it's positional encoding, the signal for ( becomes equivalent to the signal for ((, so it thinks it needs to close two levels.

I do wish we had an empirical comparison of the "official" implementation with this one though.

Edit: Oh! I had skipped past some comments, thank you @ubergarm for the ppl test on the GPTQ version. I guess this model really is just weird.

RodriMora · 2025-07-02T08:06:58Z

Tested PPL with the FP8 model and vllm compiled from source with the lastest code with support for hunyuan:

CUDA_VISIBLE_DEVICES=0,1,3,5 lm_eval \
--model vllm \
--model_args pretrained="/mnt/llms/models/tencent/Hunyuan-A13B-Instruct-FP8/",add_bos_token=True,dtype=bfloat16,max_model_len="512",enforce_eager=True,tensor_parallel_size=4,gpu_memory_utilization=0.95,disable_custom_all_reduce=True,trust_remote_code=True \
--seed 1337 \
--tasks wikitext

Tasks	Version	Filter	Metric		Value		Stderr
wikitext	2	none	bits_per_byte	↓	1.8149	±	N/A
		none	byte_perplexity	↓	3.5183	±	N/A
		none	word_perplexity	↓	834.6801	±	N/A

Even higher PPL.

Was that for the full sized bf16? I just re-pulled the new .safetensors, converted, did imatrix, and made a test quant mix weighing in at 34.088 GiB (3.642 BPW) giving me Final estimate: PPL = 522.7473 +/- 5.68072 (over on ik's fork) so seems

@ubergarm that was for Q8_0. The day it launched I converted the .bin files to safetensors manually and I've been using it for all my test since https://huggingface.co/bullerwins/Hunyuan-A13B-Instruct-hf but now it's not needed as they updated the OG repo.
So I'd say ppl test are not really useful for this model yeah.

Normal use seems fine is general knowledge and coding questions. But I haven't tested thoroughly.
I'm running normal benchmarks to see if the scores line up

Edit: MMLU-pro computer science test for the FP8 = 75.85, in line with the Q4-Q5 gguf results

ggerganov · 2025-07-02T08:46:07Z

Here is the PPL with the pretrain model from https://huggingface.co/tencent/Hunyuan-A13B-Pretrain:

make -j && ./bin/llama-perplexity -m ../models/hunyuan-a13b-pt/ggml-model-q8_0.gguf -f wikitext-2-raw/wiki.test.raw -fa

Final estimate: PPL = 5.2861 +/- 0.03234

The logits still doesn't match 100% due to the problem with router algorithm that I pointed out in #14425 (comment) , but I think we can have a look on this afterwards.

@ngxson Does the logits match if this new expert router algorithm is disabled in the reference implementation?

model : add hunyuan moe

f5d8a22

github-actions bot added the python python script changes label Jun 27, 2025

CISC reviewed Jun 27, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

ngxson added 2 commits June 27, 2025 18:49

tokenizer ok

38acf7f

fix tensor name

35591a9

ngxson commented Jun 27, 2025

View reviewed changes

ngxson added 2 commits June 27, 2025 19:35

cgraph init

cb1f9f2

chat template

51886a4

wip

cff16cc

kooshi added 5 commits June 27, 2025 18:38

almost working

5e78e88

skip embed, fix bos

d219580

Merge remote-tracking branch 'other/xsn/hunyuan-moe' into hunyuan

616f4c7

cleanup

0fd3930

yarn scaling

b19ecae

cleanup

245db15

ngxson commented Jun 30, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

Update convert_hf_to_gguf.py

251e78a

ngxson commented Jun 30, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

ngxson commented Jun 30, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

Apply suggestions from code review

06cab8f

ngxson marked this pull request as ready for review June 30, 2025 15:46

Merge branch 'master' into xsn/hunyuan-moe

5cfc73b

ngxson requested review from CISC and ggerganov June 30, 2025 15:54

fix regression

2d56a29

Downtown-Case mentioned this pull request Jun 30, 2025

Feature Request: Tencent Hunyuan-A13B model support ikawrakow/ik_llama.cpp#561

Open

CISC approved these changes Jun 30, 2025

View reviewed changes

bold84 mentioned this pull request Jun 30, 2025

We need the Hunyuan-A13B-Instruct,thanks! ollama/ollama#11239

Open

ubergarm mentioned this pull request Jun 30, 2025

add hunyuan moe support for 561 ikawrakow/ik_llama.cpp#565

Open

fix style

e5fe089

model : add hunyuan moe #14425

Are you sure you want to change the base?

model : add hunyuan moe #14425

Conversation

ngxson commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngxson Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jun 27, 2025

Uh oh!

ubergarm commented Jun 27, 2025

convert

llama-server

client

Uh oh!

arch-btw commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 27, 2025

Uh oh!

kooshi commented Jun 28, 2025

Uh oh!

ngxson commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Downtown-Case commented Jun 28, 2025

Uh oh!

Downtown-Case commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Noeda commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacekpoplawski commented Jun 28, 2025

Uh oh!

ubergarm commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Noeda commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 28, 2025

Uh oh!

Noeda commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Noeda commented Jun 28, 2025

Uh oh!

ngxson commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

kkaarrss commented Jun 30, 2025

Uh oh!

kooshi commented Jun 30, 2025

Uh oh!

ubergarm commented Jun 30, 2025

Uh oh!

RodriMora commented Jul 1, 2025

Uh oh!

ubergarm commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacekpoplawski commented Jul 1, 2025

Uh oh!

ubergarm commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jul 1, 2025

Uh oh!

theo77186 commented Jul 1, 2025

Uh oh!

jukofyork commented Jul 1, 2025

ngxson commented Jun 27, 2025 •

edited

Loading

arch-btw commented Jun 27, 2025 •

edited

Loading

ngxson commented Jun 28, 2025 •

edited

Loading

Downtown-Case commented Jun 28, 2025 •

edited

Loading

Noeda commented Jun 28, 2025 •

edited

Loading

ubergarm commented Jun 28, 2025 •

edited

Loading

Noeda commented Jun 28, 2025 •

edited

Loading

Noeda commented Jun 28, 2025 •

edited

Loading

ubergarm commented Jul 1, 2025 •

edited

Loading

ubergarm commented Jul 1, 2025 •

edited

Loading

ubergarm commented Jul 2, 2025 •

edited

Loading

kooshi commented Jul 2, 2025 •

edited

Loading

RodriMora commented Jul 2, 2025 •

edited

Loading