-
Notifications
You must be signed in to change notification settings - Fork 12.2k
model : add hunyuan moe #14425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
model : add hunyuan moe #14425
Conversation
convert_hf_to_gguf.py
Outdated
for token, rank in mergeable_ranks.items(): | ||
vocab[QwenModel.token_bytes_to_string(token)] = rank | ||
if len(token) == 1: | ||
continue | ||
merged = QwenModel.bpe(mergeable_ranks, token, max_rank=rank) | ||
if len(merged) == 2: | ||
merges.append(' '.join(map(QwenModel.token_bytes_to_string, merged))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quite doubt if this is correct. if someone knows or has time to do tokenizer test, please feel free to leave a comment
Ok, getting somewhere now. The model runs, but output gibberish
|
Thanks for working on this! I got the same looking output trying The only odd things I noticed were:
Tested on an AMD 7965WX 24x Core 256GB DDR5@4800 + Dual RTX A6000 (96GB Total VRAM) rig. 👈 a few more commands and logs fwiwconvertpython \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/ \
/mnt/raid/models/tencent/Hunyuan-A13B-Instruct/
... llama-servermodel=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf
./build/bin/llama-server \
--model "$model" \
-fa \
-ctk f16 -ctv f16 \
-c 8192 \
-ts 48,48 \
-ngl 10 \
--threads 24 \
--host 127.0.0.1 \
--port 8080
... client>>> User:
Tell a funny joke in English.
>>> Assistant:
[UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧][UNK_BYTE_0xe697a7新旧][UNK_BYTE_0xe696b0新旧] |
I don't know as much about this as you guys but, could it be that the tokenizer is splitting characters like 新 ("new") into raw bytes? So the UTF-8 sequence And so the fragments get wrapped in Because common Chinese characters always use 3 bytes in UTF-8:
It matches the error: |
The cgraph is still not correct. Testing with this tiny random weight: https://huggingface.co/ngxson/hunyuan-moe-tiny-random/tree/main Seems like the problem is from the self-attention block |
I don't know if the improvements I am seeing are from your last The changes I made were:
my edits are here: https://github.com/kooshi/llama.cpp/tree/hunyuan
|
The more looking at the upstream implementation, the more I wonder if it actually works. My Mac M3 Ultra can't load the original model even though having 512GB of RAM. Now, testing with the tiny weight. Switching between Also, And more importantly, If that is true, it means they messed up badly this time. |
https://www.diffchecker.com/P3e0hQM5/ https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Instruct/ And https://www.diffchecker.com/P9FIR5OD/ In other words, its almost Hunyuan large? I'm not sure why the HF attention implementations would be bugged. But other reimplementations like vllm's seem to work, so maybe they can shed some light on this: |
I take that back, apparently vllm is only sometimes working with A13B, heh: |
I had the original model from Huggingface work coherently on pure CPU. It uses the HunYuanSdpaAttention codepath. This is all tentative as I just got it running at all: If I compare logits for a single-token prompt, I get a very similar logit distribution from both llama.cpp and the HF. More than one token and things look different. I'm purely going with numerical token IDs for llama.cpp as the tokenizer is messed up as observed (I tried 'a' the token 64 for single-token prompt and '12' prompt (16, 17) for two-token test, e.g. This is with the code from combined @ngxson and @kooshi with the .gguf made with @kooshi 's code (I took latest efforts I saw here in the discussion to start off). Below in the dropbox is the My machine has 256GB of memory, a Hetzner server with a modern AMD EPYC CPU. I do have a Mac Studio (M2, 192GB) as well but for CPU work this Hetzner is usually much faster. (I don't know why asking it to use bfloat16 helps, maybe it doesn't make giant copies of tensors or something when you ask it to do that; it's just something I observed and never checked what's it doing behind the scenes). test.pyThis is a version of the example code from the Huggingface page that I modified a bit. #!/usr/bin/env python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re
def main():
with torch.no_grad():
model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)
messages = [
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
enable_thinking=True # Toggle thinking mode (default: True)
)
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=20)
output_text = tokenizer.decode(outputs[0])
print(outputs)
print(output_text)
if __name__ == '__main__':
main() stdout of test.pyThe output has output as token IDs and as text (two prints()) in there. To run this, you need to install
I'm on and off this weekend trying to also figure out where computation graph is off exactly. If I find out before someone else does, I'll let you all know. (Runs surprisingly fast on transformers+CPU, I'm used to that combo being extraordinarily slow. It is still very slow, just not like "it will take 30 minutes to make 10 tokens" slow). |
Is it possible to load this model in 4-bit precision using Transformers? Does bitsandbytes support this model? I’m limited to a total of 72GB of VRAM across several GPUs, so bfloat16 won’t work for me. |
Their official inference script for running the int4 quant on vllm is using (still didn't work for me though) |
To add to @ubergarm options, I did notice there are some quantized versions like https://huggingface.co/tencent/Hunyuan-A13B-Instruct-FP8 or https://huggingface.co/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4 (they look like they are designed to work with The GPTQ-Int4 one has a single Haven't tried any of them. For computation graph work feels better to get whatever is highest precision I am able to run conveniently. |
If someone can run it, could you please verify if |
@ngxson is this the part you wanted to see if it's None or not? Argument to the forward()? ![]() Edit: took a bigger screenshot to show more clearly where I put that. Stdout tail because that first paste is cut off, I see
Edit2: I'm going to let this thing generate a full response which might take a while. But I feel this might be a bit short as a test; it almost verbatim mentions the prompt in the <think> so maybe it's about to repeat itself or something. I'll paste as a new comment when it's done. Just want to get more confirmation the HF implementation itself works beyond very short generations. |
Full response example of the stdout from test2.py (I cut off all the parts that said attention mask is None)
Code is almost same as before, pasting for reproducibility: test2.py#!/usr/bin/env python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import re
def main():
with torch.no_grad():
model_path = '/home/shannon/llama.cpp/tencent_Hunyuan-A13B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)
messages = [
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
enable_thinking=True # Toggle thinking mode (default: True)
)
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=5000)
output_text = tokenizer.decode(outputs[0])
print(outputs)
print(output_text)
if __name__ == '__main__':
main() The output looks normal to me and it answered the prompt. It does look like to me it works. CPU-only, 256GB Hetzner server. |
The Should be resolved by the PR from @kooshi |
@@ -94,6 +94,7 @@ enum llm_type { | |||
LLM_TYPE_57B_A14B, | |||
LLM_TYPE_17B_16E, // llama4 Scout | |||
LLM_TYPE_17B_128E, // llama4 Maverick | |||
LLM_TYPE_A13B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, for consistency this should be 80B_A13B
, but not important.
The pow is fixed, but now I am concistentely getting repetitions with the current build after like 100 tokens. I will try some commits to see where it was introduced. |
You'll need to recreate the gguf to embed the scaled RoPE base. |
Has anyone tried making an imatrix? My testing on both mainline and ik seem to have rather higher than expected Perplexity values. Also has anyone tried running without I have an experimental ~5bpw quant running with folks on BeaverAI Club testing it out right now. Definitely some quirks but is able to have multi-turn chats. |
there is something wrong with my ppl test too. I'm getting PPL = 501.5049 +/- 5.54021 |
Yeah that seems really high for a model this large, thanks for confirming. In testing yesterday the model is able to have multi-turn chats that make some sense but still feels kind of "loose" like it loses train of though and feels a little too random even with reasonable temp ~0.5-0.7 range... Also it seemed like with an empty system prompt it would occasionally mumble things before the first Finally, hugging face bot auto-created some EDIT Also seems like vllm support just landed and I managed to get the official Int4 quant running like so on 2x A6000s (96GB VRAM total): NCCL_P2P_DISABLE=1 \
vllm serve \
/mnt/raid/models/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4/ \
--served-model-name "Hunyuan-A13B-Instruct-GPTQ-Int4" \
--quantization gptq_marlin \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--trust-remote-code \
--host 127.0.0.1 \
--port 8080 In very initial testing ☝️ seems more coherent fwiw... i gotta figure out how to vllm perplexity to know for sure though |
they just removed bins |
I just tried to run perplexity using the latest $ export model=model="/mnt/raid/models/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4/"
$ export max_model_len=512
$ VLLM_USE_V1=0 \
VLLM_ATTENTION_BACKEND=XFORMERS \
NCCL_P2P_DISABLE=1 \
lm_eval \
--model vllm \
--model_args pretrained="$model",add_bos_token=True,dtype=bfloat16,max_model_len="$max_model_len",enforce_eager=True,tensor_parallel_size=2,gpu_memory_utilization=0.95,quantization=gptq_marlin,disable_custom_all_reduce=True,trust_remote_code=True \
--seed 1337 \
--tasks wikitext
...
2025-07-01:17:09:05 INFO [loggers.evaluation_tracker:280] Output path not provided, skipping saving results aggregated
vllm (pretrained=/mnt/raid/models/tencent/Hunyuan-A13B-Instruct-GPTQ-Int4/,add_bos_token=True,dtype=bfloat16,max_model_len=512,enforce_eager=True,tensor_parallel_size=2,gpu_memory_utilization=0.95,quantization=gptq_marlin,disable_custom_all_reduce=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
So perplexity of 727.4 seems really high?! (assuming I ran it correctly...) For comparison I would expect qwen3-30b-a3b awq comes in around ~11... I re-ran the above commands without
Soo.... Is Hunyuan-80B-A13B really that high of perplexity? or is how it chooses experts or not messing with how perplexity is calculated? i'm just spitballing, out of my league here lol... |
fwiw, the model response poorly to non-formatted input (i.e. no chat template), so I assume it's very sensitive to chat template, hence the high PPL with wikitext |
Post-training is known to increase perplexity, they also released the base model. Maybe with the base model, the perplexity will be lower though it's weird that a post-trained model causes such a perplexity increase, unless the base model has a high perplexity to start with. Unfortunately I don't have enough VRAM to test the perplexity of these models. |
Have you tried examining the output logits? To get a crazy high perplexity like that the model must be predicting a token with near certainty that didn't occur. |
Was that for the full sized bf16? I just re-pulled the new .safetensors, converted, did imatrix, and made a test quant mix weighing in at 34.088 GiB (3.642 BPW) giving me Given vLLM scores high, and we're both getting pretty high values, that may just be normal for this model as mentioned. I haven't tried the base model nor looked at what tokens it is actually predicting e.g is it always trying to say In some more testing even this under 4bpw quant seems to be working okay, so I'd say ship it! I saw some updated bullerwins/Hunyuan-A13B-Instruct-GGUF mainline quants available already to go as well! If something comes out official later about it could be address in a subsequent PR. None of my biz I know but just my 2 cents hah... Cheers and thanks! |
This model is definitely weird in my testing as well.
I can't run the vllm version because it doesn't support pipeline parallel (or it didn't earlier, maybe the official release does?) so I can't know if this is inherent to the model, but I do suspect that a lot of its weirdness is from how they scaled the context. Stretching the RoPE base from the standard 10k to 11M means that it's much more difficult for it to differentiate between tokens that are close together, but it's shockingly effective at long contexts. This also results in additional weirdness if you adjust the I do wish we had an empirical comparison of the "official" implementation with this one though. Edit: Oh! I had skipped past some comments, thank you @ubergarm for the ppl test on the GPTQ version. I guess this model really is just weird. |
Tested PPL with the FP8 model and vllm compiled from source with the lastest code with support for hunyuan:
Even higher PPL.
@ubergarm that was for Q8_0. The day it launched I converted the .bin files to safetensors manually and I've been using it for all my test since https://huggingface.co/bullerwins/Hunyuan-A13B-Instruct-hf but now it's not needed as they updated the OG repo. Normal use seems fine is general knowledge and coding questions. But I haven't tested thoroughly. Edit: MMLU-pro computer science test for the FP8 = 75.85, in line with the Q4-Q5 gguf results |
Here is the PPL with the pretrain model from https://huggingface.co/tencent/Hunyuan-A13B-Pretrain: make -j && ./bin/llama-perplexity -m ../models/hunyuan-a13b-pt/ggml-model-q8_0.gguf -f wikitext-2-raw/wiki.test.raw -fa
Final estimate: PPL = 5.2861 +/- 0.03234
@ngxson Does the logits match if this new expert router algorithm is disabled in the reference implementation? |
Fix #14415
TODO: