Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@TrevorS
Copy link
Contributor

@TrevorS TrevorS commented Dec 28, 2025

Hello @ngxson, I'm back! How does this look for the first PR? I'm open to any feedback.

Original Model: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
GGUFs: https://huggingface.co/TrevorJS/Qwen3-Omni-30B-A3B-GGUF

This PR implements the thinker model only, providing just text -> text.

thinker-f16 on dgx-spark:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      1856.94 ± 11.77 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         34.88 ± 0.06 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |       1692.98 ± 4.34 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         32.07 ± 0.12 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       1552.70 ± 1.64 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         29.64 ± 0.14 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |       1304.71 ± 2.41 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         26.26 ± 0.03 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |       1001.73 ± 1.68 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         21.43 ± 0.02 |
Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b0-unknown
model      : thinker-f16.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Why write smaller PRs? Respond with less than 10 words.

Easier to review, test, and merge quickly.

[ Prompt: 68.6 t/s | Generation: 31.5 t/s ]

>

AI Disclosure

AI was used to write this code, but it was then reviewed, tested, and benchmarked by a human!

@arch-btw
Copy link
Contributor

arch-btw commented Dec 28, 2025

Nice job. I think deepcopy might not be needed since you're not modifying anything nested.

Add support for Qwen3-Omni Thinker, a 48-layer MoE model with 128 experts
(8 active per token) and optional shared expert. This enables text-only
inference as the foundation for full multimodal support.

Key changes:
- New architecture: LLM_ARCH_QWEN3OMNIMOE
- GGUF conversion with nested thinker_config handling
- IMRoPE (Interleaved M-RoPE) with sections [24, 20, 20, 0]
- Shared expert support in qwen3vl-moe graph builder
- Reuses llm_build_qwen3vlmoe for graph construction
Address review feedback:
- Rename class to Qwen3OmniMoeModel, inherit from Qwen2MoeModel
- Remove __init__ override (thinker_config handled at L720-722)
- Remove set_gguf_parameters (mrope_section via rope_scaling)

Keep set_vocab for EOS/PAD: Qwen3-Omni lacks tokenizer.json
(uses vocab.json + merges.txt), so SpecialVocab can't discover
token IDs automatically.
Comment on lines +4326 to +4328
# Qwen3-Omni lacks tokenizer.json, so token IDs must be set explicitly
self.gguf_writer.add_eos_token_id(151645) # <|im_end|> - required for generation
self.gguf_writer.add_pad_token_id(151643) # <|endoftext|> - required for batching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is incorrect, it's because they for some reason are explicitly set to null in config.json.

layer.ffn_up_exps = create_tensor(tn(LLM_TENSOR_FFN_UP_EXPS, "weight", i), { n_embd, n_ff_exp, n_expert}, 0);
}
} break;
case LLM_ARCH_QWEN3OMNIMOE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only Qwen3VLMoe with shared experts added and you are adding shared experts support to qwen3vl-moe.cpp I suggest you do the same here instead of duplicating code.

@ngxson
Copy link
Collaborator

ngxson commented Jan 1, 2026

If I understand correctly, qwen3 omni is just qwen3vl with whisper encoder for audio.

There is no need to introduce this much changes. The conversation script can simply mark this info.

Beside, I don't feel comfortable using AI for anything related to mtmd, it generates too much redundant and overkill code.

I will replace this PR with another approach which is much simpler

@ngxson ngxson closed this Jan 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants