Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: Add DeepSeek R1 and distilled model support#2131

Draft
ljluestc wants to merge 2 commits intoabetlen:mainfrom
ljluestc:feat/deepseek-r1-support
Draft

feat: Add DeepSeek R1 and distilled model support#2131
ljluestc wants to merge 2 commits intoabetlen:mainfrom
ljluestc:feat/deepseek-r1-support

Conversation

@ljluestc
Copy link

@ljluestc ljluestc commented Mar 1, 2026

feat: Add DeepSeek R1 and distilled model support

Closes #1952

Summary

Adds full chat format support for DeepSeek R1, DeepSeek R1 Distill (Qwen), and DeepSeek R1 Distill (Llama) models. Updates the llama.cpp submodule to b8184 which includes native architecture support for DeepSeek R1/V2/V3.

Problem

DeepSeek R1 and its distilled variants are among the most popular open-weight reasoning models, but llama-cpp-python currently lacks both the inference backend support and the chat format handling required to run them correctly. Users attempting to load DeepSeek R1 GGUFs get incorrect prompt formatting, double BOS tokens, and missing architecture support at the C++ layer.

Changes

llama_cpp/llama_chat_format.py

  • Added DEEPSEEK_R1_CHAT_TEMPLATE constant sourced from the official HuggingFace tokenizer config
  • Added DEEPSEEK_R1_BOS_TOKEN and DEEPSEEK_R1_EOS_TOKEN constants using DeepSeek's fullwidth Unicode special tokens (\uff5c, \u2581)
  • Registered three new chat formats:
    • deepseek-r1 — primary format with correct special token handling (<|User|>, <|Assistant|>, <|begin▁of▁sentence|>, <|end▁of▁sentence|>)
    • deepseek-r1-distill-qwen — alias for Qwen-based distilled models
    • deepseek-r1-distill-llama — alias for Llama-based distilled models
  • Updated guess_chat_format_from_gguf_metadata() to auto-detect DeepSeek R1 models via:
    • Exact template match against DEEPSEEK_R1_CHAT_TEMPLATE
    • Heuristic fallback checking for characteristic <|User|> / <|Assistant|> tokens in the chat template
  • Handles </think> reasoning content stripping in multi-turn conversations — prior assistant turns have their chain-of-thought reasoning removed to keep context clean
  • Sets added_special=True in the formatter response to prevent double BOS token injection during tokenization

llama_cpp/__init__.py

  • Version bump from 0.3.160.3.17

vendor/llama.cpp

  • Updated submodule to b8184 (3191462) which adds native DeepSeek R1/V2/V3 architecture support in the inference backend

Testing

All 11 tests pass (2 existing + 9 new):

ljluestc added 2 commits March 1, 2026 12:30
- Update llama.cpp submodule to latest (b8184) for full DeepSeek R1/V2/V3 architecture support
- Add 'deepseek-r1' chat format with correct special tokens (<|User|>, <|Assistant|>, <|begin▁of▁sentence|>, <|end▁of▁sentence|>)
- Add 'deepseek-r1-distill-qwen' and 'deepseek-r1-distill-llama' chat format aliases for distilled model variants
- Add DEEPSEEK_R1_CHAT_TEMPLATE constant from official HuggingFace tokenizer config
- Update guess_chat_format_from_gguf_metadata() to auto-detect DeepSeek R1 models via template matching and heuristic token detection
- Handle </think> reasoning content stripping for multi-turn conversations
- Bump version to 0.3.17

Closes abetlen#1952
The format_deepseek_r1 function already includes the BOS token
(<|begin▁of▁sentence|>) in the formatted prompt, but was not setting
added_special=True in the ChatFormatterResponse. This caused
chat_formatter_to_chat_completion_handler to pass add_bos=True to the
tokenizer, resulting in a duplicate BOS token.

Also adds comprehensive tests for:
- Single-turn and multi-turn conversations
- System message handling
- </think> reasoning content stripping
- Distilled model aliases (qwen/llama)
- Auto-detection via exact match and heuristic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

更新llama cpp,目前不支持deepseek r1以及蒸馏模型

1 participant