fix(api): honor chat_template_kwargs.enable_thinking (#387)#389
Merged
Conversation
Reported by @smallhadroncollider on v0.6.49: passing ``chat_template_kwargs: {enable_thinking: false}`` to /v1/chat/completions was silently ignored, so qwen3.6-27b-8bit burned ~3000 reasoning tokens for a one-line joke. Root cause: ``ChatCompletionRequest`` did not declare ``chat_template_kwargs`` at all. Pydantic accepted the payload but dropped the unknown field on parse, so the chat template ran with thinking enabled by default. llama.cpp + vLLM upstream both honor this OpenAI extended-spec field. Fix: - Add ``chat_template_kwargs: dict | None = None`` to ``ChatCompletionRequest`` (api/models.py) - Add ``_extract_thinking_from_request(request)`` — request-only precedence (ctk.enable_thinking → request.enable_thinking → None), with string-bool tolerance ("true"/"false") for client-friendliness. Single source of truth, used by both wrappers below. - Add ``_resolve_enable_thinking(request)`` — adds the cfg.no_thinking layer on top of the extractor for the OpenAI/anthropic routes. - Wire the helper into routes/chat.py and routes/anthropic.py (×2 call sites — non-streaming and streaming). - routes/chat.py: also resolve the max_tokens default through the same thinking flag (was based on top-level enable_thinking only, giving the smaller thinking-off budget when ctk said thinking-on). - speculative/dflash/server.py: layers its own closure ``no_thinking`` arg on top of ``_extract_thinking_from_request`` (NOT the cfg-aware helper — dflash never sets cfg.no_thinking, and consulting it would inject a hidden global-state dependency). Precedence (OpenAI / anthropic routes): 1. server --no-thinking (cfg.no_thinking) → False 2. request.chat_template_kwargs["enable_thinking"] 3. request.enable_thinking (top-level field, our extension) 4. None (template default) Precedence (dflash route): 1. dflash closure --no-thinking → False 2. request.chat_template_kwargs["enable_thinking"] 3. request.enable_thinking 4. None Verification: - ruff check + format clean - 23 new tests in tests/test_chat_template_kwargs.py covering: request model field accept, OpenAI/anthropic precedence (10 cases incl. string-bool tolerance + garbage-value handling), the shared extractor (6 cases), dflash inline precedence (4 cases incl. the regression guard for "dflash must NOT consult cfg.no_thinking even if set") - pytest tests/test_chat_template_kwargs.py: 23 passed - Full unit suite: 3474 passed (3451 baseline + 23 new), 19 skipped, 7 deselected, 1 xfailed — no regressions vs main - DeepSeek V4 Pro adversarial review: 3 rounds to convergence (round 1 → dflash hidden cfg dependency, round 2 → drift risk + missing dflash test, round 3 → "No blocking issues found.") Closes #387. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
9 tasks
raullenchai
added a commit
that referenced
this pull request
May 15, 2026
Same-day patch for #387 (chat_template_kwargs.enable_thinking silently ignored) which landed in #389 / 744a919. Pre-merge SHA capture (Release SOP §8 audit anchor): wheel: 3db4cd7a28ab8271ac39bb0bff1bf40f1e58bacad0913dd34142611eff7336c4 sdist: 5c29d56b355a5e3d0cf403dfdbcf8f5df5d8a7badad7bd444d974c6f622abbc6 Release SOP gates skipped/run: - §3 install size: skipped — no dep changes - §4 fresh-install simulation: skipped pre-bump (will re-run §10 post-publish) - §5 perf regression: skipped — no inference engine changes - §6 agent integration smoke: skipped — request-model + parser fix doesn't change tool-call/reasoning shape - §7B dep freshness: PASSED — uvicorn 0.47.0 (uploaded 1d ago) verified at upstream tag Kludex/uvicorn@479a2c0c, not a new floor on our side anyway - §7C-E workflow/OIDC/SHA-pin tampering: no diff vs v0.6.49 Co-authored-by: Your Name <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reported as #387 by @smallhadroncollider on v0.6.49: passing
chat_template_kwargs: {enable_thinking: false}to/v1/chat/completionswas silently ignored, so qwen3.6-27b-8bit burned ~3000 reasoning tokens for a one-line joke.Root cause:
ChatCompletionRequestdid not declarechat_template_kwargsat all — Pydantic accepted the payload but dropped the unknown field on parse, so the chat template ran with thinking enabled by default. llama.cpp + vLLM upstream both honor this OpenAI extended-spec field.Fix
chat_template_kwargs: dict | None = NonetoChatCompletionRequest(api/models.py)_extract_thinking_from_request(request)— request-only precedence (ctk.enable_thinking → request.enable_thinking → None), with string-bool tolerance ("true"/"false") for client-friendliness. Single source of truth used by both wrappers below._resolve_enable_thinking(request)— adds the cfg.no_thinking layer on top of the extractor for the OpenAI/anthropic routes.routes/chat.pyandroutes/anthropic.py(×2 call sites — non-streaming and streaming).routes/chat.py: also resolve the max_tokens default through the same thinking flag (was based on top-levelenable_thinkingonly, giving the smaller thinking-off budget when ctk said thinking-on).speculative/dflash/server.py: layers its own closureno_thinkingarg on top of_extract_thinking_from_request(NOT the cfg-aware helper — dflash never setscfg.no_thinking, and consulting it would inject a hidden global-state dependency).Precedence (OpenAI / anthropic routes)
--no-thinking(cfg.no_thinking) →Falserequest.chat_template_kwargs["enable_thinking"](OpenAI ext spec)request.enable_thinking(top-level field, our extension)None(template default)Precedence (dflash route)
--no-thinking→Falserequest.chat_template_kwargs["enable_thinking"]request.enable_thinkingNoneTest plan
ruff check+ruff format --check— cleanpytest tests/test_chat_template_kwargs.py— 23 passed (3 model-field tests + 10 OpenAI/anthropic helper tests + 6 extractor tests + 4 dflash precedence tests including a regression guard for "dflash must NOT consult cfg.no_thinking even when set")pytest tests/ --ignore=integrations --ignore=mllm/video/event_loop) — 3474 passed (3451 baseline + 23 new), 19 skipped, 7 deselected, 1 xfailed — no regressions_extract_thinking_from_requestand addingTestDflashPrecedenceCloses #387.
🤖 Generated with Claude Code