Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix(api): honor chat_template_kwargs.enable_thinking (#387)#389

Merged
raullenchai merged 1 commit into
mainfrom
fix/chat-template-kwargs-passthrough
May 15, 2026
Merged

fix(api): honor chat_template_kwargs.enable_thinking (#387)#389
raullenchai merged 1 commit into
mainfrom
fix/chat-template-kwargs-passthrough

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

Reported as #387 by @smallhadroncollider on v0.6.49: passing chat_template_kwargs: {enable_thinking: false} to /v1/chat/completions was silently ignored, so qwen3.6-27b-8bit burned ~3000 reasoning tokens for a one-line joke.

Root cause: ChatCompletionRequest did not declare chat_template_kwargs at all — Pydantic accepted the payload but dropped the unknown field on parse, so the chat template ran with thinking enabled by default. llama.cpp + vLLM upstream both honor this OpenAI extended-spec field.

Fix

  • Add chat_template_kwargs: dict | None = None to ChatCompletionRequest (api/models.py)
  • Add _extract_thinking_from_request(request) — request-only precedence (ctk.enable_thinking → request.enable_thinking → None), with string-bool tolerance ("true"/"false") for client-friendliness. Single source of truth used by both wrappers below.
  • Add _resolve_enable_thinking(request) — adds the cfg.no_thinking layer on top of the extractor for the OpenAI/anthropic routes.
  • Wire the helper into routes/chat.py and routes/anthropic.py (×2 call sites — non-streaming and streaming).
  • routes/chat.py: also resolve the max_tokens default through the same thinking flag (was based on top-level enable_thinking only, giving the smaller thinking-off budget when ctk said thinking-on).
  • speculative/dflash/server.py: layers its own closure no_thinking arg on top of _extract_thinking_from_request (NOT the cfg-aware helper — dflash never sets cfg.no_thinking, and consulting it would inject a hidden global-state dependency).

Precedence (OpenAI / anthropic routes)

  1. server --no-thinking (cfg.no_thinking) → False
  2. request.chat_template_kwargs["enable_thinking"] (OpenAI ext spec)
  3. request.enable_thinking (top-level field, our extension)
  4. None (template default)

Precedence (dflash route)

  1. dflash closure --no-thinkingFalse
  2. request.chat_template_kwargs["enable_thinking"]
  3. request.enable_thinking
  4. None

Test plan

  • ruff check + ruff format --check — clean
  • pytest tests/test_chat_template_kwargs.py23 passed (3 model-field tests + 10 OpenAI/anthropic helper tests + 6 extractor tests + 4 dflash precedence tests including a regression guard for "dflash must NOT consult cfg.no_thinking even when set")
  • Full unit suite (pytest tests/ --ignore=integrations --ignore=mllm/video/event_loop) — 3474 passed (3451 baseline + 23 new), 19 skipped, 7 deselected, 1 xfailed — no regressions
  • DeepSeek V4 Pro adversarial review: 3 rounds to convergence
    • R1: surfaced dflash hidden cfg dependency → fixed by inlining
    • R2: surfaced drift risk + missing dflash test → fixed by extracting _extract_thinking_from_request and adding TestDflashPrecedence
    • R3: "No blocking issues found."
  • CI green (await this PR's run)
  • After merge: bump 0.6.49 → 0.6.50 (separate PR per Release SOP)

Closes #387.

🤖 Generated with Claude Code

Reported by @smallhadroncollider on v0.6.49: passing
``chat_template_kwargs: {enable_thinking: false}`` to
/v1/chat/completions was silently ignored, so qwen3.6-27b-8bit burned
~3000 reasoning tokens for a one-line joke.

Root cause: ``ChatCompletionRequest`` did not declare
``chat_template_kwargs`` at all. Pydantic accepted the payload but
dropped the unknown field on parse, so the chat template ran with
thinking enabled by default. llama.cpp + vLLM upstream both honor
this OpenAI extended-spec field.

Fix:
- Add ``chat_template_kwargs: dict | None = None`` to
  ``ChatCompletionRequest`` (api/models.py)
- Add ``_extract_thinking_from_request(request)`` — request-only
  precedence (ctk.enable_thinking → request.enable_thinking → None),
  with string-bool tolerance ("true"/"false") for client-friendliness.
  Single source of truth, used by both wrappers below.
- Add ``_resolve_enable_thinking(request)`` — adds the cfg.no_thinking
  layer on top of the extractor for the OpenAI/anthropic routes.
- Wire the helper into routes/chat.py and routes/anthropic.py (×2 call
  sites — non-streaming and streaming).
- routes/chat.py: also resolve the max_tokens default through the
  same thinking flag (was based on top-level enable_thinking only,
  giving the smaller thinking-off budget when ctk said thinking-on).
- speculative/dflash/server.py: layers its own closure ``no_thinking``
  arg on top of ``_extract_thinking_from_request`` (NOT the cfg-aware
  helper — dflash never sets cfg.no_thinking, and consulting it would
  inject a hidden global-state dependency).

Precedence (OpenAI / anthropic routes):
  1. server --no-thinking (cfg.no_thinking) → False
  2. request.chat_template_kwargs["enable_thinking"]
  3. request.enable_thinking (top-level field, our extension)
  4. None (template default)

Precedence (dflash route):
  1. dflash closure --no-thinking → False
  2. request.chat_template_kwargs["enable_thinking"]
  3. request.enable_thinking
  4. None

Verification:
- ruff check + format clean
- 23 new tests in tests/test_chat_template_kwargs.py covering: request
  model field accept, OpenAI/anthropic precedence (10 cases incl.
  string-bool tolerance + garbage-value handling), the shared extractor
  (6 cases), dflash inline precedence (4 cases incl. the regression
  guard for "dflash must NOT consult cfg.no_thinking even if set")
- pytest tests/test_chat_template_kwargs.py: 23 passed
- Full unit suite: 3474 passed (3451 baseline + 23 new), 19 skipped,
  7 deselected, 1 xfailed — no regressions vs main
- DeepSeek V4 Pro adversarial review: 3 rounds to convergence
  (round 1 → dflash hidden cfg dependency, round 2 → drift risk +
  missing dflash test, round 3 → "No blocking issues found.")

Closes #387.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@raullenchai raullenchai merged commit 744a919 into main May 15, 2026
7 checks passed
@raullenchai raullenchai deleted the fix/chat-template-kwargs-passthrough branch May 15, 2026 04:07
@raullenchai raullenchai mentioned this pull request May 15, 2026
9 tasks
raullenchai added a commit that referenced this pull request May 15, 2026
Same-day patch for #387 (chat_template_kwargs.enable_thinking silently
ignored) which landed in #389 / 744a919.

Pre-merge SHA capture (Release SOP §8 audit anchor):
  wheel:  3db4cd7a28ab8271ac39bb0bff1bf40f1e58bacad0913dd34142611eff7336c4
  sdist:  5c29d56b355a5e3d0cf403dfdbcf8f5df5d8a7badad7bd444d974c6f622abbc6

Release SOP gates skipped/run:
- §3 install size: skipped — no dep changes
- §4 fresh-install simulation: skipped pre-bump (will re-run §10 post-publish)
- §5 perf regression: skipped — no inference engine changes
- §6 agent integration smoke: skipped — request-model + parser fix doesn't
  change tool-call/reasoning shape
- §7B dep freshness: PASSED — uvicorn 0.47.0 (uploaded 1d ago) verified at
  upstream tag Kludex/uvicorn@479a2c0c, not a new floor on our side anyway
- §7C-E workflow/OIDC/SHA-pin tampering: no diff vs v0.6.49

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chat_template_kwargs: { enable_thinking: false} ignored

1 participant