Codestin Search App

raullenchai · 2026-05-15T03:39:59Z

Summary

Reported as #387 by @smallhadroncollider on v0.6.49: passing chat_template_kwargs: {enable_thinking: false} to /v1/chat/completions was silently ignored, so qwen3.6-27b-8bit burned ~3000 reasoning tokens for a one-line joke.

Root cause: ChatCompletionRequest did not declare chat_template_kwargs at all — Pydantic accepted the payload but dropped the unknown field on parse, so the chat template ran with thinking enabled by default. llama.cpp + vLLM upstream both honor this OpenAI extended-spec field.

Fix

Add chat_template_kwargs: dict | None = None to ChatCompletionRequest (api/models.py)
Add _extract_thinking_from_request(request) — request-only precedence (ctk.enable_thinking → request.enable_thinking → None), with string-bool tolerance ("true"/"false") for client-friendliness. Single source of truth used by both wrappers below.
Add _resolve_enable_thinking(request) — adds the cfg.no_thinking layer on top of the extractor for the OpenAI/anthropic routes.
Wire the helper into routes/chat.py and routes/anthropic.py (×2 call sites — non-streaming and streaming).
routes/chat.py: also resolve the max_tokens default through the same thinking flag (was based on top-level enable_thinking only, giving the smaller thinking-off budget when ctk said thinking-on).
speculative/dflash/server.py: layers its own closure no_thinking arg on top of _extract_thinking_from_request (NOT the cfg-aware helper — dflash never sets cfg.no_thinking, and consulting it would inject a hidden global-state dependency).

Precedence (OpenAI / anthropic routes)

server --no-thinking (cfg.no_thinking) → False
request.chat_template_kwargs["enable_thinking"] (OpenAI ext spec)
request.enable_thinking (top-level field, our extension)
None (template default)

Precedence (dflash route)

dflash closure --no-thinking → False
request.chat_template_kwargs["enable_thinking"]
request.enable_thinking
None

Test plan

ruff check + ruff format --check — clean
pytest tests/test_chat_template_kwargs.py — 23 passed (3 model-field tests + 10 OpenAI/anthropic helper tests + 6 extractor tests + 4 dflash precedence tests including a regression guard for "dflash must NOT consult cfg.no_thinking even when set")
Full unit suite (pytest tests/ --ignore=integrations --ignore=mllm/video/event_loop) — 3474 passed (3451 baseline + 23 new), 19 skipped, 7 deselected, 1 xfailed — no regressions
DeepSeek V4 Pro adversarial review: 3 rounds to convergence
- R1: surfaced dflash hidden cfg dependency → fixed by inlining
- R2: surfaced drift risk + missing dflash test → fixed by extracting _extract_thinking_from_request and adding TestDflashPrecedence
- R3: "No blocking issues found."
CI green (await this PR's run)
After merge: bump 0.6.49 → 0.6.50 (separate PR per Release SOP)

Closes #387.

🤖 Generated with Claude Code

@smallhadroncollider

Reported by @smallhadroncollider on v0.6.49: passing ``chat_template_kwargs: {enable_thinking: false}`` to /v1/chat/completions was silently ignored, so qwen3.6-27b-8bit burned ~3000 reasoning tokens for a one-line joke. Root cause: ``ChatCompletionRequest`` did not declare ``chat_template_kwargs`` at all. Pydantic accepted the payload but dropped the unknown field on parse, so the chat template ran with thinking enabled by default. llama.cpp + vLLM upstream both honor this OpenAI extended-spec field. Fix: - Add ``chat_template_kwargs: dict | None = None`` to ``ChatCompletionRequest`` (api/models.py) - Add ``_extract_thinking_from_request(request)`` — request-only precedence (ctk.enable_thinking → request.enable_thinking → None), with string-bool tolerance ("true"/"false") for client-friendliness. Single source of truth, used by both wrappers below. - Add ``_resolve_enable_thinking(request)`` — adds the cfg.no_thinking layer on top of the extractor for the OpenAI/anthropic routes. - Wire the helper into routes/chat.py and routes/anthropic.py (×2 call sites — non-streaming and streaming). - routes/chat.py: also resolve the max_tokens default through the same thinking flag (was based on top-level enable_thinking only, giving the smaller thinking-off budget when ctk said thinking-on). - speculative/dflash/server.py: layers its own closure ``no_thinking`` arg on top of ``_extract_thinking_from_request`` (NOT the cfg-aware helper — dflash never sets cfg.no_thinking, and consulting it would inject a hidden global-state dependency). Precedence (OpenAI / anthropic routes): 1. server --no-thinking (cfg.no_thinking) → False 2. request.chat_template_kwargs["enable_thinking"] 3. request.enable_thinking (top-level field, our extension) 4. None (template default) Precedence (dflash route): 1. dflash closure --no-thinking → False 2. request.chat_template_kwargs["enable_thinking"] 3. request.enable_thinking 4. None Verification: - ruff check + format clean - 23 new tests in tests/test_chat_template_kwargs.py covering: request model field accept, OpenAI/anthropic precedence (10 cases incl. string-bool tolerance + garbage-value handling), the shared extractor (6 cases), dflash inline precedence (4 cases incl. the regression guard for "dflash must NOT consult cfg.no_thinking even if set") - pytest tests/test_chat_template_kwargs.py: 23 passed - Full unit suite: 3474 passed (3451 baseline + 23 new), 19 skipped, 7 deselected, 1 xfailed — no regressions vs main - DeepSeek V4 Pro adversarial review: 3 rounds to convergence (round 1 → dflash hidden cfg dependency, round 2 → drift risk + missing dflash test, round 3 → "No blocking issues found.") Closes #387. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Same-day patch for #387 (chat_template_kwargs.enable_thinking silently ignored) which landed in #389 / 744a919. Pre-merge SHA capture (Release SOP §8 audit anchor): wheel: 3db4cd7a28ab8271ac39bb0bff1bf40f1e58bacad0913dd34142611eff7336c4 sdist: 5c29d56b355a5e3d0cf403dfdbcf8f5df5d8a7badad7bd444d974c6f622abbc6 Release SOP gates skipped/run: - §3 install size: skipped — no dep changes - §4 fresh-install simulation: skipped pre-bump (will re-run §10 post-publish) - §5 perf regression: skipped — no inference engine changes - §6 agent integration smoke: skipped — request-model + parser fix doesn't change tool-call/reasoning shape - §7B dep freshness: PASSED — uvicorn 0.47.0 (uploaded 1d ago) verified at upstream tag Kludex/uvicorn@479a2c0c, not a new floor on our side anyway - §7C-E workflow/OIDC/SHA-pin tampering: no diff vs v0.6.49 Co-authored-by: Your Name <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

raullenchai merged commit 744a919 into main May 15, 2026
7 checks passed

raullenchai deleted the fix/chat-template-kwargs-passthrough branch May 15, 2026 04:07

raullenchai mentioned this pull request May 15, 2026

chore: bump version to 0.6.50 #390

Merged

9 tasks

raullenchai mentioned this pull request May 15, 2026

chat_template_kwargs: { enable_thinking: false} ignored #387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): honor chat_template_kwargs.enable_thinking (#387)#389

fix(api): honor chat_template_kwargs.enable_thinking (#387)#389
raullenchai merged 1 commit into
mainfrom
fix/chat-template-kwargs-passthrough

raullenchai commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented May 15, 2026

Summary

Fix

Precedence (OpenAI / anthropic routes)

Precedence (dflash route)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant