feat(json-output): strip spurious backslashes before non-ASCII chars#394
Merged
Conversation
Cherry-pick the portable half of upstream waybarrios#525. ``lm-format-enforcer``'s JSON grammar permits ``\\`` followed by any codepoint as a valid escape, so a model emitting structured JSON with non-ASCII content (CJK, emoji, …) can produce strings like ``"\\빠\\르\\게"`` — valid JSON, but the decoded value carries literal backslashes that look like corruption to clients. Strip those spurious backslashes recursively across dict/list/str when finalizing the response_format payload in ``routes/chat.py``. Switch ``json.dumps`` to ``ensure_ascii=False`` so the cleaned text keeps the characters as-is rather than re-escaping them. The off-by-one half of #525 lives in upstream's ``constrained/json_schema_processor.py``, which we don't carry — not ported. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Codex review caught: ``_strip_backslash_before_unicode`` only walked dict *values*, not keys. ``lm-format-enforcer`` makes no distinction between JSON keys and values, so a key like ``"\\한\\글"`` would leak through to the client as ``"\\한\\글"`` even when the value was cleaned. Apply the helper to keys as well, plus a regression test using a non-ASCII key (the previous coverage used ASCII keys, which is why the bug wasn't caught originally). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ontract Codex review round 2 caught two issues: 1. ``_strip_backslash_before_unicode`` on a dict silently overwrote entries when two distinct dirty keys collapsed to the same clean key (e.g. ``"\한"`` and ``"한"`` both → ``"한"``). Switch to an explicit loop that keeps the first occurrence and logs a WARNING on collision so the data loss is visible. 2. ``ensure_ascii=False`` was a deliberate behavior change but not documented. Add a comment explaining the rationale (raw UTF-8 in JSON-over-HTTP is the standard recommendation; FastAPI emits UTF-8 anyway; smaller wire bytes; no double-decoding by clients). Codex's other finding (the regex doesn't match a literal backslash) is a false positive — ``r"\\"`` in a Python raw string IS the regex for a literal backslash; all 7 unit tests pass and a manual ``re.sub`` trial-substitutes correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… test Codex review round 2 asked for a smoke test that drives the actual chain ``routes/chat.py`` runs (parse_json_output → strip helper → json.dumps with ensure_ascii=False) rather than relying solely on the isolated helper unit tests. This guards against a future refactor that moves the helper to a different module or drops the wiring. The test is hermetic — no engine boot, no HTTP — but exercises the real production functions on a synthetic dirty input that mirrors what lm-format-enforcer would produce for a CJK-heavy response. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The new TestStripBackslashBeforeUnicode class imports vllm_mlx.routes.chat, which transitively pulls in mlx. The no-MLX Linux CI matrix (test-matrix 3.10/3.11/3.12) doesn't have mlx installed, so these tests previously raised ModuleNotFoundError. Gate the class with @pytest.mark.skipif so it skips cleanly on Linux but still runs on test-apple-silicon (where mlx is available). This mirrors the existing -k "not Integration and not InjectJson" filter in .github/workflows/ci.yml for similar mlx-touching tests. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Codex review round 3 flagged the unconditional backslash-strip as a
potential data-loss risk for legitimate JSON outputs that contain a
backslash before a non-ASCII codepoint (e.g. Windows paths like
``"C:\\사용자\\file.txt"`` in a ``response_format=json_object`` reply).
The tradeoff is intentional and matches upstream waybarrios#525:
- The lm-format-enforcer bug is the overwhelming source of these
sequences in JSON-output responses.
- The file-path case is rare in practice.
- Heuristic gating ("looks like enforcer output") would be fragile.
- Clients needing raw backslash + non-ASCII can use response_format=text.
Document the scope and tradeoff in code so future maintainers
understand the choice. If a user hits the false-positive in practice,
the right next step is a config flag (--no-strip-spurious-backslashes),
not a heuristic.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
10 tasks
raullenchai
added a commit
that referenced
this pull request
May 16, 2026
* chore: bump version to 0.6.51 User-facing changes since v0.6.50: * fix(prefix-cache): slice 3D KV state along the right axis (#392) — inference correctness fix for prefix cache restore on KV tensors shaped (B, H, S) (was slicing the wrong axis, surfacing as silent first-token corruption on cache-hit prompts). * perf(scheduler): drop per-decode list() copy of output_token_ids (#391) — small per-token allocation removed from the decode hot path; benefit scales with output length on multi-stream batches. * feat(json-output): strip spurious backslashes before non-ASCII chars (#394) — JSON-emitting models occasionally emit `\\é` etc. during constrained generation; now stripped before delivery so `json.loads` doesn't choke on user output. Dev/docs (no user-visible runtime change): * docs(contributing): codify test precision policy correctness=8bit / perf=4bit (#396) * chore(pr_validate): swap Qwen3.6-27B-4bit → 8bit in stress matrix with smoke-tier 4-bit fallback for ≤32 GB hosts (#395) Pre-merge artifact SHA-256 (audit anchor; will not match post-publish SHAs because publish.yml rebuilds on Linux runners — see Release SOP §8): rapid_mlx-0.6.51-py3-none-any.whl 5affa5c527bf543b72ddab94e96f2c6308225e6f5facd229d8f21f0b9ede8ce7 rapid_mlx-0.6.51.tar.gz d248b8a7754cbe4d94b4708fa0fb88caa623f7fc71f13f156ece0a3b36224755 Release SOP gates: * §3 install size: 448 MB (vs 445 MB v0.6.48 baseline, +0.67% — well under 1.05× soft-warn threshold). * §7 supply chain: pip-audit clean on critical deps via OSV; recent uploads (HF hub 1.15.0 today, transformers 5.8.1 3 days) verified legit upstream (substantive changelogs, known maintainers / HF bot); no install.sh / workflows / pyproject.toml diffs since v0.6.50; OIDC scope minimal (id-token: write only on publish, contents: write only on auto-release); 3 third-party actions still moving-tag pinned (peter-evans, peaceiris, codecov) — pre-existing, not a release blocker but tracked for follow-up. * §5 perf: make full in flight (qwen3.5-35b done; qwen3.6-35b mid-run). Will report results in PR before merge. * §4 user-onboarding personas + §6 agent smoke: pending; will run serially after §5 (per CLAUDE.md "never in parallel" guidance for model-server workloads). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(api): detect text-only forks of multimodal architectures (#393) Issue #393 reports that ``rapid-mlx serve /path/to/Qwen3.6-35B-A3B-MLX-8bit`` crashes on startup because the model is routed to the MLLM batched engine even though the user's checkpoint is text-only. Reporter (Tylast) correctly identified the chain: the checkpoint's config.json declares ``vision_config`` (because the base ``Qwen3_5MoeForConditionalGeneration`` architecture is multimodal-capable), so ``is_mllm_model`` returns True, so the MLLM loader takes over, then hits the hybrid-backbone / ArraysCache incompatibility documented in the closed-as-spam #385. The text-only A3B fork ships zero vision tensors in its safetensors, even though its config.json carries the full vision_config block. Our detection trusted the config and ignored the actual weight presence. Fix: when config indicates VLM AND the path is a local directory AND that directory ships ``model.safetensors.index.json``, scan the index for tensor-name prefixes that indicate real vision/audio weights (``vision_tower``, ``visual.``, ``audio_tower``, ``mm_projector``, …). If none are present, override to text-only routing. The check fires only in the True → False direction; the False direction is preserved as-is to keep existing text-routed models stable. Conservative on edge cases: * Single-file safetensors (no sharded index) → return None from the probe and trust config. Wrong-True here means the text path errors clearly at first image request, whereas wrong-False would silently corrupt every text request on a real VLM. The bad-direction cost is much smaller. * Unreadable / oversized / malformed index → same. Fall back to config. * HF repo IDs (not local dirs) → unchanged; we'd need a network call to inspect remote tensors. Tests: * New ``TestIsMllmModelWeightsPresenceOverride`` class — 6 cases: - vision_config + no vision tensors → False (the #393 fix path) - vision_config + vision tensors → True (genuine VLM still works) - audio_config + audio tensors → True (audio branch covered) - missing index → fall back to config - malformed index → fall back to config - text-only config → never even probe weights Total tests: 102 pass (was 96). ruff clean. This change is bundled into the v0.6.51 bump because the user-facing fix is small + isolated and waiting for v0.6.52 would mean Tylast keeps hitting the crash for another release cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * docs(install): add brew 5.x homebrew/core pre-flight hint Brew 5.x's install sandbox cannot auto-tap homebrew/core mid-install when a third-party formula depends on core packages ([email protected], rust). Users on fresh brew installs (API-only, no homebrew/core tap cloned) see "Operation not permitted" on /opt/homebrew/Library/Taps/homebrew/. Pre-tapping with `brew tap homebrew/core --force` (one-time, ~1.3 GB) lets the install complete. Brew 4.x and earlier never needed this. --------- Co-authored-by: Your Name <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cherry-picks the portable half of upstream waybarrios#525.
`lm-format-enforcer`'s JSON grammar permits `\\` followed by any codepoint as a valid escape. A model emitting structured JSON with non-ASCII content (CJK, emoji, …) can therefore produce strings like `"\\빠\\르\\게"` — valid JSON, but the decoded value carries literal backslashes that look like corruption to clients.
This PR adds a small helper in `routes/chat.py` that:
What I did NOT port
The off-by-one fix in upstream's `vllm_mlx/constrained/json_schema_processor.py` — that module doesn't exist in our tree (we don't use `lm-format-enforcer`-driven token enforcement; our JSON output path is parse-after-the-fact in `routes/chat.py`).
Test plan
Blast radius
Low — single helper, called only when `response_format` is set AND the model emits parseable JSON. No effect on tool-calling, streaming, or any non-JSON response path.
🤖 Generated with Claude Code