Codestin Search App

v0.6.54

chore: bump version to 0.6.54

SOP §10 routing escape hatches — closes the bug-class behind #393.

## What

5 new force-off CLI flags that pair with every binary auto-routing decision:

| Flag | Override target |
|---|---|
| `--no-mllm` / `--text-only` | `is_mllm_model()` MLLM auto-detection (#393) |
| `--no-tool-call-parser` | AliasProfile tool-call parser auto-selection |
| `--no-reasoning-parser` | AliasProfile reasoning parser auto-selection |
| `--force-hybrid` / `--no-hybrid` | `ModelConfig.is_hybrid` (gates spec/suffix decode) |
| `--force-spec-decode` / `--no-spec-decode` | `ModelConfig.supports_spec_decode` (gates MTP/DFlash/suffix) |

Plus a friendly-error wrapper for the original #393 symptom (`ValueError: Missing N parameters: vision_tower.*` → re-raised as `RuntimeError` pointing at `--no-mllm`).

## Why

#393 (Tylast — Qwen3.6-35B-A3B-MLX-8bit half-quantized vision) hit `is_mllm_model()` returning True structurally, but the actual safetensors only had text weights. No CLI override existed → the user had no way to recover without waiting for a patch release.

The fix isn't just "add `--no-mllm`" — it's "every binary auto-routing decision needs both directions, and the standard tests prove it stays that way."

## Coverage

Plumbed end-to-end through 3 entrypoints:
- `rapid-mlx serve` (unified CLI)
- `python -m vllm_mlx.server` (standalone)
- `vllm-mlx-bench` (only `--no-mllm` — the other 4 aren't relevant to benchmarking)

Override applied at `EngineCore.__init__` *after* `enrich_model_config()` *before* `self.scheduler.model_config = self.model_config` — single point of mutation that every Scheduler reader sees.

3 lines of defense against mutex misuse:
1. CLI dispatch (`sys.exit(2)`)
2. `server.load_model` (`ValueError`)
3. `EngineCore.__init__` (`ValueError`)

## SOP gate

`tests/test_no_mllm_flag.py::test_auto_routing_flags_have_force_on_and_force_off_pair` — AST-based registry test. Every entry in `AUTO_ROUTING_FLAG_PAIRS` declares which entrypoint files must register BOTH directions. Adding a new model-taking CLI = adding it to every pair's required list. CI fails loudly until you do.

## Validation

- 4 codex review rounds (R1 caught DFlash bypass + MTP gap + missing standalone-server wiring + weak registry test; R2 caught bench gap; R3 clean)
- `pr_validate 407 -v`: MERGE-SAFE (lint, supply-chain, 299 targeted, 3646 unit, 3×3 stress matrix on Qwen3-0.6B / Qwen3.5-35B-A3B / Qwen3.6-27B all PASS)

## Out of scope (documented)

`OutputRouter.from_tokenizer` (Gemma 4 / Harmony channel auto-detect) — vocab-based multi-format detection with built-in legacy-parser fallback. Not in the registry. If a false-positive surfaces in the wild, add an override flag then.

## Files

- `vllm_mlx/{cli.py, server.py, engine/batched.py, engine_core.py, scheduler.py, models/mllm.py, benchmark.py}`
- `tests/test_no_mllm_flag.py` (21 tests)

May 18, 2026
b86f664
zip
tar.gz
Notes

v0.6.53

chore: bump version to 0.6.53

Fixes #404 (M5 single-stream GPU crash).

Probe-and-cache shim around mx.new_thread_local_stream, installed lazily
at every mlx_lm.generate consumer (scheduler, decode, model_runner,
mamba_cache). Transparent on M1-M4; falls back to mx.default_stream on
M5 single-stream devices. __init__.py stays mlx-free so metadata-only
import works on Metal-less systems.

Also adds SOP gap fix: structural audit test that guards every future
mlx_lm.generate consumer file from forgetting the install prelude.

Validation: pr_validate MERGE-SAFE (lint, supply_chain, targeted, full
unit 3625 passed, stress matrix 3x3 PASS). Codex review x 3 rounds.

May 18, 2026
550045c
zip
tar.gz
Notes

v0.6.52

chore: bump version to 0.6.52

* fix(cli): plumb --prefill-step-size into SchedulerConfig + add fidelity audit (#400)

Root cause: serve_command built SchedulerConfig(...) without
prefill_step_size=args.prefill_step_size, so the field kept its 2048
dataclass default. The MLLM batch path then used 2048 as a per-batch
cap (prefill_step_size * len(requests)) and rejected any prompt >2048
tokens, even when the user passed --prefill-step-size 32768.

The CLI flag was being routed to load_model(prefill_step_size=...) but
that parameter was accepted and silently discarded — BatchedEngine reads
the value off scheduler_config only. So the flag was effectively dead
for both LLM and MLLM continuous-batching paths.

Changes:
- vllm_mlx/cli.py — pass prefill_step_size in SchedulerConfig kwargs
  inside serve_command (the actual fix).
- vllm_mlx/server.py — remove the dead load_model(prefill_step_size=...)
  parameter and its docstring entry. Remove the dead arg-pass in
  server.py:main()'s load_model call.
- vllm_mlx/mllm_batch_generator.py — rewrite the "exceeds safe limit"
  error to name the actual cap formula and point at --prefill-step-size,
  so users hitting the cap know what knob to turn.

New SOP gate (so this class of bug can't recur silently):
- scripts/audit_cli_config_fidelity.py — AST audit. For every function
  in cli.py, finds the 3-way intersection of (function reads args.X)
  ∧ (SchedulerConfig has field X) ∧ (kwarg X not passed at site).
  Each hit is a user-visible silent-flag-drop bug. Zero-import
  (parses dataclass fields from source), runs on plain Linux CI.
- scripts/dev_test.py — new "audit" tier, included in smoke/all/full.
- Makefile — `make audit` target; `make smoke` now runs lint → audit → unit.
- .github/workflows/ci.yml — audit step in the lint job, so every PR
  and every push to main gets the gate for free.
- tests/test_cli_config_fidelity.py — 4 tests covering: (1) the #400
  regression itself via AST, (2) audit clean on current main, (3) audit
  detects synthetic drift, (4) audit doesn't false-positive when a
  field is in the config but not read by the function.

Known follow-up (not in this PR):
- vllm_mlx/server.py:main() (the `python -m vllm_mlx.server` /
  `mise run` entry) declares its own --prefill-step-size flag but
  builds no SchedulerConfig, so the flag is also dead in that path.
  Removing it would break argparse for anyone using that entry —
  needs a small refactor (build a SchedulerConfig in main()) rather
  than a flag deletion. Tracking separately.

Verified:
- python3.12 scripts/audit_cli_config_fidelity.py → exit 0
- python3.12 -m pytest tests/test_cli_config_fidelity.py → 4 passed
- python3.12 -m pytest tests/ (excl. integrations) → 3481 passed
- ruff check + ruff format --check → clean
- Live: `rapid-mlx serve qwen3-0.6b-8bit --port 8765 --prefill-step-size 8192
  --continuous-batching` boots, responds to /v1/chat/completions

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* chore: bump version to 0.6.52

PR #400 fix (silent --prefill-step-size flag drop on continuous-batching
path) is user-facing and triggers version-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(server): keep load_model(prefill_step_size=...) kwarg for back-compat (codex round 2)

Codex round 2 on PR #405 flagged a real regression risk: removing the
documented load_model(prefill_step_size=...) kwarg in a patch release
breaks any external caller written against the previous signature
(would TypeError on upgrade, before any model loads).

Resolution: restore the kwarg as deprecated-but-functional. When passed:
  - emit DeprecationWarning
  - translate the value into scheduler_config.prefill_step_size (which
    is what the original kwarg was *supposed* to do but never did —
    that was the root cause of #400 itself).
  - synthesise a default SchedulerConfig if the caller didn't pass one.

So callers using the legacy kwarg get the value they asked for *and* a
nudge to migrate to scheduler_config. Pre-0.6.52 silent no-op becomes
silent correctness + a deprecation pointer.

New regression test test_load_model_prefill_step_size_back_compat_translation
asserts: the kwarg is still in the signature, calling with it emits
DeprecationWarning, and the value lands in scheduler_config.prefill_step_size.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(server.py:main): forward --prefill-step-size via SchedulerConfig (codex round 3)

Codex round 3 on PR #405 caught the same silent-flag-drop pattern in the
secondary `python -m vllm_mlx.server` / `mise run` entrypoint. The flag
was parsed by argparse but not forwarded to load_model — leaving the
exact bug class this PR claims to eliminate, in a file the PR already
touched.

server.py:main now constructs a SchedulerConfig with the user-supplied
prefill_step_size and passes it to load_model. The unified rapid-mlx CLI
in cli.py already does this (via its richer SchedulerConfig builder);
the standalone entry only needs the one knob it exposes.

Hardening:

1. Extended audit to also scan server.py — `CLI_ENTRY_PATHS` now lists
   both cli.py and server.py. Either entrypoint adding a future
   args.X-reads-but-not-passed-to-SchedulerConfig pattern will fail the
   audit at PR time.

2. New regression test test_prefill_step_size_is_plumbed_in_server_main
   mirrors the existing serve_command one — symmetric coverage for both
   entrypoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

May 17, 2026
e51a558
zip
tar.gz
Notes

v0.6.51

chore: bump version to 0.6.51

* chore: bump version to 0.6.51

User-facing changes since v0.6.50:

* fix(prefix-cache): slice 3D KV state along the right axis (#392) —
  inference correctness fix for prefix cache restore on KV tensors
  shaped (B, H, S) (was slicing the wrong axis, surfacing as
  silent first-token corruption on cache-hit prompts).
* perf(scheduler): drop per-decode list() copy of output_token_ids
  (#391) — small per-token allocation removed from the decode hot
  path; benefit scales with output length on multi-stream batches.
* feat(json-output): strip spurious backslashes before non-ASCII
  chars (#394) — JSON-emitting models occasionally emit `\\é` etc.
  during constrained generation; now stripped before delivery so
  `json.loads` doesn't choke on user output.

Dev/docs (no user-visible runtime change):
* docs(contributing): codify test precision policy correctness=8bit /
  perf=4bit (#396)
* chore(pr_validate): swap Qwen3.6-27B-4bit → 8bit in stress matrix
  with smoke-tier 4-bit fallback for ≤32 GB hosts (#395)

Pre-merge artifact SHA-256 (audit anchor; will not match post-publish
SHAs because publish.yml rebuilds on Linux runners — see Release SOP §8):
  rapid_mlx-0.6.51-py3-none-any.whl
    5affa5c527bf543b72ddab94e96f2c6308225e6f5facd229d8f21f0b9ede8ce7
  rapid_mlx-0.6.51.tar.gz
    d248b8a7754cbe4d94b4708fa0fb88caa623f7fc71f13f156ece0a3b36224755

Release SOP gates:
* §3 install size: 448 MB (vs 445 MB v0.6.48 baseline, +0.67% — well
  under 1.05× soft-warn threshold).
* §7 supply chain: pip-audit clean on critical deps via OSV; recent
  uploads (HF hub 1.15.0 today, transformers 5.8.1 3 days) verified
  legit upstream (substantive changelogs, known maintainers / HF bot);
  no install.sh / workflows / pyproject.toml diffs since v0.6.50;
  OIDC scope minimal (id-token: write only on publish, contents: write
  only on auto-release); 3 third-party actions still moving-tag pinned
  (peter-evans, peaceiris, codecov) — pre-existing, not a release
  blocker but tracked for follow-up.
* §5 perf: make full in flight (qwen3.5-35b done; qwen3.6-35b mid-run).
  Will report results in PR before merge.
* §4 user-onboarding personas + §6 agent smoke: pending; will run
  serially after §5 (per CLAUDE.md "never in parallel" guidance for
  model-server workloads).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(api): detect text-only forks of multimodal architectures (#393)

Issue #393 reports that ``rapid-mlx serve /path/to/Qwen3.6-35B-A3B-MLX-8bit``
crashes on startup because the model is routed to the MLLM batched
engine even though the user's checkpoint is text-only. Reporter (Tylast)
correctly identified the chain: the checkpoint's config.json declares
``vision_config`` (because the base ``Qwen3_5MoeForConditionalGeneration``
architecture is multimodal-capable), so ``is_mllm_model`` returns True,
so the MLLM loader takes over, then hits the hybrid-backbone /
ArraysCache incompatibility documented in the closed-as-spam #385.

The text-only A3B fork ships zero vision tensors in its safetensors,
even though its config.json carries the full vision_config block. Our
detection trusted the config and ignored the actual weight presence.

Fix: when config indicates VLM AND the path is a local directory AND
that directory ships ``model.safetensors.index.json``, scan the index
for tensor-name prefixes that indicate real vision/audio weights
(``vision_tower``, ``visual.``, ``audio_tower``, ``mm_projector``, …).
If none are present, override to text-only routing. The check fires
only in the True → False direction; the False direction is preserved
as-is to keep existing text-routed models stable.

Conservative on edge cases:
* Single-file safetensors (no sharded index) → return None from the
  probe and trust config. Wrong-True here means the text path errors
  clearly at first image request, whereas wrong-False would silently
  corrupt every text request on a real VLM. The bad-direction cost
  is much smaller.
* Unreadable / oversized / malformed index → same. Fall back to config.
* HF repo IDs (not local dirs) → unchanged; we'd need a network call
  to inspect remote tensors.

Tests:
* New ``TestIsMllmModelWeightsPresenceOverride`` class — 6 cases:
  - vision_config + no vision tensors → False (the #393 fix path)
  - vision_config + vision tensors → True (genuine VLM still works)
  - audio_config + audio tensors → True (audio branch covered)
  - missing index → fall back to config
  - malformed index → fall back to config
  - text-only config → never even probe weights

Total tests: 102 pass (was 96). ruff clean.

This change is bundled into the v0.6.51 bump because the user-facing
fix is small + isolated and waiting for v0.6.52 would mean Tylast
keeps hitting the crash for another release cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* docs(install): add brew 5.x homebrew/core pre-flight hint

Brew 5.x's install sandbox cannot auto-tap homebrew/core mid-install
when a third-party formula depends on core packages ([email protected], rust).
Users on fresh brew installs (API-only, no homebrew/core tap cloned)
see "Operation not permitted" on /opt/homebrew/Library/Taps/homebrew/.

Pre-tapping with `brew tap homebrew/core --force` (one-time, ~1.3 GB)
lets the install complete. Brew 4.x and earlier never needed this.

---------

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

May 16, 2026
51a00f0
zip
tar.gz
Notes

v0.6.50

chore: bump version to 0.6.50

Same-day patch for #387 (chat_template_kwargs.enable_thinking silently
ignored) which landed in #389 / 744a919.

Pre-merge SHA capture (Release SOP §8 audit anchor):
  wheel:  3db4cd7a28ab8271ac39bb0bff1bf40f1e58bacad0913dd34142611eff7336c4
  sdist:  5c29d56b355a5e3d0cf403dfdbcf8f5df5d8a7badad7bd444d974c6f622abbc6

Release SOP gates skipped/run:
- §3 install size: skipped — no dep changes
- §4 fresh-install simulation: skipped pre-bump (will re-run §10 post-publish)
- §5 perf regression: skipped — no inference engine changes
- §6 agent integration smoke: skipped — request-model + parser fix doesn't
  change tool-call/reasoning shape
- §7B dep freshness: PASSED — uvicorn 0.47.0 (uploaded 1d ago) verified at
  upstream tag Kludex/uvicorn@479a2c0c, not a new floor on our side anyway
- §7C-E workflow/OIDC/SHA-pin tampering: no diff vs v0.6.49

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

May 15, 2026
f591b9b
zip
tar.gz
Notes

v0.6.49

chore: bump version to 0.6.49

* feat(bench): DFlash sweep harness + 6 new 8-bit aliases

scripts/bench_dflash.py: sequential two-server (baseline vs DFlash), 3 runs
x 5 workloads (4 code + chat), --disable-prefix-cache, reliability gates
inherited from bench_suffix_decoding_integrated.py (MIN_DECODE_TIME=0.5s,
TPS_CEILING=500). enable_thinking=false in the chat payload prevents the
Qwen3-family reasoning-token TTFT artifact that misclassifies thinking-mode
output as prefill. Configurable SHIP gate: --gate (code median, default
1.30x) and --non-code-floor (chat etc., default 1.00x).

aliases.json: 6 new 8-bit entries with no DFlash:
- qwen3.5-4b-8bit, qwen3.5-9b-8bit (hybrid, chat regresses)
- qwen3-4b-8bit, qwen3-8b-8bit, llama-3.1-8b-8bit (mlx-vlm 0.5.0 has no
  base loader for these architectures - DFlash unavailable regardless of
  drafter)
- gemma-4-31b-8bit (dense, chat regresses)

evals/results/dflash_*.json: 6 raw bench JSONs from the sweep. Code
workloads strongly favor DFlash on every dense candidate (median 1.31x-
1.98x), but chat regresses 22-44% universally. MoE re-bench
(qwen3.6-35b-8bit-poc) confirms z-lab's latest A3B drafter still loses
0.89x median - MoE gate stays.

test_aliases_contract.py: drafter prefix allow-list extended to include
z-lab/Qwen3-, z-lab/gemma-4-, z-lab/LLaMA3.1- (future-proofs the gate for
non-Qwen3.5/3.6 drafters z-lab now ships). Marker check accepts "DFlash"
anywhere in the repo name (handles -DFlash-b16, -DFlash-UltraChat suffix
tags). No currently-shipping alias uses these new prefixes; production
qwen3.5-27b-8bit and qwen3.6-27b-8bit DFlash settings are untouched -
whether to apply the strict chat-floor gate retroactively to those is a
separate policy call.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* test(ssot): bump alias counts for 6 new 8-bit aliases

Hard-coded counts in test_model_profiles_ssot.py drift when aliases are
added. Bump to match aliases.json after the DFlash sweep additions:
- list_aliases / list_profiles: 58 → 64 (6 new 8-bit aliases)
- qwen3.5-* family: 8 → 10 (qwen3.5-4b-8bit, qwen3.5-9b-8bit added)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(bench): address codex round-1 review

- bench_dflash.py: wrap subprocess.Popen + readiness loop in try/except
  to close logf on any error path (file-descriptor leak fix).
- bench_dflash.py: raise_for_status() on the chat-completions stream so
  4xx/5xx server errors surface as exceptions rather than producing a
  silent 0-token WorkloadRun.
- bench_dflash.py: separate ``ship: bool`` from the human-readable
  ``decision`` string; exit code keys off the bool so a future
  decision-string tweak can't accidentally flip a successful run to a
  non-zero exit.
- test_aliases_contract.py: anchor the DFlash drafter marker check with
  ``re.search(r"(?:^|-)DFlash(?:$|-)", d)`` so substring matches like
  ``-notDFlash-utils`` no longer pass the allow-list.

All four findings reported by deepseek-v4-pro on round 1 of pr_validate.
No P0 findings; these are P1/P2 hardenings.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(bench): close server log file in ServerHandle.stop()

Round-2 codex review caught the success path: ``logf`` was only closed
on error paths, never on a successful server startup. Tie ``logf``
lifetime to the ``ServerHandle`` (held while the child writes, closed
in ``stop()`` after the process exits). Bench runs many ports back-to-
back, so leaking 2 FDs per model would build up over a long sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* chore: bump version to 0.6.49

Surface-touching release. Bundle since v0.6.48:
- 6 new 8-bit aliases in aliases.json (qwen3.5-4b-8bit, qwen3.5-9b-8bit,
  qwen3-4b-8bit, qwen3-8b-8bit, llama-3.1-8b-8bit, gemma-4-31b-8bit)
- New scripts/bench_dflash.py harness (Model Onboarding SOP §6 tooling)
- Contract-test allow-list extension for future z-lab DFlash drafter
  families (Qwen3-, gemma-4-, LLaMA3.1-)
- 6 raw DFlash bench JSONs captured in evals/results/

No inference-path changes. Production DFlash settings for qwen3.5-27b-8bit
and qwen3.6-27b-8bit unchanged.

Pre-merge artifact SHAs (Release SOP §8):
- wheel:  167294f382553f0e4b06073fb412d72ca6282c3e05b81e77097269fae40dfb9a
- sdist:  d1dc9e97e8581f59fe3b288b51ab23fdbef78b5053a7c10412f2c1e13ca684c3

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

May 14, 2026
1cbbb86
zip
tar.gz
Notes

v0.6.48

chore: bump version to 0.6.48

Curate recommended_sampling for 10 aliases with empty/partial upstream generation_config.json.

- Devstral 1.x / 2.x: temperature=0.15 (Mistral code-tuned default)
- Gemma 3 (1B/12B/27B) + 3n E4B + Gemma 4 (26B/31B): (1.0, 0.95, 64)
- GLM-4.5-Air: temperature=0.6 (thinking-mode default; alias has reasoning_parser=glm4)
- GLM-4.7-Flash: top_p=0.95 only (upstream ships temperature=1.0)

Curation is gap-filling only; no entry contradicts a non-empty upstream value. Pinned by fixture-based contract tests.

User-visible behavior change: aliases listed above now use the model author's published sampling defaults instead of the framework hard-fallback 0.7/0.9.

May 14, 2026
9eedbf2
zip
tar.gz
Notes

v0.6.47

chore: bump version to 0.6.47

Cascade sampling defaults through request > CLI > AliasProfile > generation_config.json > fallback.

- New CLI flags: --default-min-p, --default-repetition-penalty, --default-presence-penalty, --default-frequency-penalty
- New AliasProfile.recommended_sampling field (curated per-alias overrides)
- New vllm_mlx/utils/generation_config.py reads HF generation_config.json from the local snapshot (refs/main aware), no network fetch
- All three OpenAI/Anthropic-compat routes (chat / completions / messages) share build_extended_sampling_kwargs
- Anthropic adapter forwards None for unset sampling fields so the cascade fires (was previously hard-coding 0.7/0.9 and short-circuiting)

May 14, 2026
63989bb
zip
tar.gz
Notes

v0.6.46

chore: bump version to 0.6.46

Add --default-top-k flag mirroring --default-temperature / --default-top-p. Closes #369. Request > CLI default > engine default; returns None to skip forwarding when neither set. 4 new TestResolveTopK unit tests; full suite 3323 pass.

May 13, 2026
5f731ca
zip
tar.gz
Notes

v0.6.45

chore: bump version to 0.6.45

Fix Gemma3/Gemma3n MLLM text-only path returning zero tokens. Adds 8 regression tests, no behavior change for other models. Full unit suite 3319 pass, CI 8/8 green, make check qwen3.5-4b green (pre-existing thinking-toggle flake noted in PR).

May 13, 2026
762cbde
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.54

v0.6.53

v0.6.52

v0.6.51

v0.6.50

v0.6.49

v0.6.48

v0.6.47

v0.6.46

v0.6.45

Tags: raullenchai/Rapid-MLX