Tags: raullenchai/Rapid-MLX
Tags
chore: bump version to 0.6.54 SOP §10 routing escape hatches — closes the bug-class behind #393. ## What 5 new force-off CLI flags that pair with every binary auto-routing decision: | Flag | Override target | |---|---| | `--no-mllm` / `--text-only` | `is_mllm_model()` MLLM auto-detection (#393) | | `--no-tool-call-parser` | AliasProfile tool-call parser auto-selection | | `--no-reasoning-parser` | AliasProfile reasoning parser auto-selection | | `--force-hybrid` / `--no-hybrid` | `ModelConfig.is_hybrid` (gates spec/suffix decode) | | `--force-spec-decode` / `--no-spec-decode` | `ModelConfig.supports_spec_decode` (gates MTP/DFlash/suffix) | Plus a friendly-error wrapper for the original #393 symptom (`ValueError: Missing N parameters: vision_tower.*` → re-raised as `RuntimeError` pointing at `--no-mllm`). ## Why #393 (Tylast — Qwen3.6-35B-A3B-MLX-8bit half-quantized vision) hit `is_mllm_model()` returning True structurally, but the actual safetensors only had text weights. No CLI override existed → the user had no way to recover without waiting for a patch release. The fix isn't just "add `--no-mllm`" — it's "every binary auto-routing decision needs both directions, and the standard tests prove it stays that way." ## Coverage Plumbed end-to-end through 3 entrypoints: - `rapid-mlx serve` (unified CLI) - `python -m vllm_mlx.server` (standalone) - `vllm-mlx-bench` (only `--no-mllm` — the other 4 aren't relevant to benchmarking) Override applied at `EngineCore.__init__` *after* `enrich_model_config()` *before* `self.scheduler.model_config = self.model_config` — single point of mutation that every Scheduler reader sees. 3 lines of defense against mutex misuse: 1. CLI dispatch (`sys.exit(2)`) 2. `server.load_model` (`ValueError`) 3. `EngineCore.__init__` (`ValueError`) ## SOP gate `tests/test_no_mllm_flag.py::test_auto_routing_flags_have_force_on_and_force_off_pair` — AST-based registry test. Every entry in `AUTO_ROUTING_FLAG_PAIRS` declares which entrypoint files must register BOTH directions. Adding a new model-taking CLI = adding it to every pair's required list. CI fails loudly until you do. ## Validation - 4 codex review rounds (R1 caught DFlash bypass + MTP gap + missing standalone-server wiring + weak registry test; R2 caught bench gap; R3 clean) - `pr_validate 407 -v`: MERGE-SAFE (lint, supply-chain, 299 targeted, 3646 unit, 3×3 stress matrix on Qwen3-0.6B / Qwen3.5-35B-A3B / Qwen3.6-27B all PASS) ## Out of scope (documented) `OutputRouter.from_tokenizer` (Gemma 4 / Harmony channel auto-detect) — vocab-based multi-format detection with built-in legacy-parser fallback. Not in the registry. If a false-positive surfaces in the wild, add an override flag then. ## Files - `vllm_mlx/{cli.py, server.py, engine/batched.py, engine_core.py, scheduler.py, models/mllm.py, benchmark.py}` - `tests/test_no_mllm_flag.py` (21 tests)
chore: bump version to 0.6.53 Fixes #404 (M5 single-stream GPU crash). Probe-and-cache shim around mx.new_thread_local_stream, installed lazily at every mlx_lm.generate consumer (scheduler, decode, model_runner, mamba_cache). Transparent on M1-M4; falls back to mx.default_stream on M5 single-stream devices. __init__.py stays mlx-free so metadata-only import works on Metal-less systems. Also adds SOP gap fix: structural audit test that guards every future mlx_lm.generate consumer file from forgetting the install prelude. Validation: pr_validate MERGE-SAFE (lint, supply_chain, targeted, full unit 3625 passed, stress matrix 3x3 PASS). Codex review x 3 rounds.
chore: bump version to 0.6.52 * fix(cli): plumb --prefill-step-size into SchedulerConfig + add fidelity audit (#400) Root cause: serve_command built SchedulerConfig(...) without prefill_step_size=args.prefill_step_size, so the field kept its 2048 dataclass default. The MLLM batch path then used 2048 as a per-batch cap (prefill_step_size * len(requests)) and rejected any prompt >2048 tokens, even when the user passed --prefill-step-size 32768. The CLI flag was being routed to load_model(prefill_step_size=...) but that parameter was accepted and silently discarded — BatchedEngine reads the value off scheduler_config only. So the flag was effectively dead for both LLM and MLLM continuous-batching paths. Changes: - vllm_mlx/cli.py — pass prefill_step_size in SchedulerConfig kwargs inside serve_command (the actual fix). - vllm_mlx/server.py — remove the dead load_model(prefill_step_size=...) parameter and its docstring entry. Remove the dead arg-pass in server.py:main()'s load_model call. - vllm_mlx/mllm_batch_generator.py — rewrite the "exceeds safe limit" error to name the actual cap formula and point at --prefill-step-size, so users hitting the cap know what knob to turn. New SOP gate (so this class of bug can't recur silently): - scripts/audit_cli_config_fidelity.py — AST audit. For every function in cli.py, finds the 3-way intersection of (function reads args.X) ∧ (SchedulerConfig has field X) ∧ (kwarg X not passed at site). Each hit is a user-visible silent-flag-drop bug. Zero-import (parses dataclass fields from source), runs on plain Linux CI. - scripts/dev_test.py — new "audit" tier, included in smoke/all/full. - Makefile — `make audit` target; `make smoke` now runs lint → audit → unit. - .github/workflows/ci.yml — audit step in the lint job, so every PR and every push to main gets the gate for free. - tests/test_cli_config_fidelity.py — 4 tests covering: (1) the #400 regression itself via AST, (2) audit clean on current main, (3) audit detects synthetic drift, (4) audit doesn't false-positive when a field is in the config but not read by the function. Known follow-up (not in this PR): - vllm_mlx/server.py:main() (the `python -m vllm_mlx.server` / `mise run` entry) declares its own --prefill-step-size flag but builds no SchedulerConfig, so the flag is also dead in that path. Removing it would break argparse for anyone using that entry — needs a small refactor (build a SchedulerConfig in main()) rather than a flag deletion. Tracking separately. Verified: - python3.12 scripts/audit_cli_config_fidelity.py → exit 0 - python3.12 -m pytest tests/test_cli_config_fidelity.py → 4 passed - python3.12 -m pytest tests/ (excl. integrations) → 3481 passed - ruff check + ruff format --check → clean - Live: `rapid-mlx serve qwen3-0.6b-8bit --port 8765 --prefill-step-size 8192 --continuous-batching` boots, responds to /v1/chat/completions Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * chore: bump version to 0.6.52 PR #400 fix (silent --prefill-step-size flag drop on continuous-batching path) is user-facing and triggers version-check. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(server): keep load_model(prefill_step_size=...) kwarg for back-compat (codex round 2) Codex round 2 on PR #405 flagged a real regression risk: removing the documented load_model(prefill_step_size=...) kwarg in a patch release breaks any external caller written against the previous signature (would TypeError on upgrade, before any model loads). Resolution: restore the kwarg as deprecated-but-functional. When passed: - emit DeprecationWarning - translate the value into scheduler_config.prefill_step_size (which is what the original kwarg was *supposed* to do but never did — that was the root cause of #400 itself). - synthesise a default SchedulerConfig if the caller didn't pass one. So callers using the legacy kwarg get the value they asked for *and* a nudge to migrate to scheduler_config. Pre-0.6.52 silent no-op becomes silent correctness + a deprecation pointer. New regression test test_load_model_prefill_step_size_back_compat_translation asserts: the kwarg is still in the signature, calling with it emits DeprecationWarning, and the value lands in scheduler_config.prefill_step_size. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(server.py:main): forward --prefill-step-size via SchedulerConfig (codex round 3) Codex round 3 on PR #405 caught the same silent-flag-drop pattern in the secondary `python -m vllm_mlx.server` / `mise run` entrypoint. The flag was parsed by argparse but not forwarded to load_model — leaving the exact bug class this PR claims to eliminate, in a file the PR already touched. server.py:main now constructs a SchedulerConfig with the user-supplied prefill_step_size and passes it to load_model. The unified rapid-mlx CLI in cli.py already does this (via its richer SchedulerConfig builder); the standalone entry only needs the one knob it exposes. Hardening: 1. Extended audit to also scan server.py — `CLI_ENTRY_PATHS` now lists both cli.py and server.py. Either entrypoint adding a future args.X-reads-but-not-passed-to-SchedulerConfig pattern will fail the audit at PR time. 2. New regression test test_prefill_step_size_is_plumbed_in_server_main mirrors the existing serve_command one — symmetric coverage for both entrypoints. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Your Name <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
chore: bump version to 0.6.51 * chore: bump version to 0.6.51 User-facing changes since v0.6.50: * fix(prefix-cache): slice 3D KV state along the right axis (#392) — inference correctness fix for prefix cache restore on KV tensors shaped (B, H, S) (was slicing the wrong axis, surfacing as silent first-token corruption on cache-hit prompts). * perf(scheduler): drop per-decode list() copy of output_token_ids (#391) — small per-token allocation removed from the decode hot path; benefit scales with output length on multi-stream batches. * feat(json-output): strip spurious backslashes before non-ASCII chars (#394) — JSON-emitting models occasionally emit `\\é` etc. during constrained generation; now stripped before delivery so `json.loads` doesn't choke on user output. Dev/docs (no user-visible runtime change): * docs(contributing): codify test precision policy correctness=8bit / perf=4bit (#396) * chore(pr_validate): swap Qwen3.6-27B-4bit → 8bit in stress matrix with smoke-tier 4-bit fallback for ≤32 GB hosts (#395) Pre-merge artifact SHA-256 (audit anchor; will not match post-publish SHAs because publish.yml rebuilds on Linux runners — see Release SOP §8): rapid_mlx-0.6.51-py3-none-any.whl 5affa5c527bf543b72ddab94e96f2c6308225e6f5facd229d8f21f0b9ede8ce7 rapid_mlx-0.6.51.tar.gz d248b8a7754cbe4d94b4708fa0fb88caa623f7fc71f13f156ece0a3b36224755 Release SOP gates: * §3 install size: 448 MB (vs 445 MB v0.6.48 baseline, +0.67% — well under 1.05× soft-warn threshold). * §7 supply chain: pip-audit clean on critical deps via OSV; recent uploads (HF hub 1.15.0 today, transformers 5.8.1 3 days) verified legit upstream (substantive changelogs, known maintainers / HF bot); no install.sh / workflows / pyproject.toml diffs since v0.6.50; OIDC scope minimal (id-token: write only on publish, contents: write only on auto-release); 3 third-party actions still moving-tag pinned (peter-evans, peaceiris, codecov) — pre-existing, not a release blocker but tracked for follow-up. * §5 perf: make full in flight (qwen3.5-35b done; qwen3.6-35b mid-run). Will report results in PR before merge. * §4 user-onboarding personas + §6 agent smoke: pending; will run serially after §5 (per CLAUDE.md "never in parallel" guidance for model-server workloads). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(api): detect text-only forks of multimodal architectures (#393) Issue #393 reports that ``rapid-mlx serve /path/to/Qwen3.6-35B-A3B-MLX-8bit`` crashes on startup because the model is routed to the MLLM batched engine even though the user's checkpoint is text-only. Reporter (Tylast) correctly identified the chain: the checkpoint's config.json declares ``vision_config`` (because the base ``Qwen3_5MoeForConditionalGeneration`` architecture is multimodal-capable), so ``is_mllm_model`` returns True, so the MLLM loader takes over, then hits the hybrid-backbone / ArraysCache incompatibility documented in the closed-as-spam #385. The text-only A3B fork ships zero vision tensors in its safetensors, even though its config.json carries the full vision_config block. Our detection trusted the config and ignored the actual weight presence. Fix: when config indicates VLM AND the path is a local directory AND that directory ships ``model.safetensors.index.json``, scan the index for tensor-name prefixes that indicate real vision/audio weights (``vision_tower``, ``visual.``, ``audio_tower``, ``mm_projector``, …). If none are present, override to text-only routing. The check fires only in the True → False direction; the False direction is preserved as-is to keep existing text-routed models stable. Conservative on edge cases: * Single-file safetensors (no sharded index) → return None from the probe and trust config. Wrong-True here means the text path errors clearly at first image request, whereas wrong-False would silently corrupt every text request on a real VLM. The bad-direction cost is much smaller. * Unreadable / oversized / malformed index → same. Fall back to config. * HF repo IDs (not local dirs) → unchanged; we'd need a network call to inspect remote tensors. Tests: * New ``TestIsMllmModelWeightsPresenceOverride`` class — 6 cases: - vision_config + no vision tensors → False (the #393 fix path) - vision_config + vision tensors → True (genuine VLM still works) - audio_config + audio tensors → True (audio branch covered) - missing index → fall back to config - malformed index → fall back to config - text-only config → never even probe weights Total tests: 102 pass (was 96). ruff clean. This change is bundled into the v0.6.51 bump because the user-facing fix is small + isolated and waiting for v0.6.52 would mean Tylast keeps hitting the crash for another release cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * docs(install): add brew 5.x homebrew/core pre-flight hint Brew 5.x's install sandbox cannot auto-tap homebrew/core mid-install when a third-party formula depends on core packages ([email protected], rust). Users on fresh brew installs (API-only, no homebrew/core tap cloned) see "Operation not permitted" on /opt/homebrew/Library/Taps/homebrew/. Pre-tapping with `brew tap homebrew/core --force` (one-time, ~1.3 GB) lets the install complete. Brew 4.x and earlier never needed this. --------- Co-authored-by: Your Name <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
chore: bump version to 0.6.50 Same-day patch for #387 (chat_template_kwargs.enable_thinking silently ignored) which landed in #389 / 744a919. Pre-merge SHA capture (Release SOP §8 audit anchor): wheel: 3db4cd7a28ab8271ac39bb0bff1bf40f1e58bacad0913dd34142611eff7336c4 sdist: 5c29d56b355a5e3d0cf403dfdbcf8f5df5d8a7badad7bd444d974c6f622abbc6 Release SOP gates skipped/run: - §3 install size: skipped — no dep changes - §4 fresh-install simulation: skipped pre-bump (will re-run §10 post-publish) - §5 perf regression: skipped — no inference engine changes - §6 agent integration smoke: skipped — request-model + parser fix doesn't change tool-call/reasoning shape - §7B dep freshness: PASSED — uvicorn 0.47.0 (uploaded 1d ago) verified at upstream tag Kludex/uvicorn@479a2c0c, not a new floor on our side anyway - §7C-E workflow/OIDC/SHA-pin tampering: no diff vs v0.6.49 Co-authored-by: Your Name <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
chore: bump version to 0.6.49 * feat(bench): DFlash sweep harness + 6 new 8-bit aliases scripts/bench_dflash.py: sequential two-server (baseline vs DFlash), 3 runs x 5 workloads (4 code + chat), --disable-prefix-cache, reliability gates inherited from bench_suffix_decoding_integrated.py (MIN_DECODE_TIME=0.5s, TPS_CEILING=500). enable_thinking=false in the chat payload prevents the Qwen3-family reasoning-token TTFT artifact that misclassifies thinking-mode output as prefill. Configurable SHIP gate: --gate (code median, default 1.30x) and --non-code-floor (chat etc., default 1.00x). aliases.json: 6 new 8-bit entries with no DFlash: - qwen3.5-4b-8bit, qwen3.5-9b-8bit (hybrid, chat regresses) - qwen3-4b-8bit, qwen3-8b-8bit, llama-3.1-8b-8bit (mlx-vlm 0.5.0 has no base loader for these architectures - DFlash unavailable regardless of drafter) - gemma-4-31b-8bit (dense, chat regresses) evals/results/dflash_*.json: 6 raw bench JSONs from the sweep. Code workloads strongly favor DFlash on every dense candidate (median 1.31x- 1.98x), but chat regresses 22-44% universally. MoE re-bench (qwen3.6-35b-8bit-poc) confirms z-lab's latest A3B drafter still loses 0.89x median - MoE gate stays. test_aliases_contract.py: drafter prefix allow-list extended to include z-lab/Qwen3-, z-lab/gemma-4-, z-lab/LLaMA3.1- (future-proofs the gate for non-Qwen3.5/3.6 drafters z-lab now ships). Marker check accepts "DFlash" anywhere in the repo name (handles -DFlash-b16, -DFlash-UltraChat suffix tags). No currently-shipping alias uses these new prefixes; production qwen3.5-27b-8bit and qwen3.6-27b-8bit DFlash settings are untouched - whether to apply the strict chat-floor gate retroactively to those is a separate policy call. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * test(ssot): bump alias counts for 6 new 8-bit aliases Hard-coded counts in test_model_profiles_ssot.py drift when aliases are added. Bump to match aliases.json after the DFlash sweep additions: - list_aliases / list_profiles: 58 → 64 (6 new 8-bit aliases) - qwen3.5-* family: 8 → 10 (qwen3.5-4b-8bit, qwen3.5-9b-8bit added) Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(bench): address codex round-1 review - bench_dflash.py: wrap subprocess.Popen + readiness loop in try/except to close logf on any error path (file-descriptor leak fix). - bench_dflash.py: raise_for_status() on the chat-completions stream so 4xx/5xx server errors surface as exceptions rather than producing a silent 0-token WorkloadRun. - bench_dflash.py: separate ``ship: bool`` from the human-readable ``decision`` string; exit code keys off the bool so a future decision-string tweak can't accidentally flip a successful run to a non-zero exit. - test_aliases_contract.py: anchor the DFlash drafter marker check with ``re.search(r"(?:^|-)DFlash(?:$|-)", d)`` so substring matches like ``-notDFlash-utils`` no longer pass the allow-list. All four findings reported by deepseek-v4-pro on round 1 of pr_validate. No P0 findings; these are P1/P2 hardenings. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * fix(bench): close server log file in ServerHandle.stop() Round-2 codex review caught the success path: ``logf`` was only closed on error paths, never on a successful server startup. Tie ``logf`` lifetime to the ``ServerHandle`` (held while the child writes, closed in ``stop()`` after the process exits). Bench runs many ports back-to- back, so leaking 2 FDs per model would build up over a long sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> * chore: bump version to 0.6.49 Surface-touching release. Bundle since v0.6.48: - 6 new 8-bit aliases in aliases.json (qwen3.5-4b-8bit, qwen3.5-9b-8bit, qwen3-4b-8bit, qwen3-8b-8bit, llama-3.1-8b-8bit, gemma-4-31b-8bit) - New scripts/bench_dflash.py harness (Model Onboarding SOP §6 tooling) - Contract-test allow-list extension for future z-lab DFlash drafter families (Qwen3-, gemma-4-, LLaMA3.1-) - 6 raw DFlash bench JSONs captured in evals/results/ No inference-path changes. Production DFlash settings for qwen3.5-27b-8bit and qwen3.6-27b-8bit unchanged. Pre-merge artifact SHAs (Release SOP §8): - wheel: 167294f382553f0e4b06073fb412d72ca6282c3e05b81e77097269fae40dfb9a - sdist: d1dc9e97e8581f59fe3b288b51ab23fdbef78b5053a7c10412f2c1e13ca684c3 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> --------- Co-authored-by: Your Name <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
chore: bump version to 0.6.48 Curate recommended_sampling for 10 aliases with empty/partial upstream generation_config.json. - Devstral 1.x / 2.x: temperature=0.15 (Mistral code-tuned default) - Gemma 3 (1B/12B/27B) + 3n E4B + Gemma 4 (26B/31B): (1.0, 0.95, 64) - GLM-4.5-Air: temperature=0.6 (thinking-mode default; alias has reasoning_parser=glm4) - GLM-4.7-Flash: top_p=0.95 only (upstream ships temperature=1.0) Curation is gap-filling only; no entry contradicts a non-empty upstream value. Pinned by fixture-based contract tests. User-visible behavior change: aliases listed above now use the model author's published sampling defaults instead of the framework hard-fallback 0.7/0.9.
chore: bump version to 0.6.47 Cascade sampling defaults through request > CLI > AliasProfile > generation_config.json > fallback. - New CLI flags: --default-min-p, --default-repetition-penalty, --default-presence-penalty, --default-frequency-penalty - New AliasProfile.recommended_sampling field (curated per-alias overrides) - New vllm_mlx/utils/generation_config.py reads HF generation_config.json from the local snapshot (refs/main aware), no network fetch - All three OpenAI/Anthropic-compat routes (chat / completions / messages) share build_extended_sampling_kwargs - Anthropic adapter forwards None for unset sampling fields so the cascade fires (was previously hard-coding 0.7/0.9 and short-circuiting)
chore: bump version to 0.6.46 Add --default-top-k flag mirroring --default-temperature / --default-top-p. Closes #369. Request > CLI default > engine default; returns None to skip forwarding when neither set. 4 new TestResolveTopK unit tests; full suite 3323 pass.
chore: bump version to 0.6.45 Fix Gemma3/Gemma3n MLLM text-only path returning zero tokens. Adds 8 regression tests, no behavior change for other models. Full unit suite 3319 pass, CI 8/8 green, make check qwen3.5-4b green (pre-existing thinking-toggle flake noted in PR).
PreviousNext