Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tags: raullenchai/Rapid-MLX

Tags

v0.6.54

Toggle v0.6.54's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.54

SOP §10 routing escape hatches — closes the bug-class behind #393.

## What

5 new force-off CLI flags that pair with every binary auto-routing decision:

| Flag | Override target |
|---|---|
| `--no-mllm` / `--text-only` | `is_mllm_model()` MLLM auto-detection (#393) |
| `--no-tool-call-parser` | AliasProfile tool-call parser auto-selection |
| `--no-reasoning-parser` | AliasProfile reasoning parser auto-selection |
| `--force-hybrid` / `--no-hybrid` | `ModelConfig.is_hybrid` (gates spec/suffix decode) |
| `--force-spec-decode` / `--no-spec-decode` | `ModelConfig.supports_spec_decode` (gates MTP/DFlash/suffix) |

Plus a friendly-error wrapper for the original #393 symptom (`ValueError: Missing N parameters: vision_tower.*` → re-raised as `RuntimeError` pointing at `--no-mllm`).

## Why

#393 (Tylast — Qwen3.6-35B-A3B-MLX-8bit half-quantized vision) hit `is_mllm_model()` returning True structurally, but the actual safetensors only had text weights. No CLI override existed → the user had no way to recover without waiting for a patch release.

The fix isn't just "add `--no-mllm`" — it's "every binary auto-routing decision needs both directions, and the standard tests prove it stays that way."

## Coverage

Plumbed end-to-end through 3 entrypoints:
- `rapid-mlx serve` (unified CLI)
- `python -m vllm_mlx.server` (standalone)
- `vllm-mlx-bench` (only `--no-mllm` — the other 4 aren't relevant to benchmarking)

Override applied at `EngineCore.__init__` *after* `enrich_model_config()` *before* `self.scheduler.model_config = self.model_config` — single point of mutation that every Scheduler reader sees.

3 lines of defense against mutex misuse:
1. CLI dispatch (`sys.exit(2)`)
2. `server.load_model` (`ValueError`)
3. `EngineCore.__init__` (`ValueError`)

## SOP gate

`tests/test_no_mllm_flag.py::test_auto_routing_flags_have_force_on_and_force_off_pair` — AST-based registry test. Every entry in `AUTO_ROUTING_FLAG_PAIRS` declares which entrypoint files must register BOTH directions. Adding a new model-taking CLI = adding it to every pair's required list. CI fails loudly until you do.

## Validation

- 4 codex review rounds (R1 caught DFlash bypass + MTP gap + missing standalone-server wiring + weak registry test; R2 caught bench gap; R3 clean)
- `pr_validate 407 -v`: MERGE-SAFE (lint, supply-chain, 299 targeted, 3646 unit, 3×3 stress matrix on Qwen3-0.6B / Qwen3.5-35B-A3B / Qwen3.6-27B all PASS)

## Out of scope (documented)

`OutputRouter.from_tokenizer` (Gemma 4 / Harmony channel auto-detect) — vocab-based multi-format detection with built-in legacy-parser fallback. Not in the registry. If a false-positive surfaces in the wild, add an override flag then.

## Files

- `vllm_mlx/{cli.py, server.py, engine/batched.py, engine_core.py, scheduler.py, models/mllm.py, benchmark.py}`
- `tests/test_no_mllm_flag.py` (21 tests)

v0.6.53

Toggle v0.6.53's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.53

Fixes #404 (M5 single-stream GPU crash).

Probe-and-cache shim around mx.new_thread_local_stream, installed lazily
at every mlx_lm.generate consumer (scheduler, decode, model_runner,
mamba_cache). Transparent on M1-M4; falls back to mx.default_stream on
M5 single-stream devices. __init__.py stays mlx-free so metadata-only
import works on Metal-less systems.

Also adds SOP gap fix: structural audit test that guards every future
mlx_lm.generate consumer file from forgetting the install prelude.

Validation: pr_validate MERGE-SAFE (lint, supply_chain, targeted, full
unit 3625 passed, stress matrix 3x3 PASS). Codex review x 3 rounds.

v0.6.52

Toggle v0.6.52's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.52

* fix(cli): plumb --prefill-step-size into SchedulerConfig + add fidelity audit (#400)

Root cause: serve_command built SchedulerConfig(...) without
prefill_step_size=args.prefill_step_size, so the field kept its 2048
dataclass default. The MLLM batch path then used 2048 as a per-batch
cap (prefill_step_size * len(requests)) and rejected any prompt >2048
tokens, even when the user passed --prefill-step-size 32768.

The CLI flag was being routed to load_model(prefill_step_size=...) but
that parameter was accepted and silently discarded — BatchedEngine reads
the value off scheduler_config only. So the flag was effectively dead
for both LLM and MLLM continuous-batching paths.

Changes:
- vllm_mlx/cli.py — pass prefill_step_size in SchedulerConfig kwargs
  inside serve_command (the actual fix).
- vllm_mlx/server.py — remove the dead load_model(prefill_step_size=...)
  parameter and its docstring entry. Remove the dead arg-pass in
  server.py:main()'s load_model call.
- vllm_mlx/mllm_batch_generator.py — rewrite the "exceeds safe limit"
  error to name the actual cap formula and point at --prefill-step-size,
  so users hitting the cap know what knob to turn.

New SOP gate (so this class of bug can't recur silently):
- scripts/audit_cli_config_fidelity.py — AST audit. For every function
  in cli.py, finds the 3-way intersection of (function reads args.X)
  ∧ (SchedulerConfig has field X) ∧ (kwarg X not passed at site).
  Each hit is a user-visible silent-flag-drop bug. Zero-import
  (parses dataclass fields from source), runs on plain Linux CI.
- scripts/dev_test.py — new "audit" tier, included in smoke/all/full.
- Makefile — `make audit` target; `make smoke` now runs lint → audit → unit.
- .github/workflows/ci.yml — audit step in the lint job, so every PR
  and every push to main gets the gate for free.
- tests/test_cli_config_fidelity.py — 4 tests covering: (1) the #400
  regression itself via AST, (2) audit clean on current main, (3) audit
  detects synthetic drift, (4) audit doesn't false-positive when a
  field is in the config but not read by the function.

Known follow-up (not in this PR):
- vllm_mlx/server.py:main() (the `python -m vllm_mlx.server` /
  `mise run` entry) declares its own --prefill-step-size flag but
  builds no SchedulerConfig, so the flag is also dead in that path.
  Removing it would break argparse for anyone using that entry —
  needs a small refactor (build a SchedulerConfig in main()) rather
  than a flag deletion. Tracking separately.

Verified:
- python3.12 scripts/audit_cli_config_fidelity.py → exit 0
- python3.12 -m pytest tests/test_cli_config_fidelity.py → 4 passed
- python3.12 -m pytest tests/ (excl. integrations) → 3481 passed
- ruff check + ruff format --check → clean
- Live: `rapid-mlx serve qwen3-0.6b-8bit --port 8765 --prefill-step-size 8192
  --continuous-batching` boots, responds to /v1/chat/completions

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* chore: bump version to 0.6.52

PR #400 fix (silent --prefill-step-size flag drop on continuous-batching
path) is user-facing and triggers version-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(server): keep load_model(prefill_step_size=...) kwarg for back-compat (codex round 2)

Codex round 2 on PR #405 flagged a real regression risk: removing the
documented load_model(prefill_step_size=...) kwarg in a patch release
breaks any external caller written against the previous signature
(would TypeError on upgrade, before any model loads).

Resolution: restore the kwarg as deprecated-but-functional. When passed:
  - emit DeprecationWarning
  - translate the value into scheduler_config.prefill_step_size (which
    is what the original kwarg was *supposed* to do but never did —
    that was the root cause of #400 itself).
  - synthesise a default SchedulerConfig if the caller didn't pass one.

So callers using the legacy kwarg get the value they asked for *and* a
nudge to migrate to scheduler_config. Pre-0.6.52 silent no-op becomes
silent correctness + a deprecation pointer.

New regression test test_load_model_prefill_step_size_back_compat_translation
asserts: the kwarg is still in the signature, calling with it emits
DeprecationWarning, and the value lands in scheduler_config.prefill_step_size.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(server.py:main): forward --prefill-step-size via SchedulerConfig (codex round 3)

Codex round 3 on PR #405 caught the same silent-flag-drop pattern in the
secondary `python -m vllm_mlx.server` / `mise run` entrypoint. The flag
was parsed by argparse but not forwarded to load_model — leaving the
exact bug class this PR claims to eliminate, in a file the PR already
touched.

server.py:main now constructs a SchedulerConfig with the user-supplied
prefill_step_size and passes it to load_model. The unified rapid-mlx CLI
in cli.py already does this (via its richer SchedulerConfig builder);
the standalone entry only needs the one knob it exposes.

Hardening:

1. Extended audit to also scan server.py — `CLI_ENTRY_PATHS` now lists
   both cli.py and server.py. Either entrypoint adding a future
   args.X-reads-but-not-passed-to-SchedulerConfig pattern will fail the
   audit at PR time.

2. New regression test test_prefill_step_size_is_plumbed_in_server_main
   mirrors the existing serve_command one — symmetric coverage for both
   entrypoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

v0.6.51

Toggle v0.6.51's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.51

* chore: bump version to 0.6.51

User-facing changes since v0.6.50:

* fix(prefix-cache): slice 3D KV state along the right axis (#392) —
  inference correctness fix for prefix cache restore on KV tensors
  shaped (B, H, S) (was slicing the wrong axis, surfacing as
  silent first-token corruption on cache-hit prompts).
* perf(scheduler): drop per-decode list() copy of output_token_ids
  (#391) — small per-token allocation removed from the decode hot
  path; benefit scales with output length on multi-stream batches.
* feat(json-output): strip spurious backslashes before non-ASCII
  chars (#394) — JSON-emitting models occasionally emit `\\é` etc.
  during constrained generation; now stripped before delivery so
  `json.loads` doesn't choke on user output.

Dev/docs (no user-visible runtime change):
* docs(contributing): codify test precision policy correctness=8bit /
  perf=4bit (#396)
* chore(pr_validate): swap Qwen3.6-27B-4bit → 8bit in stress matrix
  with smoke-tier 4-bit fallback for ≤32 GB hosts (#395)

Pre-merge artifact SHA-256 (audit anchor; will not match post-publish
SHAs because publish.yml rebuilds on Linux runners — see Release SOP §8):
  rapid_mlx-0.6.51-py3-none-any.whl
    5affa5c527bf543b72ddab94e96f2c6308225e6f5facd229d8f21f0b9ede8ce7
  rapid_mlx-0.6.51.tar.gz
    d248b8a7754cbe4d94b4708fa0fb88caa623f7fc71f13f156ece0a3b36224755

Release SOP gates:
* §3 install size: 448 MB (vs 445 MB v0.6.48 baseline, +0.67% — well
  under 1.05× soft-warn threshold).
* §7 supply chain: pip-audit clean on critical deps via OSV; recent
  uploads (HF hub 1.15.0 today, transformers 5.8.1 3 days) verified
  legit upstream (substantive changelogs, known maintainers / HF bot);
  no install.sh / workflows / pyproject.toml diffs since v0.6.50;
  OIDC scope minimal (id-token: write only on publish, contents: write
  only on auto-release); 3 third-party actions still moving-tag pinned
  (peter-evans, peaceiris, codecov) — pre-existing, not a release
  blocker but tracked for follow-up.
* §5 perf: make full in flight (qwen3.5-35b done; qwen3.6-35b mid-run).
  Will report results in PR before merge.
* §4 user-onboarding personas + §6 agent smoke: pending; will run
  serially after §5 (per CLAUDE.md "never in parallel" guidance for
  model-server workloads).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(api): detect text-only forks of multimodal architectures (#393)

Issue #393 reports that ``rapid-mlx serve /path/to/Qwen3.6-35B-A3B-MLX-8bit``
crashes on startup because the model is routed to the MLLM batched
engine even though the user's checkpoint is text-only. Reporter (Tylast)
correctly identified the chain: the checkpoint's config.json declares
``vision_config`` (because the base ``Qwen3_5MoeForConditionalGeneration``
architecture is multimodal-capable), so ``is_mllm_model`` returns True,
so the MLLM loader takes over, then hits the hybrid-backbone /
ArraysCache incompatibility documented in the closed-as-spam #385.

The text-only A3B fork ships zero vision tensors in its safetensors,
even though its config.json carries the full vision_config block. Our
detection trusted the config and ignored the actual weight presence.

Fix: when config indicates VLM AND the path is a local directory AND
that directory ships ``model.safetensors.index.json``, scan the index
for tensor-name prefixes that indicate real vision/audio weights
(``vision_tower``, ``visual.``, ``audio_tower``, ``mm_projector``, …).
If none are present, override to text-only routing. The check fires
only in the True → False direction; the False direction is preserved
as-is to keep existing text-routed models stable.

Conservative on edge cases:
* Single-file safetensors (no sharded index) → return None from the
  probe and trust config. Wrong-True here means the text path errors
  clearly at first image request, whereas wrong-False would silently
  corrupt every text request on a real VLM. The bad-direction cost
  is much smaller.
* Unreadable / oversized / malformed index → same. Fall back to config.
* HF repo IDs (not local dirs) → unchanged; we'd need a network call
  to inspect remote tensors.

Tests:
* New ``TestIsMllmModelWeightsPresenceOverride`` class — 6 cases:
  - vision_config + no vision tensors → False (the #393 fix path)
  - vision_config + vision tensors → True (genuine VLM still works)
  - audio_config + audio tensors → True (audio branch covered)
  - missing index → fall back to config
  - malformed index → fall back to config
  - text-only config → never even probe weights

Total tests: 102 pass (was 96). ruff clean.

This change is bundled into the v0.6.51 bump because the user-facing
fix is small + isolated and waiting for v0.6.52 would mean Tylast
keeps hitting the crash for another release cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* docs(install): add brew 5.x homebrew/core pre-flight hint

Brew 5.x's install sandbox cannot auto-tap homebrew/core mid-install
when a third-party formula depends on core packages ([email protected], rust).
Users on fresh brew installs (API-only, no homebrew/core tap cloned)
see "Operation not permitted" on /opt/homebrew/Library/Taps/homebrew/.

Pre-tapping with `brew tap homebrew/core --force` (one-time, ~1.3 GB)
lets the install complete. Brew 4.x and earlier never needed this.

---------

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

v0.6.50

Toggle v0.6.50's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.50

Same-day patch for #387 (chat_template_kwargs.enable_thinking silently
ignored) which landed in #389 / 744a919.

Pre-merge SHA capture (Release SOP §8 audit anchor):
  wheel:  3db4cd7a28ab8271ac39bb0bff1bf40f1e58bacad0913dd34142611eff7336c4
  sdist:  5c29d56b355a5e3d0cf403dfdbcf8f5df5d8a7badad7bd444d974c6f622abbc6

Release SOP gates skipped/run:
- §3 install size: skipped — no dep changes
- §4 fresh-install simulation: skipped pre-bump (will re-run §10 post-publish)
- §5 perf regression: skipped — no inference engine changes
- §6 agent integration smoke: skipped — request-model + parser fix doesn't
  change tool-call/reasoning shape
- §7B dep freshness: PASSED — uvicorn 0.47.0 (uploaded 1d ago) verified at
  upstream tag Kludex/uvicorn@479a2c0c, not a new floor on our side anyway
- §7C-E workflow/OIDC/SHA-pin tampering: no diff vs v0.6.49

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

v0.6.49

Toggle v0.6.49's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.49

* feat(bench): DFlash sweep harness + 6 new 8-bit aliases

scripts/bench_dflash.py: sequential two-server (baseline vs DFlash), 3 runs
x 5 workloads (4 code + chat), --disable-prefix-cache, reliability gates
inherited from bench_suffix_decoding_integrated.py (MIN_DECODE_TIME=0.5s,
TPS_CEILING=500). enable_thinking=false in the chat payload prevents the
Qwen3-family reasoning-token TTFT artifact that misclassifies thinking-mode
output as prefill. Configurable SHIP gate: --gate (code median, default
1.30x) and --non-code-floor (chat etc., default 1.00x).

aliases.json: 6 new 8-bit entries with no DFlash:
- qwen3.5-4b-8bit, qwen3.5-9b-8bit (hybrid, chat regresses)
- qwen3-4b-8bit, qwen3-8b-8bit, llama-3.1-8b-8bit (mlx-vlm 0.5.0 has no
  base loader for these architectures - DFlash unavailable regardless of
  drafter)
- gemma-4-31b-8bit (dense, chat regresses)

evals/results/dflash_*.json: 6 raw bench JSONs from the sweep. Code
workloads strongly favor DFlash on every dense candidate (median 1.31x-
1.98x), but chat regresses 22-44% universally. MoE re-bench
(qwen3.6-35b-8bit-poc) confirms z-lab's latest A3B drafter still loses
0.89x median - MoE gate stays.

test_aliases_contract.py: drafter prefix allow-list extended to include
z-lab/Qwen3-, z-lab/gemma-4-, z-lab/LLaMA3.1- (future-proofs the gate for
non-Qwen3.5/3.6 drafters z-lab now ships). Marker check accepts "DFlash"
anywhere in the repo name (handles -DFlash-b16, -DFlash-UltraChat suffix
tags). No currently-shipping alias uses these new prefixes; production
qwen3.5-27b-8bit and qwen3.6-27b-8bit DFlash settings are untouched -
whether to apply the strict chat-floor gate retroactively to those is a
separate policy call.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* test(ssot): bump alias counts for 6 new 8-bit aliases

Hard-coded counts in test_model_profiles_ssot.py drift when aliases are
added. Bump to match aliases.json after the DFlash sweep additions:
- list_aliases / list_profiles: 58 → 64 (6 new 8-bit aliases)
- qwen3.5-* family: 8 → 10 (qwen3.5-4b-8bit, qwen3.5-9b-8bit added)

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(bench): address codex round-1 review

- bench_dflash.py: wrap subprocess.Popen + readiness loop in try/except
  to close logf on any error path (file-descriptor leak fix).
- bench_dflash.py: raise_for_status() on the chat-completions stream so
  4xx/5xx server errors surface as exceptions rather than producing a
  silent 0-token WorkloadRun.
- bench_dflash.py: separate ``ship: bool`` from the human-readable
  ``decision`` string; exit code keys off the bool so a future
  decision-string tweak can't accidentally flip a successful run to a
  non-zero exit.
- test_aliases_contract.py: anchor the DFlash drafter marker check with
  ``re.search(r"(?:^|-)DFlash(?:$|-)", d)`` so substring matches like
  ``-notDFlash-utils`` no longer pass the allow-list.

All four findings reported by deepseek-v4-pro on round 1 of pr_validate.
No P0 findings; these are P1/P2 hardenings.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* fix(bench): close server log file in ServerHandle.stop()

Round-2 codex review caught the success path: ``logf`` was only closed
on error paths, never on a successful server startup. Tie ``logf``
lifetime to the ``ServerHandle`` (held while the child writes, closed
in ``stop()`` after the process exits). Bench runs many ports back-to-
back, so leaking 2 FDs per model would build up over a long sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

* chore: bump version to 0.6.49

Surface-touching release. Bundle since v0.6.48:
- 6 new 8-bit aliases in aliases.json (qwen3.5-4b-8bit, qwen3.5-9b-8bit,
  qwen3-4b-8bit, qwen3-8b-8bit, llama-3.1-8b-8bit, gemma-4-31b-8bit)
- New scripts/bench_dflash.py harness (Model Onboarding SOP §6 tooling)
- Contract-test allow-list extension for future z-lab DFlash drafter
  families (Qwen3-, gemma-4-, LLaMA3.1-)
- 6 raw DFlash bench JSONs captured in evals/results/

No inference-path changes. Production DFlash settings for qwen3.5-27b-8bit
and qwen3.6-27b-8bit unchanged.

Pre-merge artifact SHAs (Release SOP §8):
- wheel:  167294f382553f0e4b06073fb412d72ca6282c3e05b81e77097269fae40dfb9a
- sdist:  d1dc9e97e8581f59fe3b288b51ab23fdbef78b5053a7c10412f2c1e13ca684c3

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

---------

Co-authored-by: Your Name <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

v0.6.48

Toggle v0.6.48's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.48

Curate recommended_sampling for 10 aliases with empty/partial upstream generation_config.json.

- Devstral 1.x / 2.x: temperature=0.15 (Mistral code-tuned default)
- Gemma 3 (1B/12B/27B) + 3n E4B + Gemma 4 (26B/31B): (1.0, 0.95, 64)
- GLM-4.5-Air: temperature=0.6 (thinking-mode default; alias has reasoning_parser=glm4)
- GLM-4.7-Flash: top_p=0.95 only (upstream ships temperature=1.0)

Curation is gap-filling only; no entry contradicts a non-empty upstream value. Pinned by fixture-based contract tests.

User-visible behavior change: aliases listed above now use the model author's published sampling defaults instead of the framework hard-fallback 0.7/0.9.

v0.6.47

Toggle v0.6.47's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.47

Cascade sampling defaults through request > CLI > AliasProfile > generation_config.json > fallback.

- New CLI flags: --default-min-p, --default-repetition-penalty, --default-presence-penalty, --default-frequency-penalty
- New AliasProfile.recommended_sampling field (curated per-alias overrides)
- New vllm_mlx/utils/generation_config.py reads HF generation_config.json from the local snapshot (refs/main aware), no network fetch
- All three OpenAI/Anthropic-compat routes (chat / completions / messages) share build_extended_sampling_kwargs
- Anthropic adapter forwards None for unset sampling fields so the cascade fires (was previously hard-coding 0.7/0.9 and short-circuiting)

v0.6.46

Toggle v0.6.46's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.46

Add --default-top-k flag mirroring --default-temperature / --default-top-p. Closes #369. Request > CLI default > engine default; returns None to skip forwarding when neither set. 4 new TestResolveTopK unit tests; full suite 3323 pass.

v0.6.45

Toggle v0.6.45's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump version to 0.6.45

Fix Gemma3/Gemma3n MLLM text-only path returning zero tokens. Adds 8 regression tests, no behavior change for other models. Full unit suite 3319 pass, CI 8/8 green, make check qwen3.5-4b green (pre-existing thinking-toggle flake noted in PR).