STATUS.md

VoiceLLM — Status & Handoff

For a fresh LLM picking this up cold. Read this first, then 06_milestones.md for the milestone definitions and 01_architecture.md for the module/bus layout.

Last updated: 2026-06-10. M2 complete; M3 quick path shipped; M3.5 continuous STT in tree (opt-in via STT_MODE, still unverified live); LLM-gated speech protocol shipped. Code layout is agent/ + nodes/ + transport/ + core/ (JROS 0.5 structure). June hardening pass: bus is real pub/sub fanout, LLM context-overflow guard + spoken error fallback, metrics now measure the listen/STT phase, Whisper-hallucination ingress filter, farewell detection, mlx stop-marker holdback, config validation, 13-scenario smoke test.

TL;DR

Modular, bus-driven local voice assistant on Apple Silicon. All code lives at the repo root (this docs/ folder is a sibling of config.py, main.py, agent/, nodes/, transport/, core/). Demo/reference code that informed the design lives in references/ and the sibling MockingAgent/ repo. The local references/voice_assistant.py is the proven Google-Home-style baseline we ported from, and references/voice_chat.py preserves the full-duplex AEC/barge-in experiment as carry-forward design material.

Stack: sounddevice + WebRTC VAD → pywhispercpp (whisper.cpp) → swappable LLM (llama-cpp-python default or mlx-lm, both running Gemma 4 26B-A4B 4-bit) → Kokoro TTS. State + routing through a Pub/Sub Bus.

M2 status: complete. Modular voice loop runs end-to-end. Both LLM backends supported; LLM_BACKEND = "llamacpp" is the default since llama-cpp-python handles Gemma's <end_of_turn> natively. The MLX backend now stops correctly via an in-stream marker detector (agent/adapters/mlx/backend.py) — the old eot_token getattr was a no-op and let Gemma run to max_tokens and loop.

M3 status (quick path): shipped. No wake word in the default mode. The orchestrator runs a self-speech similarity filter, a single-slot pending-turn queue, and the LLM gate — every reply must begin with <ignore> or <reply>; ignored replies are suppressed from TTS without audible cost. Every decision is logged to outputs/m3_eval.jsonl.

M3.5 status: in tree, opt-in. nodes/stt/continuous.py is a drop-in replacement for STTTwoPassNode with rolling re-transcription and energy-based phrase segmentation. Activate via STT_MODE = "continuous" in config.py. Default stays "two_pass" so the proven baseline is unchanged.

Polish since M2: all four model loads (fast STT, accurate STT, LLM, TTS) happen at startup with real warm passes — first user turn pays only inference time, not setup. Conversation history is capped at MAX_HISTORY_TURNS = 8 user/assistant pairs to keep prompt size bounded.

Continuation note: This repo is in active development (June 2026) as the operator's daily voice assistant and the JROS voice-pipeline testbed. Next planned work: M4 barge-in (now unblocked — the bus supports multiple subscribers), a metrics viewer, and a persona system. Persistent memory / LLM-callable tools / skills remain carry-forward ideas, labeled (planned) where mentioned.

What works right now (M2 + M3 quick path)

cd VoiceLLM
python main.py
# REQUIRE_WAKE_WORD=False is the default — just talk.
# Flip back to True in config.py for the wake-word "okay eve" flow.

This reproduces references/voice_assistant.py's behavior, but every concern is now its own node communicating over the bus.

Modules in place

File	Role
config.py	All tunables. Flip `LLM_BACKEND` between `"llamacpp"` (default) and `"mlx"`; flip `STT_MODE` between `"two_pass"` (default) and `"continuous"`.
transport/bus.py	Pub/sub fanout: `subscribe(topics)` returns a per-subscriber queue; `publish()` never blocks (sheds oldest). `Bus.get()` polls the orchestrator's default catch-all subscription. Invariant: raw mic audio never goes on the bus.
core/state.py	`SysState`: `IDLE`/`THINKING`/`RESPONDING`.
core/metrics.py	Per-turn timing → `metrics.csv`.
nodes/audio_session/mic_stream.py	`MicStream` with `paused` flag (ported from voice_assistant.py:96-125).
nodes/audio_session/chimes.py	Wake and follow-up earcons, controlled by `CHIMES_ENABLED`, `WAKE_CHIME_ENABLED`, and `FOLLOWUP_CHIME_ENABLED`.
nodes/stt/two_pass.py	Two-pass cascade ported from voice_assistant.py: VAD worker, fast (`base.en`) + accurate (`medium.en`) eager-loaded with silence warm-up, wake-word + follow-up window. Publishes `stt.text` (dict payload with timing).
nodes/stt/continuous.py	M3.5 hybrid pipeline. Energy-based phrase segmentation, rolling re-transcription. Drop-in interface match for `STTTwoPassNode`. Unverified live; also re-transcribes silence continuously in quiet rooms (known inefficiency).
agent/llm/backend_base.py	`BackendBase` ABC: `load`, `warm`, `stream_chat`, `cancel`.
agent/adapters/mlx/backend.py	mlx-lm impl. In-stream stop-marker detector for `<end_of_turn>` / `<eos>` / `<im_end>` (replaces the old broken `eot_token` getattr).
agent/adapters/llama_cpp/backend.py	llama-cpp-python impl. Default backend; chat completion handles Gemma's stop tokens natively.
agent/llm/node.py	Owns history; streams `llm.token` deltas; publishes cleaned reply on `llm.done`. Caps history at `MAX_HISTORY_TURNS = 8` pairs. `clean_for_tts()` ported.
nodes/tts/node.py	Real `KPipeline`. Synth thread + play thread. Sentence-streams. Cancellable. Publishes `mic.pause`, `tts.audio_chunk`, `tts.done`.
agent/orchestrator.py	Single bus consumer; state machine; spawns LLM thread per turn; LLM-gate token buffer (`<ignore>` suppresses TTS, `<reply>` forwards the tail).
main.py	`make_backend()` + `make_stt()` factory funcs, then `Orchestrator(...).run()`.

Bus topics in use

stt.text (dict) — committed user phrase, post-wake-word: {"text": str, "t_speech_start", "t_last_voice", "t_commit", "t_stt_done"} (perf_counter timestamps; continuous mode omits the first two). Plain-str payloads still accepted (pending-turn refires).
llm.token (str) — streaming reply delta.
llm.done (str) — full cleaned reply, fired after the stream ends.
llm.error (str) — generation failed with no usable output; the orchestrator speaks a short fallback. LLMNode already rolled back the user message.
mic.pause (bool) — TTS toggles this around playback (TTS also calls stt.set_paused directly so the pause is synchronous; the bus message feeds the orchestrator's tts_start metric).
tts.audio_chunk (np.float32) — published before sd.play(); nobody consumes it yet (subscriber is M4 (planned) — AEC reference). Use bus.subscribe(("tts.audio_chunk",)) to consume without stealing from the orchestrator.
tts.done (None) — TTS audio queue drained.

Models & paths (verified on disk)

LMSTUDIO_MODELS = ~/.lmstudio/models/
MLX_PATH        = LMSTUDIO_MODELS/mlx-community/gemma-4-26b-a4b-4bit/
GGUF_PATH       = LMSTUDIO_MODELS/lmstudio-community/gemma-4-26B-A4B-it-GGUF/
                  gemma-4-26B-A4B-it-Q4_K_M.gguf

STT: base.en (fast) and medium.en (accurate) — both eager-loaded at startup with a 1.5 s silence warm pass, so first user turn pays only inference cost, not setup. Old behavior lazy-loaded medium.en on the first wake match (a hangover from M2's wake-word path); pointless now that M3 fires the accurate model on every phrase.

Repo layout right now

GITHUB/
├── MockingAgent/                       # working Google-Home-style baseline
│   ├── voice_assistant.py              # the canonical reference for STT/TTS plumbing
│   ├── ollamacpp/                      # chat_mlx.py, chat_llama.py, bench.py
│   ├── kokoro_tts/                     # standalone Kokoro experiments
│   ├── PywisperCpp/                    # all the always-listening STT demos
│   └── legacy_voicellm_drafts/         # old loose demos that used to live in VoiceLLM/
│
└── VoiceLLM/                           # ← THE CODE (flat at the repo root)
    ├── config.py
    ├── main.py
    ├── smoke_test.py                   # 13 fast scenarios, no model loads
    ├── agent/                          # orchestrator + llm node + backend adapters
    ├── nodes/                          # audio_session/ (mic, chimes) + stt/ + tts/
    ├── transport/                      # bus.py (in-process pub/sub fanout)
    ├── core/                           # metrics.py + state.py
    ├── references/                     # local copies of pasted reference scripts
    ├── docs/                           # ← these planning docs
    ├── outputs/                        # m3_eval.jsonl etc.
    ├── requirements.txt
    ├── metrics.csv                     # auto-written by MetricsLog
    ├── LICENSE
    └── README.md

Carry-Forward Ideas

The sections below are preserved as design notes and possible implementation recipes. They are not an active roadmap for this repository unless development is explicitly resumed here.

M3 — Continuous hearing (the actual goal)

Drop the wake word. STT runs always-on; every committed phrase becomes a turn unless we filter it out. This is the "ChatGPT Voice" feel.

We split this into a quick path (lean on the existing two-pass STT) and M3.5 (build the hybrid pipeline node for lower latency / better feel).

M3 quick path — shipped

✅ REQUIRE_WAKE_WORD = False in config.py. The existing STTTwoPassNode already has the no-wake-word branch (nodes/stt/two_pass.py) — every phrase becomes a turn.
✅ Self-speech similarity filter on stt.text ingress in the orchestrator. Compares incoming text against the most recent assistant turn from LLMNode.history_snapshot() via difflib.SequenceMatcher; drops if >= cfg.SELF_SPEECH_SIMILARITY_THRESHOLD (default 0.75).
✅ Pending-turn queue replaces the old "drop while busy" placeholder. Single-slot, last-write-wins; fires when _on_tts_done returns to IDLE, provided the queued utterance is younger than cfg.PENDING_TURN_MAX_AGE_S (default 3.0 s).
✅ LLM gate — the user explicitly didn't want the audio pipeline deciding what's directed speech. So the LLM does. Every reply must begin with <ignore> or <reply> (instruction in config.py:SYSTEM_PROMPT). The orchestrator buffers the first LLM_GATE_BUFFER_CHARS = 30 chars of the streaming reply, decides based on the tag, and either:
- <ignore> → suppresses TTS, transitions straight to IDLE, logs llm_ignored to the eval JSONL.
- <reply> → forwards the post-tag tail to TTS as normal. Falls back to "treat as reply" if the tag never appears within the buffer. See _on_llm_token and _gate_check in agent/orchestrator.py.
✅ Eval logging — every STT decision (accepted / dropped_self_echo / queued_pending / pending_fired / pending_stale / llm_ignored) is appended to outputs/m3_eval.jsonl for offline review. Disable by setting cfg.M3_EVAL_LOG = None.

M3 quick path — verification still owed

Run alongside a YouTube video for 5 minutes. The LLM should not fire on background dialogue. Inspect outputs/m3_eval.jsonl afterwards; expect dropped_self_echo for assistant playback and a few accepted for real user turns. Tune SELF_SPEECH_SIMILARITY_THRESHOLD if false positives slip through.
Sanity check on the barge-in placeholder — until M4 ships, talking over the assistant queues your follow-up rather than interrupting; that reads in the log as queued_pending → pending_fired.

M3.5 — Hybrid phrase/word STT node

Only do this once the quick path is verified and we hit a quality wall the filter+queue can't paper over (e.g. trailing-word loss on long sentences).

✅ Port always_listening_hybrid_phrase_word_pipeline.py to nodes/stt/continuous.py. Same node interface as STTTwoPassNode (publishes stt.text, has start/stop/set_paused/open_followup). Energy-based phrase segmentation; rolling re-transcription every cfg.STT_TRANSCRIBE_EVERY_S; phrase commits on cfg.STT_PHRASE_TIMEOUT_S of quiet. Single Whisper model (cfg.STT_CONTINUOUS_MODEL, default base.en).
✅ make_stt() in main.py — STT_MODE == "continuous" branch wired with the M3.5 tunables.
Verification still owed — flip STT_MODE = "continuous" in config.py, run python main.py, and compare:
- First-token latency vs. two_pass (rolling re-transcription should cut the wait at phrase end).
- False-positive rate on background dialogue (rerun the YouTube test against outputs/m3_eval.jsonl). M3.5 uses base.en by default; if accuracy lags, raise STT_CONTINUOUS_MODEL to small.en or medium.en.

M4 — Barge-in

Talk over the assistant; it cuts off and listens.

Wire AEC: build a speexdsp wrapper (the old sketch was deleted as dead code 2026-06-10; references/voice_chat.py preserves the working experiment). Use bus.subscribe(("tts.audio_chunk",)) for the far-end reference — the bus supports multiple subscribers now — and run mic frames through AEC before they reach the VAD.
VAD on cleaned audio while state == RESPONDING: when VAD says speech for ≥150 ms, publish tts.cancel, call llm.cancel(), transition state = LISTENING. Add a 250 ms start-grace at the top of each TTS turn so the speaker click doesn't self-trigger.
Add tts.cancel topic (planned) to the bus contract; route it in the orchestrator's _dispatch. KokoroNode.cancel() already exists and does the right thing.
config.BARGE_IN_ENABLED and AEC_ENABLED exist as flags but nothing reads them yet; wire them when 1-3 land.

M5 — Polish

Latency dashboard: metrics.csv is already being written; add a tiny live print of TTFT/first-audio per turn.
Voice picker (config.KOKORO_VOICE).
System-prompt presets.
Optional GUI (PySide6 demo exists in MockingAgent).

Known gotchas

transport/bus.py is pub/sub fanout (since 2026-06-10): each subscribe(topics) call gets its own queue, so M4's AEC can consume tts.audio_chunk without stealing messages from the orchestrator. publish() never blocks — full subscriber queues shed oldest-first. Invariant: raw mic audio never goes on the bus (it stays on MicStream.q / phrase_q / audio_q).
TTS publishes mic.pause before sd.play() returns. The orchestrator forwards it to STTTwoPassNode.set_paused() which calls MicStream.set_paused(). Check the exact ordering in nodes/tts/node.py:_play_loop before tightening barge-in timing — there's a tail_sleep_s = 0.12 to let speakers drain before un-pausing the mic.
Dead modules were deleted 2026-06-10 (nodes/audio_session/vad.py, aec.py, wakeword.py, audio_io.py — zero importers; git history has them). references/ remains historical material that main.py never imports.
webrtcvad vs webrtcvad-wheels: requirements.txt asks for -wheels (prebuilt). The old root requirements named bare webrtcvad which builds from source. Consistent now.
Gemma 4 in mlx-lm doesn't stop on <end_of_turn> by default — the tokenizer wrapper's API for additional EOS tokens varies by mlx-lm version. We solved it via an in-stream stop-marker detector in agent/adapters/mlx/backend.py:stream_chat() that buffers the last 32 chars of streamed text and stops when it sees <end_of_turn>, <eos>, or <|im_end|>. Without this fix, MLX runs to max_tokens and loops.
First-run latency: all four model loads happen at startup before the mic opens — Kokoro 1-line warm synth (~3-5 s), BackendBase.warm() 1-token gen (~1-2 s), both Whisper sizes plus a 1.5 s silence transcribe each (~1-2 s combined). First user turn then pays only inference time.
Whisper warm-up needs ≥1000 ms of audio; we use 1.5 s of silence in nodes/stt/two_pass.py:_warm_stt() and nodes/stt/continuous.py. Sub-1000-ms warm transcribes get rejected by Whisper with input is too short - ... ms and silently skip inference, defeating the whole point.
macOS mic permission: launching from VS Code's terminal sometimes inherits the editor's TCC grant, sometimes prompts. If MicStream silently captures zeros, that's the issue.

Sanity-check commands

# Compile-check all M2 + M3.5 modules:
cd VoiceLLM
python -m py_compile config.py main.py \
  agent/llm/backend_base.py agent/adapters/mlx/backend.py \
  agent/adapters/llama_cpp/backend.py agent/llm/node.py \
  nodes/tts/node.py nodes/audio_session/mic_stream.py nodes/audio_session/chimes.py \
  nodes/stt/two_pass.py nodes/stt/continuous.py \
  agent/orchestrator.py transport/bus.py core/state.py core/metrics.py

# Confirm models exist:
python -c "import config as c; print('mlx:', c.MLX_PATH.exists(), 'gguf:', c.GGUF_PATH.exists())"

# Run end-to-end (loads ~5 GB into memory):
python main.py

If you're picking this up cold

Read this file.
Read 00_overview.md and 01_architecture.md.
Read references/voice_assistant.py — that is the canonical reference for every STT/TTS/LLM glue decision in M2.
Read 02_stt_pipelines.md before touching M3.
Run python smoke_test.py before and after any orchestrator/LLM/bus change — it's fast (no model loads) and covers the failure paths a live session hits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VoiceLLM — Status & Handoff

TL;DR

What works right now (M2 + M3 quick path)

Modules in place

Bus topics in use

Models & paths (verified on disk)

Repo layout right now

Carry-Forward Ideas

M3 — Continuous hearing (the actual goal)

M3 quick path — shipped

M3 quick path — verification still owed

M3.5 — Hybrid phrase/word STT node

M4 — Barge-in

M5 — Polish

Known gotchas

Sanity-check commands

If you're picking this up cold

FilesExpand file tree

STATUS.md

Latest commit

History

STATUS.md

File metadata and controls

VoiceLLM — Status & Handoff

TL;DR

What works right now (M2 + M3 quick path)

Modules in place

Bus topics in use

Models & paths (verified on disk)

Repo layout right now

Carry-Forward Ideas

M3 — Continuous hearing (the actual goal)

M3 quick path — shipped

M3 quick path — verification still owed

M3.5 — Hybrid phrase/word STT node

M4 — Barge-in

M5 — Polish

Known gotchas

Sanity-check commands

If you're picking this up cold