For a fresh LLM picking this up cold. Read this first, then 06_milestones.md for the milestone definitions and 01_architecture.md for the module/bus layout.
Last updated: 2026-06-10. M2 complete; M3 quick path shipped; M3.5
continuous STT in tree (opt-in via STT_MODE, still unverified live);
LLM-gated speech protocol shipped. Code layout is agent/ + nodes/ +
transport/ + core/ (JROS 0.5 structure). June hardening pass: bus is
real pub/sub fanout, LLM context-overflow guard + spoken error fallback,
metrics now measure the listen/STT phase, Whisper-hallucination ingress
filter, farewell detection, mlx stop-marker holdback, config validation,
13-scenario smoke test.
Modular, bus-driven local voice assistant on Apple Silicon. All code lives
at the repo root (this docs/ folder is a sibling of config.py,
main.py, agent/, nodes/, transport/, core/).
Demo/reference code that informed the design lives in references/
and the sibling MockingAgent/ repo. The local
references/voice_assistant.py is the proven Google-Home-style baseline we
ported from, and references/voice_chat.py preserves the full-duplex
AEC/barge-in experiment as carry-forward design material.
Stack: sounddevice + WebRTC VAD → pywhispercpp (whisper.cpp) → swappable
LLM (llama-cpp-python default or mlx-lm, both running Gemma 4
26B-A4B 4-bit) → Kokoro TTS. State + routing through a Pub/Sub Bus.
M2 status: complete. Modular voice loop runs end-to-end. Both LLM
backends supported; LLM_BACKEND = "llamacpp" is the default since
llama-cpp-python handles Gemma's <end_of_turn> natively. The MLX backend
now stops correctly via an in-stream marker detector
(agent/adapters/mlx/backend.py) — the old eot_token getattr was
a no-op and let Gemma run to max_tokens and loop.
M3 status (quick path): shipped. No wake word in the default mode.
The orchestrator runs a self-speech similarity filter, a single-slot
pending-turn queue, and the LLM gate — every reply must begin with
<ignore> or <reply>; ignored replies are suppressed from TTS without
audible cost. Every decision is logged to outputs/m3_eval.jsonl.
M3.5 status: in tree, opt-in. nodes/stt/continuous.py
is a drop-in replacement for STTTwoPassNode with rolling re-transcription
and energy-based phrase segmentation. Activate via
STT_MODE = "continuous" in config.py. Default stays
"two_pass" so the proven baseline is unchanged.
Polish since M2: all four model loads (fast STT, accurate STT, LLM,
TTS) happen at startup with real warm passes — first user turn pays only
inference time, not setup. Conversation history is capped at
MAX_HISTORY_TURNS = 8 user/assistant pairs to keep prompt size bounded.
Continuation note: This repo is in active development (June 2026) as the operator's daily voice assistant and the JROS voice-pipeline testbed. Next planned work: M4 barge-in (now unblocked — the bus supports multiple subscribers), a metrics viewer, and a persona system. Persistent memory / LLM-callable tools / skills remain carry-forward ideas, labeled (planned) where mentioned.
cd VoiceLLM
python main.py
# REQUIRE_WAKE_WORD=False is the default — just talk.
# Flip back to True in config.py for the wake-word "okay eve" flow.
This reproduces references/voice_assistant.py's behavior, but every concern is now its own node communicating over the bus.
| File | Role |
|---|---|
| config.py | All tunables. Flip LLM_BACKEND between "llamacpp" (default) and "mlx"; flip STT_MODE between "two_pass" (default) and "continuous". |
| transport/bus.py | Pub/sub fanout: subscribe(topics) returns a per-subscriber queue; publish() never blocks (sheds oldest). Bus.get() polls the orchestrator's default catch-all subscription. Invariant: raw mic audio never goes on the bus. |
| core/state.py | SysState: IDLE/THINKING/RESPONDING. |
| core/metrics.py | Per-turn timing → metrics.csv. |
| nodes/audio_session/mic_stream.py | MicStream with paused flag (ported from voice_assistant.py:96-125). |
| nodes/audio_session/chimes.py | Wake and follow-up earcons, controlled by CHIMES_ENABLED, WAKE_CHIME_ENABLED, and FOLLOWUP_CHIME_ENABLED. |
| nodes/stt/two_pass.py | Two-pass cascade ported from voice_assistant.py: VAD worker, fast (base.en) + accurate (medium.en) eager-loaded with silence warm-up, wake-word + follow-up window. Publishes stt.text (dict payload with timing). |
| nodes/stt/continuous.py | M3.5 hybrid pipeline. Energy-based phrase segmentation, rolling re-transcription. Drop-in interface match for STTTwoPassNode. Unverified live; also re-transcribes silence continuously in quiet rooms (known inefficiency). |
| agent/llm/backend_base.py | BackendBase ABC: load, warm, stream_chat, cancel. |
| agent/adapters/mlx/backend.py | mlx-lm impl. In-stream stop-marker detector for <end_of_turn> / <eos> / <im_end> (replaces the old broken eot_token getattr). |
| agent/adapters/llama_cpp/backend.py | llama-cpp-python impl. Default backend; chat completion handles Gemma's stop tokens natively. |
| agent/llm/node.py | Owns history; streams llm.token deltas; publishes cleaned reply on llm.done. Caps history at MAX_HISTORY_TURNS = 8 pairs. clean_for_tts() ported. |
| nodes/tts/node.py | Real KPipeline. Synth thread + play thread. Sentence-streams. Cancellable. Publishes mic.pause, tts.audio_chunk, tts.done. |
| agent/orchestrator.py | Single bus consumer; state machine; spawns LLM thread per turn; LLM-gate token buffer (<ignore> suppresses TTS, <reply> forwards the tail). |
| main.py | make_backend() + make_stt() factory funcs, then Orchestrator(...).run(). |
stt.text(dict) — committed user phrase, post-wake-word:{"text": str, "t_speech_start", "t_last_voice", "t_commit", "t_stt_done"}(perf_counter timestamps; continuous mode omits the first two). Plain-str payloads still accepted (pending-turn refires).llm.token(str) — streaming reply delta.llm.done(str) — full cleaned reply, fired after the stream ends.llm.error(str) — generation failed with no usable output; the orchestrator speaks a short fallback. LLMNode already rolled back the user message.mic.pause(bool) — TTS toggles this around playback (TTS also callsstt.set_pauseddirectly so the pause is synchronous; the bus message feeds the orchestrator's tts_start metric).tts.audio_chunk(np.float32) — published beforesd.play(); nobody consumes it yet (subscriber is M4 (planned) — AEC reference). Usebus.subscribe(("tts.audio_chunk",))to consume without stealing from the orchestrator.tts.done(None) — TTS audio queue drained.
LMSTUDIO_MODELS = ~/.lmstudio/models/
MLX_PATH = LMSTUDIO_MODELS/mlx-community/gemma-4-26b-a4b-4bit/
GGUF_PATH = LMSTUDIO_MODELS/lmstudio-community/gemma-4-26B-A4B-it-GGUF/
gemma-4-26B-A4B-it-Q4_K_M.gguf
STT: base.en (fast) and medium.en (accurate) — both eager-loaded at
startup with a 1.5 s silence warm pass, so first user turn pays only
inference cost, not setup. Old behavior lazy-loaded medium.en on the
first wake match (a hangover from M2's wake-word path); pointless now that
M3 fires the accurate model on every phrase.
GITHUB/
├── MockingAgent/ # working Google-Home-style baseline
│ ├── voice_assistant.py # the canonical reference for STT/TTS plumbing
│ ├── ollamacpp/ # chat_mlx.py, chat_llama.py, bench.py
│ ├── kokoro_tts/ # standalone Kokoro experiments
│ ├── PywisperCpp/ # all the always-listening STT demos
│ └── legacy_voicellm_drafts/ # old loose demos that used to live in VoiceLLM/
│
└── VoiceLLM/ # ← THE CODE (flat at the repo root)
├── config.py
├── main.py
├── smoke_test.py # 13 fast scenarios, no model loads
├── agent/ # orchestrator + llm node + backend adapters
├── nodes/ # audio_session/ (mic, chimes) + stt/ + tts/
├── transport/ # bus.py (in-process pub/sub fanout)
├── core/ # metrics.py + state.py
├── references/ # local copies of pasted reference scripts
├── docs/ # ← these planning docs
├── outputs/ # m3_eval.jsonl etc.
├── requirements.txt
├── metrics.csv # auto-written by MetricsLog
├── LICENSE
└── README.md
The sections below are preserved as design notes and possible implementation recipes. They are not an active roadmap for this repository unless development is explicitly resumed here.
Drop the wake word. STT runs always-on; every committed phrase becomes a turn unless we filter it out. This is the "ChatGPT Voice" feel.
We split this into a quick path (lean on the existing two-pass STT) and M3.5 (build the hybrid pipeline node for lower latency / better feel).
- ✅
REQUIRE_WAKE_WORD = Falsein config.py. The existingSTTTwoPassNodealready has the no-wake-word branch (nodes/stt/two_pass.py) — every phrase becomes a turn. - ✅ Self-speech similarity filter on
stt.textingress in the orchestrator. Compares incoming text against the most recentassistantturn fromLLMNode.history_snapshot()viadifflib.SequenceMatcher; drops if>= cfg.SELF_SPEECH_SIMILARITY_THRESHOLD(default 0.75). - ✅ Pending-turn queue replaces the old "drop while busy" placeholder.
Single-slot, last-write-wins; fires when
_on_tts_donereturns to IDLE, provided the queued utterance is younger thancfg.PENDING_TURN_MAX_AGE_S(default 3.0 s). - ✅ LLM gate — the user explicitly didn't want the audio pipeline
deciding what's directed speech. So the LLM does. Every reply must begin
with
<ignore>or<reply>(instruction in config.py:SYSTEM_PROMPT). The orchestrator buffers the firstLLM_GATE_BUFFER_CHARS = 30chars of the streaming reply, decides based on the tag, and either:<ignore>→ suppresses TTS, transitions straight to IDLE, logsllm_ignoredto the eval JSONL.<reply>→ forwards the post-tag tail to TTS as normal. Falls back to "treat as reply" if the tag never appears within the buffer. See_on_llm_tokenand_gate_checkin agent/orchestrator.py.
- ✅ Eval logging — every STT decision (
accepted/dropped_self_echo/queued_pending/pending_fired/pending_stale/llm_ignored) is appended tooutputs/m3_eval.jsonlfor offline review. Disable by settingcfg.M3_EVAL_LOG = None.
- Run alongside a YouTube video for 5 minutes. The LLM should not
fire on background dialogue. Inspect
outputs/m3_eval.jsonlafterwards; expectdropped_self_echofor assistant playback and a fewacceptedfor real user turns. TuneSELF_SPEECH_SIMILARITY_THRESHOLDif false positives slip through. - Sanity check on the barge-in placeholder — until M4 ships, talking
over the assistant queues your follow-up rather than interrupting; that
reads in the log as
queued_pending→pending_fired.
Only do this once the quick path is verified and we hit a quality wall the filter+queue can't paper over (e.g. trailing-word loss on long sentences).
- ✅ Port always_listening_hybrid_phrase_word_pipeline.py
to nodes/stt/continuous.py. Same node
interface as
STTTwoPassNode(publishesstt.text, hasstart/stop/set_paused/open_followup). Energy-based phrase segmentation; rolling re-transcription everycfg.STT_TRANSCRIBE_EVERY_S; phrase commits oncfg.STT_PHRASE_TIMEOUT_Sof quiet. Single Whisper model (cfg.STT_CONTINUOUS_MODEL, defaultbase.en). - ✅
make_stt()in main.py —STT_MODE == "continuous"branch wired with the M3.5 tunables. - Verification still owed — flip
STT_MODE = "continuous"in config.py, runpython main.py, and compare:- First-token latency vs. two_pass (rolling re-transcription should cut the wait at phrase end).
- False-positive rate on background dialogue (rerun the YouTube test
against
outputs/m3_eval.jsonl). M3.5 uses base.en by default; if accuracy lags, raiseSTT_CONTINUOUS_MODELtosmall.enormedium.en.
Talk over the assistant; it cuts off and listens.
- Wire AEC: build a speexdsp wrapper (the old sketch was deleted as
dead code 2026-06-10;
references/voice_chat.pypreserves the working experiment). Usebus.subscribe(("tts.audio_chunk",))for the far-end reference — the bus supports multiple subscribers now — and run mic frames through AEC before they reach the VAD. - VAD on cleaned audio while
state == RESPONDING: when VAD says speech for ≥150 ms, publishtts.cancel, callllm.cancel(), transitionstate = LISTENING. Add a 250 ms start-grace at the top of each TTS turn so the speaker click doesn't self-trigger. - Add
tts.canceltopic (planned) to the bus contract; route it in the orchestrator's_dispatch.KokoroNode.cancel()already exists and does the right thing. config.BARGE_IN_ENABLEDandAEC_ENABLEDexist as flags but nothing reads them yet; wire them when 1-3 land.
- Latency dashboard:
metrics.csvis already being written; add a tiny live print of TTFT/first-audio per turn. - Voice picker (
config.KOKORO_VOICE). - System-prompt presets.
- Optional GUI (PySide6 demo exists in MockingAgent).
transport/bus.pyis pub/sub fanout (since 2026-06-10): eachsubscribe(topics)call gets its own queue, so M4's AEC can consumetts.audio_chunkwithout stealing messages from the orchestrator.publish()never blocks — full subscriber queues shed oldest-first. Invariant: raw mic audio never goes on the bus (it stays onMicStream.q/phrase_q/audio_q).- TTS publishes
mic.pausebeforesd.play()returns. The orchestrator forwards it toSTTTwoPassNode.set_paused()which callsMicStream.set_paused(). Check the exact ordering in nodes/tts/node.py:_play_loop before tightening barge-in timing — there's atail_sleep_s = 0.12to let speakers drain before un-pausing the mic. - Dead modules were deleted 2026-06-10 (
nodes/audio_session/vad.py,aec.py,wakeword.py,audio_io.py— zero importers; git history has them).references/remains historical material thatmain.pynever imports. webrtcvadvswebrtcvad-wheels: requirements.txt asks for-wheels(prebuilt). The old root requirements named barewebrtcvadwhich builds from source. Consistent now.- Gemma 4 in
mlx-lmdoesn't stop on<end_of_turn>by default — the tokenizer wrapper's API for additional EOS tokens varies by mlx-lm version. We solved it via an in-stream stop-marker detector in agent/adapters/mlx/backend.py:stream_chat() that buffers the last 32 chars of streamed text and stops when it sees<end_of_turn>,<eos>, or<|im_end|>. Without this fix, MLX runs tomax_tokensand loops. - First-run latency: all four model loads happen at startup before
the mic opens — Kokoro 1-line warm synth (~3-5 s),
BackendBase.warm()1-token gen (~1-2 s), both Whisper sizes plus a 1.5 s silence transcribe each (~1-2 s combined). First user turn then pays only inference time. - Whisper warm-up needs ≥1000 ms of audio; we use 1.5 s of silence
in nodes/stt/two_pass.py:_warm_stt() and
nodes/stt/continuous.py. Sub-1000-ms warm
transcribes get rejected by Whisper with
input is too short - ... msand silently skip inference, defeating the whole point. - macOS mic permission: launching from VS Code's terminal sometimes
inherits the editor's TCC grant, sometimes prompts. If
MicStreamsilently captures zeros, that's the issue.
# Compile-check all M2 + M3.5 modules:
cd VoiceLLM
python -m py_compile config.py main.py \
agent/llm/backend_base.py agent/adapters/mlx/backend.py \
agent/adapters/llama_cpp/backend.py agent/llm/node.py \
nodes/tts/node.py nodes/audio_session/mic_stream.py nodes/audio_session/chimes.py \
nodes/stt/two_pass.py nodes/stt/continuous.py \
agent/orchestrator.py transport/bus.py core/state.py core/metrics.py
# Confirm models exist:
python -c "import config as c; print('mlx:', c.MLX_PATH.exists(), 'gguf:', c.GGUF_PATH.exists())"
# Run end-to-end (loads ~5 GB into memory):
python main.py- Read this file.
- Read 00_overview.md and 01_architecture.md.
- Read references/voice_assistant.py — that is the canonical reference for every STT/TTS/LLM glue decision in M2.
- Read 02_stt_pipelines.md before touching M3.
- Run
python smoke_test.pybefore and after any orchestrator/LLM/bus change — it's fast (no model loads) and covers the failure paths a live session hits.