VoiceLLM

A continuously-listening, low-latency local voice assistant for Apple Silicon. Speak naturally; the assistant streams a spoken reply back; the conversation keeps going without a wake word. Think ChatGPT Voice, fully offline.

Status: current voice-loop milestone complete. The modular local assistant, plugin-style STT/TTS/LLM organization, continuous hearing, LLM-gated speech, self-speech rejection, streaming Kokoro TTS, and chimes are in place. Further advanced work is unlikely to continue here; the likely path is carrying ideas such as full-duplex AEC/barge-in, persistent memory, and LLM-callable tools into newer AgenticLLM-style frameworks. See docs/STATUS.md for the handoff notes.

What it does today

Continuous hearing. No wake word in the default mode — every directed utterance becomes a turn.
LLM-gated speech. Every reply begins with <ignore> or <reply>; the orchestrator suppresses TTS when the LLM judges the input as not addressed to it (background TV, keystroke noise, transcription artifacts, ambient conversation). The audio pipeline does not gatekeep — the LLM does.
Two interchangeable LLM backends. Same model behavior (Gemma 4 26B-A4B 4-bit), one config flag — llama.cpp (default, robust chat template handling) and mlx-lm (Apple MLX, faster on M-series). Swap with one line in config.py.
Two interchangeable STT pipelines. two_pass (proven Google-Home-style fast→accurate Whisper cascade) and continuous (rolling re-transcription hybrid). STT_MODE flag selects.
Self-speech rejection. Mic-pause during TTS plus a similarity filter against the most recent reply. Full-duplex AEC/barge-in is preserved as carry-forward design work rather than expected development in this repo.
Friendly chimes. A short wake chime acknowledges wake-only prompts, and a double chime marks that the assistant is ready for a follow-up.
Streaming TTS via Kokoro. Audio starts at the first sentence boundary, not after the full reply.

Stack

Layer	Choice
Audio I/O	`sounddevice` (CoreAudio)
VAD	`webrtcvad`
STT	`pywhispercpp` (whisper.cpp), `base.en` + `medium.en`
LLM	`llama-cpp-python` (default) or `mlx-lm`, both running Gemma 4 26B-A4B 4-bit
TTS	`kokoro` (`KPipeline`)

State and routing run through a single in-process pub/sub Bus consumed by the runner in agent/orchestrator.py (IDLE → THINKING → RESPONDING → IDLE).

Current status

I would call this project complete for the current local voice assistant milestone. It is not intended to become the main open-ended assistant platform; advanced continuation work is expected to move into newer frameworks such as AgenticLLM.

Done:

Runnable local voice loop through python main.py.
Plugin-style layout for STT, TTS, and LLM integrations.
Runner classification for the orchestrator.
Continuous-hearing default with optional wake-word mode.
LLM gate for ignoring ambient/non-directed speech.
Mic-pause plus similarity filtering for self-speech rejection.
Streaming Kokoro TTS and configurable wake/follow-up chimes.

Carry-forward ideas:

True full-duplex AEC/barge-in.
Persistent memory beyond in-session conversation history.
LLM-callable tools or skills.
GUI or device/voice picker.

Quick start

Apple Silicon Mac (M1/M2/M3/M4), 24 GB+ unified memory recommended for the 26B 4-bit model.

git clone https://github.com/<your-fork>/VoiceLLM.git
cd VoiceLLM
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Place the Gemma 4 26B-A4B model files (or symlink them) at the locations in config.py:

~/.lmstudio/models/lmstudio-community/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-Q4_K_M.gguf
~/.lmstudio/models/mlx-community/gemma-4-26b-a4b-4bit/

Then run:

python main.py

The first run loads ~5 GB into unified memory (LLM, both Whisper sizes, Kokoro). Once you see [ready] VoiceLLM running. Ctrl-C to quit. start talking — no wake word needed.

Configuration

All tunables live in config.py. The flags you'll touch most:

Flag	Default	What it does
`LLM_BACKEND`	`"llamacpp"`	`"llamacpp"` (proven) or `"mlx"` (faster on M-series)
`STT_MODE`	`"two_pass"`	`"two_pass"` (default) or `"continuous"` (M3.5 rolling)
`REQUIRE_WAKE_WORD`	`False`	`True` reverts to "okay eve" gating
`LLM_TEMPERATURE`	`0.6`	LLM sampling temperature
`MAX_HISTORY_TURNS`	`8`	rolling user/assistant pair cap
`KOKORO_VOICE`	`"af_heart"`	Kokoro voice id
`CHIMES_ENABLED`	`True`	master switch for wake/follow-up audio cues

See config.py for the full list (VAD aggressiveness, phrase timeouts, energy thresholds, etc.).

Project layout

VoiceLLM/
├── config.py                  # all tunables
├── main.py                    # wire bus + nodes, start orchestrator
│
├── agent/                     # THE CONSCIOUS SIDE — LLM + orchestrator
│   ├── orchestrator.py        # STT → LLM → TTS state machine
│   ├── llm/                   # LLM bus node + BackendBase
│   │   ├── node.py
│   │   └── backend_base.py
│   └── adapters/              # model backends (selected via config.LLM_BACKEND)
│       ├── llama_cpp/         # GGUF backend
│       └── mlx/               # Apple-Silicon-native MLX backend
│
├── nodes/                     # PERIPHERAL NODES — STT, TTS, mic
│   ├── audio_session/         # MicStream + VAD + AEC + chimes
│   ├── stt/                   # whisper.cpp continuous + two_pass
│   └── tts/                   # Kokoro playback node
│
├── transport/                 # the bus — pub/sub between agent + nodes
│   └── bus.py
│
├── core/                      # SHARED INFRASTRUCTURE
│   ├── state.py               # process state
│   └── metrics.py             # per-turn timing
│
├── ASSETS/                    # chimes, packaged audio
├── references/                # pasted historical scripts, not imported
├── docs/                      # architecture / milestones / status
├── outputs/                   # m3_eval.jsonl (runtime decision log)
└── metrics.csv                # per-turn timing log

Vocabulary-wise (aligned with JROS 0.5):

agent/ is the cognitive side — the LLM and its orchestration loop.
nodes/ is the peripheral side — STT, TTS, mic + AEC.
transport/ is the bus connecting them.
core/ is what both sides share (state, metrics).

The split mirrors JROS exactly so lessons here merge cleanly into the full framework.

Documentation

docs/STATUS.md — completion state and handoff notes.
docs/00_overview.md — design intent.
docs/01_architecture.md — module/bus layout.
docs/02_stt_pipelines.md — STT strategies.
docs/03_llm_backends.md — backend selection.
docs/04_tts_kokoro.md — Kokoro coordination.
docs/05_barge_in_and_self_speech.md — barge-in + self-speech.
docs/06_milestones.md — build order M0–M5.
docs/07_open_questions.md — design questions.
docs/08_vocabulary_contract.md — component naming and future plugin/skill layout.

Acknowledgments

Gemma 4 — Google's open-weight model family (26B-A4B 4-bit at the heart of this).
Kokoro — natural TTS that runs in real-time on M-series.
whisper.cpp via pywhispercpp — fastest local Whisper on Apple Silicon.
mlx-lm — Apple MLX inference for Gemma.
llama.cpp / llama-cpp-python — the cross-platform fallback backend.

License

See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceLLM

What it does today

Stack

Current status

Quick start

Configuration

Project layout

Documentation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
ASSETS		ASSETS
agent		agent
core		core
docs		docs
nodes		nodes
references		references
transport		transport
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
smoke_test.py		smoke_test.py

Folders and files

Latest commit

History

Repository files navigation

VoiceLLM

What it does today

Stack

Current status

Quick start

Configuration

Project layout

Documentation

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages