eaRS is a Rust-based streaming speech-to-text stack built on Kyutai's models. The tool is now delivered as a single CLI with two responsibilities:
- Server management:
ears server start|stoplaunches and controls the inference backend. - Client capture: Running
earswithout subcommands streams microphone audio to the server and prints live transcripts.
cargo build --release
cargo build --release --features metal # For Apple silicon
cargo build --release --features cuda # For NVIDIA GPUAll binaries are emitted into ./target/release/.
cargo install --path .
cargo install --path . --features metal # For Apple silicon
cargo install --path . --features cuda # For NVIDIA GPU# 1. Start the transcription server (runs in the background)
./target/release/ears server start
# 2. Stream your microphone to the server and print live text
./target/release/earsPress Ctrl+C in the client to stop streaming. When you are done with the backend:
./target/release/ears server stop./target/release/ears server start \
[--bind 0.0.0.0:8765] \
[--hf-repo kyutai/stt-1b-en_fr-candle] \
[--cpu] \
[--timestamps] \
[--vad] \
[--whisper] # requires --features whisper
--bind: Override the default bind address (0.0.0.0:<port-from-config>).--hf-repo: Choose a different Kyutai Speech repo hosted on Hugging Face.--cpu: Force CPU execution (otherwise CUDA/Metal is used when available).--timestamps: Include word timestamps in the server stream.--vad: Enable voice-activity detection for automatic sentence segmentation.--whisper: Force-enable Whisper post-processing (only when compiled with thewhisperfeature).
The server writes a PID file to $XDG_STATE_HOME/ears/server.pid (or ~/.local/state/ears/server.pid) so subsequent start commands will refuse to launch if an instance is already running. ears server stop sends a SIGTERM to the stored PID and removes the PID file; stale files are cleaned up automatically.
./target/release/ears [--device 1] [--server ws://host:port/] [--timestamps] [--list-devices]
--list-devices: Print available input devices and exit.--device: Select a specific capture device by index.--server: Point the client at a remote server (ws://127.0.0.1:<config-port>/by default).--timestamps: Print the final transcript with per-word timing instead of live text.
The client streams raw 24 kHz mono PCM to the server and displays each live word as it appears. When the backend signals completion, the final transcript (and optional timestamps) is printed.
Runtime configuration lives at:
$XDG_CONFIG_HOME/ears/config.toml
# or ~/.config/ears/config.toml
Key sections:
[storage]: Override model cache directories and reference audio location.[whisper]: Configure optional Whisper enhancement defaults (model, quantization, languages, sentence detection thresholds).[server]: Default WebSocket port used byears server startand the capture client.
If the file does not exist, it is created on first run together with the reference audio bundle.
The server emits JSON events:
{"type":"word","word":"hello","start_time":1.23,"end_time":null}– live word updates.{"type":"final","text":"…","words":[…]}– final transcript with timestamp list.{"type":"whisper_processing"|"whisper_complete",…}– optional Whisper status messages when Whisper is enabled.
Clients may send {"type":"stop"} to end the current session (the capture client does this automatically when interrupted).
ears server startreports "already running" – Useears server stopto terminate the existing instance. If the PID no longer exists,stopwill clean up the stale PID file.- Client prints "failed to connect" – Ensure the server is running and reachable at the URL passed via
--server(check the configured port). - High latency – Run the server on the same machine as the client or enable GPU acceleration (
--features cudaor--features metal).