Thanks to visit codestin.com
Credit goes to lib.rs

#text-to-speech #candle #speech-synthesis

bin+lib pocket-tts-cli

CLI and API server for Pocket TTS

7 releases

Uses new Rust 2024

new 0.3.0 Jan 20, 2026
0.2.8 Jan 20, 2026
0.1.1 Jan 17, 2026

#996 in Audio

MIT/Apache

725KB
5K SLoC

Contains (WOFF font, 21KB) figtree-latin-wght-normal-D_ZTVpCC.woff2, (WOFF font, 11KB) figtree-latin-ext-wght-normal-DCwSJGxG.woff2

Pocket TTS (Rust/Candle)

A native Rust port of Kyutai's Pocket TTS using Candle for tensor operations.

Text-to-speech that runs entirely on CPU—no Python, no GPU required.

Features

  • Pure Rust - No Python runtime, just a single binary
  • CPU-only - Runs on CPU, no GPU required
  • Streaming - Generate audio progressively as it's synthesized
  • Voice cloning - Clone any voice from a few seconds of audio
  • Infinite text - Handle arbitrarily long text inputs via automatic segmentation
  • int8 Quantization - Significant speedup and smaller memory footprint
  • WebAssembly - Run the full model in any modern web browser
  • Pause Handling - Support for natural pauses and explicit [pause:Xms] syntax
  • HTTP API - REST API server with OpenAI-compatible endpoint
  • Web UI - Built-in web interface for interactive use
  • Python Bindings - Use the Rust implementation from Python for improved performance

Quick Start

Build from source

cd candle
cargo build --release

Generate audio

# Using default voice
cargo run --release -p pocket-tts-cli -- generate --text "Hello, world!"

# Using a custom voice (WAV file)
cargo run --release -p pocket-tts-cli -- generate \
    --text "Hello, world!" \
    --voice ./my_voice.wav \
    --output output.wav

# Using a predefined voice
cargo run --release -p pocket-tts-cli -- generate --voice alba

Start the HTTP server

cargo run --release -p pocket-tts-cli -- serve
# Navigate to http://localhost:8000

### WebAssembly Demo

The browser demo features a "Zero-Setup" experience with an **embedded tokenizer and config**.

#### 1. Build the WASM package
From the `crates/pocket-tts` directory:
```bash
wasm-pack build --target web --out-dir pkg . -- --features wasm

2. Launch the demo server

cargo run --release -p pocket-tts-cli -- wasm-demo
  • Navigate to http://localhost:8080
  • Provides built-in voice cloning and Hugging Face Hub integration.

Installation

Add to your Cargo.toml:

[dependencies]
pocket-tts = { path = "candle/crates/pocket-tts" }

Library Usage

use pocket_tts::TTSModel;
use anyhow::Result;

fn main() -> Result<()> {
    // Load the model
    let model = TTSModel::load("b6369a24")?;
    
    // Get voice state from audio file
    let voice_state = model.get_voice_state("voice.wav")?;
    
    // Generate audio
    let audio = model.generate("Hello, world!", &voice_state)?;
    
    // Save to file
    pocket_tts::audio::write_wav("output.wav", &audio, model.sample_rate as u32)?;
    
    Ok(())
}

Streaming Generation

use pocket_tts::TTSModel;

let model = TTSModel::load("b6369a24")?;
let voice_state = model.get_voice_state("voice.wav")?;

// Stream audio chunks as they're generated
for chunk in model.generate_stream("Long text here...", &voice_state) {
    let audio_chunk = chunk?;
    // Process or play each chunk
}

Custom Parameters

let model = TTSModel::load_with_params(
    "b6369a24",     // variant
    0.7,            // temperature (higher = more variation)
    1,              // lsd_decode_steps (more = better quality, slower)
    -4.0,           // eos_threshold (more negative = longer audio)
)?;

CLI Reference

generate command

Generate audio from text and save to a WAV file.

pocket-tts generate [OPTIONS]

Options:
  -t, --text <TEXT>              Text to synthesize [default: greeting]
  -v, --voice <VOICE>            Voice: predefined name, .wav file, or .safetensors
  -o, --output <PATH>            Output file [default: output.wav]
      --variant <VARIANT>        Model variant [default: b6369a24]
      --temperature <FLOAT>      Sampling temperature [default: 0.7]
      --lsd-decode-steps <INT>   LSD decode steps [default: 1]
      --eos-threshold <FLOAT>    EOS threshold [default: -4.0]
      --stream                   Stream raw PCM to stdout
  -q, --quiet                    Suppress output

Predefined voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma

serve command

Start an HTTP API server with web interface.

pocket-tts serve [OPTIONS]

Options:
      --host <HOST>              Bind address [default: 127.0.0.1]
  -p, --port <PORT>              Port number [default: 8000]
      --voice <VOICE>            Default voice [default: alba]
      --variant <VARIANT>        Model variant [default: b6369a24]
      --temperature <FLOAT>      Temperature [default: 0.7]
      --lsd-decode-steps <INT>   LSD steps [default: 1]
      --lsd-decode-steps <INT>   LSD steps [default: 1]
      --eos-threshold <FLOAT>    EOS threshold [default: -4.0]

## Python Bindings

The Rust implementation can be used as a Python module for improved performance (~1.34x speedup).

### Installation

Requires [maturin](https://github.com/PyO3/maturin).

```bash
cd candle/crates/pocket-tts-bindings
uvx maturin develop --release

Usage

import pocket_tts_bindings

# Load the model
model = pocket_tts_bindings.PyTTSModel.load("b6369a24")

# Generate audio
audio_samples = model.generate(
    "Hello from Rust!",
    "path/to/voice.wav"
)

wasm-demo command

Serve the WASM package and browser demo.

pocket-tts wasm-demo [OPTIONS]

Options:
  -p, --port <PORT>              Port number [default: 8080]

## API Endpoints

| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/` | Web interface |
| `GET` | `/health` | Health check |
| `POST` | `/generate` | Generate audio (JSON) |
| `POST` | `/stream` | Streaming generation |
| `POST` | `/tts` | Python-compatible (multipart) |
| `POST` | `/v1/audio/speech` | OpenAI-compatible |

### Example API call

```bash
curl -X POST http://localhost:8000/generate \
  -H 'Content-Type: application/json' \
  -d '{"text": "Hello world", "voice": "alba"}' \
  --output output.wav

Project Structure

candle/
├── Cargo.toml              # Workspace configuration
├── crates/
│   ├── pocket-tts/         # Core library
│   │   ├── src/
│   │   │   ├── lib.rs          # Public API
│   │   │   ├── tts_model.rs    # Main TTSModel
│   │   │   ├── wasm.rs         # WASM entry points
│   │   │   ├── audio.rs        # WAV I/O, resampling
│   │   │   ├── quantize.rs     # int8 quantization
│   │   │   ├── pause.rs        # Pause/silence handling
│   │   │   ├── config.rs       # YAML config types
│   │   │   ├── models/         # Neural network models
│   │   │   │   ├── flow_lm.rs      # Flow language model
│   │   │   │   ├── mimi.rs         # Audio codec
│   │   │   │   ├── seanet.rs       # Encoder/decoder
│   │   │   │   └── transformer.rs  # Transformer blocks
│   │   │   └── modules/        # Reusable components
│   │   │       ├── attention.rs    # Multi-head attention
│   │   │       ├── conv.rs         # Convolution layers
│   │   │       ├── mlp.rs          # MLP with AdaLN
│   │   │       └── rope.rs         # Rotary embeddings
│   │   ├── tests/
│   │   └── benches/
│   └── pocket-tts-cli/     # CLI binary
│       ├── src/
│       │   ├── main.rs         # Entry point
│       │   ├── commands/       # generate, serve
│       │   ├── server/         # Axum HTTP server
│       │   └── voice.rs        # Voice resolution
│       └── static/             # Web UI assets
└── docs/                   # Documentation

Architecture

The Rust port mirrors the Python implementation:

  1. Text Conditioning: SentencePiece tokenizer → embedding lookup table
  2. FlowLM Transformer: Generates latent representations from text using Lagrangian Self Distillation (LSD)
  3. Mimi Decoder: Converts latents to audio via SEANet decoder

Key differences from Python

  • Uses Candle instead of PyTorch
  • Stateless streaming (no internal module state)
  • Polyphase resampling via rubato (matches scipy)
  • Compiled to native code—no JIT, no Python overhead

Benchmarking

Run benchmarks to measure performance on your hardware:

cargo bench -p pocket-tts

Note: Performance may differ from the Python implementation. Candle is optimized for portability rather than raw speed.

Performance Results

Benchmarks run on User Hardware (vs Python baseline):

  • Short Text: ~3.57x speedup
  • Medium Text: ~2.40x speedup
  • Long Text: ~2.44x speedup
  • Latency: ~128ms to first audio chunk

Rust is consistently >2.4x faster than the optimized Python implementation.

Numerical Parity

The Rust implementation achieves strong numerical parity with Python:

Component Max Difference Status
Input audio 0 ✅ Perfect
SEANet Decoder ~0.000004 ✅ Excellent
Decoder Transformer ~0.002 ✅ Good
Voice Conditioning ~0.004 ✅ Good
Full Pipeline ~0.06 ✅ Acceptable

Run parity tests:

cargo test -p pocket-tts parity --release

Dependencies

Core dependencies (see full list in Cargo.toml):

License

MIT License - see LICENSE

Dependencies

~44–66MB
~1M SLoC