5 releases
| 0.1.5 | Oct 4, 2025 |
|---|---|
| 0.1.4 | Oct 4, 2025 |
#405 in Audio
138 downloads per month
63KB
1K
SLoC
kokoroxide [WIP]
A high-performance Rust implementation of Kokoro TTS (Text-to-Speech) synthesis, leveraging ONNX Runtime for efficient neural speech generation. Uses espeak-ng for text-to-phoneme conversion, with built-in conversion logic into Misaki phoneme notation expected by Kokoro models. Distributed under a dual MIT/Apache-2.0 license to match the broader Rust ecosystem.
Note: Currently only supports and has been tested with American English. Contributions for different languages are very welcome!
Features
- 🎨 Voice Style Control - Customize voice characteristics with style vectors
- 🔤 Phoneme Support - Direct phoneme input for precise pronunciation control
- ⚡ Speed Control - Adjust speech rate dynamically
- 🔧 Flexible API - Multiple generation methods for different use cases
Installation
Add this to your Cargo.toml:
[dependencies]
kokoroxide = "0.1.3"
Quick Start
use kokoroxide::{load_voice_style, KokoroTTS, TTSConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Configure the ONNX model + tokenizer that Kokoro requires.
// These files live outside the crate; download them from Kokoro's distribution (https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX).
let config = TTSConfig::new("path/to/kokoro.onnx", "path/to/tokenizer.json")
.with_sample_rate(24000)
.with_max_tokens_length(512)
.with_graph_optimization_level(kokoroxide::GraphOptimizationLevel::Disable);
// Build the speech engine with the explicit configuration so advanced knobs are available.
let tts_service = KokoroTTS::with_config(config)?;
// Load a voice style vector (.bin) that controls prosody and speaker identity.
let voice = load_voice_style("path/to/voice.bin")?;
// Generate speech at 1.0x speed for the requested text.
let text = "Hello, this is a text-to-speech synthesis example.";
let audio = tts_service.generate_speech(text, &voice, 1.0)?;
// Persist the synthesized waveform to a WAV file for playback.
audio.save_to_wav("path/to/output.wav")?;
Ok(())
}
For a complete runnable example pointing at real assets, see the kokoroxide-demo sample project in this workspace (kokoroxide-demo/src/main.rs).
API Overview
Core Types
KokoroTTS
The main TTS engine that handles text-to-speech conversion.
// Create with default config
let tts = KokoroTTS::new(model_path, tokenizer_path)?;
// Create with custom config
let config = TTSConfig::new(model_path, tokenizer_path)
.with_max_tokens_length(128)
.with_sample_rate(24000);
let tts = KokoroTTS::with_config(config)?;
VoiceStyle
Represents voice characteristics as a style vector. Voice files contain multiple style vectors indexed by token length.
// Load from binary file
let voice = load_voice_style("voice.bin")?;
// Create custom voice with vector size
let custom_voice = VoiceStyle::new(vec![0.1, 0.2, ...], 256);
GeneratedAudio
Contains the generated audio samples and metadata.
let audio = tts.speak("Hello!", &voice)?;
println!("Duration: {} seconds", audio.duration_seconds);
println!("Sample rate: {} Hz", audio.sample_rate);
audio.save_to_wav("output.wav")?;
Generation Methods
1. Simple Text-to-Speech
let audio = tts.speak("Hello, world!", &voice)?;
2. With Speed Control
let audio = tts.generate_speech("Speak faster!", &voice, 1.5)?; // 1.5x speed
3. From Phonemes
let audio = tts.generate_speech_from_phonemes("həˈloʊ wɜːld", &voice, 1.0)?;
4. From Token IDs
let tokens = vec![101, 2234, 1567, 102]; // Pre-tokenized input
let audio = tts.generate_from_tokens(&tokens, &voice, 1.0)?;
Configuration
TTSConfig Options
use ort::{execution_providers::CoreMLExecutionProviderOptions, ExecutionProvider, GraphOptimizationLevel};
let config = TTSConfig::new(model_path, tokenizer_path)
.with_max_tokens_length(512) // Maximum token sequence length
.with_sample_rate(24000) // Audio sample rate in Hz
.with_graph_optimization_level(GraphOptimizationLevel::Level3)
.with_execution_providers(vec![
ExecutionProvider::CoreML(CoreMLExecutionProviderOptions::default()),
]); // Optional hardware acceleration
If you don't need custom providers, you can skip the call to with_execution_providers and the default CPU provider will be used.
Graph Optimization Levels
The with_graph_optimization_level() method allows you to control ONNX Runtime's graph optimization:
GraphOptimizationLevel::Disable- No optimizationsGraphOptimizationLevel::Level1- Basic optimizationsGraphOptimizationLevel::Level2- Extended optimizationsGraphOptimizationLevel::Level3- Maximum optimizations (default)
System Requirements
Prerequisites
-
Rust 1.70+
-
espeak-ng (required for text-to-phoneme conversion):
- Ubuntu/Debian:
sudo apt-get install espeak-ng libespeak-ng-dev - macOS:
brew install espeak-ng - Windows: Download from espeak-ng releases
- Arch Linux:
sudo pacman -S espeak-ng
- Ubuntu/Debian:
-
ONNX Runtime (automatically downloaded via
ortcrate) -
Kokoro model files:
- Model file (e.g.,
kokoro-v0_19.onnx) - Tokenizer configuration (
tokenizer.json) - Voice style files (
.binformat) - Downloaded at runtime or managed outside the crate package to keep the published crate lightweight
- Model file (e.g.,
Build Configuration
The crate automatically links to espeak-ng based on your platform:
- macOS: Looks for espeak-ng in
/opt/homebrew/lib(Homebrew default) - Linux: Uses system library paths
If espeak-ng is installed in a non-standard location, you may need to set:
export LD_LIBRARY_PATH=/path/to/espeak-ng/lib:$LD_LIBRARY_PATH # Linux
export DYLD_LIBRARY_PATH=/path/to/espeak-ng/lib:$DYLD_LIBRARY_PATH # macOS
Environment Variables
-
DEBUG_PHONEMES- Enable phoneme debugging output:DEBUG_PHONEMES=1 cargo runThis will print:
- Input text
- Espeak IPA output
- Converted Misaki phonemes
-
DEBUG_TOKENS- Enable token debugging output:DEBUG_TOKENS=1 cargo runThis will print:
- Generated token IDs array
-
DEBUG_TIMING- Enable performance timing logs:DEBUG_TIMING=1 cargo runThis will print:
- Phoneme tokenization time
- Espeak IPA conversion time
- Total tokenization time
-
All debug modes:
DEBUG_PHONEMES=1 DEBUG_TOKENS=1 DEBUG_TIMING=1 cargo run
Model Files
Download the Kokoro model files from the official repository:
- Model: Kokoro-82M ONNX
- Required files:
*.onnx- The model filetokenizer.json- Tokenizer configuration- Voice files (
*.bin) - Style vectors for different voices - Provide these assets at runtime (they are not packaged with the crate to keep the published tarball lightweight)
Examples
Basic TTS Application
use kokoroxide::{KokoroTTS, load_voice_style};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let tts = KokoroTTS::new("model.onnx", "tokenizer.json")?;
let voice = load_voice_style("voice.bin")?;
let text = "Welcome to kokoroxide TTS!";
let audio = tts.generate_speech(text, &voice, 1.0)?;
audio.save_to_wav("welcome.wav")?;
println!("Generated {} seconds of audio", audio.duration_seconds);
Ok(())
}
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
This project implements the Kokoro TTS model in Rust, providing a high-performance alternative to Python implementations.
Dependencies
~12–47MB
~596K SLoC