BACKENDS.md

TTS Backend Guide

What is a TTS Backend?

A TTS backend is the actual text-to-speech engine that generates audio. Think of Open Unified TTS as a smart router that:

Takes your text and voice request
Chooses the best backend for that voice
Chunks the text if needed
Sends chunks to the backend
Stitches the results together seamlessly

Each backend has different strengths, voice capabilities, and hardware requirements.

Quick Decision Guide

I want to...

Just get started quickly → Kokoro (CPU-friendly, Docker ready, 67 voices)
Clone a specific voice → VoxCPM or OpenAudio (requires GPU and manual setup)
Use cloud/hosted solution → ElevenLabs or MiniMax (API key required)
Generate singing/music → ACE-Step (requires GPU)
Creative voice generation → Higgs Audio (GPU, scene-based voices)

Backend Comparison

Kokoro (Recommended Starting Point)

Type: Neural TTS Voices: 67 built-in voices Hardware: CPU or GPU (CPU works great!) Setup Difficulty: Easy (Docker one-liner) Quality: Excellent natural speech

Best For:

Getting started quickly
Audiobook narration
General-purpose TTS
Running on machines without GPU

Setup:

# CPU version (works anywhere)
docker run -d --name kokoro-tts -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

# GPU version (faster)
docker run -d --name kokoro-tts --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest

Full Kokoro Setup Guide →

Voice Categories:

American Female (9 voices): af_alloy, af_bella, af_heart, af_nova, af_sky, etc.
American Male (8 voices): am_adam, am_echo, am_eric, am_onyx, etc.
British Female (3 voices): bf_alice, bf_emma, bf_lily
British Male (4 voices): bm_daniel, bm_fable, bm_george, bm_lewis

VoxCPM (Voice Cloning)

Type: Voice Cloning Voices: Custom (clone any voice with reference audio) Hardware: GPU required (12-16GB VRAM) Setup Difficulty: Hard (manual installation) Quality: Excellent clone quality

Best For:

Cloning specific celebrity/character voices
Matching a brand voice exactly
Creating custom voice personas

Limitations:

~75 words per generation (handled by chunking)
Requires 5-30 seconds of reference audio
GPU-intensive

Setup: Requires manual installation of VoxCPM from GitHub. Not covered in this guide.

VoxCPM 1.5 (High-Quality Voice Cloning)

Type: Voice Cloning (improved version) Voices: 88+ pre-loaded voices Hardware: GPU required (8GB+ VRAM) Setup Difficulty: Hard (manual installation) Quality: 44.1kHz output (2x quality of original VoxCPM)

Best For:

High-quality production work
When you need better audio fidelity
Lighter VRAM usage than original VoxCPM

Improvements over VoxCPM:

Higher sample rate (44.1kHz vs 22.05kHz)
More efficient VRAM usage (~8GB vs ~16GB)
88+ pre-loaded character voices
Better handling of longer texts (~150 words vs ~75)

OpenAudio S1-mini (Voice Cloning)

Type: Voice Cloning (Fish Speech based) Voices: Custom clones Hardware: GPU required Setup Difficulty: Hard (manual installation) Quality: Good clone quality

Best For:

Alternative to VoxCPM
Fish Speech ecosystem users

FishTTS (Voice Synthesis)

Type: Voice Cloning Voices: Custom Hardware: GPU required Setup Difficulty: Hard Quality: Good

Best For:

Fish Speech ecosystem integration
Alternative voice cloning approach

Higgs Audio (Generative/Creative)

Type: Generative TTS Voices: Unlimited (scene-based generation) Hardware: GPU required Setup Difficulty: Hard Quality: Variable (creative, not clone-accurate)

Best For:

Creating NEW unique voices on the fly
Experimental/creative projects
Scene-based character voices ("gruff pirate", "ethereal spirit", etc.)

How it works: Instead of cloning existing voices, Higgs generates voices based on scene descriptions:

"nervous alien diplomat speaking through translator"
"1920s radio announcer, tinny and enthusiastic"
"ancient stone golem, grinding and rumbling"

Not recommended for: Cloning specific people/characters (use VoxCPM instead)

Kyutai/Moshi (Emotion TTS)

Type: Emotional Expression Voices: 8 emotion modes Hardware: GPU required Setup Difficulty: Hard Quality: Good emotional range

Best For:

Adding emotional expression to speech
When you need happy/sad/angry variations
Interactive applications

Emotions: neutral, happy, sad, angry, surprised, disgusted, fearful, contemptuous

Chatterbox (Emotion Voice Cloning)

Type: Voice Clone + Emotion Voices: Custom with emotion control Hardware: GPU required Setup Difficulty: Hard Quality: Good

Best For:

Voice clones with emotional variation
Character dialogue with mood shifts

VibeVoice (Streaming TTS)

Type: Microsoft-based streaming TTS Voices: Microsoft voice set Hardware: GPU required Setup Difficulty: Hard Quality: Good

Best For:

Real-time streaming applications
Low-latency requirements

ACE-Step (Musical/Singing TTS)

Type: Musical TTS (singing!) Voices: Custom Hardware: GPU required Setup Difficulty: Hard Quality: Unique (singing, not speech)

Best For:

Generating singing vocals
Musical applications
Creative audio projects

Note: This is for SINGING, not speech. Use other backends for spoken word.

MiniMax TTS (Cloud)

Type: Cloud API Voices: 7 professional voices Hardware: None (cloud-based) Setup Difficulty: Easy (API key only) Quality: Excellent professional voices

Best For:

No local GPU available
Professional narration voices
Production-quality output

Voices:

English_expressive_narrator (British narrator)
English_magnetic_voiced_man (Deep American)
English_Aussie_Bloke (Australian)
English_Comedian (Comedic American)
English_radiant_girl (Energetic female)
English_compelling_lady1 (Professional British female)
English_CalmWoman (Meditation voice)

Setup: Set environment variable: MINIMAX_API_KEY=your_api_key

ElevenLabs (Cloud)

Type: Cloud API Voices: Many professional voices Hardware: None (cloud-based) Setup Difficulty: Easy (API key only) Quality: Industry-leading

Best For:

Highest quality output
When cost isn't a concern
Fallback when local backends unavailable

Setup: Set environment variable: ELEVENLABS_API_KEY=sk_...

Note: Can be expensive for long-form content. Consider Kokoro or self-hosted alternatives for audiobooks.

Multi-Backend Strategy

You don't have to choose just one! Run multiple backends and use:

Voice Preferences

Route specific voices to specific backends:

# Set "morgan" to always use VoxCPM (your best clone)
curl -X POST http://localhost:8765/v1/voice-prefs/morgan \
  -H "Content-Type: application/json" \
  -d '{"backend": "voxcpm"}'

# Set default voices to Kokoro (fast and free)
curl -X POST http://localhost:8765/v1/backends/switch \
  -H "Content-Type: application/json" \
  -d '{"backend": "kokoro"}'

Automatic Failover

If a backend is unavailable, the router automatically tries alternatives. Example flow:

Request "morgan" voice → checks voice preference → VoxCPM
VoxCPM unavailable → falls back to Kokoro
Kokoro unavailable → tries ElevenLabs (if configured)

Hardware Requirements Summary

Backend	CPU Only	GPU (8GB)	GPU (12GB+)	Cloud
Kokoro	✓	✓	✓	-
VoxCPM	-	-	✓	-
VoxCPM 1.5	-	✓	✓	-
OpenAudio	-	✓	✓	-
Higgs	-	✓	✓	-
ACE-Step	-	✓	✓	-
MiniMax	-	-	-	✓
ElevenLabs	-	-	-	✓

Setup Recommendations

Beginner Path

Start with Kokoro (Docker, CPU-friendly)
Learn the API with simple voices
Generate your first audiobook

Intermediate Path

Kokoro for general use
Add MiniMax cloud for professional voices (easy API key setup)
Experiment with voice routing

Advanced Path

Kokoro for fallback
VoxCPM 1.5 for voice cloning (GPU required)
ACE-Step for musical applications
ElevenLabs as cloud backup
Complex voice routing and preferences

Production Path

Multiple Kokoro instances (load balancing)
Dedicated VoxCPM GPU server for brand voices
MiniMax for professional narration
Comprehensive monitoring and failover

Backend Profiles (Technical)

Each backend has a profile that defines its capabilities:

"kokoro": {
    "max_words": 200,        # Hard limit before errors
    "max_chars": 1200,
    "optimal_words": 150,    # Target chunk size
    "needs_chunking": True,
    "crossfade_ms": 30,      # Stitching overlap
},
"voxcpm": {
    "max_words": 75,         # Stricter limit
    "max_chars": 400,
    "optimal_words": 50,     # Smaller chunks
    "needs_chunking": True,
    "crossfade_ms": 50,      # Longer crossfade
},
"voxcpm15": {
    "max_words": 150,
    "max_chars": 800,
    "optimal_words": 100,
    "needs_chunking": True,
    "crossfade_ms": 50,
    "sample_rate": 44100,    # Higher quality
}

The chunker uses these profiles to automatically split your text appropriately for each backend.

Common Questions

Q: Can I use multiple backends at once?

A: Yes! Run multiple backends and route different voices to different engines. The router handles selection automatically.

Q: Which backend is fastest?

A: Kokoro on GPU is fastest for general TTS. Cloud APIs (MiniMax, ElevenLabs) are fast but have network latency.

Q: Which has the best quality?

A: For cloning: VoxCPM 1.5 (44.1kHz). For neural TTS: Kokoro or ElevenLabs. For music: ACE-Step.

Q: Can I run this without a GPU?

A: Yes! Kokoro CPU mode works great for most uses. Cloud APIs (MiniMax, ElevenLabs) also require no GPU.

Q: How do I add a new backend?

A: Create an adapter in adapters/ that implements the TTSBackend interface. See adapters/kokoro.py for reference.

Q: What if my preferred backend goes down?

A: The router automatically fails over to available alternatives. You can also set a fallback chain in preferences.

Next Steps

Start simple: Set up Kokoro using the Kokoro Setup Guide
Test it: Generate some audio with different voices
Expand: Add more backends as needed
Route voices: Configure preferences for optimal quality

For implementation details, see the main README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTS Backend Guide

What is a TTS Backend?

Quick Decision Guide

Backend Comparison

Kokoro (Recommended Starting Point)

VoxCPM (Voice Cloning)

VoxCPM 1.5 (High-Quality Voice Cloning)

OpenAudio S1-mini (Voice Cloning)

FishTTS (Voice Synthesis)

Higgs Audio (Generative/Creative)

Kyutai/Moshi (Emotion TTS)

Chatterbox (Emotion Voice Cloning)

VibeVoice (Streaming TTS)

ACE-Step (Musical/Singing TTS)

MiniMax TTS (Cloud)

ElevenLabs (Cloud)

Multi-Backend Strategy

Voice Preferences

Automatic Failover

Hardware Requirements Summary

Setup Recommendations

Beginner Path

Intermediate Path

Advanced Path

Production Path

Backend Profiles (Technical)

Common Questions

Q: Can I use multiple backends at once?

Q: Which backend is fastest?

Q: Which has the best quality?

Q: Can I run this without a GPU?

Q: How do I add a new backend?

Q: What if my preferred backend goes down?

Next Steps

FilesExpand file tree

BACKENDS.md

Latest commit

History

BACKENDS.md

File metadata and controls

TTS Backend Guide

What is a TTS Backend?

Quick Decision Guide

Backend Comparison

Kokoro (Recommended Starting Point)

VoxCPM (Voice Cloning)

VoxCPM 1.5 (High-Quality Voice Cloning)

OpenAudio S1-mini (Voice Cloning)

FishTTS (Voice Synthesis)

Higgs Audio (Generative/Creative)

Kyutai/Moshi (Emotion TTS)

Chatterbox (Emotion Voice Cloning)

VibeVoice (Streaming TTS)

ACE-Step (Musical/Singing TTS)

MiniMax TTS (Cloud)

ElevenLabs (Cloud)

Multi-Backend Strategy

Voice Preferences

Automatic Failover

Hardware Requirements Summary

Setup Recommendations

Beginner Path

Intermediate Path

Advanced Path

Production Path

Backend Profiles (Technical)

Common Questions

Q: Can I use multiple backends at once?

Q: Which backend is fastest?

Q: Which has the best quality?

Q: Can I run this without a GPU?

Q: How do I add a new backend?

Q: What if my preferred backend goes down?

Next Steps