RunPod serverless worker for Echo-TTS inference
This repository is the RunPod serverless inference worker for Echo-TTS. It runs handler.py as a queue-based serverless worker, loads models from Hugging Face, and uploads generated audio to S3-compatible storage.
Core model/inference code is vendored from the upstream Echo-TTS repository at image build time (pinned via Dockerfile).
Model: jordand/echo-tts-base | Live Demo: echo-tts-preview | Blog Post: Technical Details
- ๐ฏ Multi-Speaker Generation: Condition on reference audio for voice cloning or use no reference for default voice
- ๐ฌ Advanced Architecture: Diffusion Transformer with rotary position embeddings and low-rank AdaLN adaptation
- โก High-Quality Output: Generates natural prosody and expression, output as 24kHz Opus (128k bitrate)
- ๐ฑ Universal Compatibility: OGG/Opus format works with WhatsApp, Telegram, modern browsers, and most platforms
- ๐๏ธ Fine Control: Independent classifier-free guidance for text and speaker conditioning
- ๐ Long Prompts: Default text chunking (per-request) for long prompts
- โ๏ธ Serverless: RunPod queue-based worker, S3 uploads, persistent voice directory
The Echo-TTS architecture consists of three main modalities processed by a Diffusion Transformer:
- Text Encoder: Processes tokenized text with 14-layer transformer (1280-dim)
- Speaker Encoder: Encodes reference audio into patches using 14-layer transformer
- Latent Processor: Denoises audio latents through 24-layer DiT with multi-modal attention
The model uses Fish Speech S1-DAC autoencoder for audio encoding/decoding and supports classifier-free guidance with independent scales for text and speaker conditioning.
This repo is intended to be built and run as a RunPod serverless worker image. For interactive demos or the upstream Python API examples, use the upstream Echo-TTS repository.
| Parameter | Description | Default | Range |
|---|---|---|---|
cfg_scale_text |
Classifier-free guidance scale for text conditioning | 3.0 | 1.0-10.0 |
cfg_scale_speaker |
CFG scale for speaker conditioning | 8.0 | 1.0-15.0 |
sequence_length |
Output latent length | 640 | 64-640 |
num_steps |
Diffusion sampling steps | 40 | 10-100 |
speaker_kv_scale |
Force speaker scaling | None | 1.0-2.0 |
For low VRAM tuning in upstream demos, refer to upstream Echo-TTS documentation. This repo does not run the Gradio demo.
The serverless handler includes enhanced chunking to reduce audio artifacts:
Features:
- Audio-aware chunking: Splits text based on both character count and estimated audio duration (~12 chars/second)
- Cross-fading: Overlaps adjacent chunks with smooth transitions (100ms default) to eliminate clicks and pops
- Boundary normalization: Ensures consistent silence between chunks and removes excessive trailing silence
- Deterministic seeding: Uses spaced seeds for better audio continuity between chunks
Example request with improved chunking:
curl -X POST "https://api.runpod.ai/v2/${ENDPOINT_ID}/runsync" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${RUNPOD_API_KEY}" \
-d '{
"input": {
"text": "This is a long text that will be split into multiple chunks and processed with cross-fading to ensure smooth audio transitions between segments.",
"parameters": {
"max_chars_per_chunk": 300,
"enable_crossfade": true,
"normalize_boundaries": true,
"target_duration_seconds": 10.0,
"seed": 1234
}
}
}'For streaming applications or longer audio, use inference_blockwise.py:
from inference_blockwise import sample_blockwise
# Generate in chunks for memory efficiency
audio_chunks = sample_blockwise(
model=model,
fish_ae=fish_ae,
pca_state=pca_state,
text_prompt="Your long text here...",
chunk_size=160, # 7.5 seconds per chunk
speaker_audio=speaker_audio,
)Text prompts follow WhisperD format:
- Start with
[S1]if not present (automatically added) - Use commas for pauses
- Colons, semicolons, and em-dashes normalize to commas
- Exclamation points increase expressiveness
Example prompts:
[S1] Welcome to our presentation today.[S1] Hello! How are you doing?[S1] The weather is beautiful, isn't it?
- Reference Audio: 10 seconds typical, up to 5 minutes supported
- Force Speaker: Enable for out-of-distribution text
- Scale 1.0: baseline (no forcing)
- Scale 1.5: default when enabled
- Use lowest scale that produces correct speaker
Use pre-configured presets from sampler_presets.json:
import json
with open('sampler_presets.json') as f:
presets = json.load(f)
# Use a preset
preset = presets['balanced']
sample_fn = partial(sample_euler_cfg_independent_guidances, **preset)For improved performance:
from inference import compile_model, compile_fish_ae
model = compile_model(model)
fish_ae = compile_fish_ae(fish_ae)# Run inference tests
python -m inference
# Test blockwise generation
python inference_blockwise.pyThe serverless worker runs handler.py. Reference voices come from filenames (no base64) located in a mounted directory, and outputs are written as compressed audio and uploaded to S3-compatible storage (e.g., Backblaze B2).
Key environment variables
AUDIO_VOICES_DIR(default/runpod-volume/echo-tts/audio_voices; override if you mount elsewhere): directory containing reference audio files (.wav/.mp3/.m4a/.ogg/.flac/.webm/.aac/.opus). Passspeaker_voice: "<filename>"in requests.OUTPUT_AUDIO_DIR(default/runpod-volume/echo-tts/output_audio; override if you mount elsewhere): temp dir for generated audio before upload.S3_ENDPOINT_URL: S3-compatible endpoint (e.g., Backblaze B2).S3_ACCESS_KEY_ID: S3 access key.S3_SECRET_ACCESS_KEY: S3 secret.S3_BUCKET_NAME: bucket to store generated audio.S3_REGION(defaultus-east-1): region name for the client.HF_TOKEN: Hugging Face token (required because the model weights are gated).
RunPod cached models (recommended)
- Configure your endpointโs Model (optional) to
jordand/echo-tts-baseso workers are scheduled onto hosts with the model already cached. - This worker is configured to use RunPodโs cached-model mount path via
HF_HOME=/runpod-volume/huggingface-cacheandHF_HUB_CACHE=/runpod-volume/huggingface-cache/hub.
Request shape (serverless handler)
text(str): text to synthesize.speaker_voice(str, optional): filename inAUDIO_VOICES_DIR.parameters(dict, optional): sampler config (num_steps, cfg_scale_text/speaker, cfg_min_t/cfg_max_t, truncation_factor, rescale_k, rescale_sigma, speaker_kv_scale, speaker_kv_max_layers, speaker_kv_min_t, sequence_length, seed, max_chars_per_chunk, enable_crossfade, normalize_boundaries, target_duration_seconds).max_chars_per_chunk(int, default300): long prompts are split and synthesized chunk-by-chunk, then concatenated. Set to0to disable chunking.enable_crossfade(bool, defaulttrue): apply cross-fading between audio chunks for smoother transitions (reduces clicks/pops at boundaries).normalize_boundaries(bool, defaulttrue): normalize silences at chunk boundaries to reduce artifacts (adds consistent silence, removes trailing silence).target_duration_seconds(float, default10.0): target duration per chunk in seconds when audio-aware chunking is enabled.
session_id(str, optional): used for output filename; defaults to UUID.
Response
status:completedorerror.filename: generated audio filename (OGG/Opus).url: presigned URL for download.s3_key: object key in the bucket.metadata: sample_rate (24kHz), codec (opus), bitrate (128k), duration, seed.
Deploying to RunPod (critical settings)
- Build & push an amd64 image:
docker build --platform linux/amd64 -t <registry>/<repo>:echo-tts . && docker push <registry>/<repo>:echo-tts - In the RunPod endpoint config:
- Container Image: the pushed tag above
- Container Disk: set to >= 30 GB (CUDA base image + deps)
- Endpoint Type: Queue (serverless worker)
- Command/Args: leave blank (uses
CMD ["bash", "/opt/bootstrap.sh"]) - GPU: any CUDA 12โcompatible GPU (e.g., A10, L4, etc.)
- Env vars:
HF_TOKEN,S3_ENDPOINT_URL,S3_ACCESS_KEY_ID,S3_SECRET_ACCESS_KEY,S3_BUCKET_NAME,S3_REGION(defaultus-east-1),AUDIO_VOICES_DIR(default/runpod-volume/echo-tts/audio_voices),OUTPUT_AUDIO_DIR(default/runpod-volume/echo-tts/output_audio)
Client examples (RunPod API)
- Synchronous run with Bearer token:
ENDPOINT_ID=<your-endpoint-id>
RUNPOD_API_KEY=<your-runpod-api-key>
curl -X POST "https://api.runpod.ai/v2/${ENDPOINT_ID}/runsync" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${RUNPOD_API_KEY}" \
-d '{
"input": {
"text": "Hello from Echo-TTS on RunPod.",
"speaker_voice": "EARS p004 freeform.mp3",
"parameters": {
"num_steps": 40,
"cfg_scale_text": 3.0,
"cfg_scale_speaker": 8.0,
"seed": 1234
}
}
}'
# Response (truncated): {"id":"...","status":"COMPLETED","output":{"status":"completed","filename":"...","url":"...","s3_key":"...","metadata":{...}}}- Async run + poll:
REQUEST_ID=$(curl -s -X POST "https://api.runpod.ai/v2/${ENDPOINT_ID}/run" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${RUNPOD_API_KEY}" \
-d '{"input": {"text": "Async test"}}' | jq -r '.id')
curl -X POST "https://api.runpod.ai/v2/${ENDPOINT_ID}/status/${REQUEST_ID}" \
-H "Authorization: Bearer ${RUNPOD_API_KEY}"- Python snippet:
import os, requests
endpoint_id = os.environ["ENDPOINT_ID"]
api_key = os.environ["RUNPOD_API_KEY"]
url = f"https://api.runpod.ai/v2/{endpoint_id}/runsync"
payload = {"input": {"text": "Python client call", "speaker_voice": None}}
r = requests.post(url, json=payload, headers={"Authorization": f"Bearer {api_key}"})
r.raise_for_status()
print(r.json()["output"]["url"])Don't use this model to:
- Impersonate real people without their consent
- Generate deceptive audio (fraud, misinformation, deepfakes)
- Create harmful or inappropriate content
You are responsible for complying with local laws regarding biometric data and voice cloning.
- Code: MIT License (except
autoencoder.py: Apache-2.0) - Model Weights: CC-BY-NC-SA-4.0
- Audio Outputs: CC-BY-NC-SA-4.0 (due to Fish Speech dependency)
- Audio Prompts: See
audio_prompts/LICENSE
- Fish Speech for the S1-DAC autoencoder
- TPU Research Cloud for compute support
- The Hugging Face community for model hosting
@misc{darefsky2025echo,
author = {Darefsky, Jordan},
title = {Echo-TTS: Multi-Speaker Text-to-Speech with Reference Conditioning},
year = {2025},
url = {https://jordandarefsky.com/blog/2025/echo/}
}