Kokoro Serverless

Deploy directly on Pod (Alternative, not serverless) - Pod Template

A production-ready RunPod serverless endpoint for Kokoro TTS - a high-quality, lightweight text-to-speech model with exceptional voice mixing and phoneme generation capabilities.

Features

Official Kokoro Model - v0.19 (82M parameters) high-fidelity TTS
GPU Optimized - Runs efficiently on RTX 3090, 4090, A4000, and L4 GPUs
Auto-scaling - Scales to 0 when idle to save costs
Network Volume Storage - Model cached persistently across all workers
Fast Cold Starts - Optimized Docker image with pre-installed dependencies
Advanced Audio - Voice mixing, word timestamps, and phoneme generation

Model Specifications

Model: hexgrad/Kokoro-82M
Recommended VRAM: 4GB+ (8GB recommended)
Precision: float32 / float16
Languages: English (American/British), French, Japanese, Chinese, and more
Voices: Multiple high-quality voices with mixing support (af_bella+af_sky)
License: Apache 2.0

API Usage

This serverless endpoint supports multiple operations through a single entry point. The operation is determined by the endpoint parameter in the input payload (defaults to standard TTS).

1. Standard Text-to-Speech

Endpoint: /v1/audio/speech (Default)

Generate high-quality audio from text.

{
  "input": {
    "model": "kokoro",
    "input": "Hello world! This is Kokoro running on serverless.",
    "voice": "af_bella",
    "response_format": "mp3",
    "speed": 1.0
  }
}

2. Voice Combination

Endpoint: /v1/audio/voices/combine

Mix two voices together to create a unique persona. Join voice IDs with +.

{
  "input": {
    "endpoint": "/v1/audio/voices/combine",
    "method": "POST",
    "voices": "af_bella+af_sky"
  }
}

Returns a base64-encoded .pt file containing the combined voice style.

3. Captioned Speech (Timestamps/SRT)

Endpoint: /dev/captioned_speech

Generate audio along with word-level timestamps, perfect for creating subtitles (SRT).

{
  "input": {
    "endpoint": "/dev/captioned_speech",
    "method": "POST",
    "input": "This text will have timestamps.",
    "voice": "af_bella",
    "return_timestamps": true,
    "response_format": "mp3"
  }
}

Returns audio (base64) and a timestamps array with start/end times for each word.

4. Text to Phonemes

Endpoint: /dev/phonemize

Convert text into its phonemic representation to see exactly how the model interprets pronunciation.

{
  "input": {
    "endpoint": "/dev/phonemize",
    "method": "POST",
    "text": "Hello world",
    "language": "a"
  }
}

Returns the phonemes string.

5. Generate from Phonemes

Endpoint: /dev/generate_from_phonemes

Generate audio directly from raw phonemes.

{
  "input": {
    "endpoint": "/dev/generate_from_phonemes",
    "method": "POST",
    "phonemes": "həˈloʊ wɜrld",
    "voice": "af_bella"
  }
}

6. List Available Voices

Endpoint: /v1/audio/voices

Get a list of all available voice profiles.

{
  "input": {
    "endpoint": "/v1/audio/voices",
    "method": "GET"
  }
}

Parameters Reference

Common Parameters (TTS)

Parameter	Type	Default	Description
`input`	string	required	The text to generate speech from.
`voice`	string	`af_bella`	Voice ID (e.g., `af_bella`, `af_sky`) or combination (`voice1+voice2`).
`speed`	float	`1.0`	Speed multiplier (0.5 to 2.0).
`response_format`	string	`mp3`	Audio format: `mp3`, `wav`, `pcm`.
`model`	string	`kokoro`	Model identifier.

Advanced Parameters

Parameter	Type	Default	Description
`endpoint`	string	`/v1/audio/speech`	The internal API route to target.
`method`	string	`POST`	HTTP method (`GET` or `POST`).
`return_timestamps`	boolean	`false`	For `/dev/captioned_speech`: include word-level timing data.
`language`	string	`a`	For `/dev/phonemize`: language code (`a` for American English).
`phonemes`	string	-	For `/dev/generate_from_phonemes`: raw phoneme string.

Output Format

All successful responses return a JSON object. Binary data (audio, files) is Base64 encoded.

Success Response:

{
  "success": true,
  "audio_base64": "SUQzBAAAAA...",  // Base64 encoded audio
  "text": "Hello world...",
  "voice": "af_bella",
  "size_bytes": 123456
}

Error Response:

{
  "success": false,
  "error": "Error description here"
}

Example Request (Python)

import requests
import base64

endpoint_id = "YOUR_ENDPOINT_ID"
api_key = "YOUR_API_KEY"

url = f"https://api.runpod.ai/v2/{endpoint_id}/runsync"

payload = {
    "input": {
        "input": "Hello from Kokoro Serverless!",
        "voice": "af_bella",
        "response_format": "mp3"
    }
}

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
result = response.json()

if "audio_base64" in result:
    audio_data = base64.b64decode(result["audio_base64"])
    with open("output.mp3", "wb") as f:
        f.write(audio_data)
    print("Audio saved to output.mp3")
else:
    print("Error:", result)

Example Request (cURL)

curl -X POST https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/runsync \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "input": "Hello from Kokoro Serverless!",
      "voice": "af_bella",
      "response_format": "mp3"
    }
  }'

Deployment Configuration

The template is configured with optimal settings in runpod.toml:

GPU Types: RTX 4000 Ada, L4, A4000, RTX 3090/4090, A100
Recommended VRAM: 8GB
Container Disk: 10GB (code + dependencies)
Network Volume: ~20GB (persistent model storage) - ⚠️ HIGHLY RECOMMENDED
Workers: 0-5 (auto-scaling)
Timeout: 300 seconds per job

⚠️ Important: Network Volume Recommended

We highly recommend attaching a network volume (~20GB) when deploying this endpoint.

Without a network volume:

⚠️ Models will be downloaded on every cold start (~30-60s delay)
⚠️ Higher bandwidth usage and slower scaling

The network volume:

✅ Stores the model persistently across all workers
✅ Enables instant cold starts after the first run
✅ Reduces bandwidth costs

Performance

Cold Start: ~30-60 seconds (first run), <5 seconds (cached)
Warm Inference: <1 second for typical sentences
Memory Usage: ~3-4GB VRAM
Throughput: High (lightweight model)

Tips for Best Results

Voice Mixing: Combine voices with + (e.g., af_bella+af_sky) to create unique personas.
Timestamps: Use the /dev/captioned_speech endpoint to get word-level timestamps for subtitles.
Phonemes: Use /dev/phonemize to check how text is processed before generation.
Speed: Adjust speed (0.5x to 2.0x) to control pacing without altering pitch.

License

This project is licensed under the Apache 2.0 License. The Kokoro model itself is Apache 2.0.

Credits

This project is a serverless wrapper around the excellent Kokoro-FastAPI by remsky. It uses the official ghcr.io/remsky/kokoro-fastapi-gpu Docker image as its base.

Support

For issues or questions about this RunPod template, please open an issue on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.runpod		.runpod
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api_schema.json		api_schema.json
app.py		app.py
build.sh		build.sh
handler.py		handler.py
inference.py		inference.py
requirements.txt		requirements.txt
runpod.toml		runpod.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kokoro Serverless

Features

Model Specifications

API Usage

1. Standard Text-to-Speech

2. Voice Combination

3. Captioned Speech (Timestamps/SRT)

4. Text to Phonemes

5. Generate from Phonemes

6. List Available Voices

Parameters Reference

Common Parameters (TTS)

Advanced Parameters

Output Format

Example Request (Python)

Example Request (cURL)

Deployment Configuration

⚠️ Important: Network Volume Recommended

Performance

Tips for Best Results

License

Credits

Support

About

Uh oh!

Releases 1

Packages

Languages

License

arkodeepsen/kokoro

Folders and files

Latest commit

History

Repository files navigation

Kokoro Serverless

Features

Model Specifications

API Usage

1. Standard Text-to-Speech

2. Voice Combination

3. Captioned Speech (Timestamps/SRT)

4. Text to Phonemes

5. Generate from Phonemes

6. List Available Voices

Parameters Reference

Common Parameters (TTS)

Advanced Parameters

Output Format

Example Request (Python)

Example Request (cURL)

Deployment Configuration

⚠️ Important: Network Volume Recommended

Performance

Tips for Best Results

License

Credits

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages