A curated list of open-source Text-to-Speech (TTS), voice cloning, and music generation models. Models are sorted by release date (newest first).
- Text-to-Speech (TTS) Models
- Music Generation Models
- Anything to Audio
- Audio Restoration & Enhancement
- Speech Recognition (ASR)
- Additional Resources
| Model | Voice Cloning | ASR | Languages | Streaming | License |
|---|---|---|---|---|---|
| LongCat-AudioDiT | ✅ | ❌ | Zh/En | ❌ | MIT |
| VoxCPM2 | ✅ | ❌ | 30 | ✅ | Apache-2.0 |
| MOSS-TTS-Nano | ✅ | ❌ | 20 | ✅ | Apache-2.0 |
| T5Gemma-TTS | ✅ | ❌ | En/Zh/Jp | ❌ | MIT |
| TinyTTS | ❌ | ❌ | En | ✅ | Apache-2.0 |
| LEMAS-TTS | ✅ | ❌ | 10 | ❌ | Apache-2.0 |
| OmniVoice | ✅ | ❌ | 600+ | ❌ | Apache-2.0 |
| LongCat-Next | ✅ | ✅ | Zh/En | ✅ | MIT |
| Voxtral-4B-TTS | ✅ | ❌ | 9 | ✅ | CC BY-NC 4.0 |
| Irodori-TTS-500M-v2 | ✅ | ❌ | Jp | ❌ | MIT |
| Fish Audio S2 Pro | ✅ | ❌ | 80+ | ✅ | Research License |
| KittenTTS | ✅ | ❌ | En+ | ✅ | Apache-2.0 |
| MOSS-TTS | ✅ | ❌ | 20 | ✅ | Apache-2.0 |
| SoulX-Singer | ✅ (Singing) | ❌ | Zh/En/Canto | ✅ | Apache-2.0 |
| SoproTTS | ✅ | ❌ | En | ✅ | Apache-2.0 |
| NeuTTS | ✅ | ❌ | En/Es/De/Fr | ✅ | Apache-2.0 |
| Qwen3-TTS | ✅ | ❌ | 10 | ✅ | Apache-2.0 |
| GLM-TTS | ✅ | ❌ | Zh/En | ✅ | Apache-2.0 |
| VibeVoice-Realtime | ✅ | ❌ | Multi | ✅ | MIT |
| Fun-CosyVoice 3.0 | ✅ | ❌ | 9 + 18 dialects | ✅ | Apache-2.0 |
| MioTTS-2.6B | ✅ | ❌ | En/Jp | ✅ | LFM |
| Supertonic 2 | ❌ | ❌ | 5 | ✅ | OpenRAIL-M |
| KugelAudio | ✅ | ❌ | 23 EU | ✅ | MIT |
| Kokoro-82M | ✅ | ❌ | 8 (54 voices) | ✅ | Apache-2.0 |
| KokoClone | ✅ | ❌ | 7 | ✅ | Apache-2.0 |
| IndexTTS2 | ✅ | ❌ | Zh/En | ✅ | Apache-2.0 |
| Maya1 | ✅ | ❌ | En | ✅ | Apache-2.0 |
| LFM2-Audio-1.5B | ✅ | ✅ | En | ✅ | LFM |
| Step-Audio-EditX | ✅ | ❌ | Zh/En/Jp/Ko | ✅ | Apache-2.0 |
| FireRedTTS2 | ✅ | ❌ | 7 langs | ✅ | Apache-2.0 |
| VoxCPM | ✅ | ❌ | Zh/En | ✅ | Apache-2.0 |
| LuxTTS | ✅ | ❌ | - | ✅ | Apache-2.0 |
| MegaTTS3 | ✅ | ❌ | Zh/En | ✅ | Apache-2.0 |
| Spark-TTS | ✅ | ❌ | Zh/En | ✅ | Apache-2.0 |
| Fish Speech | ✅ | ❌ | 8 langs | ✅ | Apache-2.0 |
| Step-Audio | ✅ | ✅ | Zh/En/Jp | ✅ | Apache-2.0 |
| SoulX-Podcast | ✅ | ❌ | Zh/En/Canto | ✅ | Apache-2.0 |
| Chatterbox | ✅ | ❌ | 23+ | ✅ | MIT |
| Orpheus-TTS | ✅ | ❌ | Multi | ✅ | Apache-2.0 |
| Dia | ✅ | ❌ | En | ✅ | Apache-2.0 |
| VieNeu-TTS | ✅ | ❌ | Vi | ✅ | Apache-2.0 |
| MiMo-Audio | ✅ | ✅ | Multi | ✅ | Apache-2.0 |
| Kimi-Audio | ✅ | ✅ | Multi | ✅ | MIT/Apache-2.0 |
| ZipVoice | ✅ | ❌ | Zh/En | ✅ | Apache-2.0 |
LongCat-AudioDiT
Description: State-of-the-art diffusion-based TTS model operating directly in waveform latent space. Developed by Meituan's LongCat team, it requires only a Waveform VAE and Diffusion backbone, effectively mitigating compounding errors.
Release Date: March 30, 2026
| Feature | Value |
|---|---|
| Parameters | 1B / 3.5B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ❌ |
| Languages | Chinese, English |
| Streaming | ❌ |
| Sample Rate | 24000 Hz |
| License | MIT |
Key Innovation: Adaptive Projection Guidance (APG) replaces traditional classifier-free guidance for elevated generation quality. Outperforms Seed-TTS on zero-shot voice cloning benchmarks.
VoxCPM2
Description: OpenBMB's next-generation tokenizer-free diffusion autoregressive TTS model with 2 billion parameters. Supports 30 languages with automatic detection, voice design from text descriptions, and high-fidelity voice cloning.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | 2B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ (Voice Design) |
| Languages | 30 (+ 9 Chinese dialects) |
| Streaming | ✅ (RTF ~0.3) |
| Audio Output | 48 kHz |
| License | Apache-2.0 |
Key Innovation: Tokenizer-free design with LocEnc → TSLM → RALM → LocDiT pipeline. Built-in super-resolution via AudioVAE V2 for 48kHz output.
MOSS-TTS-Nano
Description: Ultra-lightweight open-source multilingual speech generation model with only 0.1B parameters. Designed for realtime speech generation that runs directly on CPU without GPU.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | 0.1B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ❌ |
| Languages | 20 |
| Streaming | ✅ (CPU-friendly) |
| Audio Output | 48 kHz Stereo |
| License | Apache-2.0 |
Key Innovation: Pure autoregressive architecture with MOSS-Audio-Tokenizer-Nano. Compresses audio to 12.5 Hz token stream using RVQ with 16 codebooks. Runs on 4-core CPU.
T5Gemma-TTS
Description: Multilingual TTS model with voice cloning and duration control, built on the T5Gemma encoder-decoder LLM architecture. Supports batch generation for multiple audio variations.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | 2B-2B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ❌ |
| Languages | English, Chinese, Japanese |
| Streaming | ❌ |
| VRAM | 7.6-10.6 GB |
| License | MIT |
Key Innovation: PM-RoPE positional encoding with XCodec2 audio codec. Low-VRAM options with CPU offloading. Batch inference efficiency with single encoder pass.
TinyTTS
Description: The smallest English TTS model with only 1.6 million parameters. End-to-end neural network achieving ~53x real-time synthesis speed on CPU via ONNX optimization.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | 1.6M |
| Zero-shot Voice Cloning | ❌ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ❌ |
| Languages | English |
| Streaming | ✅ (~53x RTF) |
| Model Size | ~3.4 MB (ONNX FP16) |
| License | Apache-2.0 |
Key Innovation: Ultra-compact architecture optimized for CPU-only deployment. Multi-platform support via Python and Node.js APIs. Works on laptops, edge devices, and embedded systems.
LEMAS-TTS
Description: Part of the LEMAS (Large-scale Extensible Multilingual Audio Suite) project. Zero-shot multilingual TTS with 0.3B parameters supporting 10 languages with word-level precise editing capabilities.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | 0.3B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | 10 (zh/en/de/fr/es/pt/it/ru/id/vi) |
| Streaming | ❌ |
| Special Feature | Word-level editing (LEMAS-Edit) |
| License | Apache-2.0 |
Key Innovation: Built on 150,000+ hours of multilingual speech data with word-level timestamps. Includes LEMAS-Edit for precise word-level speech editing via masked token infilling.
OmniVoice
Description: Massive multilingual zero-shot TTS model scaling to 600+ languages. Uses diffusion language model-style discrete non-autoregressive architecture with single-stage text-to-acoustic mapping.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | - |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ (Pinyin/CMU) |
| Emotion Control | ✅ (Voice Design) |
| Languages | 600+ |
| Streaming | ❌ |
| Training Data | 581k hours |
| License | Apache-2.0 |
Key Innovation: Simplified single-stage architecture vs conventional two-stage pipelines. Full-codebook random masking strategy with LLM initialization for superior intelligibility. Noise-robust prompt processing.
LongCat-Next
Description: Native multimodal foundation model by Meituan LongCat Team processing text, vision, and audio under a single autoregressive objective. Industrial-strength model with strong speech synthesis and voice cloning.
Release Date: March 2026
| Feature | Value |
|---|---|
| Parameters | 3B (MoE A3B) |
| Zero-shot Voice Cloning | ✅ |
| ASR | ✅ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | Chinese, English |
| Streaming | ✅ (Low latency) |
| Audio Output | 24 kHz |
| License | MIT |
Key Innovation: Discrete Native Autoregression Paradigm (DiNA) unifying modalities in shared discrete token space. Combines visual understanding, generation, and audio processing in single model.
Voxtral-4B-TTS
Description: Frontier, open-weights text-to-speech model developed by Mistral AI. Designed to be fast, instantly adaptable, and produces lifelike speech with natural prosody and emotional range.
Release Date: March 2026
| Feature | Value |
|---|---|
| Parameters | 4B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ✅ (expressive speech) |
| Languages | 9 (En, Fr, Es, De, It, Pt, Nl, Ar, Hi) |
| Streaming | ✅ (RTF 0.103 at concurrency 1) |
| Audio Output | 24 kHz |
| License | CC BY-NC 4.0 |
Irodori-TTS-500M-v2
Description: Japanese Text-to-Speech model based on Rectified Flow Diffusion Transformer. Features emoji-based style and sound effect control by embedding emojis in input text for expressive speech generation.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | 500M |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ✅ (emoji-based) |
| Languages | Japanese |
| Streaming | ❌ |
| Output Quality | 48kHz waveform |
| License | MIT |
Key Feature: Emoji annotation control - insert specific emojis into text to control speaking styles, emotions, and sound effects.
Fish Audio S2 Pro
Description: Fish Audio S2 Pro is a leading text-to-speech model with fine-grained inline control of prosody and emotion. It combines reinforcement learning alignment with a dual-autoregressive architecture for high-quality speech synthesis.
Release Date: March 10, 2026
| Feature | Value |
|---|---|
| Parameters | 5B (4B Slow AR + 400M Fast AR) |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ (15,000+ tags) |
| Emotion Control | ✅ (fine-grained inline control) |
| Languages | 80+ (Tier 1: En, Zh, Jp) |
| Streaming | ✅ (RTF 0.195, 100ms TTFA) |
| Model Size | ~10 GB (BF16) |
| License | Fish Audio Research License |
KittenTTS
Description: KittenTTS is an open-source realistic text-to-speech model designed for lightweight deployment. It is a state-of-the-art TTS model under 25MB with just 15 million parameters, running without GPU on any device.
Release Date: February 24, 2026 (v0.8.1)
| Feature | Value |
|---|---|
| Parameters | 15M-80M |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ✅ |
| Languages | English, Multiple |
| Streaming | ✅ |
| License | Apache-2.0 |
MOSS-TTS
Description: MOSS-TTS is a production-grade Text-to-Speech foundation model developed by OpenMOSS Team and MOSI.AI. Features state-of-the-art evaluation performance on Seed-TTS-eval benchmark with zero-shot voice cloning.
Release Date: February 10, 2026
| Feature | Value |
|---|---|
| Parameters | 8B (Delay), 1.7B (Local) |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ (Pinyin/Phoneme-level) |
| Emotion Control | ✅ |
| Languages | 20 languages |
| Streaming | ✅ |
| Max Duration | 1 hour |
| License | Apache-2.0 |
SoulX-Singer
Description: SoulX-Singer is a high-fidelity, zero-shot singing voice synthesis model for generating realistic singing voices for unseen singers without fine-tuning.
Release Date: February 6, 2026
| Feature | Value |
|---|---|
| Parameters | - |
| Zero-shot Voice Cloning | ✅ (Singing) |
| ASR | ❌ |
| Pronunciation Control | ✅ (MIDI/F0) |
| Emotion Control | ✅ |
| Languages | Mandarin, English, Cantonese |
| Streaming | ✅ |
| License | Apache-2.0 |
SoproTTS
Description: SoproTTS is a lightweight English text-to-speech model with zero-shot voice cloning. It uses dilated convolutions (WaveNet-style) and lightweight cross-attention layers instead of the common Transformer architecture.
Release Date: February 4, 2026 (v1.5)
| Feature | Value |
|---|---|
| Parameters | 135M |
| Zero-shot Voice Cloning | ✅ (3-12s) |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ✅ (style_strength) |
| Languages | English |
| Streaming | ✅ (250ms TTFA) |
| RTF | 0.05 (CPU M3) |
| Training Cost | ~$100 |
| License | Apache-2.0 |
NeuTTS
Description: NeuTTS is a collection of open-source on-device TTS models with instant voice cloning. Built off LLM backbones with GGUF format quantizations for efficient on-device deployment.
Release Date: Early 2026
| Feature | Value |
|---|---|
| Parameters | 360M (Air), 120M (Nano) |
| Zero-shot Voice Cloning | ✅ (3-second) |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ❌ |
| Languages | English, Spanish, German, French |
| Streaming | ✅ |
| On-Device | ✅ (GGUF quantizations) |
| License | Apache-2.0 (Air), NeuTTS Open License 1.0 (Nano) |
Qwen3-TTS
Description: Qwen3-TTS is an open-source series of Text-to-Speech models developed by Alibaba Cloud. Supports stable, expressive, and streaming speech generation with free-form voice design.
Release Date: January 22, 2026
| Feature | Value |
|---|---|
| Parameters | 0.6B-1.7B |
| Zero-shot Voice Cloning | ✅ (3-second) |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | 10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) |
| Streaming | ✅ (97ms latency) |
| License | Apache-2.0 |
GLM-TTS
Description: High-quality TTS synthesis system based on LLMs from ZhipuAI, supporting zero-shot voice cloning with Multi-Reward Reinforcement Learning.
Release Date: December 11, 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Zero-shot Voice Cloning | ✅ (3-10s) |
| ASR | ❌ |
| Pronunciation Control | ✅ (Phoneme-level) |
| Emotion Control | ✅ (RL-enhanced) |
| Languages | Chinese, English |
| Streaming | ✅ |
| License | Apache-2.0 |
VibeVoice-Realtime
Description: Real-time TTS model from Microsoft with streaming text input and ultra-low latency (~300ms).
Release Date: December 3, 2025
| Feature | Value |
|---|---|
| Parameters | 0.5B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | Multilingual |
| Streaming | ✅ (300ms) |
| Max Duration | ~10 minutes |
| License | MIT |
Fun-CosyVoice 3.0
Description: Advanced TTS system based on LLMs for zero-shot multilingual speech synthesis from FunAudioLLM.
Release Date: December 2025
| Feature | Value |
|---|---|
| Parameters | 0.5B |
| Zero-shot Voice Cloning | ✅ (Multi-lingual/Cross-lingual) |
| ASR | ❌ |
| Pronunciation Control | ✅ (Pinyin/CMU) |
| Emotion Control | ✅ |
| Languages | 9 + 18+ Chinese dialects |
| Streaming | ✅ (150ms) |
| License | Apache-2.0 |
MioTTS-2.6B
Description: Lightweight, high-speed LLM-based TTS model for English and Japanese with minimal resource usage.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | 2.6B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ❌ |
| Languages | English, Japanese |
| Streaming | ✅ |
| RTF | 0.135-0.145 |
| License | LFM Open License |
Supertonic 2
Description: Lightning-fast, on-device text-to-speech system designed for extreme performance with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on-device—no cloud, no API calls, no privacy concerns. Outperforms ElevenLabs Flash v2.5 by up to 42× in speed benchmarks.
Release Date: 2026
| Feature | Value |
|---|---|
| Parameters | 66M |
| Zero-shot Voice Cloning | ❌ |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ❌ |
| Languages | English, Korean, Spanish, Portuguese, French |
| Streaming | ✅ |
| RTF | 0.001-0.015 (up to 167× realtime) |
| On-Device | ✅ (ONNX Runtime) |
| License | OpenRAIL-M |
Performance Comparison:
| System | Speed (chars/sec) | RTF |
|---|---|---|
| Supertonic 2 (RTX 4090) | 12,164 | 0.001 |
| Supertonic 2 (M4 Pro CPU) | 1,263 | 0.012 |
| ElevenLabs Flash v2.5 | 287 | 0.5 |
| Kokoro (Open-source) | 117 | 1.3 |
KugelAudio
Description: Open-source TTS for European languages with 7B parameters. Outperformed ElevenLabs in human preference testing.
Release Date: Early 2026
| Feature | Value |
|---|---|
| Parameters | 7B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ (Speaking styles) |
| Languages | 23 European languages |
| Streaming | ✅ |
| License | MIT |
Kokoro-82M
Description: Kokoro is an open-weight Text-to-Speech model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
Release Date: January 27, 2025 (v1.0)
| Feature | Value |
|---|---|
| Parameters | 82M |
| Architecture | StyleTTS 2, ISTFTNet |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ (via misaki G2P) |
| Emotion Control | ✅ (voice styles) |
| Languages | 8 (54 voices) |
| Streaming | ✅ (generator pattern) |
| Cost | <$0.06 per hour of audio |
| License | Apache-2.0 |
KokoClone
Description: KokoClone is a fast, real-time compatible multilingual voice cloning system built on top of Kokoro-ONNX. It enables users to type text in multiple languages, provide a short 3-10 second reference audio clip, and instantly generate speech in that same voice.
Release Date: 2025
| Feature | Value |
|---|---|
| Parameters | 82M (Base: Kokoro-ONNX) |
| Zero-shot Voice Cloning | ✅ (3-10s reference) |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ✅ |
| Languages | 7 (En, Hi, Fr, Ja, Zh, It, Pt, Es) |
| Streaming | ✅ (CPU real-time) |
| License | Apache-2.0 |
IndexTTS2
Description: AI-Enhanced Text-to-Speech System with Intelligent Optimization and self-learning capabilities.
Release Date: November 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ (5 emotions) |
| Languages | Chinese, English |
| Streaming | ✅ |
| Multi-speaker | ✅ (1-4 speakers) |
| License | Apache-2.0 |
Maya1
Description: State-of-the-art speech model for expressive voice generation with natural language voice control.
Release Date: November 2025
| Feature | Value |
|---|---|
| Parameters | 3B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ (Tags) |
| Languages | English (Multi-accent) |
| Streaming | ✅ (<100ms) |
| License | Apache-2.0 |
LFM2-Audio-1.5B
Description: Liquid AI's first end-to-end audio foundation model with low latency and real-time conversation.
Release Date: November 28, 2025
| Feature | Value |
|---|---|
| Parameters | 1.5B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ✅ (Integrated) |
| Pronunciation Control | N/A |
| Emotion Control | ✅ |
| Languages | English |
| Streaming | ✅ |
| License | LFM Open License |
Step-Audio-EditX
Description: 3B-parameter LLM-based RL audio model specialized in expressive and iterative audio editing.
Release Date: November 2025
| Feature | Value |
|---|---|
| Parameters | 3B (4B BF16) |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ (Polyphone) |
| Emotion Control | ✅ (14 emotions) |
| Languages | Mandarin, English, Sichuanese, Cantonese, Japanese, Korean |
| Streaming | ✅ |
| License | Apache-2.0 |
FireRedTTS2
Description: Long-form streaming TTS system for multi-speaker dialogue generation with stable, natural speech.
Release Date: September 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | EN, ZH, JP, KO, FR, DE, RU |
| Streaming | ✅ (140ms) |
| Multi-speaker | ✅ (4 speakers) |
| Max Duration | 3 minutes |
| License | Apache-2.0 |
VoxCPM
Description: Tokenizer-free TTS system for context-aware speech generation and true-to-life voice cloning.
Release Date: September 16, 2025
| Feature | Value |
|---|---|
| Parameters | 640M-800M |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | Chinese, English |
| Streaming | ✅ (RTF 0.17) |
| License | Apache-2.0 |
LuxTTS
Description: Lightweight ZipVoice-based TTS model for high quality voice cloning at speeds exceeding 150x realtime.
Release Date: 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ❌ |
| Emotion Control | ❌ |
| Languages | - |
| Streaming | ✅ |
| RTF | 150x |
| VRAM | 1GB |
| License | Apache-2.0 |
MegaTTS3
Description: Advanced zero-shot speech synthesis with Sparse Alignment Enhanced Latent Diffusion Transformer.
Release Date: March 22, 2025
| Feature | Value |
|---|---|
| Parameters | 0.45B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | Chinese, English |
| Streaming | ✅ |
| License | Apache-2.0 |
Spark-TTS
Description: Efficient LLM-Based TTS Model with Single-Stream Decoupled Speech Tokens, built on Qwen2.5.
Release Date: March 2025
| Feature | Value |
|---|---|
| Parameters | 0.5B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | Chinese, English |
| Streaming | ✅ |
| License | Apache-2.0 |
Fish Speech
Description: State-of-the-art open source TTS and voice cloning model that generates natural, realistic, and emotionally rich speech.
Release Date: May 31, 2025 (v1.5.1)
| Feature | Value |
|---|---|
| Parameters | 4B (S1), 0.5B (S1-mini) |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | 8 (EN, JP, KO, ZH, FR, DE, AR, ES) |
| Streaming | ✅ |
| RTF | ~1:7 |
| License | Apache-2.0 |
Step-Audio
Description: Production-ready open-source framework for intelligent speech interaction with unified speech comprehension and generation.
Release Date: February 17, 2025
| Feature | Value |
|---|---|
| Parameters | 130B (Chat), 3B (TTS) |
| Zero-shot Voice Cloning | ✅ |
| ASR | ✅ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | Chinese, English, Japanese |
| Streaming | ✅ |
| License | Apache-2.0 |
Audio Flamingo 3 (AF3) / Audio Flamingo Next
Description: NVIDIA ADLR's fully open-source Large Audio Language Model with state-of-the-art audio understanding. Audio Flamingo Next (AF-Next) is the latest generation featuring stronger general audio understanding, longer context support, and timestamp-grounded reasoning.
Release Date: July 2025 (AF3), 2026 (AF-Next)
| Feature | Value |
|---|---|
| Parameters | 7B |
| Zero-shot Voice Cloning | ❌ |
| ASR | ✅ |
| Pronunciation Control | N/A |
| Emotion Control | ✅ |
| Languages | Multi-lingual |
| Streaming | ✅ |
| Context | Up to 30 minutes |
| License | Apache-2.0 |
Key Innovation (AF-Next): Staged curriculum training with GRPO-based RL post-training. Three specialized checkpoints: Instruct, Think (reasoning), and Captioner. Temporal Audio Chain-of-Thought grounding intermediate reasoning to timestamps.
SoulX-Podcast
Description: SOTA Multi-Speaker TTS model for generating realistic long-form podcasts with dialectal diversity.
Release Date: 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | Mandarin, English, Cantonese, Sichuanese, Henanese |
| Streaming | ✅ |
| Max Duration | 90+ minutes |
| License | Apache-2.0 |
Chatterbox
Description: Family of SOTA open-source TTS models by Resemble AI with zero-shot voice cloning and multilingual synthesis.
Release Date: June 13, 2025 (v0.1.2)
| Feature | Value |
|---|---|
| Parameters | 350M-500M |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ (Tags) |
| Languages | 23+ |
| Streaming | ✅ |
| License | MIT |
Orpheus-TTS
Description: SOTA open-source TTS built on Llama-3b backbone demonstrating emergent capabilities of LLMs for speech synthesis.
Release Date: April 2025
| Feature | Value |
|---|---|
| Parameters | 3B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | Multilingual |
| Streaming | ✅ (200ms) |
| License | Apache-2.0 |
Dia
Description: 1.6B parameter TTS model by Nari Labs for generating ultra-realistic dialogue in one pass.
Release Date: June 27, 2024
| Feature | Value |
|---|---|
| Parameters | 1.6B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ✅ |
| Languages | English |
| Streaming | ✅ |
| License | Apache-2.0 |
VieNeu-TTS
Description: Advanced on-device Vietnamese TTS model with instant voice cloning from 3-5 seconds of reference audio.
Release Date: 2025
| Feature | Value |
|---|---|
| Parameters | 0.3B-0.6B |
| Zero-shot Voice Cloning | ✅ (3-5s) |
| ASR | ❌ |
| Pronunciation Control | ✅ |
| Emotion Control | ❌ |
| Languages | Vietnamese |
| Streaming | ✅ (On-device) |
| License | Apache-2.0 |
MiMo-Audio
Description: Audio Language Model by Xiaomi functioning as a Few-Shot Learner with SOTA audio understanding.
Release Date: 2025
| Feature | Value |
|---|---|
| Parameters | 7B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ✅ |
| Pronunciation Control | N/A |
| Emotion Control | ✅ |
| Languages | Multi-lingual |
| Streaming | ✅ |
| License | Apache-2.0 |
Kimi-Audio
Description: Open-source audio foundation model by Moonshot AI for audio understanding, generation, and conversation.
Release Date: 2024
| Feature | Value |
|---|---|
| Parameters | 7B |
| Zero-shot Voice Cloning | ✅ |
| ASR | ✅ |
| Pronunciation Control | N/A |
| Emotion Control | ✅ |
| Languages | Multi-lingual |
| Streaming | ✅ |
| License | MIT/Apache-2.0 |
ZipVoice
Description: Fast and high-quality zero-shot TTS models based on flow matching.
Release Date: June 16, 2025
| Feature | Value |
|---|---|
| Parameters | 123M |
| Zero-shot Cloning | ✅ |
| Languages | Chinese, English |
| Dialogue | ✅ |
| License | Apache-2.0 |
| Model | Music Gen | Languages | Streaming | License |
|---|---|---|---|---|
| ACE-Step 1.5 | ✅ | 50+ | ✅ | MIT |
| LeVo 2 | ✅ | Zh/En | ❌ | Apache-2.0 |
| Foundation-1 | ✅ (Samples) | - | ❌ | Stability AI |
| Music Flamingo | ❌ | - | - | Apache-2.0 |
| Magenta Realtime | ✅ | - | ✅ | Apache-2.0/CC-BY-4.0 |
| Uni-MoE (Audio) | ✅ | - | ✅ | Apache-2.0 |
ACE-Step 1.5
Description: The most powerful local music generation model outperforming most commercial alternatives. Supports Mac, AMD, Intel, and CUDA devices.
Release Date: February 20, 2026 (v0.1.2)
| Feature | Value |
|---|---|
| Parameters | 0.6B-4B (LM), DiT variants |
| Music Generation | ✅ |
| Lyrics Support | ✅ (50+ languages) |
| Voice2BGM | ✅ |
| Reference Audio | ✅ |
| Track Separation | ✅ |
| Duration | 10s - 10min |
| VRAM | <4GB |
| Platforms | CUDA, MPS, ROCm, XPU, CPU |
| License | MIT |
LeVo 2
Description: Open-source foundation model for commercial-grade music generation by Tencent AI Lab. It outperforms open-source baselines and rivals commercial systems in Overall Quality, Melody, Arrangement, Sound Quality, and Structure.
Release Date: 2025
| Feature | Value |
|---|---|
| Architecture | Hybrid LLM-Diffusion |
| Music Generation | ✅ |
| Lyrics Support | ✅ (Chinese, English) |
| Multilingual | ✅ (Zh, En) |
| Text/Audio Prompts | ✅ |
| VRAM | 12GB-22GB |
| License | Apache-2.0 |
Foundation-1
Description: Structured text-to-sample generation model for music production workflows. Generates tempo-synced, key-aware, bar-aware sample generation with support for instrument identity, timbre control, and FX processing.
Release Date: 2025
| Feature | Value |
|---|---|
| Type | Text-to-Sample (Music) |
| Base Model | stabilityai/stable-audio-open-1.0 |
| Instrument Control | ✅ |
| Timbre Descriptors | ✅ (Warm, Bright, etc.) |
| FX Tags | ✅ (Reverb, Delay, etc.) |
| Musical Notation | ✅ (Chord, Melody, Arp) |
| VRAM | ~8GB |
| License | Stability AI Community License |
Music Flamingo
Description: Large audio-language model designed to advance music (including song) understanding. Achieves SOTA on 10+ music benchmarks.
Release Date: 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Music Understanding | ✅ |
| Music Generation | ❌ |
| Rich Captions | ✅ |
| Music QA | ✅ |
| Reasoning | ✅ (Chain-of-thought) |
| Long-form | ✅ |
| License | Apache-2.0 |
Magenta Realtime
Description: Open music generation model from Google DeepMind enabling continuous generation of musical audio steered by text prompts or audio examples.
Release Date: August 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Music Generation | ✅ (Real-time) |
| Text-to-Music | ✅ |
| Audio-to-Music | ✅ |
| Reference Audio | ✅ |
| Continuous Generation | ✅ |
| Latency | Style prompt 2s+ |
| Context | 10 seconds |
| Training Data | ~190k hours |
| License | Apache-2.0 (code), CC-BY-4.0 (model) |
SoulX-Singer
(Already listed in TTS - singing voice synthesis)
| Feature | Value |
|---|---|
| Parameters | - |
| Singing Generation | ✅ |
| Zero-shot | ✅ |
| Melody Control | ✅ (F0/MIDI) |
| Languages | Mandarin, English, Cantonese |
| License | Apache-2.0 |
Uni-MoE (Audio)
Description: MoE-based omnimodal model with voice cloning, TTS, T2M (text-to-music), and V2M (video-to-music).
Release Date: October 16, 2025 (Uni-MoE-Audio)
| Feature | Value |
|---|---|
| Parameters | - |
| Voice Cloning | ✅ |
| TTS | ✅ |
| Text-to-Music | ✅ |
| Video-to-Music | ✅ |
| Dynamic Routing | ✅ |
| License | Apache-2.0 |
Models that can generate audio from multiple input modalities (video, text, image, audio). These are unified frameworks for multimodal audio synthesis.
| Model | Text | Video | Image | Audio | License |
|---|---|---|---|---|---|
| Woosh | ✅ | ✅ | ❌ | ✅ | Apache-2.0 |
| PrismAudio | ❌ | ✅ | ❌ | ❌ | Apache-2.0 |
| ThinkSound | ✅ | ✅ | ❌ | ✅ | Apache-2.0 |
| HunyuanVideo-Foley | ✅ | ✅ | ❌ | ❌ | Research Only |
| MMAudio | ✅ | ✅ | ✅ | ❌ | Apache-2.0 |
| AudioX | ✅ | ✅ | ✅ | ✅ | Apache-2.0 |
| Uni-MoE (Audio) | ✅ | ✅ | ❌ | ✅ | Apache-2.0 |
AudioX / Audio-Omni
Description: Audio-Omni is the first end-to-end framework unifying understanding, generation, and editing across general sound, music, and speech domains. Presented at SIGGRAPH 2026. AudioX is a unified framework integrating text, video, image, and audio conditions.
Release Date: March 2025 (AudioX), 2026 (Audio-Omni)
| Feature | Value |
|---|---|
| Parameters | - |
| Text-to-Audio | ✅ |
| Text-to-Music | ✅ |
| Text-to-Speech | ✅ |
| Video-to-Audio/Music | ✅ |
| Audio Editing | ✅ (Add/Remove/Extract/Style) |
| Voice Conversion | ✅ |
| License | Apache-2.0 / CC-BY-NC-4.0 |
Key Innovation: First unified framework covering all three audio domains. Combines frozen multimodal LLM (Qwen2.5-Omni) with trainable Diffusion Transformer for high-fidelity synthesis. Any-to-any audio processing.
MMAudio
Description: Multimodal joint training framework for high-quality synchronized audio generation from video and/or text inputs. State-of-the-art open source model for generating sounds for videos, images, and text prompts.
Release Date: December 2024 (CVPR 2025)
| Feature | Value |
|---|---|
| Parameters | - |
| Video-to-Audio | ✅ |
| Text-to-Audio | ✅ |
| Image-to-Audio | ✅ |
| Synchronized Audio | ✅ |
| Multimodal Joint Training | ✅ |
| License | Apache-2.0 |
HunyuanVideo-Foley
Description: Tencent's end-to-end video sound effect generation model for professional-grade AI Foley sound generation. Analyzes footage and creates immersive audio that matches the visual content perfectly.
Release Date: 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Video-to-Audio (Foley) | ✅ |
| Text-to-Audio | ✅ |
| High-Quality Foley | ✅ |
| Context-Aware | ✅ |
| Output Quality | 48 kHz |
| License | Research & Non-commercial only |
ThinkSound
Description: Unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning. Supports generating or editing audio from video, text, audio, or their combinations. Accepted to NeurIPS 2025.
Release Date: 2025
| Feature | Value |
|---|---|
| Parameters | - |
| Video-to-Audio | ✅ (SOTA) |
| Text-to-Audio | ✅ |
| Audio-to-Audio | ✅ |
| Audio Editing | ✅ |
| CoT-Driven Reasoning | ✅ |
| Interactive Object-centric Editing | ✅ |
| License | Apache-2.0 (Research only) |
Woosh
Description: Sony AI's sound effect foundation model for text-to-audio and video-to-audio generation. Includes Woosh-AE (audio encoder/decoder), Woosh-Flow/DFlow (T2A), and Woosh-VFlow/DVFlow (V2A) with distilled fast inference variants.
Release Date: 2026
| Feature | Value |
|---|---|
| Text-to-Audio | ✅ |
| Video-to-Audio | ✅ |
| Audio Encoding | ✅ |
| Fast Inference | ✅ (Distilled models) |
| Architecture | Flow-based generative models |
| License | Apache-2.0 |
Key Innovation: Optimized for sound effects (not general audio) with both public and private model versions. Video-conditioned generation without requiring captions. Competitive with Stable Audio Open and TangoFlux.
PrismAudio
Description: Video-to-Audio generation framework with Reinforcement Learning and specialized Chain-of-Thought (CoT) planning. Decomposes reasoning into four specialized modules (Semantic, Temporal, Aesthetic, Spatial CoT) for comprehensive video understanding. Built upon ThinkSound.
Release Date: 2025 (ICLR 2026)
| Feature | Value |
|---|---|
| Parameters | 518M |
| Video-to-Audio | ✅ |
| CoT Planning | ✅ (4 modules) |
| Multi-Dimensional RL | ✅ |
| Fast-GRPO | ✅ (Hybrid ODE-SDE) |
| Inference Time | 0.63 seconds |
| License | Apache-2.0 |
Performance Benchmarks:
| Metric | VGGSound | AudioCanvas |
|---|---|---|
| Semantic (CLAP) | 0.47 | 0.52 |
| Temporal (DeSync↓) | 0.41 | 0.36 |
| Aesthetic (MOS-Q) | 4.21±0.35 | 4.12±0.28 |
Uni-MoE (Audio)
Description: MoE-based omnimodal model with voice cloning, TTS, T2M (text-to-music), and V2M (video-to-music).
Release Date: October 16, 2025 (Uni-MoE-Audio)
| Feature | Value |
|---|---|
| Parameters | - |
| Voice Cloning | ✅ |
| TTS | ✅ |
| Text-to-Music | ✅ |
| Video-to-Music | ✅ |
| Dynamic Routing | ✅ |
| License | Apache-2.0 |
| Model | Type | Bandwidth Extension | Inpainting | License |
|---|---|---|---|---|
| NVIDIA A2SB | Restoration | ✅ | ✅ | NVIDIA Non-Commercial |
| NovaSR | Enhancement | ✅ | ❌ | Apache-2.0 |
| AudioSR | Enhancement | ✅ | ❌ | MIT |
NVIDIA A2SB
Description: Diffusion-based audio restoration model tailored for high-resolution music at 44.1kHz. An end-to-end, vocoder-free, multi-task model capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Can restore hour-long audio inputs without boundary artifacts.
Release Date: January 2025
| Feature | Value |
|---|---|
| Architecture | End-to-end vocoder-free |
| Bandwidth Extension | ✅ |
| Audio Inpainting | ✅ |
| High-Resolution | ✅ (44.1kHz) |
| Long Audio | ✅ (hour-long) |
| Streaming | ❌ |
| License | NVIDIA OneWay NonCommercial License |
NovaSR
Description: Lightning fast audio upsampler - 50kB model that upscales 16kHz audio to 48kHz at 3500x realtime.
Release Date: 2025
| Feature | Value |
|---|---|
| Size | 52kB |
| Speed | 3600x realtime (A100) |
| Input | 16kHz |
| Output | 48kHz |
| VRAM | Minimal |
| License | Apache-2.0 |
AudioSR
Description: Audio super resolution model using latent diffusion to upscale low-quality audio to 48kHz.
Release Date: February 12, 2026 (v1.1.1)
| Feature | Value |
|---|---|
| Input | 8kHz-48kHz |
| Output | 48kHz |
| VRAM | 6GB min |
| Stereo | ✅ |
| Long Audio | ✅ |
| License | MIT |
| Model | Languages | Streaming | License |
|---|---|---|---|
| Cohere Transcribe | 14 | ✅ | Apache-2.0 |
| VibeVoice-ASR | 50+ | ✅ | MIT |
| FunASR | 50+ | ✅ | MIT |
Cohere Transcribe
Description: Open-source automatic speech recognition (ASR) model developed by Cohere. A 2 billion parameter dedicated audio-in, text-out model that ranks #1 on the English ASR leaderboard.
Release Date: March 2026
| Feature | Value |
|---|---|
| Parameters | 2B |
| Architecture | Conformer-based encoder-decoder |
| ASR | ✅ |
| Languages | 14 (En, Fr, De, It, Es, Pt, Gr, Nl, Pl, Zh, Jp, Ko, Vi, Ar) |
| Streaming | ✅ |
| RTFx | Up to 3x faster than comparable models |
| License | Apache-2.0 |
Key Features:
- Long-form transcription with automatic chunking (>35 seconds)
- Optional punctuation control
- Batched inference support
- vLLM integration for production serving
- Apple Silicon support via mlx-audio
- WebGPU browser deployment via transformers.js
VibeVoice-ASR
Description: Microsoft's unified speech-to-text model for 60-minute long-form audio processing with speaker diarization and timestamping.
Release Date: January 21, 2026
| Feature | Value |
|---|---|
| Parameters | 7B |
| ASR | ✅ |
| Languages | 50+ |
| Streaming | ✅ |
| License | MIT |
FunASR
Description: Fundamental end-to-end speech recognition toolkit with SOTA pretrained models.
Release Date: Ongoing (First: 2023)
| Feature | Value |
|---|---|
| ASR | ✅ |
| VAD | ✅ |
| Punctuation | ✅ |
| Speaker Diarization | ✅ |
| Multi-talker ASR | ✅ |
| Timestamp | ✅ |
| Emotion Recognition | ✅ |
| Languages | 50+ |
| License | MIT/Model License |
| Resource | Description | Link |
|---|---|---|
| Open ASR Leaderboard | Hugging Face leaderboard for comparing ASR model performance across languages and metrics. |
This list is continuously evolving. If you have any models to add or updates to suggest, please feel free to contribute!
Last Updated: March 2026
