Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.

License

Notifications You must be signed in to change notification settings

sergenes/mlx-audio

 
 

Repository files navigation

MLX-Audio

The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.

Features

  • Fast inference optimized for Apple Silicon (M series chips)
  • Multiple model architectures for TTS, STT, and STS
  • Multilingual support across models
  • Voice customization and cloning capabilities
  • Adjustable speech speed control
  • Interactive web interface with 3D audio visualization
  • OpenAI-compatible REST API
  • Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
  • Swift package for iOS/macOS integration

Installation

pip install mlx-audio

For development or web interface:

git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev]"

Quick Start

Command Line

# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello, world!"

# With voice selection and speed adjustment
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --voice af_heart --speed 1.2

# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --play

# Save to a specific directory
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --output_path ./my_audio

Python API

from mlx_audio.tts.utils import load_model

# Load model
model = load_model("mlx-community/Kokoro-82M-bf16")

# Generate speech
for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
    print(f"Generated {result.audio.shape[0]} samples")
    # result.audio contains the waveform as mx.array

Supported Models

Text-to-Speech (TTS)

Model Description Languages Repo
Kokoro Fast, high-quality multilingual TTS EN, JA, ZH, FR, ES, IT, PT, HI mlx-community/Kokoro-82M-bf16
Qwen3-TTS Alibaba's multilingual TTS with voice design ZH, EN, JA, KO, + more mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16
CSM Conversational Speech Model with voice cloning EN mlx-community/csm-1b
Dia Dialogue-focused TTS EN mlx-community/Dia-1.6B-bf16
OuteTTS Efficient TTS model EN mlx-community/OuteTTS-0.2-500M
Spark SparkTTS model EN, ZH mlx-community/SparkTTS-0.5B-bf16
Chatterbox Expressive multilingual TTS EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO mlx-community/Chatterbox-bf16
Soprano High-quality TTS EN mlx-community/Soprano-bf16

Speech-to-Text (STT)

Model Description Languages Repo
Whisper OpenAI's robust STT model 99+ languages mlx-community/whisper-large-v3-turbo-asr-fp16
Parakeet NVIDIA's accurate STT EN mlx-community/parakeet-tdt-0.6b-v2
Voxtral Mistral's speech model Multiple mlx-community/Voxtral-Mini-3B-2507-bf16

Speech-to-Speech (STS)

Model Description Use Case Repo
SAM-Audio Text-guided source separation Extract specific sounds mlx-community/sam-audio-large
Liquid2.5-Audio* Speech-to-Speech, Text-to-Speech and Speech-to-Text Speech interactions mlx-community/LFM2.5-Audio-1.5B-8bit
MossFormer2 SE Speech enhancement Noise removal starkdmi/MossFormer2_SE_48K_MLX

Model Examples

Kokoro TTS

Kokoro is a fast, multilingual TTS model with 54 voice presets.

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Kokoro-82M-bf16")

# Generate with different voices
for result in model.generate(
    text="Welcome to MLX-Audio!",
    voice="af_heart",  # American female
    speed=1.0,
    lang_code="a"  # American English
):
    audio = result.audio

Available Voices:

  • American English: af_heart, af_bella, af_nova, af_sky, am_adam, am_echo, etc.
  • British English: bf_alice, bf_emma, bm_daniel, bm_george, etc.
  • Japanese: jf_alpha, jm_kumo, etc.
  • Chinese: zf_xiaobei, zm_yunxi, etc.

Language Codes:

Code Language Note
a American English Default
b British English
j Japanese Requires pip install misaki[ja]
z Mandarin Chinese Requires pip install misaki[zh]
e Spanish
f French

Qwen3-TTS

Alibaba's state-of-the-art multilingual TTS with three model variants:

from mlx_audio.tts.utils import load_model

# Base model with predefined voices
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
    text="Hello, welcome to MLX-Audio!",
    voice="Chelsie",
    language="English",
))

# CustomVoice model - predefined voices with emotion control
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16")
results = list(model.generate_custom_voice(
    text="I'm so excited to meet you!",
    speaker="Vivian",
    language="English",
    instruct="Very happy and excited.",
))

# VoiceDesign model - create any voice from text description
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16")
results = list(model.generate_voice_design(
    text="Big brother, you're back!",
    language="English",
    instruct="A cheerful young female voice with high pitch and energetic tone.",
))

# Access generated audio
audio = results[0].audio  # mx.array

Available Models:

Model Method Description
mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 generate() Fast, predefined voices
mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 generate() Higher quality
mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16 generate_custom_voice() Voices + emotion
mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16 generate_custom_voice() Better emotion control
mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 generate_voice_design() Create any voice

Speakers (Base/CustomVoice): Chelsie, Ethan, Serena, Vivian, Ryan, Aiden, Eric, Dylan

CSM (Voice Cloning)

Clone any voice using a reference audio sample:

mlx_audio.tts.generate \
    --model mlx-community/csm-1b \
    --text "Hello from Sesame." \
    --ref_audio ./reference_voice.wav \
    --play

Whisper STT

from mlx_audio.stt.utils import load_model, transcribe

model = load_model("mlx-community/whisper-large-v3-turbo-asr-fp16")
result = transcribe("audio.wav", model=model)
print(result["text"])

SAM-Audio (Source Separation)

Separate specific sounds from audio using text prompts:

from mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio

model = SAMAudio.from_pretrained("mlx-community/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("mlx-community/sam-audio-large")

batch = processor(
    descriptions=["A person speaking"],
    audios=["mixed_audio.wav"],
)

result = model.separate_long(
    batch.audios,
    descriptions=batch.descriptions,
    anchors=batch.anchor_ids,
    chunk_seconds=10.0,
    overlap_seconds=3.0,
    ode_opt={"method": "midpoint", "step_size": 2/32},
)

save_audio(result.target[0], "voice.wav")
save_audio(result.residual[0], "background.wav")

MossFormer2 (Speech Enhancement)

Remove noise from speech recordings:

from mlx_audio.sts import MossFormer2SEModel, save_audio

model = MossFormer2SEModel.from_pretrained("starkdmi/MossFormer2_SE_48K_MLX")
enhanced = model.enhance("noisy_speech.wav")
save_audio(enhanced, "clean.wav", 48000)

Web Interface & API Server

MLX-Audio includes a modern web interface and OpenAI-compatible API.

Starting the Server

# Start API server
mlx_audio.server --host 0.0.0.0 --port 8000

# Start web UI (in another terminal)
cd mlx_audio/ui
npm install && npm run dev

API Endpoints

Text-to-Speech (OpenAI-compatible):

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Kokoro-82M-bf16", "input": "Hello!", "voice": "af_heart"}' \
  --output speech.wav

Speech-to-Text:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=mlx-community/whisper-large-v3-turbo-asr-fp16"

Quantization

  • MLX
  • Python 3.8+
  • Apple Silicon Mac (for optimal performance)
  • For the web interface and API:
    • FastAPI
    • Uvicorn

Swift

Looking for Swift/iOS support? Check out mlx-audio-swift for on-device TTS using MLX on macOS and iOS. Reduce model size and improve performance with quantization using the convert script:

# Convert and quantize to 4-bit
python -m mlx_audio.convert \
    --hf-path prince-canuma/Kokoro-82M \
    --mlx-path ./Kokoro-82M-4bit \
    --quantize \
    --q-bits 4 \
    --upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face)

# Convert with specific dtype (bfloat16)
python -m mlx_audio.convert \
    --hf-path prince-canuma/Kokoro-82M \
    --mlx-path ./Kokoro-82M-bf16 \
    --dtype bfloat16 \
    --upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face)

Options:

Flag Description
--hf-path Source Hugging Face model or local path
--mlx-path Output directory for converted model
-q, --quantize Enable quantization
--q-bits Bits per weight (4, 6, or 8)
--q-group-size Group size for quantization (default: 64)
--dtype Weight dtype: float16, bfloat16, float32
--upload-repo Upload converted model to HF Hub

Requirements

  • Python 3.10+
  • Apple Silicon Mac (M1/M2/M3/M4)
  • MLX framework
  • ffmpeg (required for MP3/FLAC audio encoding)

Installing ffmpeg

ffmpeg is required for saving audio in MP3 or FLAC format. Install it using:

# macOS (using Homebrew)
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

WAV format works without ffmpeg.

License

MIT License

Citation

@misc{mlx-audio,
  author = {Canuma, Prince},
  title = {MLX Audio},
  year = {2025},
  howpublished = {\url{https://github.com/Blaizzy/mlx-audio}},
  note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.}
}

Acknowledgements

About

A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 95.8%
  • TypeScript 4.1%
  • Other 0.1%