Multilingual Text-to-Speech with CSM 1B [Work in progress]

This repository contains a multilingual adaptation of the CSM 1B text-to-speech model, enabling high-quality speech synthesis across multiple languages.

Overview

This project extends the CSM 1B model to support multilingual text-to-speech generation. The implementation includes:

Language-specific text preprocessing
Training pipeline for multiple languages
Inference tools for generating speech in different languages
Support for Mozilla Common Voice datasets

Architecture

The system is based on the CSM 1B model architecture which combines:

A Llama 3.2-1B backbone for text understanding
A smaller Llama 3.2-100M decoder for audio token generation
The Mimi codec for audio tokenization and reconstruction

Our multilingual extension adds:

Language-specific text processors
Adapted training pipeline for different languages
Language-aware generation tools

Features

Multilingual Support: Generate speech in German and more to come
Language-Specific Processing: Custom text normalization for each supported language
Complete Training Pipeline: Scripts for data preparation, model training, and evaluation
Easy Inference: Simple API for generating speech from text in any supported language
Configurable: Language-specific configurations for fine-tuning

Installation

# Clone this repository
git clone https://github.com/rooms-solutions/csm-multilingual
cd csm-multilingual

# Install dependencies
pip install -r requirements.txt

Usage

Generating Speech

Generate speech in German:

python generate_multilingual.py --text "Hallo, wie geht es dir?" --language de --checkpoint ./checkpoints/de/best_model.pt

Training a New Language Model

The training process consists of two steps:

Data Preparation:

python prepare_commonvoice.py --input_dir /path/to/cv-corpus-20.0-de --output_dir ./data/de --language de --filter_quality

Model Training:

python train_multilingual.py --language de --train_csv ./data/de/train_de.tsv --data_dir ./data --checkpoint ckpt.pt

Training Multiple Languages

Use the provided script to train models for multiple languages:

python train_multilingual_example.py --common_voice_root /path/to/common-voice --languages de --cv_version 20.0

Dataset

This project uses the Mozilla Common Voice dataset for training. Common Voice is a public domain speech dataset available in multiple languages.

To train a model, you need to:

Download the appropriate language dataset from the Common Voice website
Use the prepare_commonvoice.py script to preprocess the data
Train the model using the prepared data

Supported Languages

The following languages are supported with language-specific processing:

🇩🇪 German (de)
🇫🇷 French (fr) [coming soon]
🇪🇸 Spanish (es) [coming soon]

Additional languages can be used with the default text processor.

Adding a New Language

To add support for a new language:

Extend the LanguageProcessor class with language-specific normalization
Add your new processor to the get_processor factory method
Create a language configuration file in language_configs/

Example for adding Italian support:

class ItalianProcessor(LanguageProcessor):
    """Italian language text processor"""
    
    def __init__(self):
        super().__init__("it", "Italian")
        
        # Italian-specific replacements
        self.replacements = [
            (r'(\d)\.(\d)', r'\1,\2'),  # Convert decimal points to commas
            (r'Sig\.', 'Signore'),      # Expand common abbreviations
            (r'Sig.ra', 'Signora'),
            (r'(\d)(\s*)€', r'\1 euro'),
        ]
    
    def normalize_text(self, text: str) -> str:
        """Italian-specific text normalization"""
        text = super().normalize_text(text)
        
        # Apply Italian-specific replacements
        for pattern, replacement in self.replacements:
            text = re.sub(pattern, replacement, text)
            
        return text

Performance Considerations

GPU with at least 8GB VRAM recommended for inference
16GB+ VRAM recommended for training
Mixed precision training is supported to reduce memory requirements
Model checkpoints are approximately 2GB in size per language

Acknowledgments

This project builds upon:

CSM 1B - The base text-to-speech model
Mozilla Common Voice - Multilingual speech datasets
Llama 3.2 - The backbone language model

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
language_configs		language_configs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_multilingual.py		generate_multilingual.py
generator.py		generator.py
language_utils.py		language_utils.py
models.py		models.py
multilingual_dataset.py		multilingual_dataset.py
prepare_commonvoice.py		prepare_commonvoice.py
requirements.txt		requirements.txt
train_multilingual.py		train_multilingual.py
train_multilingual_example.py		train_multilingual_example.py
watermarking.py		watermarking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multilingual Text-to-Speech with CSM 1B [Work in progress]

Overview

Architecture

Features

Installation

Usage

Generating Speech

Training a New Language Model

Training Multiple Languages

Dataset

Supported Languages

Adding a New Language

Performance Considerations

Acknowledgments

License

About

Uh oh!

Releases

Packages

Languages

License

hscspring/csm-multilingual

Folders and files

Latest commit

History

Repository files navigation

Multilingual Text-to-Speech with CSM 1B [Work in progress]

Overview

Architecture

Features

Installation

Usage

Generating Speech

Training a New Language Model

Training Multiple Languages

Dataset

Supported Languages

Adding a New Language

Performance Considerations

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages