CSM

2025/03/13 - We are releasing the 1B CSM variant. The checkpoint is hosted on HuggingFace.

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post.

A hosted HuggingFace space is also available for testing audio generation.

Requirements

A CUDA-compatible GPU (runs on CPU otherwise)
The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
Similarly, Python 3.10 is recommended, but newer versions may be fine
For some audio operations, ffmpeg may be required
Access to the following Hugging Face models:
- Llama-3.2-1B
- CSM-1B

Setup

Clone and setup the repo:

git clone [email protected]:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate

# Set environment variable to disable Triton compilation
export NO_TORCH_COMPILE=1

pip install -r requirements.txt

Install ffmpeg (required for audio processing):

```bash
# On macOS with Homebrew
brew install ffmpeg

# On Ubuntu/Debian
sudo apt-get install ffmpeg

# On Windows
# Download from https://ffmpeg.org/download.html

# You will need access to CSM-1B and Llama-3.2-1B
huggingface-cli login

Quick Start

Run the model using the command line interface:

# Run normally (environment variable is set in the script)
python run_csm.py

The script will automatically use CUDA if available for faster generation, otherwise it will fall back to CPU mode.

python run_csm.py from huggingface_hub import hf_hub_download

from generator import load_csm_1b import torchaudio import torch

Use CUDA if available, otherwise CPU

device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt") generator = load_csm_1b(model_path, device)

if torch.backends.mps.is_available(): device = "mps" elif torch.cuda.is_available(): device = "cuda" else: device = "cpu"

generator = load_csm_1b(device=device)

audio = generator.generate( text="Hello from Sesame.", speaker=0, context=[], max_audio_length_ms=10_000, )

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)


CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker's utterance.

```python
speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

FAQ

Does this model come with any voices?

The model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.

Can I converse with the model?

CSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.

Does it support other languages?

The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.

Misuse and abuse ⚠️

This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generator.py		generator.py
models.py		models.py
requirements.txt		requirements.txt
run_csm.py		run_csm.py
watermarking.py		watermarking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CSM

Requirements

Setup

Quick Start

python run_csm.py from huggingface_hub import hf_hub_download

Use CUDA if available, otherwise CPU

FAQ

Misuse and abuse ⚠️

Authors

About

Uh oh!

Releases

Packages

Languages

License

lolwuttav/sesame-csm

Folders and files

Latest commit

History

Repository files navigation

CSM

Requirements

Setup

Quick Start

python run_csm.py from huggingface_hub import hf_hub_download

Use CUDA if available, otherwise CPU

FAQ

Misuse and abuse ⚠️

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages