An open research effort toward end-to-end spoken dialogue systems — starting with the audio codec foundation.
SoviaMate is a long-term research project aiming to build an end-to-end spoken dialogue system (SDS): a single model that listens, reasons, and speaks naturally, with controllable voice, robust to real-world noise, and integrable with large language models.
The first released component is SoviaMate-Codec, a neural audio codec designed from the ground up for LLM integration. Future releases will add a speech-to-speech LLM, dialogue management, and full-pipeline streaming.
⚠️ Current scope: SoviaMate is in active research. Today we ship the codec architecture, the training pipeline, and pretrained codec checkpoints (alpha). The full dialogue system is the goal, not the current deliverable. We are looking for collaborators and compute — see Collaborate.
Existing neural audio codecs (EnCodec, SoundStream, DAC) optimize for perceptual quality but lack the properties needed to drive a downstream speech LLM: measurable semantic preservation, noise robustness by design, and content–speaker decoupling. SoviaMate-Codec is built around four architectural choices that target exactly those properties.
- ASR decoder before quantization — A lightweight ASR head reads the encoder's continuous features and is trained jointly with the codec. Its gradient forces the encoder to bake linguistic information into its representation. Semantic fidelity becomes directly measurable (WER), not assumed.
- Continuous features for LLM input — Discrete tokens are used for transmission/storage; the downstream LLM consumes the pre-quantization continuous features, avoiding quantization-induced information loss while keeping the codec's low-bitrate transmission path intact.
- Speech enhancement as a training paradigm — The codec is trained noisy-in → clean-out, so the encoder learns to discard noise rather than encode it. Real-world robustness comes from the objective, not from post-hoc adaptation.
- Post-quantization speaker adapter — Voice identity is injected after quantization via a hybrid AdaLN + cross-attention adapter conditioned on a 3–5 s reference. This decouples "what is said" from "who says it", enables zero-shot voice swapping, and frees the quantizer's capacity for content.
Audio Input
│
▼
Encoder ──► [Continuous Features] ────┐
│ │ │
│ └──► ASR Decoder └──► LLM Input (continuous)
│ (text output)
▼
Quantizer ──► [Discrete Tokens] ──► Bitstream (transmission)
│
▼
Speaker Adapter ◄── Speaker Prompt (3–5 s)
│
▼
Audio Decoder ──► Clean Speech
A more detailed architecture write-up will accompany the forthcoming technical report.
| Component | Status |
|---|---|
| Codec architecture (encoder / quantizer / decoder / ASR head / speaker adapter) | ✅ Implemented |
| Multi-objective training pipeline (audio + adversarial + text losses) | ✅ Implemented |
| Speech enhancement training (noisy → clean) | ✅ Implemented |
| Streaming inference | ✅ Implemented (not yet benchmarked end-to-end) |
| Pretrained checkpoint release | ✅ Released (alpha) — samson-ailabs/SoviaMate-Codec |
| Benchmarking against EnCodec / SoundStream / DAC | 🔄 In progress |
| Technical report / paper | 🔄 In progress |
| LLM integration adapters | ⏳ Planned |
| End-to-end spoken dialogue system | ⏳ Long-term goal |
Honest disclaimer: this is alpha research code. APIs will change, results are preliminary, and many evaluation numbers are not in yet.
- Python 3.12
- uv package manager
- CUDA-capable GPU recommended for training (single-GPU inference is feasible)
git clone https://github.com/samson-ailabs/SoviaMate.git
cd SoviaMate
uv sync --frozenPretrained codec weights are published on Hugging Face at samson-ailabs/SoviaMate-Codec — see the model card for download recipes and the full usage API.
Once weights are in checkpoints/, the bundle is a one-liner:
from soviamate.bundles import AudioCodecBundle
# Reconstruction (encode → decode)
reconstructor = AudioCodecBundle.from_checkpoint(
"checkpoints/neural_audio_codec/audio_codec_base.ckpt",
device="cuda", # or "cpu"
)
reconstructed, _ = reconstructor(source_audio)
# Voice conversion — always with a speaker prompt
voice_converter = AudioCodecBundle.from_checkpoint(
"checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
device="cuda",
)
converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)The example training config is configs/training/audio_codec.yaml. Required fields are marked ??? and must be supplied:
uv run python train.py --config-name audio_codec \
task.data.trainset.filepaths=/path/to/trainset.jsonl \
task.data.valset.filepaths=/path/to/valset.jsonl \
task.model.speaker_adapter.sv_checkpoint=/path/to/campplus.bin \
loggers.tb.name=my_run \
trainer.devices=1You can also copy the file and edit it directly, or compose your own config on top of it via Hydra.
uv run python scripts/eval_audio_codec.py --helpThe repository will evolve in three releases:
- v0.1 — Codec foundation (current) Alpha release of SoviaMate-Codec with ASR-constrained encoding, zero-shot speaker adaptation, and enhancement-trained robustness. Comprehensive benchmarking and the technical report are still in progress.
- v0.2 — LLM integration Input/output adapters that bridge the codec's continuous features with a speech-aware LLM. Streaming speech-to-speech inference.
- v1.0 — End-to-end SDS Dialogue management, multi-turn context, emotion/prosody control. The full vision of SoviaMate.
Building a credible end-to-end spoken dialogue system from scratch needs more than code — it needs compute, datasets, and people. We are actively looking for:
- Academic & industry collaborators with expertise in speech codecs, speech LLMs, ASR/TTS, or dialogue systems.
- Compute grants & sponsorships for large-scale codec and LLM training (e.g., academic compute programs, cloud research credits, GPU partnerships).
- Dataset partners — multilingual conversational speech, real-world noisy recordings, expressive/emotional speech corpora.
- Engineers and researchers who want to own a piece of the stack (codec internals, LLM adapters, streaming runtime, evaluation harness).
If any of that fits you or your organization, please reach out: [email protected] with subject line SoviaMate collaboration. For code-level discussion, open a GitHub issue or discussion.
Code contributions are welcome — see CONTRIBUTING.md for setup, coding style, and a list of good first contributions. By participating you agree to the Code of Conduct.
A technical report is in preparation. In the meantime, please cite the repository:
@misc{soviamate2026,
author = {Son Dang Dinh (Samson)},
title = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
year = {2026},
howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
}A CITATION.cff is provided for GitHub's "Cite this repository" button.
SoviaMate is released under the Apache License 2.0. It is intended for open research and beneficial applications of conversational AI.
The architecture supports zero-shot voice cloning. It must not be used for impersonation, fraud, non-consensual voice synthesis, or any deceptive or harmful purpose. Outputs may contain biases or inaccuracies inherited from training data; the authors accept no liability for downstream use. By using SoviaMate you agree to these terms and to applicable law in your jurisdiction.
SoviaMate builds on a large body of public research in neural codecs, self-supervised speech models, ASR/TTS, and speech LLMs. The forthcoming technical report will include a full bibliography. Thanks to the open-source PyTorch, Lightning, Hydra, SentencePiece, and HuggingFace communities — this project would not be possible without them.