━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo-Audio-Training Toolkit
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Welcome to the MiMo-Audio-Training toolkit! This toolkit is designed to fine-tune the XiaomiMiMo/MiMo-Audio-7B-Instruct. This toolkit serves as a reference implementation for researchers and developers interested in MiMo-Audio and looking to adapt it to their own custom tasks.
The MiMo-Audio-Eval toolkit supports a comprehensive set of tasks. Some of the key features include:
-
Tasks:
-
SFT:
- ASR
- TTS / InstructTTS
- Audio Understanding and Reasoning
- Spoken Dialogue
-
To get started with the MiMo-Audio-Training toolkit, follow the instructions below to set up the environment and install the required dependencies.
- Python 3.12
- CUDA >= 12.0
git clone --recurse-submodules https://github.com/XiaomiMiMo/MiMo-Audio-Training
cd MiMo-Audio-Training
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1
pip install -e .Note
If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:
pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whlDownload the fine-tuning Dataset and pre-process the data as the instruct_template.md
We provide multiple training scripts under the scripts directory, supporting both single-GPU and multi-GPU training setups.
cd MiMo-Audio-Training
bash scripts/train_multiGPU_torchrun.sh
Run inference using: generate.py
Evaluate the SFT model with 🌐MiMo-Audio-Eval.
@misc{coreteam2025mimoaudio,
title={MiMo-Audio: Audio Language Models are Few-Shot Learners},
author={LLM-Core-Team Xiaomi},
year={2025},
url={https://github.com/XiaomiMiMo/MiMo-Audio},
}Please contact us at [email protected] or open an issue if you have any questions.