We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (relatively 29.5% WER reduction) and speaker similarity (relatively 4.6% SIM score increase) compared to conventional flow-matching based TTS systems. Audio samples are available at demo.
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n f5r-tts python=3.10
conda activate f5r-tts
# Install pytorch with your CUDA version, e.g.
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
Then you can choose from a few options below:
pip install git+https://github.com/shaw0fr/F5R-TTS
git clone https://github.com/shaw0fr/F5R-TTS
cd F5R-TTS
# git submodule update --init --recursive # (optional, if need bigvgan)
pip install -e .
If initialize submodule, you should add the following code at the beginning of src/third_party/BigVGAN/bigvgan.py
.
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
python ./src/f5_tts/infer/infer_cli.py \
--model F5-TTS \
--ckpt_file "your_model_path" \
--ref_audio "path_to_reference.wav" --ref_text "reference_text" \
--gen_text "generated_text" \
--output_dir ./tests
You can download SenseVoice_small and wespeaker for GRPO phase.
If you want to use our code directly, you need to place the reference model in ckpts/F5TTS_ref
. if not, you can change the loading method in src/rl/trainer_rl.py
.
accelerate config
# Data preparing
python src/f5_tts/train/datasets/prepare_emilia.py
# Pretraining phase
accelerate launch rc/f5_tts/train/train.py
# GRPO phase
accelerate launch rc/f5_tts/train/train_rl.py
- F5-TTS backbone of our work
- E2-TTS brilliant work, simple and effective
- Emilia, WenetSpeech4TTS valuable datasets
- lucidrains initial CFM structure with also bfs18 for discussion
- SD3 & Hugging Face diffusers DiT and MMDiT code structure
- torchdiffeq as ODE solver, Vocos as vocoder
- FunASR, faster-whisper, UniSpeech for evaluation tools
- ctc-forced-aligner for speech edit test
If our work and codebase is useful for you, please cite as:
@article{sun2025f5r,
title={F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization},
author={Sun, Xiaohui and Xiao, Ruitong and Mo, Jianye and Wu, Bowen and Yu, Qun and Wang, Baoxun},
journal={arXiv preprint arXiv:2504.02407},
year={2025}
}
Our code is released under MIT License.