Wenxiang Guo, Yu Zhang, Changhao Pan, Rongjie Huang, Li Tang, Ruiqi Li, Zhiqing Hong, Yongqi Wang, Zhou Zhao | Zhejiang University
PyTorch Implementation of TechSinger (AAAI 2025): Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching.
We provide our implementation in this repository.
Visit our demo page for audio samples.
- 2025.2: We released the checkpoints of TechSinger!
- 2025.2: We released the code of TechSinger!
- 2024.12: TechSinger is accepted by AAAI 2025!
We provide an example of how you can generate high-fidelity samples using TechSinger.
To try on your own dataset or GTSinger, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
Simply download the models from .
Details of each folder are as follows:
| Model | Description |
|---|---|
| stage1 | stage1 model (config) |
| stage2 | stage2 model (config) |
| HIFI-GAN | Neural Vocoder |
Notably, this TechSinger checkpoint only supports Chinese and English! You should train your own model based on GTSinger for multilingual style transfer and control!
A suitable conda environment named techsinger can be created
and activated with:
conda create -n techsinger python=3.10
conda activate techsinger
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count().
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.
Here we provide a singing synthesis pipeline using TechSinger.
- Prepare stage1, stage2: Download and put checkpoint at
checkpoints/stage1,checkpoints/stage2. - Prepare HIFI-GAN: Download and put checkpoint at
checkpoints/hifigan. - Prepare Information: Provide singer id and input target ph, target note for each ph, target note_dur for each ph, target note_type for each ph (rest: 1, lyric: 2, slur: 3), and target technique for each ph (control: 0, technique: 1, random: 2). Input these information in
Inference/techsinger.py. Notably, if you want to use data from GTSinger to infer this checkpoint, you need to modify the phonemes in metadata.json of GTSinger (delete "_zh" or "_en" from each phoneme) to ensure that all phonemes are included in phone_set! - Infer with techsinger:
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=$GPU python inference/techsinger.py --config egs/stage2.yaml --exp_name stage2 --reset- You can also use cfg_scale to adjust the degree of the technique. The larger the scale value is, the stronger the degree of the technique will be. The default value is 1.0:
CUDA_VISIBLE_DEVICES=$GPU python inference/techsinger.py --config egs/stage2.yaml --exp_name stage2 --hparams="cfg_scale=2.0" --resetGenerated wav files are saved in infer_out by default.
- Prepare your own singing dataset or download GTSinger.
- Put
metadata.json(including ph, word, item_name, ph_durs, wav_fn, singer, ep_pitches, ep_notedurs, ep_types, and techniques for each singing voice) andphone_set.json(all phonemes of your dictionary) indata/processed/tech(Note: GTSinger providesmetadata.jsonandphone_set.json, but you need to change the wav_fn of each wav inmetadata.jsonto your own absolute path). - Set
processed_data_dir(data/processed/tech),binary_data_dir,valid_prefixes(list of parts of item names, like["Chinese#ZH-Alto-1#Mixed_Voice_and_Falsetto#一次就好"]),test_prefixesin the config. - Preprocess Dataset:
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config egs/stage1.yaml- Train Stage1 Model:
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stage1.yaml --exp_name Stage1 --reset- Train Stage2 Model:
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stage2.yaml --exp_name Stage2 --resetCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stage2.yaml --exp_name Stage2 --inferThis implementation uses parts of the code from the following Github repos: NATSpeech, TCSinger, GTSinger as described in our code.
If you find this code useful in your research, please cite our work:
@article{guo2025techsinger,
title={TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching},
author={Guo, Wenxiang and Zhang, Yu and Pan, Changhao and Huang, Rongjie and Tang, Li and Li, Ruiqi and Hong, Zhiqing and Wang, Yongqi and Zhao, Zhou},
journal={arXiv preprint arXiv:2502.12572},
year={2025}
}Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.