https://streichgeorg.github.io/autosing_samples/
The code was developed under Python 3.11 and expects PyTorch 2.1.2 (including torchaudio, torchvision) to be installed. To get started, clone the repository and run pip install .
Most of the processing is implemented in processing.py. Data and intermediate results are stored in Parquet files. The initial data is expected to be split across multiple Parquet partitions using the following directory structure:
<dataset>/
0/index.parquet
1/index.parquet
2/index.parquet
...
Each index.parquet file should contain the following columns:
id: Unique identifier for each song.audio: Byte string of audio data encoded in some format supported by PyTorch.
Optional fields:
lrc_lyrics: Lyrics in LRC format, the code will still compute a word-level alignment but will base it on the LRC timestamps.raw_lyrics: Text-only lyrics.artist: Artist name, used to train the embedding model.
After running the necessary processing steps, the construct_dataset.py script is used to shuffle and chunk the dataset.
Training runs can be started as follows:
python3 autosing/train.py --task_name t2s \
--task_args '{"size": "<model size: small, medium, large>"}' \
--dataset-config '{"path": "<path to your dataset>"}' \
--tunables '{"lr0": 4e-3}'
python3 autosing/train.py --task_name sm2a \
--task_args '{"size": "<model size: small, medium, large>"}' \
--dataset-config '{"path": "<path to your dataset>", "multiscale": <whether to enable multiscale training>}' \
--tunables '{"lr0": 3e-3}'
Samples can be generated using the sing.py script like this. The script expects two lines of lyrics, the first (second) line controls the first (last) 15 seconds of output.
python3 autosing/sing.py <audiofile containing reference song excerpt> \
--lyrics $'i saw you standing under moonlight your eyes like diamonds in the sky i felt a spark ignite oh couldn\'t help but catch your smile\n we were strangers in a crowded room but something pulled me close to you a whisper in the wind oh a heartbeat racing to the truth'
Our architecture and training code is based on the wonderful WhisperSpeech codebase.