This repo contains PyTorch model definitions, pre-trained weights, and training/sampling code for paper Flux that plays music. It explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation. The model architecture can be seen as follows:
You can refer to the link to build the running environment.
To launch small version in the latent space training with N GPUs on one node with pytorch DDP:
torchrun --nnodes=1 --nproc_per_node=N train.py \
--version small \
--data-path xxx \
--global_batch_size 128More scripts of different model size can reference to scripts file direction.
We include a sample.py script which samples music clips according to conditions from a MusicFlux model as:
python sample.py \
--version small \
--ckpt_path /path/to/model \
--prompt_file config/example.txtAll prompts used in paper are lists in config/example.txt.
We use VAE and Vocoder in AudioLDM2, CLAP-L, and T5-XXL. You can download in the following table directly, we also provide the training scripts in our experiments.
Note that in actual experiments, a restart experiment was performed due to machine malfunction, so there will be resume options in some scripts.
| Model | Url | Training scripts |
|---|---|---|
| VAE | link | - |
| Vocoder | link | - |
| T5-XXL | link | - |
| CLAP-L | link | - |
| FluxMusic-Small | link | link |
| FluxMusic-Base | link | link |
| FluxMusic-Large | link | link |
| FluxMusic-Giant | link | link |
The construction of training data can refer to the test.py file, showing a simple build of combing differnet datasets in json file.
Considering copyright issues, the data used in the paper needs to be downloaded by oneself. A quick download link can be found in Huggingface : ).
The codebase is based on the awesome Flux and AudioLDM2 repos.