PyTorch implementation for filling MIDI velocities from given MIDI notes. The model is an U-Net image colorizor & trained on expert performances from the Piano-e-Competition (MAESTRO dataset). It can work on all instrumental MIDI, but best expressiveness on piano (will train with other instruments in future).
This repo provides supplementary materials for our paper: "Filling MIDI Velocity using U-Net Image Colorizer" in CMMR2025.
- WandB workspace (public available, but need login wandB first)
- WandB report
- This wandb report includes quantitative results, refer to Tables 2 & 3 in the paper.
- This wandb report includes hyperparameter search not detailed in the paper (to save spaces).
train.py— model training entry pointevaluation.ipynb— reproduce Tables 2 & 3 quantitative results in our paperinterface.ipynb— colorize & visualize the midi; reproduce Figures 1–4 in our paperresults/- directory for model, outputs and audio demos- checkpoints.zip download and extract here to interface with our trained model
- subjective_test_audio.rar listening test samples generated from our colorized MIDI using PianoTeq 8 (see Section 5.2 of the paper)
compare/— Flat model and Kim2023's Seq2Seq model.
All training settings are defined in conf/config.yaml.
- Some data filtering operations were implemented but not used in our experiments.
- To reproduce our results, please refer to the training logs and parameter tracking in our WandB workspace.
Please download the following datasets and place them under the /dataset folder:
Tested on:
- Ubuntu 20.04 (CUDA 12.0)
- Ubuntu 22.04 (CUDA 12.2)
conda env create -f environment.yamlconda create --name velocity_pred python=3.11
conda install pytorch=2.2.2 torchvision=0.17.2 torchaudio=2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install lightning -c conda-forgeWe use Weights & Biases for experiment tracking.
wandb loginIf preferred, you can switch to TensorBoard by modifying train.py.
Edit training options in conf/config.yaml. Then run:
# Re-implemented ConvAE baseline
python train.py exp.test_dataset="MAESTRO" matrix.seg_time=10 ae.model="ConvAE" loss.type="BCELoss" loss.mask='element_wise' loss.weight='u_shape' loss.cosim=0.2 exp.save_k_ckpt=10
# Proposed U-Net with 2x2 attention window
python train.py exp.test_dataset="MAESTRO" matrix.seg_time=10 ae.model="UNet" ae.ablation="attn" ae.attn_window=2 loss.type="BCELoss" loss.mask='element_wise' loss.weight='u_shape' loss.cosim=0.2 exp.save_k_ckpt=10Use evaluation.ipynb to reproduce our results. You can skip the training via download & unzip our pretrain model checkpoints.zip to results/checkpoints folder.
For other models' results in Tables 2&3:
- Flat model:
compare/Flat_model/flat_evaluation.ipynb - Kim2023 Seq2Seq model:
compare/Kim2023_model/seq2seq_evalaution.ipynb
Use interface.ipynb to fill the MIDI (without velocity), you will visualise & obtain the MIDI (w. velocity filled) accordingly. You can skip the training, directly use our pretrain model checkpoints.zip as Step 5. Following are MIDI example from [Human] and [ConvAE, UNet]'s results, find more example in our interface.ipynb, and audio demos in subjective_test_audio.rar.
Copyright 2025 Dolby. The code and checkpoints are released for academic and non-commercial use only. Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)