Fine-tuning multimodal LLMs to be world-class DJs π΅
π§ This is a work in progress! A release will be made once it's ready.
- Song Structure Analysis β Identify sections like intro, verse, chorus, drop, and outro
- BPM Estimation β Estimate the tempo or rhythmic feel of a track
- Key and Chord Detection β Detect musical key and chord progressions
- Genre Classification β Classify the track into one or more genres
- Mood and Energy Analysis β Tag tracks with emotional and intensity labels
- Cue Point Recommendation β Suggest where to start or end playback for mixing
- Instrumental and Vocal Presence Detection β Identify if a track has vocals or is instrumental
- Loop Region Suggestion β Find sections that can be looped smoothly
- Drop Detection β Locate the most impactful or climactic moment
A novel annotated dataset of music licensed under Creative Commons is introduced. The annotations are provided as metadata for each audio file, containing information such as song sections, BPM, key, chord progression, genre, mood, energy, cue points, instrumental and vocal sections, loopable regions, and beat drops.
To facilitate the process of fetching, reviewing, and selecting music from ccMixter, the following scripts have been created (they need to be run in this order):
-
uv run dataset/fetch_ccmixter.pyuses ccMixter's API to fetch the list of all uploads with a CC BY license, saving the data as JSONL todataset/ccmixter_data.jsonl. This script must be run first. -
uv run dataset/select_ccmixter.pyprovides a Terminal User Interface (TUI) to navigate, view, listen to, and select uploads to be included in the dataset. The selected upload IDs are saved one per line todataset/selected_uploads.txt.
uv run dataset/download_ccmixter.pydownloads the selected uploads, saving them todataset/music/<upload_id>_<file_index>.mp3. It currently only downloads the first file of each upload, if it's in MP3 format.
The provided dataset can be used to fine-tune any multimodal LLM suitable for audio understanding, capable of simultaneously processing text and audio inputs.
The project is currently based on Qwen3-Omni. Support for more advanced and smaller multimodal models is planned.
The baseline or fine-tuned LLM should be run via Gradio, which provides an API for evaluations and the demo app. This requires a GPU with a sufficiently large VRAM (e.g. Nvidia H100).
Qwen3-Omni has an official HuggingFace space. As an example, the Gradio API address of this space is https://qwen-qwen3-omni-demo.hf.space/. However, the model should be run on a local machine with a suitable GPU or via a cloud GPU provider.
Once Gradio is running, inference can be performed using the inference/infer.py script:
uv run inference/infer.py \
--client https://qwen-qwen3-omni-demo.hf.space/ \
--text "Estimate the BPM (beats per minute) of this track.
Provide your answer as a single numerical value representing
the tempo." \
--audio ~/Music/Test.mp3
120
| Task | Baseline Accuracy | Fine-Tuned Accuracy |
|---|---|---|
| Song Structure Analysis | ||
| BPM Estimation | ||
| Key and Chord Detection | ||
| Genre Classification | ||
| Mood and Energy Analysis | ||
| Cue Point Recommendation | ||
| Instrumental and Vocal Presence Detection | ||
| Loop Region Suggestion | ||
| Drop Detection |
A project by Mohammad Tomaraei.
@misc{tomaraei2025,
title = {DJ LLM: Fine-tuning multimodal LLMs to be world-class DJs},
author = {Mohammad Tomaraei},
year = {2025},
url = {https://github.com/themreza/DJ-LLM},
}- Qwen3-Omni is a large language model (LLM) developed by the Qwen team at Alibaba Cloud
- ms-swift is a fine-tuning framework developed by the ModelScope community
- evalscope is an LLM evaluation framework developed by the ModelScope community
- The music files used in the dataset are licensed under Creative Commons (please see dataset/ATTRIBUTION.csv for a complete list of attributions)
- The DJ LLM logo was generated with Microsoft Copilot and animated with OpenAI's Sora 2 Pro