Overview of Lyra:
Lyra shows superiority compared with leading omni-models in:
- Stronger performance: Achieve SOTA results across a variety of speech-centric tasks.
- More versatile: Support image, video, speech/long-speech, sound understanding and speech generation.
- More efficient: Less training data, support faster training and inference.
- [12/12] 🔥 Lyra is coming! We release the paper, demo, code, models. More related data and checkpoints will be released soon!
We provide video demo here for better experience and illustrations. More examples can be found in our project page and feel free to try our online demo! Due to the computing cost, GPU memory of the demo machine (3090), and uploading storage, the long-speech function is not supported for the current online demo. 😰
Please follow the instructions below to install the required packages.
- Clone this repository:
git clone https://github.com/dvlab-research/Lyra.git- Install Package:
conda create -n lyra python=3.10 -y
conda activate lyra
cd Lyra
pip install --upgrade pip
pip install -e .- Install optional packages for simultaneous text-speech generation:
pip install pip==24.0
pip install fairseq==0.12.2
pip install --upgrade pipLyra supports multi-modal inputs. When the data contains a speech modality, we use the latent cross-modality regularizer to assist. Data from each modality is processed through encoders and projectors before being sent into the LLM. Within the LLM, multi-modality LoRA and latent multi-modality extraction modules operate synergistically, facilitating the simultaneous generation of both speech and text outputs.
We provide the processed data for the model training.
For model pretraining data, please download the following the training multi-modality data and organize them as:
-> means put the data in the local folder.
-
LibriSpeech ->
data/Lyra_Pretrain/LibriSpeech and ->
data/Lyra_SFT/multi_modality_speech/LibriSpeech and ->
data/Lyra_Eval/LibriSpeechdownload all training and develop data. -
Common Voice ->
data/Lyra_Pretrain/CommonVoicedownload the English Common Voice Corpus.
During the pretraining process, we filtered out some noisy and short audio speech data.
For the image part of finetuning data, similar to Mini-Gemini, please download the following the instruction data and organize them as:
-> means put the data in the local folder.
- COCO train2017 ->
data/Lyra_SFT/multi_modality_image/coco - GQA ->
data/Lyra_SFT/multi_modality_image/gqa - OCR-VQA (we save all files as
.jpg) ->data/Lyra_SFT/multi_modality_image/ocr_vqa - TextVQA (not included for training) ->
data/Lyra_SFT/multi_modality_image/textvqa - VisualGenome part1, VisualGenome part2 ->
data/Lyra_SFT/multi_modality_image/vg - ShareGPT4V-100K ->
data/Lyra_SFT/multi_modality_image/sam,share_textvqa,wikiart,web-celebrity,web-landmark - LAION GPT4V ->
data/Lyra_SFT/multi_modality_image/gpt4v-dataset - ALLaVA Instruction ->
data/Lyra_SFT/multi_modality_image/ALLaVA-4V - DocVQA ->
data/Lyra_SFT/multi_modality_image/docvqa - ChartQA ->
data/Lyra_SFT/multi_modality_image/chartqa - DVQA ->
data/Lyra_SFT/multi_modality_image/dvqa - AI2D ->
data/Lyra_SFT/multi_modality_image/ai2d
For the audio part of finetuning data, please download the following the instruction data and organize them as:
-> means put the data in the local folder.
-
Lyra_MultiModal ->
data/Lyra_SFT/multi_modality_speech/Lyra_MMFor details, please refer the Lyra multi-modality preparation.
For the long speech audio finetuning data, please download the following the instruction data and organize them as:
-> means put the data in the local folder.
-
Lyra_LongSpeech ->
data/Lyra_SFT/long_speech/Lyra_LongSpeechFor details, please refer the Lyra long-speech preparation.
For the text-speech generation data, please download the following the instruction data and organize them as:
-> means put the data in the local folder.
-
Lyra_SpeechGeneration ->
data/Lyra_SFT/speech_generationFor details, please refer the Lyra speech generation preparation.
For model evaluation data, to be release soon!
Please put the pretrained data, finetuned data, and eval data in Lyra_Pretrain, Lyra_SFT, and Lyra_Eval subset following Structure.
We recommend users to download the pretrained weights from the following link Qwen2VL_2B_LLM, Qwen2VL_7B_LLM, Qwen2VL_70B_LLM, Qwen2VL_2B_ViT, Qwen2VL_7B_ViT, Qwen2VL_70B_ViT, whisper-large-v3-turbo, whisper-large-v3, imagebind_huge, and put them in model_zoo following Structure.
Download the unit-based HiFi-GAN vocoder using the follow commands:
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -P model_zoo/audio/vocoder/
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -P model_zoo/audio/vocoder/The folder structure should be organized as follows before training.
Lyra
├── lyra
├── scripts
├── work_dirs
│ ├── Lyra
│ │ ├── Lyra_Mini_3B
│ │ ├── Lyra_Base_9B
│ │ ├── Lyra_Pro_74B
│ │ ├── ...
├── model_zoo
│ ├── LLM
│ │ ├── Qwen2VL_2B_LLM
│ │ ├── Qwen2VL_7B_LLM
│ │ ├── Qwen2VL_70B_LLM
│ │ ├── Qwen2.5
│ │ ├── LLaMA3.2
│ │ ├── ...
│ ├── vision
│ │ ├── Qwen2VL_2B_ViT
│ │ ├── Qwen2VL_7B_ViT
│ │ ├── Qwen2VL_70B_ViT
│ │ ├── clip-vit-large
│ │ ├── siglip
│ │ ├── ConvNeXt
│ │ ├── ...
│ ├── audio
│ │ ├── whisper-large-v3-turbo
│ │ ├── whisper-large-v3
│ │ ├── imagebind_huge
│ │ ├── vocoder
│ │ ├── ...
├── data
│ ├── Lyra_Pretrain
│ │ ├── lyra_pretrain.json
│ │ ├── LibriSpeech
│ │ ├── CommonVoice
│ ├── Lyra_SFT
│ │ ├── multi_modality_speech
│ │ │ ├── lyra_multimodal.json
│ │ │ ├── Lyra_MM
│ │ │ ├── LibriSpeech
│ │ ├── multi_modality_image (similar to MGM-Finetune)
│ │ │ ├── llava
│ │ │ ├── coco
│ │ │ ├── gqa
│ │ │ ├── ocr_vqa
│ │ │ ├── textvqa
│ │ │ ├── vg
│ │ │ ├── gpt4v-dataset
│ │ │ ├── ...
│ │ ├── long_speech
│ │ │ ├── lyra_longspeech.json
│ │ │ ├── Lyra_LongSpeech
│ │ ├── speech_generation
│ │ │ ├── lyra_speechgeneration.json
│ ├── Lyra_Eval
│ │ ├── LibriSpeech
│ │ ├── TextVQA_speech
│ │ ├── MM_vet_speech
│ │ ├── Docvqa_val
│ │ ├── Chartvqa_human
│ │ ├── VideoMME_speech
│ │ ├── Lyra_needle_in_a_haystack
The training process consists of four stages: (1) feature alignment stage: bridge the speech and language tokens; (2) multi-modality instruction tuning stage: teach the model to follow text-image-speech multimodal instructions. (3) long-speech instruction tuning stage: enable the model to handle long speech audios. (4) text-speech streaming generation stage: Enable the model to stream both text and speech simultaneously.
Our models are trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.
Please make sure you download and organize the data following Preparation before training.
NOTE: Please set hostfile/hostfile_2 for 2 machine training and hostfile/hostfile_4 for 4 machine training.
(1) feature alignment stage:
bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_Pretrain.sh(2) multi-modality instruction tuning stage:
bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_text_image_speech.sh(3) long-speech instruction tuning stage:
bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_long_speech.sh(4) text-speech streaming generation stage:
bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_speech_generate.sh| Omni Comparison | Params. | Text-Image | Text-Video | Image-Speech | Text-Speech | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| TextVQA | MME | MM-Vet | VideoMME | MVBench | Egoschema | TextVQAs | DocVQAs | ChartQAs | LibriSpeech | ||
| Mini-Gemini | 8B | 71.9 | 1989 | 53.5 | - | - | - | - | - | - | - |
| LLaVA-OV | 7B | 65.4 | 1998 | 57.5 | 58.2 | 56.7 | 60.1 | - | - | - | - |
| Intern-VL2 | 8B | 77.4 | 2211 | 60.0 | 54.0 | - | - | - | - | - | - |
| Mini-Omni | 7B | - | - | - | - | - | - | - | - | - | 4.5 |
| SALMONN | 13B | - | - | - | - | - | - | - | - | - | 2.1 |
| Qwen2-Audio | 8B | - | - | - | - | - | - | - | - | - | 1.6 |
| Intern-Omni | 8B | 80.6 | 2210 | 60.0 | - | - | - | 69.1 | 79.9 | 56.0 | - |
| VITA | 66B | - | 2097 | 41.6 | 59.2 | - | - | - | - | - | 8.1 |
| EMOVA | 14B | 82.0 | 2205 | 55.8 | - | - | - | - | - | - | 4.0 |
| Lyra-Mini | 3B | 78.3 | 1884 | 51.2 | 55.0 | 62.5 | 54.1 | 73.9 | 75.0 | 40.7 | 2.1 |
| Lyra-Base | 9B | 82.6 | 2335 | 63.5 | 62.8 | 67.2 | 63.2 | 80.0 | 85.5 | 61.0 | 2.0 |
| Lyra-Pro | 74B | 83.5 | 2485 | 71.4 | 69.9 | 72.3 | 75.8 | 81.0 | 89.4 | 68.5 | 1.8 |
To be release soon!
Chat with images without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. Please make sure you have installed fairseq for speech generation, and try the following command for speech and generation inference:
# image-file: <path to your image: context>
# speech-file: <path to your audio: instruction>
# generate speech: <output path to generated speech: examples/pred_roundX.wav>
python -m lyra.serve.cli \
--model-path work_dirs/Lyra_Base_9B \
--image-file examples/Chinese_painting.jpg \
--audio-file examples/Chinese_painting.mp3 \
--generate-speechLyra can also handle your long speech input (max duration can be about two or three hours).
Here is an example: ABC New, Oct. 1, 2024, 20 mins:
# speech-file: <path to your long audio: context>
# instuction by the text keyboard input
python -m lyra.serve.cli \
--model-path work_dirs/Lyra_Base_9B \
--audio-file examples/ABC_News_20241001.mp3 \
--generate-speechHere is an example for video input with its audio (you can use ffmpeg or other tools to extract video's audio):
# video-file: <path to your video: context>
# speech-file: <path to your audio: instruction>
python -m lyra.serve.cli \
--model-path work_dirs/Lyra_Base_9B \
--video-file examples/movement.mp4 \
--audio-file examples/movement.mp3 \
--generate-speechHere is an example for video input and text instruction:
# video-file: <path to your video: context>
# instuction by the text keyboard input
python -m lyra.serve.cli \
--model-path work_dirs/Lyra_Base_9B \
--video-file examples/Trump.mp4 \
--generate-speechTo be release soon!
We provide some examples in this section. More examples can be found in our project page.
If you find this repo useful for your research, please consider citing the paper😊:
@article{zhong2024lyra,
title={Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition},
author={Zhong, Zhingsheng and Wang, Chengyao and Liu, Yuqi and Yang, Senqiao and Tang, Longxiang and Zhang, Yuechen and Li, Jingyao and Qu, Tianyuan and Li, Yanwei and Chen, Yukang and Yu, Shaozuo and Wu, Sitong and Lo, Eric and Liu, Shu and Jia, Jiaya},
journal={arXiv preprint arXiv:2412.09501},
year={2024}
}
We would like to thank the following repos for their great work:
- This work is built upon the LLaVA Series, Mini-Gemini, LLaMA-Omni, fairseq, lmms-eval.
- This work utilizes models from Qwen2-VL, Qwen2 Series, LLaMA3 Series, and Whisper.
The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, Qwen, LLaMA, Whisper, and GPT-4o. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.