Code and data for paper Is Extending Modality The Right Path Towards Omni-Modality?.
To install the inference environment, run the following code:
conda env create -f environment.ymlTo generate answers from LLMs, run the script scripts/infer.sh.
The script has three steps:
- Download the model tensors from huggingface.
- Extract the LLM component from the multimodal model.
- Generate.
In case certain steps are not required for specific models, you can delete these steps on your own.
To generate answers from multimodal models, run the script scripts/infer_multimodal.sh.
For merged models, you need to first run python src/utils/save_merged_vlm.py to load the merged LLM into the multimodal model.
You can change the target model name in the python file.
To train the merged model, you can run the script src/training/Qwen2.5-VL/qwen-vl-finetune/train.sh adapted from Qwen2.5-VL training codes.
If you find this repo useful, please cite the following paper:
@article{zhu2025extending,
title={Is Extending Modality The Right Path Towards Omni-Modality?},
author={Zhu, Tinghui and Zhang, Kai and Chen, Muhao and Su, Yu},
journal={arXiv preprint arXiv:2506.01872},
year={2025}
}