This repo presents the code and scripts for MoCa-Qwen25VL series of multimodal embedding models.
🏠 Homepage | 🤖 MoCa-Qwen25VL-7B | 🤖 MoCa-Qwen25VL-3B | 📚 Datasets | 📄 Paper
Highlights
- SOTA performance on MMEB (General Multimodal) and ViDoRe V2 (Document Retrieval).
- Supports texts, images, and interleaved input.
- Generalize on out-of-distribution data well due to innovative methods.
- Continually pre-trained on 30B interleaved high quality data with modality-aware reconstruction objectives.
- Contrastive fine-tuning on diverse data spanning long-form query-document pairs, curated multimodal pairs, and real-world text pairs.
- 2025‑06‑29: Initial release – paper, codes, model checkpoints, and datasets.
pip install -r requirements.txt
pip install flash-attn==2.5.8
See scripts in https://huggingface.co/moca-embed/MoCa-Qwen25VL-3B and https://huggingface.co/moca-embed/MoCa-Qwen25VL-7B.
- Preparation
bash scripts/prepare_images.sh
This script will download images for Heterogeneous Contrastive Learning from MoCa CL Pairs, mmE5 Synthetic Dataset, and MMEB-eval.
Caution: This could take a while as the images are large in size. Make sure you have enough disk space (at least 1T).
We have provided example scripts in the scripts/ directory to help you get started with training and evaluation.
- Continual Pre-training
bash scripts/cpt_train.sh
Caution: Calculating statistics for batching the cpt data could takes a while. However, this process is only needed once and the stats will be saved in the ${DATA_DIR}/cache/stats directory.
- Contrastive Learning
bash scripts/cl_train.sh
- Test MMEB
bash scripts/eval_full.sh
- Test ViDoRe-v2
-
Install vidore-benchmark package following this repo.
-
Move
__init__.pyandmmeb_qwen25_retriever.pyfrom/evaluation/vidore_benchmark/to the vidore-benchmark repo (src/vidore_benchmark/evaluation). -
Run
bash scripts/eval_vidore.sh
You can also use demo.py to embed your own text and images.
python demo.py
MoCa achieves SOTA performance on MMEB benchmark.
MoCa surpasses several strong baselines on ViDoRe-v2 benchmark.
Our code builds upon mmE5, VLM2Vec, and Qwen‑2.5‑VL.
@article{chen2025moca,
title={MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings},
author={Chen, Haonan and Liu, Hong and Luo, Yuping and Wang, Liang and Yang, Nan and Wei, Furu and Dou, Zhicheng},
journal={arXiv preprint arXiv:2506.23115},
year={2025}
}