MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

This repo presents the code and scripts for MoCa-Qwen25VL series of multimodal embedding models.

🏠 Homepage | 🤖 MoCa-Qwen25VL-7B | 🤖 MoCa-Qwen25VL-3B | 📚 Datasets | 📄 Paper

Highlights

SOTA performance on MMEB (General Multimodal) and ViDoRe V2 (Document Retrieval).
Supports texts, images, and interleaved input.
Generalize on out-of-distribution data well due to innovative methods.
- Continually pre-trained on 30B interleaved high quality data with modality-aware reconstruction objectives.
- Contrastive fine-tuning on diverse data spanning long-form query-document pairs, curated multimodal pairs, and real-world text pairs.

Updates

2025‑06‑29: Initial release – paper, codes, model checkpoints, and datasets.

Quick Start

Setup

pip install -r requirements.txt
pip install flash-attn==2.5.8

Inference

See scripts in https://huggingface.co/moca-embed/MoCa-Qwen25VL-3B and https://huggingface.co/moca-embed/MoCa-Qwen25VL-7B.

Training and Evaluation

Preparation

bash scripts/prepare_images.sh

This script will download images for Heterogeneous Contrastive Learning from MoCa CL Pairs, mmE5 Synthetic Dataset, and MMEB-eval.

Caution: This could take a while as the images are large in size. Make sure you have enough disk space (at least 1T).

We have provided example scripts in the scripts/ directory to help you get started with training and evaluation.

Continual Pre-training

bash scripts/cpt_train.sh

Caution: Calculating statistics for batching the cpt data could takes a while. However, this process is only needed once and the stats will be saved in the ${DATA_DIR}/cache/stats directory.

Contrastive Learning

bash scripts/cl_train.sh

Test MMEB

bash scripts/eval_full.sh

Test ViDoRe-v2

Install vidore-benchmark package following this repo.
Move __init__.py and mmeb_qwen25_retriever.py from /evaluation/vidore_benchmark/ to the vidore-benchmark repo (src/vidore_benchmark/evaluation).
Run

bash scripts/eval_vidore.sh

You can also use demo.py to embed your own text and images.

python demo.py

Experimental Results

MoCa achieves SOTA performance on MMEB benchmark.

MoCa surpasses several strong baselines on ViDoRe-v2 benchmark.

Acknowledgement

Our code builds upon mmE5, VLM2Vec, and Qwen‑2.5‑VL.

Citation

@article{chen2025moca,
  title={MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings},
  author={Chen, Haonan and Liu, Hong and Luo, Yuping and Wang, Liang and Yang, Nan and Wei, Furu and Dou, Zhicheng},
  journal={arXiv preprint arXiv:2506.23115},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
evaluation		evaluation
figures		figures
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cl_train.py		cl_train.py
cpt_train.py		cpt_train.py
demo.py		demo.py
ds_config_stage3.json		ds_config_stage3.json
eval.py		eval.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

Updates

Quick Start

Setup

Inference

Training and Evaluation

Experimental Results

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

haon-chen/MoCa

Folders and files

Latest commit

History

Repository files navigation

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

Updates

Quick Start

Setup

Inference

Training and Evaluation

Experimental Results

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages