🌟 [CVPR26] ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
- Accepted at CVPR 2026
- 🌐 project page
- 📄 arXiv
Daichi Yashima1,3 Shuhei Kurita2,3 Yusuke Oda3 Komei Sugiura1
1Keio University 2NII 3NII LLMC
uv sync --extra train
# Or with pip
pip install -e ".[train]"For the motion-vector extraction / visualization utilities under
mviz/, add the mviz extra (covered by train as
well):
pip install -e ".[mviz]"The pretrained ReMoRa checkpoint is available on Hugging Face:
python infer_with_mv.py \
--checkpoint checkpoints/ReMoRa-7B \
--base lmms-lab/LLaVA-Video-7B-Qwen2 \
--video /path/to/video.mp4 \
--prompt "Describe what happens in this video."python scripts/extract_motion_vectors.py \
--video-root /path/to/your/videos \
--output-dir DATAS/motion_vectors \
--fps 16 --block-size 16# Training
bash scripts/train_remora.sh # add --motion_vector_dir DATAS/motion_vectors
# or:
export REMORA_MV_DIR=DATAS/motion_vectors
# Batch evaluation
python llava/eval/infer.py \
--motion_vector_dir DATAS/motion_vectors \
...This codebase builds on:
This work is licensed under the BSD-3-Clause-Clear license. See LICENSE.