Thanks to visit codestin.com
Credit goes to github.com

Skip to content

wooseungw/BIMBA

 
 

Repository files navigation

BIMBA Original link

 @article{islam2025bimba,
   title={BIMBA: Selective-Scan Compression for Long-Range Video Question Answering},
   author={Islam, Md Mohaiminul and Nagarajan, Tushar and Wang, Huiyu and Bertasius, Gedas and Torresani, Lorenzo},
   journal={arXiv preprint arXiv:2503.09590},
   year={2025}
 }

🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 Model | 🌟 Demo

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, and Lorenzo Torresani
Accepted by CVPR 2025

BIMBA Overview

BIMBA is a multimodal large language model (MLLM) capable of efficiently processing long-range videos. Our model leverages the selective scan mechanism of Mamba to effectively select critical information from high-dimensional video and transform it into a reduced token sequence for efficient LLM processing. Extensive experiments demonstrate that BIMBA  achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, Video-MME, and MLVU

Installation 🔧

Please use the following commands to install the required packages:

conda create --name bimba python=3.10
conda activate bimba
pip install --upgrade pip # PEP 660
pip install e .
conda create --name bimba python=3.10
conda activate bimba
pip install --upgrade pip # PEP 660
pip install e ".[train]"

This codebase is built on LLaVA-NeXT and mamba codebases.

Demo Selective-Scan Compression

We provide a demo notebook on how to use selective-scan/mamba-based token compression method for long-range videos introduced in our paper. Following this notebook, you can easily utilize this compression technique to reduce the input video tokens of your model.

Download Model

Download the model from HuggingFace 🤗

cd checkpoints
git clone https://huggingface.co/mmiemon/BIMBA-LLaVA-Qwen2-7B

Model Inference

Use the following script to make inference on any video.

python inference.py

Model Training

  1. Follow the LLaVA-NeXT codebase to prepare the training data (e.g., LLaVA-Video-178K). Update the exp.yaml file to point to your data.
  2. Follow the commands below to train BIMBA model:
bash scripts/video/train/Train_BIMBA_LLaVA_Qwen2_7B.sh

Evaluation

Evaluate Single Dataset (e.g., VideoMME)

  1. First, download the videos from the huggingface/dataset repo and replace "path_to_video_folder" accordingly.
  2. We provide the formatted json files for the evaluation datasets in the BIMBA-LLaVA-NeXT/DATAS/eval folder. You can format a new dataset using the script.
python llava/eval/format_eval_data.py
  1. Use the following script to evaluate a particular dataset.
model_path = "checkpoints/BIMBA-LLaVA-Qwen2-7B"
model_base = "lmms-lab/LLaVA-Video-7B-Qwen2"
model_name = "llava_qwen_lora"
results_dir=results/BIMBA-LLaVA-Qwen2-7B
dataset_name=VideoMME
python llava/eval/infer.py \
    --model_path $model_path \
    --model_base $model_base \
    --model_name $model_name \
    --results_dir ${results_dir}/${dataset_name} \
    --max_frames_num 64 \
    --dataset_name $dataset_name \
    --video_root "path_to_video_folder" \
    --data_path DATAS/eval/VideoMME/formatted_dataset.json \
    --cals_acc

Evaluate All Benchmarks

  1. Use the following script to evaluate PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, Video-MME, and MLVU benchmarks.
bash scripts/video/eval/Eval_BIMBA_LLaVA_Qwen2_7B.sh
  1. For EgoSchema, use the following script to prepare the submission file.
python llava/eval/submit_ego_schema.py

Then, you can either submit directly to the kaggle competition page or use the script for submission and evaluation.

kaggle competitions submit -c egoschema-public -f results/BIMBA-LLaVA-Qwen2-7B/EgoSchema/es_submission.csv -m "BIMBA-LLaVA-Qwen2-7B"
  1. You should get the following results on these benchmarks.
Dataset EgoSchema VNBench VideoMME MLVU LongVideoBench NextQA PerceptionTest
Results 71.14 77.88 64.67 71.37 59.46 83.73 68.51

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.5%
  • Other 1.5%