Thanks to visit codestin.com
Credit goes to github.com

Skip to content

damianomarsili/VALOR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⭐ VALOR - No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

This is the code for the paper No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers by Damiano Marsili and Georgia Gkioxari.

arXiv Project Page

image

🚀 Quickstart

Clone the repo:

git clone --recurse-submodules https://github.com/damianomarsili/VALOR.git

We use uv to manage all dependencies. If your system does not have uv, install it via:

curl -LsSf https://astral.sh/uv/install.sh | sh

Setup your environment:

cd VALOR
uv sync

⚠️ Note: This setup assumes CUDA 12.8 and Python 3.10. If you are using a different CUDA version, you may need to install a version of torch and flash-attn compatible with your system.

VALOR uses MoGe and GroundingDINO (forked for install compatibility).

⚠️ Note: Prior to installing GroundingDINO, please follow the additional installation steps for setting the CUDA_HOME environment variable detailed on their repo.

Then, install them as follows:

uv run python -m pip install --no-build-isolation -e modules/GroundingDINO && \
uv run python -m pip install git+https://github.com/microsoft/MoGe.git

Additionally, VALOR uses GPT-5-mini for VQA. Please set your OpenAI API key to the following environment variable:

export OPENAI_API_KEY="API KEY"

All model checkpoints are hosted on Huggingface 🤗. We provide a script to download the trained GroundingDINO checkpoint. First, authenticate with huggingface-cli:

huggingface-cli login                

Then, download the checkpoint:

bash scripts/download_gd_checkpoint.sh

📓 For a brief exploration of VALOR's functionality, we have compiled a notebook demo/quickstart.ipynb. To run the notebook in the uv environment, run:

uv run jupyter lab

and navigate to the demo/quickstart.ipynb in the UI.

For full evaluations, please refer to the "Evaluating VALOR" section below.

📊 Evaluating VALOR

We use Huggingface 🤗 to load all datasets, except GQA and TallyQA, where we use the subsets provided in GRIT.

To evaluate VALOR, please run the following code:

uv run python -m eval.eval --datasets omni3d-bench,tallyqa,realworldqa --out_dir eval_outputs/

🧠 Reasoning Training

VALOR uses LLM verifiers to improve the reasoning ability of an LLM. Training invokes Gemini and requires a Google Cloud account. We recommend authenticating via a service account. Once authenticated, you should update the GENAI_CREDS_PATH and GENAI_PROJECT_ID variables in the grounding_training/llm_training.sh script.

Then, you can launch reasoning training via the following command:

uv run bash valor/reasoning_training/llm_training.sh

The data used to train VALOR is found in reasoning_training/data/reasoning_data.jsonl.

After training, you can create a checkpoint compatible with huggingface by running the merge script in verl:

uv run python -m verl.model_merger merge --backend fsdp --local_dir /path/to/verl/checkpoint --target_dir /path/to/output/checkpoint

📌 Grounding Training

VALOR uses VLM verifiers to improve the visual grounding ability of a GroundingDINO model via automated hard-negative mining. Sourcing training data requires an OpenAI API key. Please set your OpenAI API key to the following environment variable:

export OPENAI_API_KEY="API KEY"

To generate training data, first download the pre-trained GroundingDINO model:

mkdir -p modules/GroundingDINO/weights/
wget -O modules/GroundingDINO/weights/groundingdino_swint_ogc.pth https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

Then, you can source training data via the following command:

uv run bash valor/grounding_training/source_training_data.sh

We build from the third-party Open-GroundingDino repository for training GroundingDINO 🦖. We thank the contributors of the repository for their efforts!

To launch training, you must first copy the list in grounding_training/data/odvg/labels.txt to the label_list entry at the bottom of the training config at grounding_training/Open-GroundingDino/config/cfg_odvg.py. Then, run the following command:

uv run bash valor/grounding_training/train_gd.sh

The trained checkpoint will default save to valor/grounding_training/checkpoints/. You can edit this target directory in the bash script valor/grounding_training/train_gd.sh.

📚 Citation

If you use VALOR in your research, please consider citing our work:

@misc{marsili2025labelsproblemtrainingvisual,
      title={No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers}, 
      author={Damiano Marsili and Georgia Gkioxari},
      year={2025},
      eprint={2512.08889},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.08889}, 
}

About

Training Visual Reasoners with Multimodal Verifiers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published