This is the code for the paper No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers by Damiano Marsili and Georgia Gkioxari.
Clone the repo:
git clone --recurse-submodules https://github.com/damianomarsili/VALOR.gitWe use uv to manage all dependencies. If your system does not have uv, install it via:
curl -LsSf https://astral.sh/uv/install.sh | shSetup your environment:
cd VALOR
uv synctorch and flash-attn compatible with your system.
VALOR uses MoGe and GroundingDINO (forked for install compatibility).
CUDA_HOME environment variable detailed on their repo.
Then, install them as follows:
uv run python -m pip install --no-build-isolation -e modules/GroundingDINO && \
uv run python -m pip install git+https://github.com/microsoft/MoGe.git
Additionally, VALOR uses GPT-5-mini for VQA. Please set your OpenAI API key to the following environment variable:
export OPENAI_API_KEY="API KEY"All model checkpoints are hosted on Huggingface 🤗. We provide a script to download the trained GroundingDINO checkpoint. First, authenticate with huggingface-cli:
huggingface-cli login Then, download the checkpoint:
bash scripts/download_gd_checkpoint.sh📓 For a brief exploration of VALOR's functionality, we have compiled a notebook demo/quickstart.ipynb. To run the notebook in the uv environment, run:
uv run jupyter laband navigate to the demo/quickstart.ipynb in the UI.
For full evaluations, please refer to the "Evaluating VALOR" section below.
We use Huggingface 🤗 to load all datasets, except GQA and TallyQA, where we use the subsets provided in GRIT.
To evaluate VALOR, please run the following code:
uv run python -m eval.eval --datasets omni3d-bench,tallyqa,realworldqa --out_dir eval_outputs/VALOR uses LLM verifiers to improve the reasoning ability of an LLM. Training invokes Gemini and requires a Google Cloud account. We recommend authenticating via a service account. Once authenticated, you should update the GENAI_CREDS_PATH and GENAI_PROJECT_ID variables in the grounding_training/llm_training.sh script.
Then, you can launch reasoning training via the following command:
uv run bash valor/reasoning_training/llm_training.shThe data used to train VALOR is found in reasoning_training/data/reasoning_data.jsonl.
After training, you can create a checkpoint compatible with huggingface by running the merge script in verl:
uv run python -m verl.model_merger merge --backend fsdp --local_dir /path/to/verl/checkpoint --target_dir /path/to/output/checkpointVALOR uses VLM verifiers to improve the visual grounding ability of a GroundingDINO model via automated hard-negative mining. Sourcing training data requires an OpenAI API key. Please set your OpenAI API key to the following environment variable:
export OPENAI_API_KEY="API KEY"To generate training data, first download the pre-trained GroundingDINO model:
mkdir -p modules/GroundingDINO/weights/
wget -O modules/GroundingDINO/weights/groundingdino_swint_ogc.pth https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pthThen, you can source training data via the following command:
uv run bash valor/grounding_training/source_training_data.shWe build from the third-party Open-GroundingDino repository for training GroundingDINO 🦖. We thank the contributors of the repository for their efforts!
To launch training, you must first copy the list in grounding_training/data/odvg/labels.txt to the label_list entry at the bottom of the training config at grounding_training/Open-GroundingDino/config/cfg_odvg.py. Then, run the following command:
uv run bash valor/grounding_training/train_gd.shThe trained checkpoint will default save to valor/grounding_training/checkpoints/. You can edit this target directory in the bash script valor/grounding_training/train_gd.sh.
If you use VALOR in your research, please consider citing our work:
@misc{marsili2025labelsproblemtrainingvisual,
title={No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers},
author={Damiano Marsili and Georgia Gkioxari},
year={2025},
eprint={2512.08889},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.08889},
}