Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged

Dcvlr #1750

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 94 additions & 35 deletions configs/projects/dcvlr/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# DCVLR: Data Curation for Vision Language Reasoning
# DCVLR - Getting Under the Hood

[![NeurIPS 2025](https://img.shields.io/badge/NeurIPS-2025-blue.svg)](https://neurips.cc/Conferences/2025)
[![Competition](https://img.shields.io/badge/Competition-Open-green.svg)](https://dcvlr.org)
Expand All @@ -16,58 +16,117 @@

---

## What is this directory?

DCVLR is the first open-data, open-models, open-source competition for data curation in vision-language reasoning, hosted at NeurIPS 2025.
This directory is intended to accompany the [2025 DCVLR (Data Curation for Vision-Language Reasoning) NeurIPS competition](https://dcvlr-neurips.github.io/). If you don't know what that is, you should go read the competition website and then come back here!

## DCVLR: Digging Deeper

## 🎯 Challenge
The DCVLR competition was explicitly designed to have a *low barrier to entry*, allowing a diverse collection of teams to compete. However, we know that many teams may be interested in digging deeper into the data and the tasks in order to optimize the performance of their allowed submissions. If that's you, you've come to the right place. This directory will give you all the building blocks necessary to reproduce the train and eval pipeline used in the DCVLR competition on your own cluster.

Participants can leverage any source datasets to curate high-quality instruction-tuning datasets (1K or 10K examples). Participants are encouraged to explore diverse curation strategies, from synthetic data generation to subset selection. Submissions will be evaluated by fine-tuning an undisclosed, open-source vision-language model on the curated data and measuring performance across a wide variety of benchmarks.
## What You Will Need

## 🚀 Quick Start
In order to reproduce our experimental pipeline with the model architectures we consider for this competition (which range from 7B to 10B parameters), you will need access to a cluster with at least 8 A100 GPUs, and 1TB of disk space. If you don't have access, you can rent a cluster, e.g. on [Lambda](https://lambdalabs.com/service/gpu-cloud). All DCVLR participants are eligible for a credit on Lambda which they can use to run experiments for the competition.

Get started with training in minutes:
We plan to provide add examples of how to experiment on smaller architectures (e.g. 1B parameters) to this directory at a later date, so stay tuned. You can also refer to the [Oumi documentation](https://oumi.ai/docs/en/latest/index.html) for more information on how to run experiments on smaller clusters.

### Data Sourcing

Where can you source data that might be suitable for training for this competition? If you want to draw on existing datasets, here are a few we recommend looking into --

[Llava-O1](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k)

[Math-Llava](https://huggingface.co/datasets/Zhiqiang007/MathV360K)

[Geo-170K](https://huggingface.co/datasets/Luckyjhg/Geo170K)

[Open-R1](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified)

[AIDC Ovis](https://huggingface.co/datasets/AIDC-AI/Ovis-dataset)

[Llava 1V](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data)

### Data Curation

We will add documentation on how to use Oumi for synthetic data curation and data transformation here soon. Stay tuned!

For now, you will have to BYOD (bring your own dataset) in an Oumi-supported dataset format. For this competition, we highly recommend the flexible "hf_vision" format, which allows you to load a wide range of VL datasets from the Hugging Face Hub. Here's an example we used for training on a filtered version of the Multimodal Open-R1 dataset:

```bash
# Install oumi
uv pip install "oumi[gpu]"
datasets:
- dataset_name: "hf_vision"
split: "train"
shuffle: True
seed: 42
trust_remote_code: True
transform_num_workers: "auto"
dataset_kwargs:
hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
image_column: "image"
question_column: "problem"
answer_column: "solution"
return_tensors: True
```

### Model Training

# Train with Molmo-7B-O
oumi train -c molmo-o --dataset dataset.jsonl
#### Setup and Environment

# Train with Qwen2.5-VL-7B-Instruct
oumi train -c qwen2.5-vl-7b-instruct --dataset dataset.jsonl
DCVLR experiments can be run using the main branch of the Oumi repository. We provide a [DOCKERFILE](https://github.com/oumi-ai/oumi/blob/main/Dockerfile) for building Oumi, or you can follow the instructions in the [Quickstart](https://oumi.ai/docs/en/latest/get_started/quickstart.html).

#### Commands

Model training is extremely straightforward, requiring only a single command:

```bash
export MY_CONFIG=<PATH/TO/qwenvl-openr1.yaml>
torchrun --nproc-per-node 8 --standalone -m oumi train -c $MY_CONFIG
```

## 📅 Key Dates
We provide configurations for three models; Molmo-D, Molmo-O, and QwenVL-2.5. Other models such as InternVL3 may also be used in the competition.

Depending on how `training: output_dir` is set in the config file, the model checkpoints will be saved in the base of the specified directory.

We then recommend syncing the trained model to HuggingFace Hub using the `huggingface-cli` tool to enable version control and ease of future access. The repository need not exist in advance, it will be automatically created when you use this command.

| Date | Milestone |
|------|-----------|
| **June 11, 2025** | Release of Competition Materials |
| **July 1, 2025** | Submission Portal Opens |
| **October 1, 2025** | Final Submission Deadline |
| **November 1, 2025** | Results Announced |
| **December 2025** | NeurIPS 2025 Presentation |
```bash
huggingface-cli upload-large-folder <YOUR_HF_REPO> <YOUR_OUTPUT_DIRECTORY> --repo-type=model
```

### Model Evaluation

#### Setup and Environment

## 📚 Competition Resources
We use a modified version of [VLMEvalKit](https://github.com/oumi-ai/VLMEvalKit) for our evaluation harness. You can clone and install it following the directions in the repo, or use the provided [DOCKERFILE](https://github.com/oumi-ai/VLMEvalKit/blob/main/docker/Dockerfile.cuda12.9-oumi-molmo-qwen).

| Resource | Description | Link |
|----------|-------------|------|
| 📊 **Starter Kit** | Comprehensive starter kit with example datasets, training scripts, and best practices | [Access Starter Kit](https://huggingface.co/datasets/oumi-ai/dcvlr-starter-kit) |
| 💻 **Training Scripts** | Starting scripts for fine-tuning multiple vision-language models | [View Scripts](https://github.com/oumi-ai/oumi/tree/main/configs/projects/dcvlr) |
| 🧪 **Evaluation Code** | Scripts to evaluate model outputs on diverse benchmark development sets | [Get Code](https://github.com/oumi-ai/oumi/tree/main/configs/projects/dcvlr) |
| ☁️ **Compute Resources** | GPU credits from Lambda Labs for participants | [Apply for Credits](https://oumi-ai.typeform.com/to/OGPuRt6U") |
| 📚 **Documentation** | Complete guides and tutorials | [View Documentation](https://oumi.ai/docs) |
#### Commands

## 🤝 Sponsors
Model evaluation can also be conducted using a simple one-line command. We give an example with four datasets; these datasets are not guaranteed to be the ones we use in the competition, however, they are a good starting point for the types of tasks we are considering.

- **Lambda Labs** - Compute Resources
- **Oumi.ai** - Competition Support
```bash
export MODEL_NAME=<YOUR/HF/MODEL/PATH>
export WORK_DIR=<YOUR/OUTPUT/DIRECTORY>
mkdir -p "$WORK_DIR"
export DATASETS="VMCBench_DEV WeMath MathVista_MINI LiveXivVQA"
python scripts/wandb_logger.py --run-and-log \
--data $DATASETS \
--work-dir $WORK_DIR \
--use-vllm \
--save-detailed-eval \
--save-judge-responses \
--max-output-tokens 4096 \
--pass-custom-model $MODEL_NAME
```

## 📞 Contact
## How to Cite DCVLR

Have questions? Get in touch with the DCVLR team:
If you wish to refer to DCVLR in your work, please cite the following:

- **Website**: [dcvlr.org](https://dcvlr.org)
- **Email**: [Contact Form](https://dcvlr.org/contact)
```bib
@misc{DCVLR: Data Curation for Vision-Language Reasoning,
author = {Feuer, Benjamin and Tripathi, Rohun and Elachqar, Oussama and Zhang, Yuhui and Hulkund, Neha and Nguyen, Thao and Shabtay, Nimrod and Udandarao, Vishaal and Wang, Xiaohan and Webb, Stefan and Koukoumidis, Emmanouil and Schmidt, Ludwig and Xie, Saining and Yeung-Levy, Serena and Liang, Paul and Beery, Sara and Gkioxari, Georgia}
month = June,
title = {{DCVLR}},
year = {2025}
}
```
1 change: 0 additions & 1 deletion configs/projects/dcvlr/starter_kit/README.md

This file was deleted.

Empty file.
65 changes: 65 additions & 0 deletions configs/projects/dcvlr/starter_kit/molmo-d-train-openr1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Full fine-tune config for Molmo-7B-D.
#
# Note: the original model is not compatible with the latest version of transformers and oumi
# We use the oumi-ai version of the model instead until the original model is updated.
#
# Requirements:
# - uv pip install einops tf-keras
#
# Usage:
# oumi train -c configs/recipes/vision/molmo/sft/molmo_d_full/train.yaml
#
# See Also:
# - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
# - Config class: oumi.core.configs.TrainingConfig
# - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
# - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/

model:
# model_name: "allenai/Molmo-7B-O-0924"
model_name: "oumi-ai/Molmo-7B-D-0924"
torch_dtype_str: "float32"
model_max_length: 8192
trust_remote_code: True
model_kwargs:
max_position_embeddings: 8192

data:
train:
collator_name: "vision_language_sft"
collator_kwargs:
process_individually: True
use_torchdata: True
datasets:
- dataset_name: "hf_vision"
split: "train"
shuffle: True
seed: 42
trust_remote_code: True
transform_num_workers: "auto"
dataset_kwargs:
hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
image_column: "image"
question_column: "problem"
answer_column: "solution"
return_tensors: True

training:
output_dir: "output/molmo_d_openr1"
trainer_type: "TRL_SFT"
enable_gradient_checkpointing: False # Note: Molmo does not support gradient checkpointing
per_device_train_batch_size: 1
optimizer: "adamw_torch_fused"
logging_steps: 100
save_steps: 0
include_performance_metrics: True
log_model_summary: True
dataloader_main_process_only: False

fsdp:
enable_fsdp: True
sharding_strategy: "HYBRID_SHARD"
mixed_precision: "bf16"
forward_prefetch: True
auto_wrap_policy: "SIZE_BASED_WRAP" # TODO: use transformer wrapper instead
min_num_params: 100000
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Full fine-tune config for Molmo-7B-D.
# Full fine-tune config for Molmo-7B-O.
#
# Note: the original model is not compatible with the latest version of transformers and oumi
# We use the oumi-ai version of the model instead until the original model is updated.
Expand All @@ -8,7 +8,6 @@
#
# Usage:
# oumi train -c configs/recipes/vision/molmo/sft/molmo_o_full/train.yaml
# torchrun --nproc-per-node 4 --standalone -m oumi train -c configs/recipes/vision/molmo/sft/molmo_o_full/train.yaml
#
# See Also:
# - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
Expand All @@ -17,11 +16,12 @@
# - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/

model:
# model_name: "allenai/Molmo-7B-O-0924"
model_name: "oumi-ai/Molmo-7B-O-0924"
torch_dtype_str: "float32"
model_max_length: 2048
model_max_length: 8192
trust_remote_code: True
model_kwargs:
max_position_embeddings: 8192

data:
train:
Expand All @@ -30,24 +30,26 @@ data:
process_individually: True
use_torchdata: True
datasets:
- dataset_name: "merve/vqav2-small"
split: "validation"
- dataset_name: "hf_vision"
split: "train"
shuffle: True
seed: 42
trust_remote_code: True
transform_num_workers: "auto"
dataset_kwargs:
# processor_name: "allenai/Molmo-7B-O-0924"
processor_name: "oumi-ai/Molmo-7B-O-0924"
return_conversation: True
hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
image_column: "image"
question_column: "problem"
answer_column: "solution"
return_tensors: True

training:
output_dir: "output/molmo_sft"
output_dir: "output/molmo_o_openr1"
trainer_type: "TRL_SFT"
enable_gradient_checkpointing: False # Note: Molmo does not support gradient checkpointing
per_device_train_batch_size: 2
max_steps: 20
per_device_train_batch_size: 1
optimizer: "adamw_torch_fused"
logging_steps: 5
logging_steps: 100
save_steps: 0
include_performance_metrics: True
log_model_summary: True
Expand Down
82 changes: 82 additions & 0 deletions configs/projects/dcvlr/starter_kit/qwenvl-openr1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Qwen 2.5 VL 7B full fine-tune training config.
#
# Requirements:
# - Log into WandB (`wandb login`) or disable `enable_wandb`
# - (optional) If you want to use flash attention, run `pip install -U flash-attn --no-build-isolation`
#
#
# See Also:
# - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
# - Config class: oumi.core.configs.TrainingConfig
# - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
# - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/

model:
model_name: "Qwen/Qwen2.5-VL-7B-Instruct"
torch_dtype_str: "bfloat16"
model_max_length: 10000
trust_remote_code: True
attn_implementation: "sdpa" # You can also use `flash_attention_2` if you install it
chat_template: "qwen2-vl-instruct" # 2.5 uses the same template as 2.0

data:
train:
collator_name: "vision_language_sft"
collator_kwargs:
process_individually: True
use_torchdata: True
datasets:
- dataset_name: "hf_vision"
split: "train"
shuffle: True
seed: 42
trust_remote_code: True
transform_num_workers: "auto"
dataset_kwargs:
hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
image_column: "image"
question_column: "problem"
answer_column: "solution"
return_tensors: True
processor_name: "Qwen/Qwen2.5-VL-7B-Instruct"

training:
output_dir: "output/qwen2_5_vl_7b_openr1"
trainer_type: "TRL_SFT"
enable_gradient_checkpointing: True
per_device_train_batch_size: 1 # Must be 1: the model generates variable-sized image features
gradient_accumulation_steps: 1
# max_steps: 20 # Uncomment if you want to limit the number of training steps.
num_train_epochs: 1
# If this is not passed, checkpoints may be saved which are suitable for resuming training but not for loading from HF
save_final_model: True

gradient_checkpointing_kwargs:
# Reentrant docs: https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint
use_reentrant: False
ddp_find_unused_parameters: False
empty_device_cache_steps: 1
compile: False

optimizer: "adamw_torch_fused"
learning_rate: 2e-5
warmup_ratio: 0.03
weight_decay: 0.01
lr_scheduler_type: "cosine"

logging_steps: 5
save_steps: 0
dataloader_main_process_only: False
dataloader_num_workers: 2
dataloader_prefetch_factor: 8
include_performance_metrics: True
log_model_summary: False
enable_wandb: True

fsdp:
enable_fsdp: True
sharding_strategy: "HYBRID_SHARD"
mixed_precision: "bf16"
forward_prefetch: True
auto_wrap_policy: "SIZE_BASED_WRAP" # TODO: use transformer wrapper instead
min_num_params: 100000
Loading
Loading