oumi-ai · oelachqar · Jun 10, 2025 · Jun 5, 2025 · Jun 5, 2025 · Jun 5, 2025
diff --git a/configs/projects/dcvlr/README.md b/configs/projects/dcvlr/README.md
@@ -1,4 +1,4 @@
-# DCVLR: Data Curation for Vision Language Reasoning
+# DCVLR - Getting Under the Hood
 
 [![NeurIPS 2025](https://img.shields.io/badge/NeurIPS-2025-blue.svg)](https://neurips.cc/Conferences/2025)
 [![Competition](https://img.shields.io/badge/Competition-Open-green.svg)](https://dcvlr.org)
@@ -16,58 +16,117 @@
 
 ---
 
+## What is this directory?
 
-DCVLR is the first open-data, open-models, open-source competition for data curation in vision-language reasoning, hosted at NeurIPS 2025.
+This directory is intended to accompany the [2025 DCVLR (Data Curation for Vision-Language Reasoning) NeurIPS competition](https://dcvlr-neurips.github.io/). If you don't know what that is, you should go read the competition website and then come back here!
 
+## DCVLR: Digging Deeper
 
-## 🎯 Challenge
+The DCVLR competition was explicitly designed to have a *low barrier to entry*, allowing a diverse collection of teams to compete. However, we know that many teams may be interested in digging deeper into the data and the tasks in order to optimize the performance of their allowed submissions. If that's you, you've come to the right place. This directory will give you all the building blocks necessary to reproduce the train and eval pipeline used in the DCVLR competition on your own cluster.
 
-Participants can leverage any source datasets to curate high-quality instruction-tuning datasets (1K or 10K examples). Participants are encouraged to explore diverse curation strategies, from synthetic data generation to subset selection. Submissions will be evaluated by fine-tuning an undisclosed, open-source vision-language model on the curated data and measuring performance across a wide variety of benchmarks.
+## What You Will Need
 
-## 🚀 Quick Start
+In order to reproduce our experimental pipeline with the model architectures we consider for this competition (which range from 7B to 10B parameters), you will need access to a cluster with at least 8 A100 GPUs, and 1TB of disk space. If you don't have access, you can rent a cluster, e.g. on [Lambda](https://lambdalabs.com/service/gpu-cloud). All DCVLR participants are eligible for a credit on Lambda which they can use to run experiments for the competition.
 
-Get started with training in minutes:
+We plan to provide add examples of how to experiment on smaller architectures (e.g. 1B parameters) to this directory at a later date, so stay tuned. You can also refer to the [Oumi documentation](https://oumi.ai/docs/en/latest/index.html) for more information on how to run experiments on smaller clusters.
+
+### Data Sourcing
+
+Where can you source data that might be suitable for training for this competition? If you want to draw on existing datasets, here are a few we recommend looking into --
+
+[Llava-O1](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k)
+
+[Math-Llava](https://huggingface.co/datasets/Zhiqiang007/MathV360K)
+
+[Geo-170K](https://huggingface.co/datasets/Luckyjhg/Geo170K)
+
+[Open-R1](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified)
+
+[AIDC Ovis](https://huggingface.co/datasets/AIDC-AI/Ovis-dataset)
+
+[Llava 1V](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data)
+
+### Data Curation
+
+We will add documentation on how to use Oumi for synthetic data curation and data transformation here soon. Stay tuned!
+
+For now, you will have to BYOD (bring your own dataset) in an Oumi-supported dataset format. For this competition, we highly recommend the flexible "hf_vision" format, which allows you to load a wide range of VL datasets from the Hugging Face Hub. Here's an example we used for training on a filtered version of the Multimodal Open-R1 dataset:
 
 ```bash
-# Install oumi
-uv pip install "oumi[gpu]"
+datasets:
+    - dataset_name: "hf_vision"
+    split: "train"
+    shuffle: True
+    seed: 42
+    trust_remote_code: True
+    transform_num_workers: "auto"
+    dataset_kwargs:
+        hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
+        image_column: "image"
+        question_column: "problem"
+        answer_column: "solution"
+        return_tensors: True
+```
+
+### Model Training
 
-# Train with Molmo-7B-O
-oumi train -c molmo-o --dataset dataset.jsonl
+#### Setup and Environment
 
-# Train with Qwen2.5-VL-7B-Instruct
-oumi train -c qwen2.5-vl-7b-instruct --dataset dataset.jsonl
+DCVLR experiments can be run using the main branch of the Oumi repository. We provide a [DOCKERFILE](https://github.com/oumi-ai/oumi/blob/main/Dockerfile) for building Oumi, or you can follow the instructions in the [Quickstart](https://oumi.ai/docs/en/latest/get_started/quickstart.html).
+
+#### Commands
+
+Model training is extremely straightforward, requiring only a single command:
+
+```bash
+export MY_CONFIG=<PATH/TO/qwenvl-openr1.yaml>
+torchrun --nproc-per-node 8 --standalone -m oumi train -c $MY_CONFIG
 ```
 
-## 📅 Key Dates
+We provide configurations for three models; Molmo-D, Molmo-O, and QwenVL-2.5. Other models such as  InternVL3 may also be used in the competition.
+
+Depending on how `training: output_dir` is set in the config file, the model checkpoints will be saved in the base of the specified directory.
+
+We then recommend syncing the trained model to HuggingFace Hub using the `huggingface-cli` tool to enable version control and ease of future access. The repository need not exist in advance, it will be automatically created when you use this command.
 
-| Date | Milestone |
-|------|-----------|
-| **June 11, 2025** | Release of Competition Materials |
-| **July 1, 2025** | Submission Portal Opens |
-| **October 1, 2025** | Final Submission Deadline |
-| **November 1, 2025** | Results Announced |
-| **December 2025** | NeurIPS 2025 Presentation |
+```bash
+huggingface-cli upload-large-folder <YOUR_HF_REPO> <YOUR_OUTPUT_DIRECTORY> --repo-type=model
+```
+
+### Model Evaluation
 
+#### Setup and Environment
 
-## 📚 Competition Resources
+We use a modified version of [VLMEvalKit](https://github.com/oumi-ai/VLMEvalKit) for our evaluation harness. You can clone and install it following the directions in the repo, or use the provided [DOCKERFILE](https://github.com/oumi-ai/VLMEvalKit/blob/main/docker/Dockerfile.cuda12.9-oumi-molmo-qwen).
 
-| Resource | Description | Link |
-|----------|-------------|------|
-| 📊 **Starter Kit** | Comprehensive starter kit with example datasets, training scripts, and best practices | [Access Starter Kit](https://huggingface.co/datasets/oumi-ai/dcvlr-starter-kit) |
-| 💻 **Training Scripts** | Starting scripts for fine-tuning multiple vision-language models | [View Scripts](https://github.com/oumi-ai/oumi/tree/main/configs/projects/dcvlr) |
-| 🧪 **Evaluation Code** | Scripts to evaluate model outputs on diverse benchmark development sets | [Get Code](https://github.com/oumi-ai/oumi/tree/main/configs/projects/dcvlr) |
-| ☁️ **Compute Resources** | GPU credits from Lambda Labs for participants | [Apply for Credits](https://oumi-ai.typeform.com/to/OGPuRt6U") |
-| 📚 **Documentation** | Complete guides and tutorials | [View Documentation](https://oumi.ai/docs) |
+#### Commands
 
-## 🤝 Sponsors
+Model evaluation can also be conducted using a simple one-line command. We give an example with four datasets; these datasets are not guaranteed to be the ones we use in the competition, however, they are a good starting point for the types of tasks we are considering.
 
-- **Lambda Labs** - Compute Resources
-- **Oumi.ai** - Competition Support
+```bash
+export MODEL_NAME=<YOUR/HF/MODEL/PATH>
+export WORK_DIR=<YOUR/OUTPUT/DIRECTORY>
+mkdir -p "$WORK_DIR"
+export DATASETS="VMCBench_DEV WeMath MathVista_MINI LiveXivVQA"
+python scripts/wandb_logger.py --run-and-log \
+                               --data $DATASETS \
+                               --work-dir $WORK_DIR \
+                               --use-vllm \
+                               --save-detailed-eval \
+                               --save-judge-responses \
+                               --max-output-tokens 4096 \
+                               --pass-custom-model $MODEL_NAME
+```
 
-## 📞 Contact
+## How to Cite DCVLR
 
-Have questions? Get in touch with the DCVLR team:
+If you wish to refer to DCVLR in your work, please cite the following:
 
-- **Website**: [dcvlr.org](https://dcvlr.org)
-- **Email**: [Contact Form](https://dcvlr.org/contact)
+```bib
+@misc{DCVLR: Data Curation for Vision-Language Reasoning,
+  author = {Feuer, Benjamin and Tripathi, Rohun and Elachqar, Oussama and Zhang, Yuhui and Hulkund, Neha and Nguyen, Thao and Shabtay, Nimrod and Udandarao, Vishaal and Wang, Xiaohan and Webb, Stefan and Koukoumidis, Emmanouil and Schmidt, Ludwig and Xie, Saining and Yeung-Levy, Serena and Liang, Paul and Beery, Sara and Gkioxari, Georgia}
+  month = June,
+  title = {{DCVLR}},
+  year = {2025}
+}
+```
diff --git a/configs/projects/dcvlr/starter_kit/README.md b/configs/projects/dcvlr/starter_kit/README.md
diff --git a/configs/projects/dcvlr/starter_kit/evaluate.sh b/configs/projects/dcvlr/starter_kit/evaluate.sh
diff --git a/configs/projects/dcvlr/starter_kit/molmo-d-train-openr1.yaml b/configs/projects/dcvlr/starter_kit/molmo-d-train-openr1.yaml
@@ -0,0 +1,65 @@
+# Full fine-tune config for Molmo-7B-D.
+#
+# Note: the original model is not compatible with the latest version of transformers and oumi
+# We use the oumi-ai version of the model instead until the original model is updated.
+#
+# Requirements:
+#   - uv pip install einops tf-keras
+#
+# Usage:
+#   oumi train -c configs/recipes/vision/molmo/sft/molmo_d_full/train.yaml
+#
+# See Also:
+#   - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
+#   - Config class: oumi.core.configs.TrainingConfig
+#   - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
+#   - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/
+
+model:
+  # model_name: "allenai/Molmo-7B-O-0924"
+  model_name: "oumi-ai/Molmo-7B-D-0924"
+  torch_dtype_str: "float32"
+  model_max_length: 8192
+  trust_remote_code: True
+  model_kwargs:
+    max_position_embeddings: 8192
+
+data:
+  train:
+    collator_name: "vision_language_sft"
+    collator_kwargs:
+      process_individually: True
+    use_torchdata: True
+    datasets:
+      - dataset_name: "hf_vision"
+        split: "train"
+        shuffle: True
+        seed: 42
+        trust_remote_code: True
+        transform_num_workers: "auto"
+        dataset_kwargs:
+          hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
+          image_column: "image"
+          question_column: "problem"
+          answer_column: "solution"
+          return_tensors: True
+
+training:
+  output_dir: "output/molmo_d_openr1"
+  trainer_type: "TRL_SFT"
+  enable_gradient_checkpointing: False # Note: Molmo does not support gradient checkpointing
+  per_device_train_batch_size: 1
+  optimizer: "adamw_torch_fused"
+  logging_steps: 100
+  save_steps: 0
+  include_performance_metrics: True
+  log_model_summary: True
+  dataloader_main_process_only: False
+
+fsdp:
+  enable_fsdp: True
+  sharding_strategy: "HYBRID_SHARD"
+  mixed_precision: "bf16"
+  forward_prefetch: True
+  auto_wrap_policy: "SIZE_BASED_WRAP" # TODO: use transformer wrapper instead
+  min_num_params: 100000
diff --git a/...igs/projects/dcvlr/starter_kit/train.yaml → ...vlr/starter_kit/molmo-o-train-openr1.yaml b/...igs/projects/dcvlr/starter_kit/train.yaml → ...vlr/starter_kit/molmo-o-train-openr1.yaml
@@ -1,4 +1,4 @@
-# Full fine-tune config for Molmo-7B-D.
+# Full fine-tune config for Molmo-7B-O.
 #
 # Note: the original model is not compatible with the latest version of transformers and oumi
 # We use the oumi-ai version of the model instead until the original model is updated.
@@ -8,7 +8,6 @@
 #
 # Usage:
 #   oumi train -c configs/recipes/vision/molmo/sft/molmo_o_full/train.yaml
-#   torchrun --nproc-per-node 4 --standalone -m	oumi train -c configs/recipes/vision/molmo/sft/molmo_o_full/train.yaml
 #
 # See Also:
 #   - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
@@ -17,11 +16,12 @@
 #   - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/
 
 model:
-  # model_name: "allenai/Molmo-7B-O-0924"
   model_name: "oumi-ai/Molmo-7B-O-0924"
   torch_dtype_str: "float32"
-  model_max_length: 2048
+  model_max_length: 8192
   trust_remote_code: True
+  model_kwargs:
+    max_position_embeddings: 8192
 
 data:
   train:
@@ -30,24 +30,26 @@ data:
       process_individually: True
     use_torchdata: True
     datasets:
-      - dataset_name: "merve/vqav2-small"
-        split: "validation"
+      - dataset_name: "hf_vision"
+        split: "train"
         shuffle: True
         seed: 42
+        trust_remote_code: True
         transform_num_workers: "auto"
         dataset_kwargs:
-          # processor_name: "allenai/Molmo-7B-O-0924"
-          processor_name: "oumi-ai/Molmo-7B-O-0924"
-          return_conversation: True
+          hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
+          image_column: "image"
+          question_column: "problem"
+          answer_column: "solution"
+          return_tensors: True
 
 training:
-  output_dir: "output/molmo_sft"
+  output_dir: "output/molmo_o_openr1"
   trainer_type: "TRL_SFT"
   enable_gradient_checkpointing: False # Note: Molmo does not support gradient checkpointing
-  per_device_train_batch_size: 2
-  max_steps: 20
+  per_device_train_batch_size: 1
   optimizer: "adamw_torch_fused"
-  logging_steps: 5
+  logging_steps: 100
   save_steps: 0
   include_performance_metrics: True
   log_model_summary: True

diff --git a/configs/projects/dcvlr/starter_kit/qwenvl-openr1.yaml b/configs/projects/dcvlr/starter_kit/qwenvl-openr1.yaml
@@ -0,0 +1,82 @@
+# Qwen 2.5 VL 7B full fine-tune training config.
+#
+# Requirements:
+#   - Log into WandB (`wandb login`) or disable `enable_wandb`
+#   - (optional) If you want to use flash attention, run `pip install -U flash-attn --no-build-isolation`
+#
+#
+# See Also:
+#   - Documentation: https://oumi.ai/docs/en/latest/user_guides/train/train.html
+#   - Config class: oumi.core.configs.TrainingConfig
+#   - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/training_config.py
+#   - Other training configs: configs/**/pretraining/, configs/**/sft/, configs/**/dpo/
+
+model:
+  model_name: "Qwen/Qwen2.5-VL-7B-Instruct"
+  torch_dtype_str: "bfloat16"
+  model_max_length: 10000
+  trust_remote_code: True
+  attn_implementation: "sdpa" # You can also use `flash_attention_2` if you install it
+  chat_template: "qwen2-vl-instruct" # 2.5 uses the same template as 2.0
+
+data:
+  train:
+    collator_name: "vision_language_sft"
+    collator_kwargs:
+      process_individually: True
+    use_torchdata: True
+    datasets:
+      - dataset_name: "hf_vision"
+        split: "train"
+        shuffle: True
+        seed: 42
+        trust_remote_code: True
+        transform_num_workers: "auto"
+        dataset_kwargs:
+          hf_dataset_path: "penfever/multimodal-open-r1-8192-filtered-tighter"
+          image_column: "image"
+          question_column: "problem"
+          answer_column: "solution"
+          return_tensors: True
+          processor_name: "Qwen/Qwen2.5-VL-7B-Instruct"
+
+training:
+  output_dir: "output/qwen2_5_vl_7b_openr1"
+  trainer_type: "TRL_SFT"
+  enable_gradient_checkpointing: True
+  per_device_train_batch_size: 1 # Must be 1: the model generates variable-sized image features
+  gradient_accumulation_steps: 1
+  # max_steps: 20 # Uncomment if you want to limit the number of training steps.
+  num_train_epochs: 1
+  # If this is not passed, checkpoints may be saved which are suitable for resuming training but not for loading from HF
+  save_final_model: True
+
+  gradient_checkpointing_kwargs:
+    # Reentrant docs: https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint
+    use_reentrant: False
+  ddp_find_unused_parameters: False
+  empty_device_cache_steps: 1
+  compile: False
+
+  optimizer: "adamw_torch_fused"
+  learning_rate: 2e-5
+  warmup_ratio: 0.03
+  weight_decay: 0.01
+  lr_scheduler_type: "cosine"
+
+  logging_steps: 5
+  save_steps: 0
+  dataloader_main_process_only: False
+  dataloader_num_workers: 2
+  dataloader_prefetch_factor: 8
+  include_performance_metrics: True
+  log_model_summary: False
+  enable_wandb: True
+
+fsdp:
+  enable_fsdp: True
+  sharding_strategy: "HYBRID_SHARD"
+  mixed_precision: "bf16"
+  forward_prefetch: True
+  auto_wrap_policy: "SIZE_BASED_WRAP" # TODO: use transformer wrapper instead
+  min_num_params: 100000