Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ SSL4RL Public

The official implementations of the Paper "SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning"

License

Notifications You must be signed in to change notification settings

PKU-ML/SSL4RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SSL4RL Logo

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning


Xiaojun Guo*, Runyu Zhou*, Yifei Wang*, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang
*Equal Contribution. Correspondence.

📊 Overview

We propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives—such as predicting image rotation or reconstructing masked patches—into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks, with encouraging potentials on open-ended image-captioning tasks. Through systematic ablations, we identify key factors—such as data volume, model scale, model choice, task difficulty, and semantic alignment with the target domain — that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework’s generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

SSL4RL Overview

📌 Key Takeaways

1️⃣ SSL as Intrinsic Reward Sharpens VLM Reasoning. The SSL4RL paradigm demonstrably enhances vision-language reasoning by repurposing SSL tasks as intrinsic rewards. It deepens the perception and understanding of the image itself, leading towards more precise visual attention and less language bias.

2️⃣ Task Choice is Critical. SSL tasks show effectiveness when their inherent semantic aligns with core reasoning skills, while an inappropriate task may induce negative transfer and hinder downstream performance.

3️⃣ Goldilocks Principle of Task Difficulty. The effectiveness of an SSL task is contingent on its difficulty being appropriately matched to the model's capacity. Insufficient challenge provides a weak learning signal, while excessive difficulty leads to negative transfer.

4️⃣ Non-additivity of Rewards. A naive combination of multiple SSL rewards does not yield cumulative improvements, indicating potential optimization conflicts and underscoring the need for sophisticated integration strategies rather than simple averaging.

🔥 Open-source Collections

Our models are released in the huggingface collection PKU-ML/SSL4RL:

🚀 Environment Setups

Our implementation is based on the library verl 0.3.0 developed by ByteDance Seed team.

  1. Requirements:

    • Python: Version >= 3.9
    • CUDA: Version >= 12.1
  2. For installing the dependencies, we recommend to use a fresh new conda environment:

    conda create -n verl python==3.10
    conda activate verl
  3. Execute the install script.

    git clone https://github.com/PKU-ML/SSL4RL.git
    cd SSL4RL
    bash scripts/install_vllm_sglang_mcore.sh

    We provide the version of key required packages here:

      - accelerate==1.8.1
      - datasets==4.0.0
      - flash-attn==2.7.4.post1
      - pyarrow==20.0.0
      - qwen-vl-utils==0.0.11
      - tokenizers==0.21.1
      - torch==2.6.0
      - torchvision==0.21.0
      - transformers==4.52.4
      - verl==0.3.0.post1
      - vllm==0.8.5.post1
      - xformers==0.0.29.post2
  4. Install our package with some lightweight dependencies in setup.py:

    pip3 install -e .

If you encounter any issues during installation, please refer to the Installation Guide provided by Verl. If problems persist, don’t hesitate to report them to us.

🎯 Build SSL4RL Tasks

We provide the codes for building SSL4RL tasks, including Position, Rotation, Contrastive, and Jigsaw.

Take the MMBench as an example:

  1. Download the MMBench benchmark from HuggingFaceM4/MMBench and save it to local directorys datasets/MMBench.

  2. To build the SSL4RL dataset, execute the codes provided in build_benchmarks. Each code will generate an SSL4RL dataset that can be directedly load with datasets.load_dataset. By default, the dataset will be saved in the directory datasets. You can assign the target directory by changing the save_dir in the code.

cd build_benchmarks

## For Position Task
python makeposition_mmbench.py
## For Rotation Task
python makerotation_mmbench.py
## For Jigsaw Task
python makejigsaw_mmbench.py
## For Contrastive Task
python makecontrastive_mmbench.py

We provide the download link of benchmarks used in the paper here:

🧩 Running Reinforcement Learning

Follow these steps to reproduce our SSL4RL implementation:

  1. Preprocess the dataset for RL training. Run the preprocessing script to convert the dataset format:

    cd verl
    
    python preprocess.py --data_source datasets/MMBench_PositionQA  --local_dir our_datasets/MMBench_PositionQA
    • data_source: The directory of SSL4RL datasets.
    • local_dir: Output directory for the processed dataset.
  2. Launch RL Training.

    Execute the training script (Position as an example):

    bash run_mmbench_position.sh

    Configuration notes for the training script:

    • SAVE_DIR: Output directory for the trained model.

    • train_path and test_path: Paths to the processed dataset.

    • Logging: Defaults to Tensorboard. To use Weights & Biases, set trainer.logger = ['console','wandb'].

    • trainer.n_gpus_per_node: Your actual GPU count.

    • Our paper used 8×A800 GPUs. For limited GPU resources, reduce these parameters (may affect performance):

      actor_rollout_ref.actor.ppo_mini_batch_size
      actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu
      actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu
      actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu
  3. Convert RL-trained Checkpoints to HuggingFace Format

    Merge the model checkpoints into a HuggingFace-compatible format:

    python scripts/merge_models.sh

🌊 Inference and Evaluation

We utilize VLMEvalKit to evaluate SSL4RL models.

  1. Install VLMEvalKit

      git clone https://github.com/open-compass/VLMEvalKit.git
      cd VLMEvalKit
      pip install -e .
  2. Customize Prompt

    To accurately extract the answer, we customize the prompt in vlmeval/dataset/image_mcq.py

    Answer this multiple-choice question. Think step by step before answering. The last line of your response should be of the following format: <think>step-by-step reasoning</think> <answer>$LETTER</answer>, where LETTER is one of the options.

  3. Write Configs

    Add configuration files to the configs directory.

    {
        "model": {
          "ssl4rl_qwen_3b_mmbench_position_step300": {
            "class": "Qwen2VLChat",
            "model_path": "our_models/ssl4rl_qwen_3b_mmbench_position_step300",
            "min_pixels": 35840,
            "max_pixels": 12845056,
            "use_custom_prompt": false
          }
        },
        "data": {
          "MMBench": {
            "class": "ImageMCQDataset",
            "dataset": "MMBench_DEV_EN"
          }
        }
      }
  4. Run Evaluation

    We provide evaluation scripts as below:

    torchrun --nproc-per-node=8 run.py \
      --config configs/mmbench_position.json \
      --work-dir eval_results/SSL4RL \
      --mode infer \

🎨 Customization Guide

To adapt SSL4RL for your needs, we recommend modifying these key files from our verl-based implementation:

verl/utils/                # Supporting utilities
  - reward_score/__init__.py   # Reward normalization/scaling
  - reward_score/ssl4rl.py      # Scoring metrics

Citation

If you find this work useful, please give us a free cite:

@article{guo2025ssl4rl,
  title={SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning},
  author={Guo, Xiaojun and Zhou, Runyu and Wang, Yifei and Zhang, Qi and Zhang, Chenheng and Jegelka, Stefanie and Wang, Xiaohan and Chai, Jiajun and Yin, Guojun and Lin, Wei and others},
  journal={arXiv preprint arXiv:2510.16416},
  year={2025}
}

About

The official implementations of the Paper "SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published