SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Xiaojun Guo^*, Runyu Zhou^*, Yifei Wang^*, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang^‡
^*Equal Contribution. ^‡Correspondence.

📊 Overview

We propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives—such as predicting image rotation or reconstructing masked patches—into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks, with encouraging potentials on open-ended image-captioning tasks. Through systematic ablations, we identify key factors—such as data volume, model scale, model choice, task difficulty, and semantic alignment with the target domain — that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework’s generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

📌 Key Takeaways

1️⃣ SSL as Intrinsic Reward Sharpens VLM Reasoning. The SSL4RL paradigm demonstrably enhances vision-language reasoning by repurposing SSL tasks as intrinsic rewards. It deepens the perception and understanding of the image itself, leading towards more precise visual attention and less language bias.

2️⃣ Task Choice is Critical. SSL tasks show effectiveness when their inherent semantic aligns with core reasoning skills, while an inappropriate task may induce negative transfer and hinder downstream performance.

3️⃣ Goldilocks Principle of Task Difficulty. The effectiveness of an SSL task is contingent on its difficulty being appropriately matched to the model's capacity. Insufficient challenge provides a weak learning signal, while excessive difficulty leads to negative transfer.

4️⃣ Non-additivity of Rewards. A naive combination of multiple SSL rewards does not yield cumulative improvements, indicating potential optimization conflicts and underscoring the need for sophisticated integration strategies rather than simple averaging.

🔥 Open-source Collections

Our models are released in the huggingface collection PKU-ML/SSL4RL:

SSL4RL-MMbench-Position-3B: PKU-ML/SSL4RL-MMbench-Position-3B
SSL4RL-MMbench-Rotation-3B: PKU-ML/SSL4RL-MMbench-Rotation-3B
SSL4RL-MMbench-Jigsaw-3B: PKU-ML/SSL4RL-MMbench-Jigsaw-3B
SSL4RL-MMbench-Contrastive-3B: PKU-ML/SSL4RL-MMbench-Contrastive-3B

🚀 Environment Setups

Our implementation is based on the library verl 0.3.0 developed by ByteDance Seed team.

Requirements:
- Python: Version >= 3.9
- CUDA: Version >= 12.1
For installing the dependencies, we recommend to use a fresh new conda environment:
```
conda create -n verl python==3.10
conda activate verl
```

Execute the install script.

git clone https://github.com/PKU-ML/SSL4RL.git
cd SSL4RL
bash scripts/install_vllm_sglang_mcore.sh

We provide the version of key required packages here:

  - accelerate==1.8.1
  - datasets==4.0.0
  - flash-attn==2.7.4.post1
  - pyarrow==20.0.0
  - qwen-vl-utils==0.0.11
  - tokenizers==0.21.1
  - torch==2.6.0
  - torchvision==0.21.0
  - transformers==4.52.4
  - verl==0.3.0.post1
  - vllm==0.8.5.post1
  - xformers==0.0.29.post2

Install our package with some lightweight dependencies in setup.py:
```
pip3 install -e .
```

If you encounter any issues during installation, please refer to the Installation Guide provided by Verl. If problems persist, don’t hesitate to report them to us.

🎯 Build SSL4RL Tasks

We provide the codes for building SSL4RL tasks, including Position, Rotation, Contrastive, and Jigsaw.

Take the MMBench as an example:

Download the MMBench benchmark from HuggingFaceM4/MMBench and save it to local directorys datasets/MMBench.
To build the SSL4RL dataset, execute the codes provided in build_benchmarks. Each code will generate an SSL4RL dataset that can be directedly load with datasets.load_dataset. By default, the dataset will be saved in the directory datasets. You can assign the target directory by changing the save_dir in the code.

cd build_benchmarks

## For Position Task
python makeposition_mmbench.py
## For Rotation Task
python makerotation_mmbench.py
## For Jigsaw Task
python makejigsaw_mmbench.py
## For Contrastive Task
python makecontrastive_mmbench.py

We provide the download link of benchmarks used in the paper here:

MMBench: HuggingFaceM4/MMBench
SEED-Bench: Opencompass/SEEDBench
BLINK: BLINK-Benchmark/BLINK
MME-RealWorld: yifanzhang114/MME-RealWorld-Lite
MMStar: Lin-Chen/MMStar
RealWorldQA: Opencompass/RealWorldQA
ImageNet: PKU-ML/ImageNet-Subset
CapArena: NJU-Desk/CapArena

🧩 Running Reinforcement Learning

Follow these steps to reproduce our SSL4RL implementation:

Preprocess the dataset for RL training. Run the preprocessing script to convert the dataset format:
```
cd verl

python preprocess.py --data_source datasets/MMBench_PositionQA  --local_dir our_datasets/MMBench_PositionQA
```
- data_source: The directory of SSL4RL datasets.
- local_dir: Output directory for the processed dataset.
Launch RL Training.

Execute the training script (Position as an example):
```
bash run_mmbench_position.sh
```
Configuration notes for the training script:
- SAVE_DIR: Output directory for the trained model.
- train_path and test_path: Paths to the processed dataset.
- Logging: Defaults to Tensorboard. To use Weights & Biases, set trainer.logger = ['console','wandb'].
- trainer.n_gpus_per_node: Your actual GPU count.
- Our paper used 8×A800 GPUs. For limited GPU resources, reduce these parameters (may affect performance):
```
actor_rollout_ref.actor.ppo_mini_batch_size
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu
```
Convert RL-trained Checkpoints to HuggingFace Format

Merge the model checkpoints into a HuggingFace-compatible format:
```
python scripts/merge_models.sh
```

🌊 Inference and Evaluation

We utilize VLMEvalKit to evaluate SSL4RL models.

Install VLMEvalKit

  git clone https://github.com/open-compass/VLMEvalKit.git
  cd VLMEvalKit
  pip install -e .

Customize Prompt

To accurately extract the answer, we customize the prompt in vlmeval/dataset/image_mcq.py

Answer this multiple-choice question. Think step by step before answering. The last line of your response should be of the following format: <think>step-by-step reasoning</think> <answer>$LETTER</answer>, where LETTER is one of the options.

Write Configs

Add configuration files to the configs directory.

{
    "model": {
      "ssl4rl_qwen_3b_mmbench_position_step300": {
        "class": "Qwen2VLChat",
        "model_path": "our_models/ssl4rl_qwen_3b_mmbench_position_step300",
        "min_pixels": 35840,
        "max_pixels": 12845056,
        "use_custom_prompt": false
      }
    },
    "data": {
      "MMBench": {
        "class": "ImageMCQDataset",
        "dataset": "MMBench_DEV_EN"
      }
    }
  }

Run Evaluation

We provide evaluation scripts as below:

torchrun --nproc-per-node=8 run.py \
  --config configs/mmbench_position.json \
  --work-dir eval_results/SSL4RL \
  --mode infer \

🎨 Customization Guide

To adapt SSL4RL for your needs, we recommend modifying these key files from our verl-based implementation:

verl/utils/                # Supporting utilities
  - reward_score/__init__.py   # Reward normalization/scaling
  - reward_score/ssl4rl.py      # Scoring metrics

Citation

If you find this work useful, please give us a free cite:

@article{guo2025ssl4rl,
  title={SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning},
  author={Guo, Xiaojun and Zhou, Runyu and Wang, Yifei and Zhang, Qi and Zhang, Chenheng and Jegelka, Stefanie and Wang, Xiaohan and Chai, Jiajun and Yin, Guojun and Lin, Wei and others},
  journal={arXiv preprint arXiv:2510.16416},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gemini		.gemini
.github		.github
build_benchmarks		build_benchmarks
docker		docker
examples		examples
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
SSL4RL-Example.png		SSL4RL-Example.png
SSL4RL-Overview.png		SSL4RL-Overview.png
logo.png		logo.png
ray_config.yaml		ray_config.yaml
run_mmbench_contrastive.sh		run_mmbench_contrastive.sh
run_mmbench_jigsaw.sh		run_mmbench_jigsaw.sh
run_mmbench_position.sh		run_mmbench_position.sh
run_mmbench_rotation.sh		run_mmbench_rotation.sh
run_seedbench_contrastive.sh		run_seedbench_contrastive.sh
run_seedbench_jigsaw.sh		run_seedbench_jigsaw.sh
run_seedbench_position.sh		run_seedbench_position.sh
run_seedbench_rotation.sh		run_seedbench_rotation.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

📊 Overview

📌 Key Takeaways

🔥 Open-source Collections

🚀 Environment Setups

🎯 Build SSL4RL Tasks

🧩 Running Reinforcement Learning

🌊 Inference and Evaluation

🎨 Customization Guide

Citation

About

Uh oh!

Releases

Packages

Languages

License

PKU-ML/SSL4RL

Folders and files

Latest commit

History

Repository files navigation

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

📊 Overview

📌 Key Takeaways

🔥 Open-source Collections

🚀 Environment Setups

🎯 Build SSL4RL Tasks

🧩 Running Reinforcement Learning

🌊 Inference and Evaluation

🎨 Customization Guide

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages