CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

Long Xing* · Xiaoyi Dong* · Yuhang Zang · Yuhang Cao · Jianze Liang · Qidong Huang · Jiaqi Wang · Feng Wu · Dahua Lin

📖Paper |🤗CapRL-3B Model | 🤗CapRL-2M Dataset |🤗CapRL Collection | 🤗Daily Paper

🌈We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B. By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.

📢 News

🚀 [09/25/2025] We release CapRL repository, model, evaluation code and dataset.

💡 Highlights

🔥 Remarkable visual understanding for Chart, Infographics and Document: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
🔥 Well-organized output: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
🔥 Detailed description for natural images: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.

👨‍💻 Todo

Release training code.
Release 75k QA dataset.
Release CapRL-series on stronger base model.

🛠️ Setup

git clone https://github.com/InternLM/CapRL.git
conda create -n CapRL python=3.10
conda activate CapRL
bash setup.sh

⭐️ Quick Start

If you want to use CapRL-3B for captioning, you can directly follow the exact same inference approach as in Qwen2.5-VL-series.

We recommend using vLLM to speed up inference.

Start an OpenAI API Service

Run the command below to start an OpenAI-compatible API service:

vllm serve "/PATH/CapRL-3B" \
    --trust-remote-code \
    --tensor-parallel-size=1 \
    --pipeline-parallel-size=1 \
    --gpu_memory_utilization=0.95 \
    --served-model-name=caprl \
    --port 8000 \
    --host 0.0.0.0

Then you can use the chat API as below: (see OpenAI API protocol document for more details):

import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
    encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
    model="caprl",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_qwen
                    },
                },
                {"type": "text", "text": "What is the text in the illustrate?"},
            ],
        },
    ],
    temperature=1.0,
    max_tokens=max_tokens,
    top_p=1.0,
    extra_body={
        "repetition_penalty": 1.0,
        },
)
print("Chat response:", chat_response)

Pretraining

Datasets

Our CapRL-2M dataset is available on : 🔗 Hugging Face

It includes images from ShareGPT-1M and DenseFusion-1M, with high-quality captions re-annotated using CapRL-3B, totaling 2M samples.

In our JSONL files, we provide the captions along with their corresponding image paths. The images can be downloaded from ShareGPT-1M and DenseFusion-1M.

Reproducing Pretraining Experiments

To reproduce the pretraining experiments presented in our paper:

Initialize Qwen2.5-VL. Follow the steps in the notebook initiallize_vlm_3b.ipynb to set up the Qwen2.5-VL model for training.
Training. You can then use LLaMAFactory directly to run the training process.

Comparing Caption Quality via Prism Framework

We evaluate caption quality by decoupling the traditional VQA (Visual Question Answering) task:

First, a model generates a caption for the image.
Then, a language model answers questions based solely on the generated caption.

This approach allows us to assess the informational quality and completeness of the generated captions — if the language model can accurately answer visual questions based only on the caption, then the caption is likely high-quality.

The complete evaluation scripts can be found in the Prism_Evaluation folder, with the core implementation located in Eval_CapRL.py.

The model used for answering questions based on captions is CapRL-Eval-3B, which is a finetuned version of Qwen2.5-VL-3B. When dealing with tasks such as ChartQA (not multiple-choice questions), it provides more stable output formatting.

You can specify --reward-model-path as the path to CapRL-Eval-3B in Eval_CapRL.py.

Cases

📄 License

Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use

❤️ Acknowledgments

Open-LLaVA-NeXT: Thanks for the impressive open-source dataset.
VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Prism_Evaluation		Prism_Evaluation
assets		assets
eval_results/CapRL-3B		eval_results/CapRL-3B
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

📢 News

💡 Highlights

👨‍💻 Todo

🛠️ Setup

⭐️ Quick Start

Start an OpenAI API Service

Pretraining

Datasets

Reproducing Pretraining Experiments

Comparing Caption Quality via Prism Framework

Cases

📄 License

❤️ Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

InternLM/CapRL

Folders and files

Latest commit

History

Repository files navigation

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

📢 News

💡 Highlights

👨‍💻 Todo

🛠️ Setup

⭐️ Quick Start

Start an OpenAI API Service

Pretraining

Datasets

Reproducing Pretraining Experiments

Comparing Caption Quality via Prism Framework

Cases

📄 License

❤️ Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages