Long Xing* · Xiaoyi Dong* · Yuhang Zang · Yuhang Cao · Jianze Liang · Qidong Huang · Jiaqi Wang · Feng Wu · Dahua Lin
📖Paper |🤗CapRL-3B Model | 🤗CapRL-2M Dataset |🤗CapRL Collection | 🤗Daily Paper
🌈We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B. By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.


- 🚀 [09/25/2025] We release CapRL repository, model, evaluation code and dataset.
- 🔥 Remarkable visual understanding for Chart, Infographics and Document: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
- 🔥 Well-organized output: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
- 🔥 Detailed description for natural images: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.
- Release training code.
- Release 75k QA dataset.
- Release CapRL-series on stronger base model.
git clone https://github.com/InternLM/CapRL.git
conda create -n CapRL python=3.10
conda activate CapRL
bash setup.sh
If you want to use CapRL-3B for captioning, you can directly follow the exact same inference approach as in Qwen2.5-VL-series.
We recommend using vLLM to speed up inference.
Run the command below to start an OpenAI-compatible API service:
vllm serve "/PATH/CapRL-3B" \
--trust-remote-code \
--tensor-parallel-size=1 \
--pipeline-parallel-size=1 \
--gpu_memory_utilization=0.95 \
--served-model-name=caprl \
--port 8000 \
--host 0.0.0.0
Then you can use the chat API as below: (see OpenAI API protocol document for more details):
import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
model="caprl",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": base64_qwen
},
},
{"type": "text", "text": "What is the text in the illustrate?"},
],
},
],
temperature=1.0,
max_tokens=max_tokens,
top_p=1.0,
extra_body={
"repetition_penalty": 1.0,
},
)
print("Chat response:", chat_response)
Our CapRL-2M dataset is available on : 🔗 Hugging Face
It includes images from ShareGPT-1M and DenseFusion-1M, with high-quality captions re-annotated using CapRL-3B, totaling 2M samples.
In our JSONL files, we provide the captions along with their corresponding image paths. The images can be downloaded from ShareGPT-1M and DenseFusion-1M.
To reproduce the pretraining experiments presented in our paper:
-
Initialize Qwen2.5-VL. Follow the steps in the notebook
initiallize_vlm_3b.ipynb
to set up the Qwen2.5-VL model for training. -
Training. You can then use LLaMAFactory directly to run the training process.
We evaluate caption quality by decoupling the traditional VQA (Visual Question Answering) task:
- First, a model generates a caption for the image.
- Then, a language model answers questions based solely on the generated caption.
This approach allows us to assess the informational quality and completeness of the generated captions — if the language model can accurately answer visual questions based only on the caption, then the caption is likely high-quality.
The complete evaluation scripts can be found in the Prism_Evaluation
folder, with the core implementation located in Eval_CapRL.py
.
The model used for answering questions based on captions is CapRL-Eval-3B, which is a finetuned version of Qwen2.5-VL-3B. When dealing with tasks such as ChartQA (not multiple-choice questions), it provides more stable output formatting.
You can specify --reward-model-path
as the path to CapRL-Eval-3B in Eval_CapRL.py
.




Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
- Open-LLaVA-NeXT: Thanks for the impressive open-source dataset.
- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!