Sara Ghaboura * Β
Ketan More * Β
Wafa Alghallabi Β
Omkar Thawakar Β
Jorma Laaksonen Β
Hisham Cholakkal Β
Salman Khan Β
Rao M. Anwer
*Equal Contribution
*Equal Contribution
π₯ [22 May 2025] ARB is 1st Arabic multimodal benchmark focused on step-by-step reasoning is released.
π€ [22 May 2025] ARB dataset available on HuggingFace.
ARB is the first benchmark focused on step-by-step reasoning in Arabic cross both textual and visual modalities, covering 11 diverse domains spanning science, culture, OCR, and historical interpretation.
- 1,356 multimodal samples, each with an image, Arabic question, and reasoning-based answer.
- 5,119 curated reasoning steps reflecting human logic
- 11 diverse domains, from visual reasoning to historical and scientific analysis.
- Native Arabic speakers and domain experts verified.
- Hybrid sources: original Arabic data, high-quality translations, and synthetic samples.
- Robust evaluation framework for final answer accuracy and reasoning quality
- Fully open-source dataset and toolkit to support research in Arabic reasoning and multimodal AI.
| Domain | English Bench | Arabic Bench | Human-Created | Synthetic |
|---|---|---|---|---|
| Visual Reasoning | β | β | β | β |
| OCR & Document Analysis | β | β | β | β |
| Chart & Data Table (CDT) | β | β | β | β |
| Math & Logic | β | β | β | β |
| Social & Cultural | β | β | β | β |
| Computer Vision Perception | β | β | β | β |
| Medical Image Analysis | β | β | β | β |
| Scientific Reasoning | β | β | β | β |
| Agricultural Interpretation | β | β | β | β |
| Remote Sensing Understanding | β | β | β | β |
| Historical & Anthropological | β | β | β | β |
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("MBZUAI/ARB")
We evaluated 12 open- and closed-source LMMs using:
- Lexical and Semantic Similarity Scoes: BLEU, ROUGE, BERTScore.
- Cross-lingual semantic alignment: LaBSE
- Custom Rubric (Arabic):: Our curated metric rebric includes 10 factors like faithfulness, interpretive depth, coherence, hallucination, and more.
We evaluate models using:
- Step-by-step reasoning quality (coherence, informativeness, commonsense)
- Final answer accuracy
- Agreement with human raters (Krippendorffβs Alpha > 87%)
For Closed-Source Models:
| GPT-4o | GPT-4o-mini | GPT-4.1 | o4-mini | Gemini 1.5 Pro | Gemini 2.0 Flash | |
|---|---|---|---|---|---|---|
| Final Answer (%) | 60.22 | 52.22 | 59.43 | 58.93 | 56.7 | 57.8 |
| Reasoning Steps (%) | 64.29 | 61.02 | 80.41 | 80.75 | 64.34 | 64.09 |
For Open-Source Models:
| Qwen2.5-VL-7B | Llama-3.2-11B | AIN | Llama-4 Scout | Aya-Vision-8B | InternVL3-8B | |
|---|---|---|---|---|---|---|
| Final Answer (%) | 37.02 | 25.58 | 27.35 | 48.52 | 28.81 | 31.04 |
| Reasoning Steps (%) | 64.03 | 53.2 | 52.77 | 77.7 | 63.64 | 54.5 |
Each sample includes:
image: Visual inputquestion: Arabic reasoning promptchoices: The choices for MCQsteps: Ordered reasoning chainanswer: Final solution (Arabic)domain: One of 11 categories (e.g., OCR, Scientific, Visual, Math)curriculum: One of the 4 curricula followed by the prompt for steps generation (Computational, Sci/Med, Textual/Partial, and General)
If you use ARB dataset in your research, please consider citing:
@misc{ghaboura2025arbcomprehensivearabicmultimodal,
title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark},
author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
year={2025},
eprint={2505.17021},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.17021},
}