ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Sara Ghaboura ^* Ketan More ^* Wafa Alghallabi Omkar Thawakar
Jorma Laaksonen Hisham Cholakkal Salman Khan Rao M. Anwer
^{*Equal Contribution}

^{*Equal Contribution}

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Latest Updates

🔥 [22 May 2025] ARB is 1st Arabic multimodal benchmark focused on step-by-step reasoning is released.
🤗 [22 May 2025] ARB dataset available on HuggingFace.

ARB Scope and Diversity

ARB is the first benchmark focused on step-by-step reasoning in Arabic cross both textual and visual modalities, covering 11 diverse domains spanning science, culture, OCR, and historical interpretation.

🌟 Key Features

1,356 multimodal samples, each with an image, Arabic question, and reasoning-based answer.
5,119 curated reasoning steps reflecting human logic
11 diverse domains, from visual reasoning to historical and scientific analysis.
Native Arabic speakers and domain experts verified.
Hybrid sources: original Arabic data, high-quality translations, and synthetic samples.
Robust evaluation framework for final answer accuracy and reasoning quality
Fully open-source dataset and toolkit to support research in Arabic reasoning and multimodal AI.

🏗️ ARB Construction Pipeline

ARB Collection

ARB Data Distribution over Domains

Source Types Across Domains

Domain	English Bench	Arabic Bench	Human-Created	Synthetic
Visual Reasoning	✅	–	–	–
OCR & Document Analysis	–	–	✅	✅
Chart & Data Table (CDT)	✅	✅	✅	✅
Math & Logic	✅	–	–	–
Social & Cultural	✅	–	–	–
Computer Vision Perception	✅	–	–	–
Medical Image Analysis	✅	✅	–	–
Scientific Reasoning	✅	–	–	–
Agricultural Interpretation	✅	–	✅	✅
Remote Sensing Understanding	–	✅	–	–
Historical & Anthropological	✅	–	✅	✅

Download

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("MBZUAI/ARB")

Evaluation Protocol

We evaluated 12 open- and closed-source LMMs using:

Lexical and Semantic Similarity Scoes: BLEU, ROUGE, BERTScore.
Cross-lingual semantic alignment: LaBSE
Custom Rubric (Arabic):: Our curated metric rebric includes 10 factors like faithfulness, interpretive depth, coherence, hallucination, and more.

LLM-as-Judge (Arabic prompt-based)

We evaluate models using:

Step-by-step reasoning quality (coherence, informativeness, commonsense)
Final answer accuracy
Agreement with human raters (Krippendorff’s Alpha > 87%)

Stepwise Evaluation Results

For Closed-Source Models:

	GPT-4o	GPT-4o-mini	GPT-4.1	o4-mini	Gemini 1.5 Pro	Gemini 2.0 Flash
Final Answer (%)	60.22	52.22	59.43	58.93	56.7	57.8
Reasoning Steps (%)	64.29	61.02	80.41	80.75	64.34	64.09

For Open-Source Models:

	Qwen2.5-VL-7B	Llama-3.2-11B	AIN	Llama-4 Scout	Aya-Vision-8B	InternVL3-8B
Final Answer (%)	37.02	25.58	27.35	48.52	28.81	31.04
Reasoning Steps (%)	64.03	53.2	52.77	77.7	63.64	54.5

📂 Dataset Structure

Each sample includes:

image: Visual input
question: Arabic reasoning prompt
choices: The choices for MCQ
steps: Ordered reasoning chain
answer: Final solution (Arabic)
domain: One of 11 categories (e.g., OCR, Scientific, Visual, Math)
curriculum: One of the 4 curricula followed by the prompt for steps generation (Computational, Sci/Med, Textual/Partial, and General)

Citation

If you use ARB dataset in your research, please consider citing:

@misc{ghaboura2025arbcomprehensivearabicmultimodal,
      title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark}, 
      author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
      year={2025},
      eprint={2505.17021},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.17021}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github/workflows		.github/workflows
assets		assets
evaluation		evaluation
DS_Store		DS_Store
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Latest Updates

ARB Scope and Diversity

🌟 Key Features

🏗️ ARB Construction Pipeline

ARB Collection

ARB Data Distribution over Domains

Source Types Across Domains

Download

Evaluation Protocol

LLM-as-Judge (Arabic prompt-based)

Stepwise Evaluation Results

📂 Dataset Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mbzuai-oryx/ARB

Folders and files

Latest commit

History

Repository files navigation

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Latest Updates

ARB Scope and Diversity

🌟 Key Features

🏗️ ARB Construction Pipeline

ARB Collection

ARB Data Distribution over Domains

Source Types Across Domains

Download

Evaluation Protocol

LLM-as-Judge (Arabic prompt-based)

Stepwise Evaluation Results

📂 Dataset Structure

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages