Authors: Tajamul Ashraf*, Amal Saqib*, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan
* Equally contribution, Correspondence: Tajamul Ashraf, Amal Saqib.
**[2025-11-01]: ** Full evaluation and integrated tool server support for custom models are now live!
[2025-08-03]: Agent-X won Second Place in the Research Track at the Agentic AI Summit 2025, UC Berkeley.
[2025-06-02]: Agent-X paper published on
[2025-05-29]: Released evaluation & deployment code for Agent-X
[2025-05-22]: Published the Agent-X dataset on Hugging Face
Current tool-use tests for vision-centric LLMs rely on single-turn, synthetic queries and text-only inputs, so they miss the real-world challenge of multi-step, multimodal reasoning. Agent-X closes this gap with 828 authentic tasks spanning images, videos, and mixed-modal instructions across six domainsβfrom web browsing to autonomous driving. Each task demands explicit, step-by-step decisions and judicious tool use, and our evaluation scores every reasoning step as well as the overall chain. Even top models (GPT, Gemini, Qwen) solve fewer than half of these tasks, exposing major bottlenecks and pointing the way for future research.
Agent-X is a benchmark for assessing deep-reasoning and tool-use skills of vision-centric LLM agents in real-world settings. It highlights three key aspects:
- Authentic multi-step tasks. The benchmark offers 828 human-authored tasks with implicit tool use and sequential planning requirements, spanning six domains such as web browsing, surveillance, autonomous driving, sports, and math reasoning.
- Real deployed tools. Agent-X supplies an evaluation platform stocked with perception, web, manipulation, math, and data-processing tools, compelling agents to choose and apply the right tool at each reasoning step.
- Diverse multimodal contexts. Every task is paired with real images, multi-image comparisons, or video clipsβplus textual instructions, closely mirroring the visual complexity of real-world scenarios.
The comparison of Agent-Xqueries with AI-generated queries is shown in the table below. The steps and tool types for queries in ToolBench and m&m's are explicitly stated, as marked in red and blue. The queries in APIBench are simple, only containing one step. Our GTA's queries are both step-implicit and tool-implicit.
Overview of the Agent-X benchmark. Key data statistics, overall frequency of the tool usage, number of steps, and distribution of tasks across the six vision-centric environments.
We design the Agent-X benchmark using a semi-automated pipeline that ensures each task is solvable with a defined tool subset and requires deep reasoning over realistic, multimodal scenarios. The pipeline begins with an LMM (Large Multimodal Model) generating candidate queries based on visual input and an available toolset. These queries are then refined by human annotators for clarity and realism. Next, the refined queries are passed back to the LMM to produce step-by-step reasoning traces, including tool calls, intermediate outputs, and final answers. Each trace is manually reviewed for logical consistency and correctness.
We evaluate models on the Agent-X benchmark across three distinct modes:
-
Step-by-Step: Assesses the agentβs ability to execute individual reasoning steps, focusing on how well it follows structured tool-use sequences grounded in visual inputs.
-
Deep Reasoning: Evaluates the coherence and logical consistency of the full reasoning trace. This mode emphasizes the agentβs capacity to integrate visual and textual context to produce semantically meaningful and factually accurate explanations.
-
Outcome: Measures the agentβs overall task-solving performance by verifying the correctness of the final answer and appropriate tool usage.
We report results using GPT-4 and Qwen-15B as evaluation judges. For each metric, the best-performing value is shown in bold and underlined, while the second-best is italicized.
| Model | Instalign | Toolacc | Argacc | Sumacc | Faithfulness | Context-Score | Factual-Precision | Semantic-Accuracy | pf1 | of1 | lf1 | cf1 | Ansacc | Ansaccw/img gen |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-VL | 0.42 | 0.20 | 0.03 | 0.08 | 0.14 | 0.09 | 0.18 | 0.32 | 0.19 | 0.33 | 0.25 | 0.24 | 0.01 | 0.01 |
| InternVL3-8B | 0.49 | 0.28 | 0.04 | 0.12 | 0.14 | 0.09 | 0.13 | 0.41 | 0.11 | 0.08 | 0.12 | 0.12 | 0.02 | 0.02 |
| Model | Grounds | Toolp | Toolacc | Factacc | Contexts | Factp | Semacc | Goalacc | Goalacc* | Toolaccs |
|---|---|---|---|---|---|---|---|---|---|---|
| Open-source | ||||||||||
| Phi-4-VL-Instruct | 0.13 | 0.21 | 0.24 | 0.61 | 0.19 | 0.47 | 0.40 | 0.11 | 0.26 | 0.42 |
| InternVL-2.5-8B | 0.45 | 0.31 | 0.47 | 0.68 | 0.47 | 0.52 | 0.60 | 0.28 | 0.55 | 0.58 |
| Gemma-3-4B | 0.26 | 0.30 | 0.78 | 0.61 | 0.54 | 0.38 | 0.54 | 0.27 | 0.67 | 0.60 |
| InternVL-3-8B | 0.46 | 0.34 | 0.54 | 0.68 | 0.45 | 0.70 | 0.40 | 0.20 | 0.59 | 0.62 |
| VideoLLaMA-3-7B | 0.45 | 0.28 | 0.46 | 0.65 | 0.46 | 0.62 | 0.54 | 0.28 | 0.54 | 0.54 |
| Qwen-2.5-VL-7B | 0.54 | 0.43 | 0.63 | 0.75 | 0.57 | 0.56 | 0.67 | 0.36 | 0.65 | 0.67 |
| Pixtral-12B | 0.12 | 0.20 | 0.63 | 0.45 | 0.19 | 0.26 | 0.34 | 0.07 | 0.55 | 0.54 |
| LLaMA-3.2-11B-Vision | 0.03 | 0.15 | 0.14 | 0.70 | 0.08 | 0.70 | 0.24 | 0.07 | 0.26 | 0.42 |
| Kimi-VL-A3B-Thinking | 0.26 | 0.19 | 0.5 | 0.62 | 0.42 | 0.52 | 0.65 | 0.29 | 0.29 | 0.48 |
| mPLUG-Owl3-7B-240728 | 0.10 | 0.14 | 0.30 | 0.49 | 0.25 | 0.32 | 0.37 | 0.11 | 0.26 | 0.50 |
| Closed-source | ||||||||||
| Gemini-1.5-Pro | 0.43 | 0.23 | 0.84 | 0.62 | 0.45 | 0.53 | 0.62 | 0.04 | 0.56 | 0.48 |
| Gemini-2.5-Pro | 0.40 | 0.36 | 0.81 | 0.72 | 0.48 | 0.64 | 0.73 | 0.40 | 0.56 | 0.62 |
| GPT-4o | 0.60 | 0.47 | 0.72 | 0.81 | 0.57 | 0.79 | 0.59 | 0.37 | 0.70 | 0.68 |
| OpenAI o4-mini | 0.42 | 0.32 | 0.89 | 0.71 | 0.51 | 0.60 | 0.80 | 0.45 | 0.67 | 0.63 |
| Model | Grounds | Toolp | Toolacc | Factacc | Contexts | Factp | Semacc | Goalacc | Goalacc* | Toolaccs |
|---|---|---|---|---|---|---|---|---|---|---|
| Open-source | ||||||||||
| Phi-4-VL-Instruct | 0.27 | 0.11 | 0.32 | 0.54 | 0.39 | 0.59 | 0.46 | 0.16 | 0.35 | 0.39 |
| InternVL2.5-8B | 0.38 | 0.16 | 0.49 | 0.63 | 0.51 | 0.61 | 0.55 | 0.29 | 0.53 | 0.53 |
| Gemma-3-4B | 0.50 | 0.24 | 0.67 | 0.74 | 0.66 | 0.59 | 0.74 | 0.30 | 0.68 | 0.68 |
| InternVL3-8B | 0.41 | 0.16 | 0.51 | 0.71 | 0.61 | 0.60 | 0.69 | 0.23 | 0.51 | 0.62 |
| VideoLLaMA3-7B | 0.39 | 0.15 | 0.40 | 0.68 | 0.56 | 0.60 | 0.68 | 0.27 | 0.53 | 0.56 |
| Qwen2.5-VL-7B | 0.51 | 0.27 | 0.63 | 0.77 | 0.66 | 0.64 | 0.77 | 0.37 | 0.62 | 0.67 |
| Pixtral-12B | 0.30 | 0.17 | 0.68 | 0.59 | 0.50 | 0.42 | 0.58 | 0.10 | 0.68 | 0.58 |
| LLaMA-3.2-11B-Vision | 0.16 | 0.06 | 0.12 | 0.49 | 0.17 | 0.74 | 0.20 | 0.10 | 0.11 | 0.15 |
| Kimi-VL-A3B-Thinking | 0.47 | 0.20 | 0.59 | 0.79 | 0.64 | 0.68 | 0.74 | 0.35 | 0.60 | 0.62 |
| mPLUG-Owl3-7B-240728 | 0.30 | 0.11 | 0.31 | 0.59 | 0.48 | 0.48 | 0.56 | 0.16 | 0.45 | 0.48 |
| Closed-source | ||||||||||
| Gemini-1.5-Pro | 0.57 | 0.36 | 0.80 | 0.82 | 0.73 | 0.76 | 0.63 | 0.05 | 0.77 | 0.71 |
| Gemini-2.5-Pro | 0.63 | 0.40 | 0.84 | 0.86 | 0.76 | 0.80 | 0.83 | 0.50 | 0.74 | 0.72 |
| GPT-4o | 0.46 | 0.27 | 0.63 | 0.72 | 0.59 | 0.75 | 0.69 | 0.44 | 0.48 | 0.56 |
| OpenAI-o4-mini | 0.63 | 0.35 | 0.86 | 0.89 | 0.78 | 0.79 | 0.88 | 0.53 | 0.64 | 0.69 |
- Clone this repo.
git clone https://github.com/mbzuai-oryx/Agent-X.git
cd Agent-X- You can get the dataset from release file. The images are under the files section in the huggingface repository, download the other files like dataset.json and toolmeta.json and structure it as show below. Put all the images under image subfolder.
mkdir ./opencompass/dataPut it under the folder ./opencompass/data/. The structure of files should be:
Agent-X/
βββ agentlego
βββ opencompass
β βββ data
β β βββ agentx_dataset
β β β βββ dataset.json
β β β βββ toolmeta.json
β β β βββ image
β β β
β βββ ...
βββ ...
- Download the model weights.
pip install -U huggingface_hub
# huggingface-cli download --resume-download hugging/face/repo/name --local-dir your/local/path --local-dir-use-symlinks False
huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ~/models/qwen1.5-7b-chat --local-dir-use-symlinks False- Install LMDeploy.
conda create -n lmdeploy python=3.10
conda activate lmdeployFor CUDA 12:
pip install lmdeployFor CUDA 11+:
export LMDEPLOY_VERSION=0.4.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118- Launch a model service.
# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]
lmdeploy serve api_server ~/models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat- Install AgentLego.
conda create -n agentlego python=3.11.9
conda activate agentlego
cd agentlego
pip install -r requirements_all.txt
pip install agentlego
pip install -e .
mim install mmengine
mim install mmcv==2.1.0Open ~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py, then set _supports_sdpa = False to _supports_sdpa = True in line 1279.
- Deploy tools for Agent-X benchmark.
To use the GoogleSearch and MathOCR tools, you should first get the Serper API key from https://serper.dev, and the Mathpix API key from https://mathpix.com/. Then export these keys as environment variables.
export SERPER_API_KEY='your_serper_key_for_google_search_tool'
export MATHPIX_APP_ID='your_mathpix_key_for_mathocr_tool'
export MATHPIX_APP_KEY='your_mathpix_key_for_mathocr_tool'Start the tool server.
agentlego-server start --port 16181 --extra ./benchmark.py `cat benchmark_toollist.txt` --host 0.0.0.0- Install OpenCompass.
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
cd agentlego
pip install -e .
cd ../opencompass
pip install -e .huggingface_hub==0.25.2 (<0.26.0)
transformers==4.40.1
2. Modify the config file at configs/eval_agentx_bench.py as below.
The ip and port number of openai_api_base is the ip of your model service and the port number you specified when using lmdeploy.
The ip and port number of tool_server is the ip of your tool service and the port number you specified when using agentlego.
models = [
dict(
abbr='qwen1.5-7b-chat',
type=LagentAgent,
agent_type=ReAct,
max_turn=10,
llm=dict(
type=OpenAI,
path='qwen1.5-7b-chat',
key='EMPTY',
openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
query_per_second=1,
max_seq_len=4096,
stop='<|im_end|>',
),
tool_server='http://10.140.0.138:16181',
tool_meta='data/agentx_dataset/toolmeta.json',
batch_size=8,
),
]If you infer and evaluate in step-by-step mode, you should comment out tool_server and enable tool_meta in configs/eval_agentx_bench.py, and set infer mode and eval mode to every_with_gt in configs/datasets/agentx_bench.py:
models = [
dict(
abbr='qwen1.5-7b-chat',
type=LagentAgent,
agent_type=ReAct,
max_turn=10,
llm=dict(
type=OpenAI,
path='qwen1.5-7b-chat',
key='EMPTY',
openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
query_per_second=1,
max_seq_len=4096,
stop='<|im_end|>',
),
# tool_server='http://10.140.0.138:16181',
tool_meta='data/agentx_dataset/toolmeta.json',
batch_size=8,
),
]agentx_bench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="""{questions}""",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),
)
agentx_bench_eval_cfg = dict(evaluator=dict(type=AGENTXBenchEvaluator, mode='every_with_gt'))If you infer and evaluate in end-to-end mode, you should comment out tool_meta and enable tool_server in configs/eval_agentx_bench.py, and set infer mode and eval mode to every in configs/datasets/agentx_bench.py:
models = [
dict(
abbr='qwen1.5-7b-chat',
type=LagentAgent,
agent_type=ReAct,
max_turn=10,
llm=dict(
type=OpenAI,
path='qwen1.5-7b-chat',
key='EMPTY',
openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
query_per_second=1,
max_seq_len=4096,
stop='<|im_end|>',
),
tool_server='http://10.140.0.138:16181',
# tool_meta='data/agentx_dataset/toolmeta.json',
batch_size=8,
),
]agentx_bench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="""{questions}""",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=AgentInferencer, infer_mode='every'),
)
agentx_bench_eval_cfg = dict(evaluator=dict(type=AGENTXBenchEvaluator, mode='every'))- Infer and evaluate with OpenCompass.
# infer only
python run.py configs/eval_agentx_bench.py --max-num-workers 32 --debug --mode infer# evaluate only
# srun -p llmit -q auto python run.py configs/eval_agentx_bench.py --max-num-workers 32 --debug --reuse [time_stamp_of_prediction_file] --mode eval
srun -p llmit -q auto python run.py configs/eval_agentx_bench.py --max-num-workers 32 --debug --reuse 20240628_115514 --mode eval# infer and evaluate
python run.py configs/eval_agentx_bench.py -p llmit -q auto --max-num-workers 32 --debugSee generation/README.md for details on:
- Frame extraction from video clips
- Query generation using GPT-4o
- Step-by-step reasoning trace generation
π Path:
generation/README.md
See analysis/README.md for:
- Error analysis notebook
- Model comparison plots
- Tool usage breakdown and visualizations
π Path:
analysis/README.md
See eval/ for:
- Scripted evaluation of model inference results
- Accuracy metrics, binary matching scores, and goal success analysis
- Useful for benchmarking your model outputs against Agent-X GT
π Path:
eval/
This project builds upon the excellent open-source work of the GTA paper, as well as the teams behind OpenCompass and AgentLego.
We sincerely thank the authors and contributors of these projects for sharing their implementations and evaluation frameworks, which greatly supported the development and experimentation in this repository.
If you use Agent-X in your research, please cite the following paper:
@misc{ashraf2025agentxevaluatingdeepmultimodal,
title={Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks},
author={Tajamul Ashraf and Amal Saqib and Hanan Ghani and Muhra AlMahri and Yuhao Li and Noor Ahsan and Umair Nawaz and Jean Lahoud and Hisham Cholakkal and Mubarak Shah and Philip Torr and Fahad Shahbaz Khan and Rao Muhammad Anwer and Salman Khan},
year={2025},
eprint={2505.24876},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.24876},
}






