Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Authors: Tajamul Ashraf, Amal Saqib, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan

* Equally contribution, Correspondence: Tajamul Ashraf, Amal Saqib.

Updates

**[2025-11-01]: ** Full evaluation and integrated tool server support for custom models are now live!

[2025-08-03]: Agent-X won Second Place in the Research Track at the Agentic AI Summit 2025, UC Berkeley.

[2025-06-02]: Agent-X paper published on

[2025-05-29]: Released evaluation & deployment code for Agent-X

[2025-05-22]: Published the Agent-X dataset on Hugging Face

Introduction

Current tool-use tests for vision-centric LLMs rely on single-turn, synthetic queries and text-only inputs, so they miss the real-world challenge of multi-step, multimodal reasoning. Agent-X closes this gap with 828 authentic tasks spanning images, videos, and mixed-modal instructions across six domains—from web browsing to autonomous driving. Each task demands explicit, step-by-step decisions and judicious tool use, and our evaluation scores every reasoning step as well as the overall chain. Even top models (GPT, Gemini, Qwen) solve fewer than half of these tasks, exposing major bottlenecks and pointing the way for future research.

What is Agent-X?

Agent-X is a benchmark for assessing deep-reasoning and tool-use skills of vision-centric LLM agents in real-world settings. It highlights three key aspects:

Authentic multi-step tasks. The benchmark offers 828 human-authored tasks with implicit tool use and sequential planning requirements, spanning six domains such as web browsing, surveillance, autonomous driving, sports, and math reasoning.
Real deployed tools. Agent-X supplies an evaluation platform stocked with perception, web, manipulation, math, and data-processing tools, compelling agents to choose and apply the right tool at each reasoning step.
Diverse multimodal contexts. Every task is paired with real images, multi-image comparisons, or video clips—plus textual instructions, closely mirroring the visual complexity of real-world scenarios.

The comparison of Agent-Xqueries with AI-generated queries is shown in the table below. The steps and tool types for queries in ToolBench and m&m's are explicitly stated, as marked in red and blue. The queries in APIBench are simple, only containing one step. Our GTA's queries are both step-implicit and tool-implicit.

📚 Dataset Statistics

Overview of the Agent-X benchmark. Key data statistics, overall frequency of the tool usage, number of steps, and distribution of tasks across the six vision-centric environments.

Our Pipeline

We design the Agent-X benchmark using a semi-automated pipeline that ensures each task is solvable with a defined tool subset and requires deep reasoning over realistic, multimodal scenarios. The pipeline begins with an LMM (Large Multimodal Model) generating candidate queries based on visual input and an available toolset. These queries are then refined by human annotators for clarity and realism. Next, the refined queries are passed back to the LMM to produce step-by-step reasoning traces, including tool calls, intermediate outputs, and final answers. Each trace is manually reviewed for logical consistency and correctness.

🏆 Leaderboard, May 2025

Evaluation Protocol

We evaluate models on the Agent-X benchmark across three distinct modes:

Step-by-Step: Assesses the agent’s ability to execute individual reasoning steps, focusing on how well it follows structured tool-use sequences grounded in visual inputs.
Deep Reasoning: Evaluates the coherence and logical consistency of the full reasoning trace. This mode emphasizes the agent’s capacity to integrate visual and textual context to produce semantically meaningful and factually accurate explanations.
Outcome: Measures the agent’s overall task-solving performance by verifying the correctness of the final answer and appropriate tool usage.

We report results using GPT-4 and Qwen-15B as evaluation judges. For each metric, the best-performing value is shown in bold and underlined, while the second-best is italicized.

Real-Time Tool Use Evaluation)

Model	Inst_align	Tool_acc	Arg_acc	Sum_acc	Faithfulness	Context-Score	Factual-Precision	Semantic-Accuracy	p_f1	o_f1	l_f1	c_f1	Ans_acc	Ans_acc^{w/img gen}
Qwen2.5-7B-VL	0.42	0.20	0.03	0.08	0.14	0.09	0.18	0.32	0.19	0.33	0.25	0.24	0.01	0.01
InternVL3-8B	0.49	0.28	0.04	0.12	0.14	0.09	0.13	0.41	0.11	0.08	0.12	0.12	0.02	0.02

With GPT-4o as a judge

Model	Ground_s	Tool_p	Tool_acc	Fact_acc	Context_s	Fact_p	Sem_acc	Goal_acc	Goal_acc^*	Tool_acc^s
Open-source
Phi-4-VL-Instruct	0.13	0.21	0.24	0.61	0.19	0.47	0.40	0.11	0.26	0.42
InternVL-2.5-8B	0.45	0.31	0.47	0.68	0.47	0.52	0.60	0.28	0.55	0.58
Gemma-3-4B	0.26	0.30	0.78	0.61	0.54	0.38	0.54	0.27	0.67	0.60
InternVL-3-8B	0.46	0.34	0.54	0.68	0.45	0.70	0.40	0.20	0.59	0.62
VideoLLaMA-3-7B	0.45	0.28	0.46	0.65	0.46	0.62	0.54	0.28	0.54	0.54
Qwen-2.5-VL-7B	0.54	0.43	0.63	0.75	0.57	0.56	0.67	0.36	0.65	0.67
Pixtral-12B	0.12	0.20	0.63	0.45	0.19	0.26	0.34	0.07	0.55	0.54
LLaMA-3.2-11B-Vision	0.03	0.15	0.14	0.70	0.08	0.70	0.24	0.07	0.26	0.42
Kimi-VL-A3B-Thinking	0.26	0.19	0.5	0.62	0.42	0.52	0.65	0.29	0.29	0.48
mPLUG-Owl3-7B-240728	0.10	0.14	0.30	0.49	0.25	0.32	0.37	0.11	0.26	0.50
Closed-source
Gemini-1.5-Pro	0.43	0.23	0.84	0.62	0.45	0.53	0.62	0.04	0.56	0.48
Gemini-2.5-Pro	0.40	0.36	0.81	0.72	0.48	0.64	0.73	0.40	0.56	0.62
GPT-4o	0.60	0.47	0.72	0.81	0.57	0.79	0.59	0.37	0.70	0.68
OpenAI o4-mini	0.42	0.32	0.89	0.71	0.51	0.60	0.80	0.45	0.67	0.63

With Qwen-15B as a judge

Model	Ground_s	Tool_p	Tool_acc	Fact_acc	Context_s	Fact_p	Sem_acc	Goal_acc	Goal_acc^*	Tool_acc^s
Open-source
Phi-4-VL-Instruct	0.27	0.11	0.32	0.54	0.39	0.59	0.46	0.16	0.35	0.39
InternVL2.5-8B	0.38	0.16	0.49	0.63	0.51	0.61	0.55	0.29	0.53	0.53
Gemma-3-4B	0.50	0.24	0.67	0.74	0.66	0.59	0.74	0.30	0.68	0.68
InternVL3-8B	0.41	0.16	0.51	0.71	0.61	0.60	0.69	0.23	0.51	0.62
VideoLLaMA3-7B	0.39	0.15	0.40	0.68	0.56	0.60	0.68	0.27	0.53	0.56
Qwen2.5-VL-7B	0.51	0.27	0.63	0.77	0.66	0.64	0.77	0.37	0.62	0.67
Pixtral-12B	0.30	0.17	0.68	0.59	0.50	0.42	0.58	0.10	0.68	0.58
LLaMA-3.2-11B-Vision	0.16	0.06	0.12	0.49	0.17	0.74	0.20	0.10	0.11	0.15
Kimi-VL-A3B-Thinking	0.47	0.20	0.59	0.79	0.64	0.68	0.74	0.35	0.60	0.62
mPLUG-Owl3-7B-240728	0.30	0.11	0.31	0.59	0.48	0.48	0.56	0.16	0.45	0.48
Closed-source
Gemini-1.5-Pro	0.57	0.36	0.80	0.82	0.73	0.76	0.63	0.05	0.77	0.71
Gemini-2.5-Pro	0.63	0.40	0.84	0.86	0.76	0.80	0.83	0.50	0.74	0.72
GPT-4o	0.46	0.27	0.63	0.72	0.59	0.75	0.69	0.44	0.48	0.56
OpenAI-o4-mini	0.63	0.35	0.86	0.89	0.78	0.79	0.88	0.53	0.64	0.69

🚀 Evaluate on AgentX

Prepare Agent-X Dataset

Clone this repo.

git clone https://github.com/mbzuai-oryx/Agent-X.git
cd Agent-X

You can get the dataset from release file. The images are under the files section in the huggingface repository, download the other files like dataset.json and toolmeta.json and structure it as show below. Put all the images under image subfolder.

mkdir ./opencompass/data

Put it under the folder ./opencompass/data/. The structure of files should be:

Agent-X/
├── agentlego
├── opencompass
│   ├── data
│   │   ├── agentx_dataset
│   │   │   ├── dataset.json
│   │   │   ├── toolmeta.json
│   │   │   ├── image
│   │   │   
│   ├── ...
├── ...

Prepare Your Model

Download the model weights.

pip install -U huggingface_hub
# huggingface-cli download --resume-download hugging/face/repo/name --local-dir your/local/path --local-dir-use-symlinks False
huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ~/models/qwen1.5-7b-chat --local-dir-use-symlinks False

Install LMDeploy.

conda create -n lmdeploy python=3.10
conda activate lmdeploy

For CUDA 12:

pip install lmdeploy

For CUDA 11+:

export LMDEPLOY_VERSION=0.4.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Launch a model service.

# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]
lmdeploy serve api_server ~/models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat

Deploy Tools

Install AgentLego.

conda create -n agentlego python=3.11.9
conda activate agentlego
cd agentlego
pip install -r requirements_all.txt
pip install agentlego
pip install -e .
mim install mmengine
mim install mmcv==2.1.0

Open ~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py, then set _supports_sdpa = False to _supports_sdpa = True in line 1279.

Deploy tools for Agent-X benchmark.

To use the GoogleSearch and MathOCR tools, you should first get the Serper API key from https://serper.dev, and the Mathpix API key from https://mathpix.com/. Then export these keys as environment variables.

export SERPER_API_KEY='your_serper_key_for_google_search_tool'
export MATHPIX_APP_ID='your_mathpix_key_for_mathocr_tool'
export MATHPIX_APP_KEY='your_mathpix_key_for_mathocr_tool'

Start the tool server.

agentlego-server start --port 16181 --extra ./benchmark.py  `cat benchmark_toollist.txt` --host 0.0.0.0

Start Evaluation

Install OpenCompass.

conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
cd agentlego
pip install -e .
cd ../opencompass
pip install -e .

huggingface_hub==0.25.2 (<0.26.0) transformers==4.40.1 2. Modify the config file at configs/eval_agentx_bench.py as below.

The ip and port number of openai_api_base is the ip of your model service and the port number you specified when using lmdeploy.

The ip and port number of tool_server is the ip of your tool service and the port number you specified when using agentlego.

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        tool_meta='data/agentx_dataset/toolmeta.json',
        batch_size=8,
    ),
]

If you infer and evaluate in step-by-step mode, you should comment out tool_server and enable tool_meta in configs/eval_agentx_bench.py, and set infer mode and eval mode to every_with_gt in configs/datasets/agentx_bench.py:

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        # tool_server='http://10.140.0.138:16181',
        tool_meta='data/agentx_dataset/toolmeta.json',
        batch_size=8,
    ),
]

agentx_bench_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template="""{questions}""",
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),
)
agentx_bench_eval_cfg = dict(evaluator=dict(type=AGENTXBenchEvaluator, mode='every_with_gt'))

If you infer and evaluate in end-to-end mode, you should comment out tool_meta and enable tool_server in configs/eval_agentx_bench.py, and set infer mode and eval mode to every in configs/datasets/agentx_bench.py:

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        # tool_meta='data/agentx_dataset/toolmeta.json',
        batch_size=8,
    ),
]

agentx_bench_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template="""{questions}""",
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=AgentInferencer, infer_mode='every'),
)
agentx_bench_eval_cfg = dict(evaluator=dict(type=AGENTXBenchEvaluator, mode='every'))

Infer and evaluate with OpenCompass.

# infer only
python run.py configs/eval_agentx_bench.py --max-num-workers 32 --debug --mode infer

# evaluate only
# srun -p llmit -q auto python run.py configs/eval_agentx_bench.py --max-num-workers 32 --debug --reuse [time_stamp_of_prediction_file] --mode eval
srun -p llmit -q auto python run.py configs/eval_agentx_bench.py --max-num-workers 32 --debug --reuse 20240628_115514 --mode eval

# infer and evaluate
python run.py configs/eval_agentx_bench.py -p llmit -q auto --max-num-workers 32 --debug

📂 Model as Judge

Generation Pipeline

See generation/README.md for details on:

Frame extraction from video clips
Query generation using GPT-4o
Step-by-step reasoning trace generation

📁 Path: generation/README.md

Analysis & Evaluation

See analysis/README.md for:

Error analysis notebook
Model comparison plots
Tool usage breakdown and visualizations

📁 Path: analysis/README.md

Evaluation Scripts

See eval/ for:

Scripted evaluation of model inference results
Accuracy metrics, binary matching scores, and goal success analysis
Useful for benchmarking your model outputs against Agent-X GT

📁 Path: eval/

Acknowledgements

This project builds upon the excellent open-source work of the GTA paper, as well as the teams behind OpenCompass and AgentLego.
We sincerely thank the authors and contributors of these projects for sharing their implementations and evaluation frameworks, which greatly supported the development and experimentation in this repository.

📝 Citation

If you use Agent-X in your research, please cite the following paper:

@misc{ashraf2025agentxevaluatingdeepmultimodal,
      title={Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks}, 
      author={Tajamul Ashraf and Amal Saqib and Hanan Ghani and Muhra AlMahri and Yuhao Li and Noor Ahsan and Umair Nawaz and Jean Lahoud and Hisham Cholakkal and Mubarak Shah and Philip Torr and Fahad Shahbaz Khan and Rao Muhammad Anwer and Salman Khan},
      year={2025},
      eprint={2505.24876},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.24876}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Authors: Tajamul Ashraf, Amal Saqib, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan

Updates

Introduction

What is Agent-X?

📚 Dataset Statistics

Our Pipeline

🏆 Leaderboard, May 2025

Evaluation Protocol

Real-Time Tool Use Evaluation)

With GPT-4o as a judge

With Qwen-15B as a judge

🚀 Evaluate on AgentX

Prepare Agent-X Dataset

Prepare Your Model

Deploy Tools

Start Evaluation

📂 Model as Judge

Generation Pipeline

Analysis & Evaluation

Evaluation Scripts

Acknowledgements

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
.github/workflows		.github/workflows
agentlego		agentlego
analysis		analysis
data_curation		data_curation
evaluation		evaluation
images		images
opencompass		opencompass
README.md		README.md

eturner15/Agent-X

Folders and files

Latest commit

History

Repository files navigation

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Authors: Tajamul Ashraf*, Amal Saqib*, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan

Updates

Introduction

What is Agent-X?

📚 Dataset Statistics

Our Pipeline

🏆 Leaderboard, May 2025

Evaluation Protocol

Real-Time Tool Use Evaluation)

With GPT-4o as a judge

With Qwen-15B as a judge

🚀 Evaluate on AgentX

Prepare Agent-X Dataset

Prepare Your Model

Deploy Tools

Start Evaluation

📂 Model as Judge

Generation Pipeline

Analysis & Evaluation

Evaluation Scripts

Acknowledgements

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Authors: Tajamul Ashraf, Amal Saqib, Hanan Gani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip H.S. Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Salman Khan

Packages