During the development of AgentThink, we drew inspiration from ancient wisdom. As stated by Xunzi:
π "εεηιεΌδΉοΌεεδΊη©δΉγ"
"A gentleman is not inherently different from others; he excels by skillfully leveraging external tools."
This philosophy aligns perfectly with our design principles - by integrating multiple tools and reasoning, AgentThink achieves superior capabilities in complex autonomous driving scenarios.
δΈζ ο½ English
π Project β’ π Paper β’ π Repo β’ πͺͺ License
E-mails: [email protected] or [email protected] or [email protected] or [email protected]
Experience AgentThink's real-world performance through our demonstration materials that illustrate its capabilities in autonomous driving scenarios.
Watch this video to see AgentThink's environmental perception in complex traffic conditions:
Complementing the video, these visualizations demonstrate key capabilities:
| Scenario | Description | Image |
|---|---|---|
| High-level planning | Visualizes high-level planning | View |
| Spatial Understanding | Demonstrates spatial relationship analysis | View |
| Environment Adaptability | Shows performance in adverse weather or low-light | View |
- β¨ Highlights
- π° Project Updates
- π Quick Navigation
- βοΈ Getting Started
- π Quick Start
- π TODO List
- π Repository Structure
- π Benchmark Results
- πΌοΈ Paper Results
- π License & Citation
- π§ Tool-Augmented Reasoning: Multi-modal perception through integrated vision, prediction, occupancy, and mapping tools
- π§ Reasoning Chain + Tool Calls: Task decomposition with explicit tool invocation
- π― GRPO Training: Triple reward signals (final answer, step-wise, tool usage)
- π Performance Boost: 53.91% accuracy improvement over traditional VLM models
- π [2025.08.20] Our paper was accepted as EMNLP2025 Fundings
- π [2025.07.02] v1.1 released with demo and sample data
- π [2025.05.22] Paper published on arXiv
- π₯ Web Demo and Swift full training pipeline coming soon
| Section | Description | Link |
|---|---|---|
| Environment Setup | Install dependencies and setup | Environment Setup |
| AgentThink data generation | generate the agentthink data | Data Generation |
| Model Inference | Real-time inference on val set | Model Inference |
| Demo Inference | Real-time inference on test set | Demo Inference |
| Evaluation-Metrics | Scoring pipeline using LLM-as-Judge | Evaluation Metrics |
| Benchmark Results | Quantitative performance comparisons | Benchmark Results |
| component | version | command |
|---|---|---|
| os | Ubuntu 20.04 | cat /etc/issue |
| Python | 3.10.12 | python --version |
| CUDA Toolkit | 12.4 | nvcc --version |
| GPU Driver | 535.129.03 | nvidia-smi |
| Pytorch | 2.6.0 | print(torch.__version__) |
Install dependencies and prepare your environment:
# Create virtual environment
conda create -n agentthink python=3.10
conda activate agentthink
# Install dependencies
pip install -r requirements.txt
# Install ms-swift
bash scripts/env.sh
# Install drivemllm dependency
bash scripts/env_drivemllm.shcd third_party
git clone https://github.com/modelscope/ms-swift.gitUse GPT-4o-mini to generate Tool-CoT data after filling the OpenAI api_key and base_url.
# based on the DriveLMM-o1 TRAIN.json
python scripts/tools/agentthink_data_generater_pipeline.py --split train --model_name gpt-4o-mini
Generate combined 6-views images.
# based on the nuScenes dataset
python data_image_converter.py \
--json_file_path /path/to/your/DriveLMMo1_TRAIN/TEST.json \
--image_root_dir /path/to/your/nuscene/root/directory \
--savepath_img /path/to/your/output/directory㪠Use your trained checkpoint AgentThink to run inference on val samples AgentThink-CoT-val:
# Inference script
bash scripts/inference_scripts/inference.sh [your_CKPT_PATH] [your_OUTPUT_DIR]
# Inference with tool script
bash scripts/inference_scripts/inference_withtool.sh [your_CKPT_PATH] [your_OUTPUT_DIR]
# Inference using multi-node GPUs
bash scripts/inference_scripts/inference_multigpu.sh [your_CKPT_PATH] [your_OUTPUT_DIR]
# Inference AgentThink
bash scripts/inference_agentthink.sh [your_CKPT_PATH] [your_OUTPUT_DIR]πUse LLM-as-Judge to calculate performance metrics:
# Evaluate reasoning ability and MCQ accuracy
python evaluation/evaluation_script.py
πSee the Results section or poster for AgentThinkβs SOTA performance.
Our AgentThink model based on the Qwen2.5-VL-7B.
Clone the depth anythingv2[DAM]: (https://github.com/DepthAnything/Depth-Anything-V2)
git clone https://github.com/DepthAnything/Depth-Anything-V2
Clone the YoloWorld[YoloWorld]: (https://github.com/AILab-CVC/YOLO-World)
git clone https://github.com/AILab-CVC/YOLO-World
Then download the pretrain models in the YoloWorld and DepthAnything
Download the val.pkl file in the https://github.com/USC-GVL/Agent-Driver
Folder structure:
AgentThink/
βββ π data/ # Dataset and processed data
βββ DriveLMMo1_TEST_tool_results.jsonl
βββ DriveLMMo1_TEST.jsonl
β βββ π image2concat/ # Concatenated image files
β βββ π tool_results/ # Results from tool processing
β
βββ π demo_image/ # Demonstration images
β βββ nuscenes_CAM_FRONT_3590.webp
β βββ nuscenes_CAM_FRONT_3757.webp
β βββ nuscenes_CAM_FRONT_3896.webp
β
βββ π pretrained_model/ # Pre-trained model files
β βββ π AgentThink/
β β βββ checkpoint-700-merged
β βββ depth_anything_v2_vitb.pth
β βββ yolov8x-world2.pt
β
βββ π assets/ # Visual assets and resources
βββ π evaluation/ # Evaluation scripts and benchmarks
βββ π Inference/ # Inference-related scripts and data
βββ π results/ # Output and result files
βββ π scripts/ # Various utility scripts
βββ π third_party/ # Third-party libraries and resources
βββ README.cn.md # Chinese documentation
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
βββ ... # Other project files
# drivemllm
python Inference/inference_demo_drivemllm.py
# drivelmm-o1
python Inference/inference_demo_drivelmm.py| Status | Task Description |
|---|---|
| β | AgentThink demo implementation |
| β | General reasoning evaluation metrics |
| π | Tool-specific evaluation metrics |
| π | Data preprocessing pipeline |
| β | Debug example implementation |
| π | Multi-stage training framework |
| π | Tool function interaction environment |
| Vision Language Models | Risk Assess. (%)β | Rule Adh. (%)β | Scene Aware. (%)β | Relevance (%)β | Missing (%)β | Reason. (%)β | MCQ (%)β |
|---|---|---|---|---|---|---|---|
| GPT-4o | 71.32 | 80.72 | 72.96 | 76.65 | 71.43 | 72.52 | 57.84 |
| Ovis1.5-Gemma2-9B | 51.34 | 66.36 | 54.74 | 55.72 | 55.74 | 55.62 | 48.85 |
| Mulberry-7B | 51.89 | 63.66 | 56.68 | 57.27 | 57.45 | 57.65 | 52.86 |
| LLaVA-CoT | 57.62 | 69.01 | 60.84 | 62.72 | 60.67 | 61.41 | 49.27 |
| LlamaV-o1 | 60.20 | 73.52 | 62.67 | 64.66 | 63.41 | 63.13 | 50.02 |
| InternVL2.5-8B | 69.02 | 78.43 | 71.52 | 75.80 | 70.54 | 71.62 | 54.87 |
| Qwen2.5-VL-7B | 46.44 | 60.45 | 51.02 | 50.15 | 52.19 | 51.77 | 37.81 |
| DriveLMM-o1 | 73.01 | 81.56 | 75.39 | 79.42 | 74.49 | 75.24 | 62.36 |
| AgentThink (Ours) | 80.51 | 84.98 | 82.11 | 84.99 | 79.56 | 79.68 | 71.35 |
| Type | Model | L/R | F/B | RHD | RD | PPos | BBox | CVD | CD | AccS | Overall |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Zero-shot | GPT-4o | 91.72 | 67.60 | 9.58 | 14.69 | 40.90 | 4.07 | 46.11 | 70.65 | 43.16 | 25.63 |
| GPT-4o-mini | 67.67 | 50.13 | 70.44 | 0.00 | 29.28 | 3.78 | 0.00 | 46.40 | 33.46 | 16.68 | |
| LLaVA-ov-72B | 85.42 | 49.48 | 13.76 | 45.27 | 16.46 | 0.00 | 42.97 | 27.09 | 35.06 | 21.10 | |
| Qwen2.5-VL-7B | 76.55 | 55.24 | 7.14 | 17.11 | 55.97 | 38.31 | 55.94 | 51.52 | 44.72 | 13.36 | |
| Qwen + CoT | 87.06 | 63.09 | 16.69 | 22.56 | 52.51 | 38.87 | 76.90 | 38.71 | 49.55 | 19.31 | |
| Qwen + DirectTool | 78.95 | 48.96 | 58.43 | 67.57 | 58.20 | 42.22 | 51.76 | 51.38 | 57.18 | 24.05 | |
| AgentThink (Ours) | 82.33 | 54.40 | 56.14 | 61.45 | 70.45 | 56.23 | 23.09 | 51.60 | 56.96 | 26.52 | |
| One-shot | GPT-4o | 91.08 | 69.37 | 36.51 | 71.17 | 42.44 | 5.10 | 0.00 | 63.88 | 47.44 | 33.17 |
| GPT-4o-mini | 66.00 | 48.95 | 83.02 | 58.47 | 25.71 | 3.97 | 52.73 | 55.23 | 49.26 | 22.13 | |
| LLaVA-ov-72B | 79.12 | 62.97 | 49.26 | 68.04 | 28.57 | 2.20 | 53.12 | 60.90 | 50.52 | 36.66 | |
| Qwen2.5-VL-7B | 80.30 | 53.14 | 36.96 | 39.13 | 62.69 | 22.63 | 49.88 | 48.32 | 49.13 | 33.53 | |
| Qwen + CoT | 86.35 | 59.95 | 43.29 | 31.81 | 53.64 | 26.93 | 51.02 | 42.30 | 49.41 | 32.06 | |
| Qwen + DirectTool | 84.57 | 55.50 | 67.32 | 59.54 | 85.58 | 26.07 | 52.34 | 53.25 | 60.52 | 42.27 | |
| AgentThink (Ours) | 78.71 | 48.46 | 60.64 | 60.71 | 72.36 | 64.46 | 52.26 | 52.04 | 61.21 | 47.24 |
AgentThink/
βββ assets/ # Visual assets and resources
βββ data/ # Data files and datasets
βββ evaluation/ # Evaluation scripts and benchmarks
β βββ evaluation_script.py
β βββ inference_agentthink.py
βββ Inference/ # Inference-related scripts and data
β βββ inference_demo_data_drivemllm.json
β βββ inference_demo_data_drivelmm.json
β βββ inference_demo_drivemllm.py
βββ results/ # Output and result files
β βββ agentthink/
βββ scripts/ # Various utility scripts
β βββ debug_scripts/
β βββ inference_scripts/
β βββ tools/ # Tool library implementations
βββ third_party/ # Third-party libraries and resources
β βββ π inference.py # Main inference script
β βββ π prepare_data.py # Data preparation script
β βββ π utlis.py # Utility functions
β βββ π env.sh # Environment setup script
β βββ π env_drivemllm.sh # DriveMLLM environment script
β βββ π prepare_json_data.sh # Long JSON data preparation script
βββ π README.md # Project documentation
βββ π README_CN.md # δΈζζζ‘£
βββ π requirements.txt # Python dependencies
| Name | Description | Link |
|---|---|---|
| Depth-Anything-V2 | High-quality monocular depth estimation | GitHub |
| YOLO-World | Open-vocabulary object detection | GitHub |
| all-MiniLM | Extract language semantic similarity | HuggingFace |
| AgentDriver | Offer the tool results | Github |
This project is licensed under Apache License 2.0. See LICENSE file for details.
Please cite our work if you use AgentThink in your research:
@article{qian2025agentthink,
title={Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision-language models for autonomous driving},
author={Qian, Kangan and Jiang, Sicong and Zhong, Yang and Luo, Ziang and Huang, Zilin and Zhu, Tianze and Jiang, Kun and Yang, Mengmeng and Fu, Zheng and Miao, Jinyu and others},
journal={arXiv preprint arXiv:2505.15298},
year={2025}
}
Authors:
Kangan Qian*, Sicong Jiang*, Yang Zhong*, Ziang Luo*, Zilin Huang, Tianze Zhu, Kun Jiangβ , Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Yeβ , Lijun Sun, Diange Yangβ
Affiliations:
1 Tsinghua University
2 McGill University
3 Xiaomi Corporation
4 University of Wisconsin-Madison* Equal contribution, β Corresponding author