Thanks to visit codestin.com
Credit goes to github.com

Skip to content

curryqka/AgentThink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

[EMNLP'25] πŸš— AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

AgentThink Logo

During the development of AgentThink, we drew inspiration from ancient wisdom. As stated by Xunzi:

πŸ“œ "ε›ε­η”ŸιžεΌ‚δΉŸοΌŒε–„ε‡δΊŽη‰©δΉŸγ€‚"

"A gentleman is not inherently different from others; he excels by skillfully leveraging external tools."

This philosophy aligns perfectly with our design principles - by integrating multiple tools and reasoning, AgentThink achieves superior capabilities in complex autonomous driving scenarios.


δΈ­ζ–‡ | English

🌐 Project β€’ πŸ“„ Paper β€’ πŸ”– Repo β€’ πŸͺͺ License

πŸͺš Model | πŸ”’ Data

E-mails: [email protected] or [email protected] or [email protected] or [email protected]

🎬 Demo Showcase

Experience AgentThink's real-world performance through our demonstration materials that illustrate its capabilities in autonomous driving scenarios.

Video Demonstration

Watch this video to see AgentThink's environmental perception in complex traffic conditions:

Demo GIF

Visualization Gallery

Daytime Planning Nighttime Planning

AgentThink zero-shot learning

Complementing the video, these visualizations demonstrate key capabilities:

Scenario Description Image
High-level planning Visualizes high-level planning View
Spatial Understanding Demonstrates spatial relationship analysis View
Environment Adaptability Shows performance in adverse weather or low-light View

Contents

✨ Highlights

  • πŸ”§ Tool-Augmented Reasoning: Multi-modal perception through integrated vision, prediction, occupancy, and mapping tools
  • 🧠 Reasoning Chain + Tool Calls: Task decomposition with explicit tool invocation
  • 🎯 GRPO Training: Triple reward signals (final answer, step-wise, tool usage)
  • πŸš€ Performance Boost: 53.91% accuracy improvement over traditional VLM models

πŸ“° Project Updates

  • πŸŽ‰ [2025.08.20] Our paper was accepted as EMNLP2025 Fundings
  • πŸš€ [2025.07.02] v1.1 released with demo and sample data
  • πŸ“„ [2025.05.22] Paper published on arXiv
  • πŸŽ₯ Web Demo and Swift full training pipeline coming soon

πŸš€ Quick Navigation

Section Description Link
Environment Setup Install dependencies and setup Environment Setup
AgentThink data generation generate the agentthink data Data Generation
Model Inference Real-time inference on val set Model Inference
Demo Inference Real-time inference on test set Demo Inference
Evaluation-Metrics Scoring pipeline using LLM-as-Judge Evaluation Metrics
Benchmark Results Quantitative performance comparisons Benchmark Results

Environment Setup

πŸ› οΈ Basic

component version command
os Ubuntu 20.04 cat /etc/issue
Python 3.10.12 python --version
CUDA Toolkit 12.4 nvcc --version
GPU Driver 535.129.03 nvidia-smi
Pytorch 2.6.0 print(torch.__version__)

Basic Setup

Install dependencies and prepare your environment:

# Create virtual environment
conda create -n agentthink python=3.10
conda activate agentthink

# Install dependencies
pip install -r requirements.txt

# Install ms-swift
bash scripts/env.sh

# Install drivemllm dependency
bash scripts/env_drivemllm.sh

Clone ms-swift

cd third_party
git clone https://github.com/modelscope/ms-swift.git

Data Generation

Use GPT-4o-mini to generate Tool-CoT data after filling the OpenAI api_key and base_url.

# based on the DriveLMM-o1 TRAIN.json
python scripts/tools/agentthink_data_generater_pipeline.py --split train --model_name gpt-4o-mini

Generate combined 6-views images.

# based on the nuScenes dataset
python data_image_converter.py \
    --json_file_path /path/to/your/DriveLMMo1_TRAIN/TEST.json \
    --image_root_dir /path/to/your/nuscene/root/directory \
    --savepath_img /path/to/your/output/directory

Model Inference

🎬 Use your trained checkpoint AgentThink to run inference on val samples AgentThink-CoT-val:

# Inference script
bash scripts/inference_scripts/inference.sh [your_CKPT_PATH] [your_OUTPUT_DIR]

# Inference with tool script
bash scripts/inference_scripts/inference_withtool.sh [your_CKPT_PATH] [your_OUTPUT_DIR]

# Inference using multi-node GPUs
bash scripts/inference_scripts/inference_multigpu.sh [your_CKPT_PATH] [your_OUTPUT_DIR]

# Inference AgentThink
bash scripts/inference_agentthink.sh [your_CKPT_PATH] [your_OUTPUT_DIR]

Evaluation Metrics

πŸ“ŠUse LLM-as-Judge to calculate performance metrics:

# Evaluate reasoning ability and MCQ accuracy
python evaluation/evaluation_script.py

Benchmark Results

πŸ†See the Results section or poster for AgentThink’s SOTA performance.

πŸš€ Quick Start

Download the model

Our AgentThink model based on the Qwen2.5-VL-7B.

Download the tool model

Clone the depth anythingv2[DAM]: (https://github.com/DepthAnything/Depth-Anything-V2)

git clone https://github.com/DepthAnything/Depth-Anything-V2

Clone the YoloWorld[YoloWorld]: (https://github.com/AILab-CVC/YOLO-World)

git clone https://github.com/AILab-CVC/YOLO-World

Then download the pretrain models in the YoloWorld and DepthAnything

Download the basic tool results

Download the val.pkl file in the https://github.com/USC-GVL/Agent-Driver

Folder structure:

AgentThink/
β”œβ”€β”€ πŸ“‚ data/                    # Dataset and processed data
    β”œβ”€β”€ DriveLMMo1_TEST_tool_results.jsonl
    β”œβ”€β”€ DriveLMMo1_TEST.jsonl
β”‚   β”œβ”€β”€ πŸ“‚ image2concat/        # Concatenated image files
β”‚   └── πŸ“‚ tool_results/        # Results from tool processing
β”‚
β”œβ”€β”€ πŸ“‚ demo_image/              # Demonstration images
β”‚   β”œβ”€β”€ nuscenes_CAM_FRONT_3590.webp
β”‚   β”œβ”€β”€ nuscenes_CAM_FRONT_3757.webp
β”‚   └── nuscenes_CAM_FRONT_3896.webp
β”‚
β”œβ”€β”€ πŸ“‚ pretrained_model/        # Pre-trained model files
β”‚   β”œβ”€β”€ πŸ“‚ AgentThink/
β”‚   β”‚   └── checkpoint-700-merged
β”‚   β”œβ”€β”€ depth_anything_v2_vitb.pth
β”‚   └── yolov8x-world2.pt
β”‚
β”œβ”€β”€ πŸ“‚ assets/                  # Visual assets and resources
β”œβ”€β”€ πŸ“‚ evaluation/              # Evaluation scripts and benchmarks
β”œβ”€β”€ πŸ“‚ Inference/               # Inference-related scripts and data
β”œβ”€β”€ πŸ“‚ results/                 # Output and result files
β”œβ”€β”€ πŸ“‚ scripts/                 # Various utility scripts
β”œβ”€β”€ πŸ“‚ third_party/             # Third-party libraries and resources
β”œβ”€β”€ README.cn.md                # Chinese documentation
β”œβ”€β”€ README.md                   # Project documentation
β”œβ”€β”€ requirements.txt            # Python dependencies
└── ...                         # Other project files

Demo Inference

# drivemllm
python Inference/inference_demo_drivemllm.py

# drivelmm-o1
python Inference/inference_demo_drivelmm.py

πŸ“‹ TODO List

πŸ”§ Development Roadmap

Status Task Description
βœ… AgentThink demo implementation
βœ… General reasoning evaluation metrics
πŸ”œ Tool-specific evaluation metrics
πŸ”œ Data preprocessing pipeline
βœ… Debug example implementation
πŸ”œ Multi-stage training framework
πŸ”œ Tool function interaction environment

AgentThink Results

πŸ“Š DriveLMM-o1 Performance

Vision Language Models Risk Assess. (%)↑ Rule Adh. (%)↑ Scene Aware. (%)↑ Relevance (%)↑ Missing (%)↑ Reason. (%)↑ MCQ (%)↑
GPT-4o 71.32 80.72 72.96 76.65 71.43 72.52 57.84
Ovis1.5-Gemma2-9B 51.34 66.36 54.74 55.72 55.74 55.62 48.85
Mulberry-7B 51.89 63.66 56.68 57.27 57.45 57.65 52.86
LLaVA-CoT 57.62 69.01 60.84 62.72 60.67 61.41 49.27
LlamaV-o1 60.20 73.52 62.67 64.66 63.41 63.13 50.02
InternVL2.5-8B 69.02 78.43 71.52 75.80 70.54 71.62 54.87
Qwen2.5-VL-7B 46.44 60.45 51.02 50.15 52.19 51.77 37.81
DriveLMM-o1 73.01 81.56 75.39 79.42 74.49 75.24 62.36
AgentThink (Ours) 80.51 84.98 82.11 84.99 79.56 79.68 71.35

πŸ“Š DriveMLLM Comparison

Type Model L/R F/B RHD RD PPos BBox CVD CD AccS Overall
Zero-shot GPT-4o 91.72 67.60 9.58 14.69 40.90 4.07 46.11 70.65 43.16 25.63
GPT-4o-mini 67.67 50.13 70.44 0.00 29.28 3.78 0.00 46.40 33.46 16.68
LLaVA-ov-72B 85.42 49.48 13.76 45.27 16.46 0.00 42.97 27.09 35.06 21.10
Qwen2.5-VL-7B 76.55 55.24 7.14 17.11 55.97 38.31 55.94 51.52 44.72 13.36
Qwen + CoT 87.06 63.09 16.69 22.56 52.51 38.87 76.90 38.71 49.55 19.31
Qwen + DirectTool 78.95 48.96 58.43 67.57 58.20 42.22 51.76 51.38 57.18 24.05
AgentThink (Ours) 82.33 54.40 56.14 61.45 70.45 56.23 23.09 51.60 56.96 26.52
One-shot GPT-4o 91.08 69.37 36.51 71.17 42.44 5.10 0.00 63.88 47.44 33.17
GPT-4o-mini 66.00 48.95 83.02 58.47 25.71 3.97 52.73 55.23 49.26 22.13
LLaVA-ov-72B 79.12 62.97 49.26 68.04 28.57 2.20 53.12 60.90 50.52 36.66
Qwen2.5-VL-7B 80.30 53.14 36.96 39.13 62.69 22.63 49.88 48.32 49.13 33.53
Qwen + CoT 86.35 59.95 43.29 31.81 53.64 26.93 51.02 42.30 49.41 32.06
Qwen + DirectTool 84.57 55.50 67.32 59.54 85.58 26.07 52.34 53.25 60.52 42.27
AgentThink (Ours) 78.71 48.46 60.64 60.71 72.36 64.46 52.26 52.04 61.21 47.24

πŸ“ Repository Structure

AgentThink/
β”œβ”€β”€ assets/                 # Visual assets and resources
β”œβ”€β”€ data/                   # Data files and datasets
β”œβ”€β”€ evaluation/             # Evaluation scripts and benchmarks
β”‚   β”œβ”€β”€ evaluation_script.py
β”‚   └── inference_agentthink.py
β”œβ”€β”€ Inference/              # Inference-related scripts and data
β”‚   β”œβ”€β”€ inference_demo_data_drivemllm.json
β”‚   β”œβ”€β”€ inference_demo_data_drivelmm.json
β”‚   └── inference_demo_drivemllm.py
β”œβ”€β”€ results/                # Output and result files
β”‚   └── agentthink/
β”œβ”€β”€ scripts/                # Various utility scripts
β”‚   β”œβ”€β”€ debug_scripts/
β”‚   β”œβ”€β”€ inference_scripts/
β”‚   └── tools/              # Tool library implementations
β”œβ”€β”€ third_party/            # Third-party libraries and resources
β”‚   β”œβ”€β”€ 🐍 inference.py         # Main inference script
β”‚   β”œβ”€β”€ 🐍 prepare_data.py      # Data preparation script
β”‚   β”œβ”€β”€ 🐍 utlis.py             # Utility functions
β”‚   β”œβ”€β”€ 🐚 env.sh               # Environment setup script
β”‚   β”œβ”€β”€ 🐚 env_drivemllm.sh     # DriveMLLM environment script
β”‚   └── 🐚 prepare_json_data.sh # Long JSON data preparation script
β”œβ”€β”€ πŸ“„ README.md            # Project documentation
β”œβ”€β”€ πŸ“„ README_CN.md         # δΈ­ζ–‡ζ–‡ζ‘£
β”œβ”€β”€ πŸ“„ requirements.txt     # Python dependencies

πŸ”— Related Works

Name Description Link
Depth-Anything-V2 High-quality monocular depth estimation GitHub
YOLO-World Open-vocabulary object detection GitHub
all-MiniLM Extract language semantic similarity HuggingFace
AgentDriver Offer the tool results Github

πŸͺͺ License & Citation

License

This project is licensed under Apache License 2.0. See LICENSE file for details.

Citation

Please cite our work if you use AgentThink in your research:

@article{qian2025agentthink,
  title={Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision-language models for autonomous driving},
  author={Qian, Kangan and Jiang, Sicong and Zhong, Yang and Luo, Ziang and Huang, Zilin and Zhu, Tianze and Jiang, Kun and Yang, Mengmeng and Fu, Zheng and Miao, Jinyu and others},
  journal={arXiv preprint arXiv:2505.15298},
  year={2025}
}

Authors:

Kangan Qian*, Sicong Jiang*, Yang Zhong*, Ziang Luo*, Zilin Huang, Tianze Zhu, Kun Jiang†, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye†, Lijun Sun, Diange Yang†

Affiliations:

1 Tsinghua University
2 McGill University
3 Xiaomi Corporation
4 University of Wisconsin-Madison

* Equal contribution, † Corresponding author

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •