Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[NeurIPS 2025] CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification

License

Notifications You must be signed in to change notification settings

JiuTian-VL/CogVLA

Repository files navigation

CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification
NeurIPS 2025

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author

arXiv project page GitHub star chart

🔥CogVLA is accepted to NeurIPS 2025!🔥
⭐ Give us a star if you like it! ⭐
✨If you find this work useful for your research, please kindly cite our paper.✨

🔥 Updates

  • [09/2025] 🔥 Code released. Enjoy it!
  • [09/2025] 🔥 CogVLA is accepted to NeurIPS 2025!
  • [08/2025] 🔥 Project page released.
  • [08/2025] 🔥 arXiv paper released.

Introduction

This is the github repository of CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture.

Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5× and decreasing inference latency by 2.8× compared to OpenVLA.

The overall framework of CogVLA is illustrated below.

Installation

# Create and activate conda environment
conda create -n cogvla python=3.10 -y
conda activate cogvla

# Clone CogVLA repo and pip install to download dependencies
git clone [email protected]:JiuTian-VL/CogVLA.git
cd CogVLA
pip install -e .

# Install Flash Attention 2 for training
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation

Training and Evaluation

See LIBERO.md for fine-tuning/evaluating on LIBERO simulation benchmark task suites.

See ALOHA.md for fine-tuning/evaluating on real-world ALOHA robot tasks.

Demos

After training, fill your checkpoint path in demo.py. Then run the following command

CUDA_VISIBLE_DEVICES=0 python demo.py

Experiments

Performance. CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0% on simulation and real-world tasks, respectively.

Efficiency. CogVLA also reduces training costs by 2.5× and decreases inference latency by 2.8× compared to OpenVLA.

Visualization

The attention maps of CogVLA highlight task-relevant regions in the input image, well aligning with human cognition during task execution.

GaLaXea.R1.Lite.Robot.Folding.Clothes.Demo.mp4
GaLaXea.R1.Lite.Robot.Open.Drawer.and.Place.Toy.Demo.mp4

🔥 Citation

If you find this work useful for your research, please kindly cite our paper.

@article{li2025cogvla,
  title={CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing \& Sparsification},
  author={Li, Wei and Zhang, Renshan and Shao, Rui and He, Jie and Nie, Liqiang},
  journal={Advances in neural information processing systems},
  year={2025}
}

Star History Chart

About

[NeurIPS 2025] CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages