CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification
NeurIPS 2025
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author
- [09/2025] 🔥 Code released. Enjoy it!
- [09/2025] 🔥 CogVLA is accepted to NeurIPS 2025!
- [08/2025] 🔥 Project page released.
- [08/2025] 🔥 arXiv paper released.
This is the github repository of CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture.
Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5× and decreasing inference latency by 2.8× compared to OpenVLA.
The overall framework of CogVLA is illustrated below.
# Create and activate conda environment
conda create -n cogvla python=3.10 -y
conda activate cogvla
# Clone CogVLA repo and pip install to download dependencies
git clone [email protected]:JiuTian-VL/CogVLA.git
cd CogVLA
pip install -e .
# Install Flash Attention 2 for training
pip install packaging ninja
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolationSee LIBERO.md for fine-tuning/evaluating on LIBERO simulation benchmark task suites.
See ALOHA.md for fine-tuning/evaluating on real-world ALOHA robot tasks.
After training, fill your checkpoint path in demo.py. Then run the following command
CUDA_VISIBLE_DEVICES=0 python demo.pyPerformance. CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0% on simulation and real-world tasks, respectively.
Efficiency. CogVLA also reduces training costs by 2.5× and decreases inference latency by 2.8× compared to OpenVLA.
The attention maps of CogVLA highlight task-relevant regions in the input image, well aligning with human cognition during task execution.
GaLaXea.R1.Lite.Robot.Folding.Clothes.Demo.mp4
GaLaXea.R1.Lite.Robot.Open.Drawer.and.Place.Toy.Demo.mp4
If you find this work useful for your research, please kindly cite our paper.
@article{li2025cogvla,
title={CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing \& Sparsification},
author={Li, Wei and Zhang, Renshan and Shao, Rui and He, Jie and Nie, Liqiang},
journal={Advances in neural information processing systems},
year={2025}
}