VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation
🔥 [2025/09/18] Our VLA-Cache is accepted by NeurIPS 2025!
🔥 [2025/06/12]: Code for OpenVLA is available (OpenVLA README).
🔥 [2025/05/29]: Code for OpenVLA-OFT is released (OpenVLA-OFT README).
Vision-Language-Action (VLA) models can map multi-modal inputs (vision + language) to actions for robotic tasks in an end-to-end manner. However, due to the high frame rate and spatial complexity in robotic control, VLA inference can be computationally expensive.
VLA-Cache introduces a lightweight and effective caching mechanism by detecting unchanged visual tokens between frames and reusing their key-value computations. This leads to substantial speed-up with minimal accuracy loss.
git clone https://github.com/siyuhsu/vla-cache.git
cd vla-cacheFollow the OpenVLA and OpenVLA-OFT setup instructions.
conda activate openvla
cd src/openvla
pip install -e .conda activate openvla-oft
cd src/openvla-oft
pip install -e .conda activate openvla
cd src/openvla
python vla_cache_scripts/download_model_local.py \
--model_id openvla/openvla-7b-finetuned-libero-spatialpython experiments/robot/libero/run_libero_eval.py \
--pretrained_checkpoint checkpoints/openvla-7b-finetuned-libero-spatial \
--task_suite_name libero_spatial \
--use_vla_cache Truepython experiments/robot/libero/run_libero_eval.py \
--pretrained_checkpoint checkpoints/openvla-7b-finetuned-libero-spatial \
--task_suite_name libero_spatial \
--use_vla_cache Falseconda activate openvla-oft
cd src/openvla-oft
python vla_cache_scripts/download_model_local.py \
--model_id moojink/openvla-7b-oft-finetuned-libero-spatialpython experiments/robot/libero/run_libero_eval.py \
--pretrained_checkpoint checkpoints/openvla-7b-oft-finetuned-libero-spatial \
--task_suite_name libero_spatial \
--use_vla_cache Truepython experiments/robot/libero/run_libero_eval.py \
--pretrained_checkpoint checkpoints/openvla-7b-oft-finetuned-libero-spatial \
--task_suite_name libero_spatial \
--use_vla_cache FalseIf you find this work useful, please cite:
@article{xu2025vla,
title={VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation},
author={Xu, Siyu and Wang, Yunke and Xia, Chenghao and Zhu, Dihao and Huang, Tao and Xu, Chang},
journal={arXiv preprint arXiv:2502.02175},
year={2025}
}We build on the amazing work of OpenVLA, OpenVLA-OFT, and Huggingface Transformers.
This project is licensed under the Apache 2.0 License.