ControlVLA

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Puhao Li^1,2, Yingying Wu¹, Ziheng Xi¹, Wanlin Li², Yuzhe Huang², Zhiyuan Zhang¹, Yinghan Chen³, Jianan Wang⁴, Song-Chun Zhu^1,2,3, Tengyu Liu^{2, ✉️}, Siyuan Huang^{2, ✉️}

¹Tsinghua University, ²Beijing Institute for General Artificial Intelligence (BIGAI), ³Peking University, ⁴AstriBot.

ControlVLA is a general framework for few-shot object-centric adaptation for pre-trained VLA models. It can be used to adapt pre-trained VLA models to task- and environment-specific skills with only 10-20 expert demonstrations.

🛠️ Installation

Create a virtual environment through conda or other python package managers.
```
conda create -n controlvla python==3.9.18
conda activate controlvla
```

Install torch and other dependent libraries.

pip install torch==2.1.0+cu121 torchvision==0.16.0+cu121 --extra-index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

## install sam2 from the source code
cd thirdparty
git clone https://github.com/facebookresearch/sam2.git
git checkout 7e1596c0b6462eb1d1ba7e1492430fed95023598
## remove the python and pytorch version restrictions in sam2 setup config
cd sam2 && pip install -e .

The code is tested on pytorch 2.1.0 and cuda 12.1, other versions may have compatibility issues.

Download the pre-trained model:
- Pre-trained ControlVLA model here, unzip and place it in the ./data folder.
- SAM2 model following the instructions here and place it in the ./data/checkpoints folder. Note that the default config uses checkpoint sam2_hiera_tiny.pt. You can simply download the default checkpoint with wget: cd data/checkpoints && wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt.

🗃️ Data Collection

Note that our data is collected and pre-processed with UMI system, which provides robot-arm-agnostic training data. You may also collect data with your own robot setup for faster validation and deployment.

Collect data with UMI data pipeline system, and get the replay_buffer.zarr.zip data file. An example of this zarr data file is provided here.

Annotate the interactive parts for each object for SAM2.

python scripts_objcen_pipeline/prompts_annotation.py -i ./example_finetune_demo/picknplace_toy.d10
python scripts_objcen_pipeline/prompts_extraction.py -i picknplace_toy.d10

You can also use GroundingDINO to automatically annotate the interactive parts with task language instructions.

Process and integrate the object-centric masks into the data file.

python scripts_objcen_pipeline/08_propagate_interactive_parts.py -i ./example_finetune_demo/picknplace_toy.d10
python scripts_objcen_pipeline/09_integrate_into_dataset.py -i ./example_finetune_demo/picknplace_toy.d10

🦾 Fine-tuning and Deployment

Finetune the pre-trained ControlVLA model on example dataset with the following command:

bash runs/controlvla_pnptoy.sh

For real-world deployment, customize your robot and camera interface for the inference script eval_controlvla.py. Then run:

python eval_controlvla.py -i ./data/checkpoints/latest.ckpt -p ./example_finetune_demo/picknplace_toy.d10/picknplace_toy.d10.objectcentric.anno.pkl

👏 Acknowledgments

We thank Yuyang Li, Yuwei Guo, and Ziyuan Jiao for their valuable discussions and technical support. This work builds upon the codebase of UMI.

🔗 Citation

If you find this work useful, please consider citing:

@article{li2025controlvla,
  title={ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models},
  author={Li, Puhao and Wu, Yingying and Xi, Ziheng and Li, Wanlin and Huang, Yuzhe and Zhang, Zhiyuan and Chen, Yinghan and Wang, Jianan and Zhu, Song-Chun and Liu, Tengyu and others},
  journal={arXiv preprint arXiv:2506.16211},
  year={2025}
}

If you have any questions about this work, feel free to contact Puhao Li at [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ControlVLA

Achievements