VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

This repository includes the replication package for VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

The codebase is modified based on SimplerEnv and ManiSkill2_real2sim

0. Data Availability

Our generated testing scenes is provided under data/ in json files. To reproduce our experiment results, one can proceed to the following installation and replication guides.

1. Installation

If you want to use pre-built docker image, skip 1.1.

1.1 Local Installation

Local installation steps

Prerequisites:

CUDA version >=12.
An NVIDIA GPU.
Python >= 3.10

Clone this repo:

git clone https://github.com/ma-labo/VLATest.git

Install ManiSkill2 real-to-sim environments and their dependencies:

cd VLATest/ManiSkill2_real2sim
pip install -e .

Install this package:

cd VLATest
pip install -e .

Install development support

sudo apt-get install -yqq --no-install-recommends libvulkan-dev vulkan-tools
sudo apt-get install libglvnd-dev

sudo apt install ffmpeg

pip install -r requirements_full_install.txt
pip install tensorflow[and-cuda]==2.15.1 # tensorflow gpu support
pip install gymnasium==0.29.1
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

Install simulated annealing utils for system identification:

pip install git+https://github.com/nathanrooy/simulated-annealing

1.2 Docker Installation (Recommended)

Prerequisites:

CUDA version >=12.
An NVIDIA GPU.
Nvidia container toolkit

sudo docker pull zhijiewang22/vlatest:0.3

sudo docker run -dt -e "USER=vlatest" -e "PASSWORD=vlatest" -e \
  "SHELL=zsh" -p 48022:22 --ipc=host \
  --gpus 'all,"capabilities=compute,utility,graphics,display"' \
  --name=vlatest zhijiewang22/vlatest:0.3
  
ssh -p 48022 vlatest@localhost

Note that the repo is default to /VLATest instead of ~/VLATest.

Now, you should be able to run the minimal example in test_vla_inference.ipynb with OpenVLA-7b. If you want to test RT-1 series models and Octo, please proceed to 1.3.

1.3 Inference Setup

Now, go to the directory where you pull the repo. For example, if you use the docker, cd /VLATest.

RT-1 Inference Setup

Download RT-1 Checkpoint:

# First, install gsutil following https://cloud.google.com/storage/docs/gsutil_install

# Make a checkpoint dir:
mkdir checkpoints

# RT-1-X
gsutil -m cp -r gs://gdm-robotics-open-x-embodiment/open_x_embodiment_and_rt_x_oss/rt_1_x_tf_trained_for_002272480_step.zip .
unzip rt_1_x_tf_trained_for_002272480_step.zip
mv rt_1_x_tf_trained_for_002272480_step checkpoints
rm rt_1_x_tf_trained_for_002272480_step.zip

# RT-1-400k
gsutil -m cp -r gs://gdm-robotics-open-x-embodiment/open_x_embodiment_and_rt_x_oss/rt_1_tf_trained_for_000400120 .
mv rt_1_tf_trained_for_000400120 checkpoints

# RT-1-58k
gsutil -m cp -r gs://gdm-robotics-open-x-embodiment/open_x_embodiment_and_rt_x_oss/rt_1_tf_trained_for_000058240 .
mv rt_1_tf_trained_for_000058240 checkpoints

# RT-1-1k
gsutil -m cp -r gs://gdm-robotics-open-x-embodiment/open_x_embodiment_and_rt_x_oss/rt_1_tf_trained_for_000001120 .
mv rt_1_tf_trained_for_000001120 checkpoints

Octo Inference Setup

Install Octo:

pip install --upgrade "jax[cuda12_pip]==0.4.20" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html # or jax[cuda12_pip] if you have CUDA 12

git clone https://github.com/octo-models/octo/
cd octo
git checkout 653c54acde686fde619855f2eac0dd6edad7116b  # we use octo-1.0
pip install -e .
# You don't need to run "pip install -r requirements.txt" inside the octo repo; the package dependencies are already handled in the simpler_env repo
# Octo checkpoints are managed by huggingface, so you don't need to download them manually.

3. Replication Package

To reproduce experiment results with our generated testing scenes (data/):

RQ1

cd experiments
./run_exp_base_performance.sh

RQ2

cd experiments
./run_exp_100_grasp.sh
./run_exp_100_move_near.sh
./run_exp_100_put_on.sh
./run_exp_100_put_in.sh

RQ3

cd experiments
./run_random_lighting.sh

RQ4

cd experiments
./run_random_camera.sh

RQ5

cd experiments
./run_exp_base_performance_ycb.sh

The experiment results will be generated within results/

4. Data Generation

To generate new testing scenes:

RQ1

cd experiments
PYTHONPATH=~/VLATest python3 test_generation.py -t grasp -n 1000 --ro
PYTHONPATH=~/VLATest python3 test_generation.py -t move -n 1000 --ro
PYTHONPATH=~/VLATest python3 test_generation.py -t put-on -n 1000 --ro
PYTHONPATH=~/VLATest python3 test_generation.py -t put-in -n 1000 --ro

RQ2

cd experiments
PYTHONPATH=~/VLATest python3 test_generation.py -t grasp -n 100 --obstacles 0
PYTHONPATH=~/VLATest python3 test_generation.py -t grasp -n 100 --obstacles 1
PYTHONPATH=~/VLATest python3 test_generation.py -t grasp -n 100 --obstacles 2
PYTHONPATH=~/VLATest python3 test_generation.py -t grasp -n 100 --obstacles 3
PYTHONPATH=~/VLATest python3 test_generation.py -t grasp -n 100 --obstacles 4

PYTHONPATH=~/VLATest python3 test_generation.py -t move -n 100 --obstacles 0
PYTHONPATH=~/VLATest python3 test_generation.py -t move -n 100 --obstacles 1
PYTHONPATH=~/VLATest python3 test_generation.py -t move -n 100 --obstacles 2
PYTHONPATH=~/VLATest python3 test_generation.py -t move -n 100 --obstacles 3
PYTHONPATH=~/VLATest python3 test_generation.py -t move -n 100 --obstacles 4

PYTHONPATH=~/VLATest python3 test_generation.py -t put-on -n 100 --obstacles 0
PYTHONPATH=~/VLATest python3 test_generation.py -t put-on -n 100 --obstacles 1
PYTHONPATH=~/VLATest python3 test_generation.py -t put-on -n 100 --obstacles 2
PYTHONPATH=~/VLATest python3 test_generation.py -t put-on -n 100 --obstacles 3
PYTHONPATH=~/VLATest python3 test_generation.py -t put-on -n 100 --obstacles 4

PYTHONPATH=~/VLATest python3 test_generation.py -t put-in -n 100 --obstacles 0
PYTHONPATH=~/VLATest python3 test_generation.py -t put-in -n 100 --obstacles 1
PYTHONPATH=~/VLATest python3 test_generation.py -t put-in -n 100 --obstacles 2
PYTHONPATH=~/VLATest python3 test_generation.py -t put-in -n 100 --obstacles 3
PYTHONPATH=~/VLATest python3 test_generation.py -t put-in -n 100 --obstacles 4

RQ3 & RQ4

These RQs reuses the data from RQ1. You need to extract the indexes of successful tasks in RQ1.

RQ5

cd experiments
PYTHONPATH=~/VLATest python3 test_generation.py -t grasp -n 1000 --ro --ycb
PYTHONPATH=~/VLATest python3 test_generation.py -t move -n 1000 --ro --ycb
PYTHONPATH=~/VLATest python3 test_generation.py -t put-on -n 1000 --ro --ycb
PYTHONPATH=~/VLATest python3 test_generation.py -t put-in -n 1000 --ro --ycb

RQ6

This RQ reuses the data from RQ1.

5 Citation

If you found our paper/code useful in your research, please consider citing:

@article{wang2025vlatest,
 author = {Wang, Zhijie and Zhou, Zhehua and Song, Jiayang and Huang, Yuheng and Shu, Zhan and Ma, Lei},
 title = {VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation},
 journal = {Proceedings of ACM Software Engineering},
 year = {2025},
 volume = {2},
 issue = {FSE},
 articleno = {FSE073},
 month = jul
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
ManiSkill2_real2sim		ManiSkill2_real2sim
data		data
experiments		experiments
simpler_env		simpler_env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mutated_prompts.md		mutated_prompts.md
pyproject.toml		pyproject.toml
requirements_full_install.txt		requirements_full_install.txt
setup.py		setup.py
test_vla_inference.ipynb		test_vla_inference.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

0. Data Availability

1. Installation

1.1 Local Installation

1.2 Docker Installation (Recommended)

1.3 Inference Setup

RT-1 Inference Setup

Octo Inference Setup

3. Replication Package

RQ1

RQ2

RQ3

RQ4

RQ5

4. Data Generation

RQ1

RQ2

RQ3 & RQ4

RQ5

RQ6

5 Citation

6 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

AkramKalaee/VLATest

Folders and files

Latest commit

History

Repository files navigation

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

0. Data Availability

1. Installation

1.1 Local Installation

1.2 Docker Installation (Recommended)

1.3 Inference Setup

RT-1 Inference Setup

Octo Inference Setup

3. Replication Package

RQ1

RQ2

RQ3

RQ4

RQ5

4. Data Generation

RQ1

RQ2

RQ3 & RQ4

RQ5

RQ6

5 Citation

6 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages