π AGNOSTOS: Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
[π Project Page] | [π Paper] | π€ Huggingface Data | π€ Huggingface Model | [πΊ Video]
Jiaming Zhou1, Ke Ye1, Jiayi Liu1, Teli Ma1, Zifan Wang1, Ronghe Qiu1, Kun-Yu Lin2, Zhilin Zhao3, Junwei Liang1,4
1HKUST (Guangzhou), 2HKU, 3SYSU, 4HKUST
The project introduces AGNOSTOS, a simulation manipulation benchmark designed to rigorously evaluate cross-task zero-shot generalization of Vision-Language-Action models, and proposes Cross-Task In-Context Manipulation (X-ICM), a method that significantly improves cross-task generalization capabilities.
Please refer to INSTALL_docker.md to initialize your environment.
For simplified installation using modern package management, we recommend Pixi. Install it via the official guide, and you can set up dependencies with minimal commands:
git clone https://github.com/jiaming-zhou/X-ICM.git && cd X-ICM
pixi shell # Install dependencies and enter virtual environment
pixi run setup_env # Install additional system dependencies (like xvfb, CoppeliaSim and flash-attention etc.)Inside the Pixi shell, you can run additional tasks. For more options, run pixi run --list.
The benchmark consists of two parts. To download the data, use the following Pixi tasks:
pixi run get_seen_tasks # Downloads and extracts 18 seen tasks (140G)
pixi run get_unseen_tasks # Downloads and extracts 23 unseen tasks (20.2GB)Data will be placed in the data/ directory. For manual download instructions, see MANUAL_DATA_DOWNLOAD.md.
To download the pre-trained dynamics diffusion model, run:
pixi run get_modelThe model will be extracted to data/dynamics_diffusion/. For manual download instructions, see MANUAL_DATA_DOWNLOAD.md.
Run the evaluation using Pixi task with the below parameters. (You can also run bash eval_scripts/eval_XICM.sh directly in the Pixi shell as an alternative to pixi run eval_xicm.)
### set seed numbers for different runs
seeds: [example: "0,99"]
### set the number of rollouts for each run
episodes: [example: 25]
### set the method of LLM
modelname: [example: Qwen2.5.7B.instruct]
### set the number of cross-task in-context samples
num_icls: [example: 18]
### set the gpu list
gpu_ids: [example: 0,1]
### set the in-context sample selection method
ranking_method: [example: "lang_vis.out"]For dynamics-guided in-context manipulation, run:
pixi run eval_xicm "0,99" 25 Qwen2.5.7B.instruct 18 0,1 "lang_vis.out"Reminder: During evaluation, you need to load the Stable-Diffusion model and Qwen-LLM models from huggingface.
You can manually download them from huggingface and load them from the local paths in load_weight func and model_path, accordingly.
For random selection of cross-task samples, run:
pixi run eval_xicm "0,99" 25 Qwen2.5.7B.instruct 18 0,1 "random"After testing, you can use gather_score.py to collect and analyze the results.
We provide the testing results of our X-ICM (7B) and X-ICM (72B) models under the sub-folder logs/.
- X-ICM (7B) achieves 23.5% average success rate and X-ICM (72B) achieves 30.1% average success rate, both versions outperform all existing VLA models;
- X-ICM (7B) fails on only two tasks, while X-ICM (72B) succeeds on all tasks;
Due to the embodiment gap, existing VLA models need to be fine-tuned on RLBench data.
Please follow your VLA model's fine-tuning guidelines to fine-tune your models on our 18 seen tasks, and then test the models on our 23 unseen tasks.
Modify the custom_agent.py file:
-
Load your VLA model in the
load_weightsfunction; -
Implement VLA model inference in the
_inferencefunction, including input construction and output format conversion; -
Run the evaluation:
bash eval_CustomModel.sh seeds episodes gpu_ids
Example:
bash eval_scripts/eval_CustomModel.sh "0,99" 25 0,1
π‘ Note: Different VLA models may require different input image sizes (default is 256x256).
π‘ Please modify IMAGE_SIZE in main_custom.py accordingly.
This repository is built upon the RoboPrompt. Some resources from RVT and RLBench are used in this work.
If you find our work helpful to your research, please kindly give us a star and cite our paper.
@article{zhou2025exploring,
title={Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization},
author={Zhou, Jiaming and Ye, Ke and Liu, Jiayi and Ma, Teli and Wang, Zifang and Qiu, Ronghe and Lin, Kun-Yu and Zhao, Zhilin and Liang, Junwei},
journal={arXiv preprint arXiv:2505.15660},
year={2025}
}