This is the repo of training code and eval for our work:
Ruihan Yang1*, Qinxi Yu2*, Yecheng Wu3,4, Rui Yan1, Borui Li1, An-Chieh Cheng1, Xueyan Zou1, Yunhao Fang1, Xuxin Cheng1, Ri-Zhao Qiu1, Hongxu Yin4, Sifei Liu4, Song Han3,4, Yao Lu4, Xiaolong Wang1
1UC San Diego / 2UIUC / 3MIT / 4NVIDIA
Project Page / Arxiv / Simulation Benchmark
follow the VILA setup instruction
cd VILA
./environment_setup.sh vilabash ./build_env.shRegister at the MANO website and download the models.
- Download Mano Hand model link
- Unzip the mano models and place it in the repo directory (EgoVLA/mano_v1_2)
git clone https://github.com/hassony2/manopth # This is for hand pose preprocessing
git clone https://github.com/facebookresearch/hot3d # This is for Hot3d data preprocessing
export PYTHONPATH=$PYTHONPATH:/path/to/your/manopth
export PYTHONPATH=$PYTHONPATH:/path/to/your/hot3dOverall instruction for setup IsaacLab: https://isaac-sim.github.io/IsaacLab/main/index.html
- Follow the instruction to install IsaacSim (4.2.0.2)
pip install isaacsim==4.2.0.2 --extra-index-url https://pypi.nvidia.com # if this command failed please try the following command: # pip install isaacsim==4.2.0.2 isaacsim-extscache-physics==4.2.0.2 isaacsim-extscache-kit==4.2.0.2 isaacsim-extscache-kit-sdk==4.2.0.2 --extra-index-url https://pypi.nvidia.com
- Clone Ego Humanoid Manipulation Benchmark, then install it with the command in the instruction
For the following human dataset, we use egocentric RGB video & Hand/Head/Camera Pose & language labels.
- TACO
Download Raw data follow official instruction: https://taco2024.github.io/
And follow instruction (https://github.com/leolyliu/TACO-Instructions) to setup the virtrual environment to process the TACO data. (It's a bit complicated to merge all dependency)
# RAW DATA: data/TACO
# HF DATA: data/TACO_HF
conda activate taco
# Preprocess RAW -> HF:
sh human_plan/dataset_preprocessing/taco/hf_dataset/generate_dataset_hands_30hz.sh
sh human_plan/dataset_preprocessing/taco/hf_dataset/generate_dataset_image_30hz.sh
- HOT3D
Download Raw data follow official instruction: https://github.com/facebookresearch/hot3d
And follow instruction to setup the virtrual environment to process the HOT3D data. (It's a bit complicated to merge all dependency)
# RAW DATA: data/hot3d
# HF DATA: data/hot3d_hf
# Preprocess RAW -> HF:
conda activate hot3d
sh human_plan/dataset_preprocessing/hot3d/hf_dataset/generate_dataset_hands_job_set1.sh
sh human_plan/dataset_preprocessing/hot3d/hf_dataset/generate_dataset_hands_job_set2.sh
sh human_plan/dataset_preprocessing/hot3d/hf_dataset/generate_dataset_hands.sh
sh human_plan/dataset_preprocessing/hot3d/hf_dataset/generate_dataset_image.sh
- HOI4D
Download Raw data from official website: https://hoi4d.github.io/
Follow instruction (https://github.com/leolyliu/HOI4D-Instructions) to setup the virtrual environment to process the HOI4D data. (It's a bit complicated to merge all dependency)
# RAW DATA: data/HOI4D
# HF DATA: data/hoi4d_hf
conda activate hoi4d
sh human_plan/dataset_preprocessing/hoi4d/hf_dataset/generate_dataset_hands.sh
sh human_plan/dataset_preprocessing/hoi4d/hf_dataset/generate_dataset_image.sh- HoloAssist
Download the data from HoloAssist Official site
# RAW DATA: data/HoloAssist
# HF DATA: data/ha_dataset
# Preprocess RAW -> HF:
sh human_plan/dataset_preprocessing/holoassist/hf_dataset/generate_dataset_image.sh
sh human_plan/dataset_preprocessing/holoassist/hf_dataset/generate_dataset_hand_set1.sh
sh human_plan/dataset_preprocessing/holoassist/hf_dataset/generate_dataset_hand_set2.sh
sh human_plan/dataset_preprocessing/holoassist/hf_dataset/generate_dataset_hand_merge.sh
Download from HuggingFace
huggingface-cli download EgoVLA/EgoVLA-Humanoid-Sim --repo-type dataset --local-dir data/EgoVLA_SIMData processing
#Without Augmentation Version
bash human_plan/dataset_preprocessing/otv_isaaclab/hf_dataset_fixed_set/generate_dataset_image.sh
bash human_plan/dataset_preprocessing/otv_isaaclab/hf_dataset_fixed_set/generate_dataset_hands.shhuggingface-cli download rchal97/egovla_base_vlm --repo-type model --local-dir checkpoints
huggingface-cli download rchal97/ego_vla_human_video_pretrained --repo-type model --local-dir checkpoints
huggingface-cli download rchal97/egovla --repo-type model --local-dir checkpoints
bash training_scripts/human_video_pretraining/trans_v2_f1p30_split.shDownload Pretrained Checkpoints on human video
Put the checkpoints to the correct directory
-
Pretained on Human Video
bash training_scripts/robot_finetuning/hoi4dhot3dholotaco_p30_h5_transv2.sh
Second stage finetuning
bash training_scripts/robot_finetuning/hoi4dhot3dholotaco_p30_h5_transv2_continual_lr.sh
-
Not Pretained on Human Video
nopretrained base model (VILA)
bash training_scripts/robot_finetuning/nopretrain_p30_h5_transv2.sh
Second stage finetuning
bash training_scripts/robot_finetuning/nopretrain_p30_h5_transv2_continual_lr.sh
The following script will output hand_actuation_net.pth and hand_mano_retarget_net.pth used for EgoVLA inference
The checkpoint we used (hand_actuation_net.pth & hand_mano_retarget_net.pth) already included
python human_plan/utils/nn_retarget_formano.py
python human_plan/utils/nn_retarget_tomano.pyBefore training eval on our Ego Humanoid Manipulation Benchmark. Please follow the instruction on Ego Humanoid Manipulation Benchmark
mkdir video_output
# Evaluation Result will be stored in result_log.txt
# Evaluation Videos will be stored in video_output
# This command is evaluate the given model
bash human_plan/ego_bench_eval/fullpretrain_p30_h5_transv2.sh Humanoid-Push-Box-v0 1 2 0.2 3 1 result_log.txt 0 0 0.8 video_output evaluation_tagpython human_plan/ego_bench_eval/batch_script_30hz.pyThe code hasn't been fully tested yet: There might be some hard-coded path issue. Let me know if there is any issues.
This software is part of the BAIR Commons HIC Repository as of calendar year 2025.
-
Error:
"deepspeed/runtime/config_utils.py", line 116, in get_config_default field name). required, f"'{field name?" is a required field and does not have a default value"Cause: VILA requires pydantic==v1, and IsaacSim 4.2.0.2 would install one pydantic==V2
Fix: remove
anaconda3/envs/vila/lib/python3.10/site-packages/isaacsim/extscache/omni.kit.pip_archive-0.0.0+10a4b5c0.lx64.cp310/pip_prebundle/pydantic
@misc{yang2025egovlalearningvisionlanguageactionmodels,
title={EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos},
author={Ruihan Yang and Qinxi Yu and Yecheng Wu and Rui Yan and Borui Li and An-Chieh Cheng and Xueyan Zou and Yunhao Fang and Xuxin Cheng and Ri-Zhao Qiu and Hongxu Yin and Sifei Liu and Song Han and Yao Lu and Xiaolong Wang},
year={2025},
eprint={2507.12440},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2507.12440},
}
