EgoVLA

This is the repo of training code and eval for our work:

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang^1*, Qinxi Yu^2*, Yecheng Wu^3,4, Rui Yan¹, Borui Li¹, An-Chieh Cheng¹, Xueyan Zou¹, Yunhao Fang¹, Xuxin Cheng¹, Ri-Zhao Qiu¹, Hongxu Yin⁴, Sifei Liu⁴, Song Han^3,4, Yao Lu⁴, Xiaolong Wang¹

¹UC San Diego / ²UIUC / ³MIT / ⁴NVIDIA

Project Page / Arxiv / Simulation Benchmark

Installation

Setup VILA dependency

follow the VILA setup instruction

cd VILA
./environment_setup.sh vila

Install EgoVLA related dependency

bash ./build_env.sh

Register at the MANO website and download the models.

Download Mano Hand model link
Unzip the mano models and place it in the repo directory (EgoVLA/mano_v1_2)

git clone https://github.com/hassony2/manopth # This is for hand pose preprocessing
git clone https://github.com/facebookresearch/hot3d # This is for Hot3d data preprocessing
export PYTHONPATH=$PYTHONPATH:/path/to/your/manopth
export PYTHONPATH=$PYTHONPATH:/path/to/your/hot3d

Setup Simulation

Overall instruction for setup IsaacLab: https://isaac-sim.github.io/IsaacLab/main/index.html

Follow the instruction to install IsaacSim (4.2.0.2)

pip install isaacsim==4.2.0.2 --extra-index-url https://pypi.nvidia.com

# if this command failed please try the following command:
# pip install isaacsim==4.2.0.2 isaacsim-extscache-physics==4.2.0.2 isaacsim-extscache-kit==4.2.0.2 isaacsim-extscache-kit-sdk==4.2.0.2 --extra-index-url https://pypi.nvidia.com

Clone Ego Humanoid Manipulation Benchmark, then install it with the command in the instruction

Data Preparation

For the following human dataset, we use egocentric RGB video & Hand/Head/Camera Pose & language labels.

TACO

Download Raw data follow official instruction: https://taco2024.github.io/

And follow instruction (https://github.com/leolyliu/TACO-Instructions) to setup the virtrual environment to process the TACO data. (It's a bit complicated to merge all dependency)

# RAW DATA: data/TACO
# HF DATA: data/TACO_HF

conda activate taco
# Preprocess RAW -> HF:
sh human_plan/dataset_preprocessing/taco/hf_dataset/generate_dataset_hands_30hz.sh
sh human_plan/dataset_preprocessing/taco/hf_dataset/generate_dataset_image_30hz.sh

HOT3D

Download Raw data follow official instruction: https://github.com/facebookresearch/hot3d

And follow instruction to setup the virtrual environment to process the HOT3D data. (It's a bit complicated to merge all dependency)

# RAW DATA: data/hot3d
# HF DATA: data/hot3d_hf

# Preprocess RAW -> HF:

conda activate hot3d
sh human_plan/dataset_preprocessing/hot3d/hf_dataset/generate_dataset_hands_job_set1.sh
sh human_plan/dataset_preprocessing/hot3d/hf_dataset/generate_dataset_hands_job_set2.sh
sh human_plan/dataset_preprocessing/hot3d/hf_dataset/generate_dataset_hands.sh
sh human_plan/dataset_preprocessing/hot3d/hf_dataset/generate_dataset_image.sh

HOI4D

Download Raw data from official website: https://hoi4d.github.io/

Follow instruction (https://github.com/leolyliu/HOI4D-Instructions) to setup the virtrual environment to process the HOI4D data. (It's a bit complicated to merge all dependency)

# RAW DATA: data/HOI4D
# HF DATA: data/hoi4d_hf

conda activate hoi4d
sh human_plan/dataset_preprocessing/hoi4d/hf_dataset/generate_dataset_hands.sh
sh human_plan/dataset_preprocessing/hoi4d/hf_dataset/generate_dataset_image.sh

HoloAssist

Download the data from HoloAssist Official site

# RAW DATA: data/HoloAssist
# HF DATA: data/ha_dataset

# Preprocess RAW -> HF:
sh human_plan/dataset_preprocessing/holoassist/hf_dataset/generate_dataset_image.sh
sh human_plan/dataset_preprocessing/holoassist/hf_dataset/generate_dataset_hand_set1.sh
sh human_plan/dataset_preprocessing/holoassist/hf_dataset/generate_dataset_hand_set2.sh
sh human_plan/dataset_preprocessing/holoassist/hf_dataset/generate_dataset_hand_merge.sh

Robot Data

Original Demonstration

Download from HuggingFace

huggingface-cli download EgoVLA/EgoVLA-Humanoid-Sim --repo-type dataset --local-dir data/EgoVLA_SIM

Data processing

#Without Augmentation Version

bash human_plan/dataset_preprocessing/otv_isaaclab/hf_dataset_fixed_set/generate_dataset_image.sh
bash human_plan/dataset_preprocessing/otv_isaaclab/hf_dataset_fixed_set/generate_dataset_hands.sh

Training

Pretrained Model:

huggingface-cli download rchal97/egovla_base_vlm --repo-type model --local-dir checkpoints

huggingface-cli download rchal97/ego_vla_human_video_pretrained --repo-type model --local-dir checkpoints

huggingface-cli download rchal97/egovla --repo-type model --local-dir checkpoints

Human Video Pretraining

bash training_scripts/human_video_pretraining/trans_v2_f1p30_split.sh

Robot Data Fine-tuning

Download Pretrained Checkpoints on human video

Put the checkpoints to the correct directory

Pretained on Human Video

  bash training_scripts/robot_finetuning/hoi4dhot3dholotaco_p30_h5_transv2.sh

Second stage finetuning

  bash training_scripts/robot_finetuning/hoi4dhot3dholotaco_p30_h5_transv2_continual_lr.sh

Not Pretained on Human Video

nopretrained base model (VILA)

bash training_scripts/robot_finetuning/nopretrain_p30_h5_transv2.sh

Second stage finetuning

bash training_scripts/robot_finetuning/nopretrain_p30_h5_transv2_continual_lr.sh

Training retargetting MLP for inference

The following script will output hand_actuation_net.pth and hand_mano_retarget_net.pth used for EgoVLA inference

The checkpoint we used (hand_actuation_net.pth & hand_mano_retarget_net.pth) already included

python human_plan/utils/nn_retarget_formano.py
python human_plan/utils/nn_retarget_tomano.py

Simulation on Ego Humanoid Manipulation Benchmark

Before training eval on our Ego Humanoid Manipulation Benchmark. Please follow the instruction on Ego Humanoid Manipulation Benchmark

Evaluation Single Task on Single Visual Config:

mkdir video_output
# Evaluation Result will be stored in result_log.txt
# Evaluation Videos will be stored in video_output
# This command is evaluate the given model 
bash human_plan/ego_bench_eval/fullpretrain_p30_h5_transv2.sh Humanoid-Push-Box-v0 1 2 0.2 3 1 result_log.txt 0 0 0.8 video_output evaluation_tag

Evaluation Cross Tasks and Visual Configs:

python human_plan/ego_bench_eval/batch_script_30hz.py

Note:

The code hasn't been fully tested yet: There might be some hard-coded path issue. Let me know if there is any issues.

This software is part of the BAIR Commons HIC Repository as of calendar year 2025.

FAQ:

Error: "deepspeed/runtime/config_utils.py", line 116, in get_config_default field name). required, f"'{field name?" is a required field and does not have a default value"

Cause: VILA requires pydantic==v1, and IsaacSim 4.2.0.2 would install one pydantic==V2

Fix: remove anaconda3/envs/vila/lib/python3.10/site-packages/isaacsim/extscache/omni.kit.pip_archive-0.0.0+10a4b5c0.lx64.cp310/pip_prebundle/pydantic

Bibtex

@misc{yang2025egovlalearningvisionlanguageactionmodels,
      title={EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos}, 
      author={Ruihan Yang and Qinxi Yu and Yecheng Wu and Rui Yan and Borui Li and An-Chieh Cheng and Xueyan Zou and Yunhao Fang and Xuxin Cheng and Ri-Zhao Qiu and Hongxu Yin and Sifei Liu and Song Han and Yao Lu and Xiaolong Wang},
      year={2025},
      eprint={2507.12440},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2507.12440}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
VILA		VILA
human_plan		human_plan
media		media
tools		tools
training_scripts		training_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_env.sh		build_env.sh
hand_actuation_net.pth		hand_actuation_net.pth
hand_mano_retarget_net.pth		hand_mano_retarget_net.pth
init_poses.pkl		init_poses.pkl
init_poses_fixed_set.pkl		init_poses_fixed_set.pkl
init_poses_fixed_set_100traj.pkl		init_poses_fixed_set_100traj.pkl
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EgoVLA

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Installation

Setup VILA dependency

Install EgoVLA related dependency

Setup Simulation

Data Preparation

Robot Data

Original Demonstration

Training

Pretrained Model:

Human Video Pretraining

Robot Data Fine-tuning

Training retargetting MLP for inference

Simulation on Ego Humanoid Manipulation Benchmark

Evaluation Single Task on Single Visual Config:

Evaluation Cross Tasks and Visual Configs:

Note:

FAQ:

Bibtex

About

Uh oh!

Releases

Packages

Languages

License

RchalYang/EgoVLA_Release

Folders and files

Latest commit

History

Repository files navigation

EgoVLA

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Installation

Setup VILA dependency

Install EgoVLA related dependency

Setup Simulation

Data Preparation

Robot Data

Original Demonstration

Training

Pretrained Model:

Human Video Pretraining

Robot Data Fine-tuning

Training retargetting MLP for inference

Simulation on Ego Humanoid Manipulation Benchmark

Evaluation Single Task on Single Visual Config:

Evaluation Cross Tasks and Visual Configs:

Note:

FAQ:

Bibtex

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages