ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations. (Oral Presentation @ CoRL 2025)
We provide code to train ReWiND reward models and policies on MetaWorld. The overall pipeline is as follows:
- Train the ReWiND Reward Model on MetaWorld + OXE data.
- Label the offline training dataset with the ReWiND Reward Model.
- Train the ReWiND Policy with offline to online RL for new tasks.
git clone [email protected]:Jiahui-3205/ReWiND_Release.git
cd ReWiND_Release/# Run the setup script to create environment and install all dependencies
bash -i setup_ReWiND_env.sh
conda activate rewindThis project uses Weights & Biases (WandB) for experiment tracking. Before running experiments:
-
For Policy Training: Edit
metaworld_policy_training/configs/base_config.yamllines 15-16:wandb_entity_name: your-wandb-entity wandb_project_name: rewind-policy-training
-
To Disable WandB: Set
logging.wandb=falsewhen running policy training commands.
Data Processing (Recommend run with Default path)
# Download preprocessed OpenX DinoV2 Embeddings
python download_data.py --download_path DOWNLOADPATH(Default:datasets)Generate MetaWorld Trajectories for ReWiND Reward Training (Recommend run with Default path)
# Generate Metaworld trajectories
python data_generation/metaworld_generation.py --save_path SAVE_DATA_PATH(Default:datasets)
# Centercrop the videos and convert to DinoV2 features
python data_preprocessing/metaworld_center_crop.py --video_path SAVE_DATA_PATH(Default:datasets) --target_path TARGET_DATASET_PATH(Default:datasets)
python data_preprocessing/generate_dino_embeddings.py --video_path_folder TARGET_DATASET_PATH(Default:datasets) --target_path EMBEDDING_TARGET_PATH(Default:datasets)# require wandb entity
python train_reward.py --wandb_entity YOUR_WANDB_ENTITY(Required) \
--wandb_project WANDB_Project_NAME(Default:rewind-reward-training) \
--rewind \
--subsample_video \
--clip_grad \
--cosine_scheduler \
--batch_size 1024 \
--worker 1# Relabel the dataset we collect with ReWiND reward model
python data_preprocessing/metaworld_label_reward.py --reward_model_path CHECKPOINT_PATH --h5_video_path GENERATION_PATH --h5_embedding_path EMBEDDING_TARGET_PATH --output_path OUTPUT_PATHNote:
OUTPUT_PATH: The labeled dataset file path (default:datasets/metaworld_labeled.h5). This will be used as<OUTPUT_PATH>in Offline Training and Online Training below.
cd metaworld_policy_trainingpython train_policy.py metaworld=off_on_15 \
algorithm=wsrl_iql \
reward=rewind_metaworld \
offline_training.offline_training_steps=15000 \
general_training.seed=42 \
environment.env_id=<ENV_ID> \
offline_training.offline_h5_path=<OUTPUT_PATH> \
reward_model.model_path=<CHECKPOINT_PATH><ENV_ID>: the Metaworld task you want to train online, e.g.,button-press-wall-v2,window-close-v2. Full list of our (not in training data) evaluation tasks in the paper is: [window-close-v2,reach-wall-v2,faucet-close-v2,coffee-button-v2,button-press-wall-v2,door-lock-v2,handle-press-side-v2,sweep-into-v2]<OFFLINE_CKPT_PATH>: path to your offline-trained checkpoint directory (often containslast_offline) to warm-start online training. If set tonull, the run will first execute the offline phase foroffline_training.offline_training_stepssteps on the dataset, and then proceed to the online phase.- To skip offline learning entirely, set
offline_training.offline_training_steps=0.
We also provide code to just train the policy offline, so that you can load the same offline policy checkpoint for online RL to multiple new tasks downstream.
You only need to set online_training.total_time_steps=0.
After offline training completes, check the model_dir in your wandb log to find the <OFFLINE_CKPT_PATH> for online training (see Online Training below).
Then, run the above offline to online RL training command with offline_training.ckpt_path=<OFFLINE_CKPT_PATH> as an extra argument to perform online RL directly with the same offline policy.
Note:
- In offline training,
environment.env_idis not important; the agent is trained over all training tasks found in your offline dataset. <OUTPUT_PATH>should point to your labeled offline dataset (see Label Offline Dataset above).
- Download mujoco210 from mujoco-py installation guide
- Extract the downloaded mujoco210 directory into
~/.mujoco/mujoco210 - Add the following lines to
~/.bashrc:export LD_LIBRARY_PATH=~/.mujoco/mujoco210/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia
- Reload your shell configuration:
source ~/.bashrc
fatal error: GL/glew.h: No such file or directory 4 | #include <GL/glew.h>
Solution: check openai/mujoco-py#745
@inproceedings{
zhang2025rewind,
title={ReWi{ND}: Language-Guided Rewards Teach Robot Policies without New Demonstrations},
author={Jiahui Zhang and Yusen Luo and Abrar Anwar and Sumedh Anand Sontakke and Joseph J Lim and Jesse Thomason and Erdem Biyik and Jesse Zhang},
booktitle={9th Annual Conference on Robot Learning},
year={2025},
url={https://openreview.net/forum?id=XjjXLxfPou}
}