This is the official implementation of RGM (Reward Gap Minimization) (https://openreview.net/forum?id=WumysvcMvV6). RGM can be perceived as a hybrid offline RL and offline IL method that can handle diverse types of imperfect rewards include but not limited to partially correct reward, sparse reward, multi-task datasharing setting and completely incorrect rewards.
RGM formalizes offline policy optimization for imperfect rewards as a bilevel optimization problem, where the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data and the lower layer solves a pessimistic RL problem with the corrected rewards.
To install the dependencies, use
pip install -r requirements.txtIf you want conduct experiments on Robomimic datasets. You need to install Robomimic according to the instructions in Robomimic
You can reproduce the Mujoco tasks and Robomimic tasks like so:
bash run d4rl.sh bash run_robomimic.shFor the experiments on multi-task datasharing setting, we'll release soon.
You can resort to wandb to login your personal account via export your own wandb api key.
export WANDB_API_KEY=YOUR_WANDB_API_KEY
and run
wandb online
to turn on the online syncronization.
If you find our code and paper can help, please cite our paper as:
@inproceedings{
li2023mind,
title={Mind the Gap: Offline Policy Optimization for Imperfect Rewards},
author={Jianxiong Li and Xiao Hu and Haoran Xu and Jingjing Liu and Xianyuan Zhan and Qing-Shan Jia and Ya-Qin Zhang},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=WumysvcMvV6}
}