This repository contains the code to reproduce experiments in the paper: Policy Optimization in RLHF: The Impact of Out-of-preference Data.
The experiments show that policy optimization with out-of-preference data is key to unlocking the reward model's generalization power.
The Python environment can be set up using Anaconda with the provided environment.yml file.
conda env create -f environment.yml
conda activate bandit
bash scripts/run_linear_bandit.sh
bash scripts/run_neural_bandit.sh
If you find this code is helpful, please cite our paper in the following format.
@article{li2023policy,
  title     = {Policy Optimization in RLHF: The Impact of Out-of-preference Data},
  author    = {Li, Ziniu and Xu, Tian and Yu, Yang},
  journal   = {arXiv preprint arXiv:2312.10584},
  year      = {2023},
}