💻 This is the official implementation for our paper GFRIEND: Generative Few-shot Reward Inference through Efficient DPO.
Yiyang Zhao, Zhiqi Shen, Xuejiao Zhao*
Nanyang Technological University
* Corresponding author
- [2025.06.10] We release the latest paper version on arXiv.
- [2025.06.09] We have added more detailed information on the dataset, RewardBench, and data preprocessing. Have a try!
- [2025.06.09] We release the official implementation of GFRIEND.
GFRIEND is a generative reward model for RLHF (Reinforcement Learning with Human Feedback) designed for scenarios with limited human preference data. Specifically, GFRIEND integrates a preference refinement module to produce diverse, high-quality preference data, mitigating data sparsity. Then, GFRIEND employs a multi-level preference modeling strategy rather than simple binary comparisons, using a perplexity-based scoring mechanism to quantify preference degrees and enable finer-grained reward modeling. Next, we modify the Direct Preference Optimization (DPO) loss by weighting sample pairs based on preference disparity, ensuring more representative data is emphasized during reward model training.
The core processes of GFRIEND include:
- SFT: Supervised fine-tuning of the base model using a small amount of (question, chain-of-thought) data to enable it to generate high-quality thoughts/reasoning.
- Preference Refinement: Sampling multiple times on data with preference labels to generate diverse CoT (chain-of-thought) and judgment results, and expanding and fine-grainedly distinguishing preference data based on perplexity scoring.
- M-DPO: Weighted Direct Preference Optimization training on the above multi-level preference data.
Figure 1: The overall framework of GFRIEND. The steps for generating more preference data with a preference dataset that includes preference labels for a pair of answers to a question.
The project primarily utilizes the following two types of datasets for training and evaluation as described in the paper:
- General Domain Dataset: We selected the publicly available "Skywork-Reward-Preference-80K-v0.2" as the base preference data. For few-shot scenarios, we used a small number of high-quality samples (approximately 3,000) for experimentation and tested on public benchmarks such as
- Reward Bench A dataset for evaluating the capabilities of reward models, covering multiple categories including chat, reasoning, and safety, is designed to test the performance of reward models in complex and structured queries.
- UltraFeedback A large-scale, fine-grained, and diverse preference dataset, containing prompts from various resources, and annotated by GPT-4 in four aspects: instruction following, authenticity, honesty, and usefulness.
- PKU-SafeRLHF A human-annotated preference dataset, containing over 300,000 human-labeled comparison data points, covering preferences for usefulness and harmlessness, aimed at promoting research on the safe alignment of large language models.
- Medical Domain Dataset: To verify the effectiveness of the method in specialized scenarios, the paper constructed a medical preference dataset simulating a low-resource environment based on the iCliniq dataset. The dataset consists of 3,500 entries, with 3,000 used for training and 500 for validation. The data is derived from anonymized segments of real clinical conversations and publicly available medical data. It has undergone deduplication, normalization, anonymization, and expert annotation to form a structured preference format of (question, answer_pos, answer_neg).
When reproducing or conducting research using the above datasets, please note the following points:
- The preprocessing and filtering methods for the general domain dataset are detailed in the paper and script comments. It is recommended to ensure that there is no overlap between the training and test sets before training.
- If you have other custom preference data (such as for question-answering or dialogue scenarios), you can also integrate it into the same process in the format of (question, answer_neg, answer_pos).
Table 1:Accuracy of models’ judges on the test sets of Ultra-Feedback, PKU-SafeRLHF, and Reward-Bench. BT-model, ArmoRM and GFRIEND are trained on 3000 samples of Skywork-Reward-Preference-80K-v0.2 based on Llama3-8B-Instruct.
Table 2: Evaluation of different language model bases using supervised fine-tuning (SFT), BTmodel, and the GFRIEND method on the three benchmarks: UltraFeedback, PKU-SafeRLHF and Reward-Bench. With the exception of SFT, the data used to train the model were all 3000 samples.
Table 3(Left): Judgment accuracy of GFRIEND and other models on specific medical datasets. BT-model, ArmoRM and GFRIEND are trained on 3000 samples based on Llama3-8B-Instruct.
Table 4(Right): Judgment accuracy of GFRIEND and its variants. CoT-S indicates whether or not to use Preference Refinement. With the exception of SFT, the models are all trained on 3000 samples based on Llama3-8B-Instruct.
/data: Scripts for data loading and processing/models: Core logic for models, trainers, etc./generate: Functions related to generating diverse preference data, including CoT sampling and perplexity calculation/utils: General utility functions, such as log management/run_sft.py: Script for running SFT training/run_preference_refinement.py: Script for generating and scoring multi-level preference data/run_m_dpo.py: Script for executing multi-level preference weighted DPO training
- Python 3.8+
- PyTorch >= 1.13
- transformers >= 4.30
- [Optional] accelerate / deepspeed / flash-attention, and other optimization tools
Installation method:
pip install -r requirements.txtSFT: Prepare your (question, chain-of-thought) data, adjust the data path in run_sft.py, and run:
python run_sft.pyPreference Refinement: Prepare your (question, a–, a+) preference pairs, then run:
python run_preference_refinement.pyM-DPO: Training with the generated multi-level preference data using a multi-level preference weighted loss.
python run_m_dpo.pyIf you find our work useful, please consider citing our papers:
@misc{zhao2025gfriendgenerativefewshotreward,
title={GFRIEND: Generative Few-shot Reward Inference through Efficient DPO},
author={Yiyang Zhao and Huiyu Bai and Xuejiao Zhao},
year={2025},
eprint={2506.08965},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.08965},
}