GFRIEND: Generative Few-shot Reward Inference through Efficient DPO

💻 This is the official implementation for our paper GFRIEND: Generative Few-shot Reward Inference through Efficient DPO.

Authors

Yiyang Zhao, Zhiqi Shen, Xuejiao Zhao*

Nanyang Technological University

* Corresponding author

🔥 News

[2025.06.10] We release the latest paper version on arXiv.
[2025.06.09] We have added more detailed information on the dataset, RewardBench, and data preprocessing. Have a try!
[2025.06.09] We release the official implementation of GFRIEND.

Overview

GFRIEND is a generative reward model for RLHF (Reinforcement Learning with Human Feedback) designed for scenarios with limited human preference data. Specifically, GFRIEND integrates a preference refinement module to produce diverse, high-quality preference data, mitigating data sparsity. Then, GFRIEND employs a multi-level preference modeling strategy rather than simple binary comparisons, using a perplexity-based scoring mechanism to quantify preference degrees and enable finer-grained reward modeling. Next, we modify the Direct Preference Optimization (DPO) loss by weighting sample pairs based on preference disparity, ensuring more representative data is emphasized during reward model training.

The core processes of GFRIEND include:

SFT: Supervised fine-tuning of the base model using a small amount of (question, chain-of-thought) data to enable it to generate high-quality thoughts/reasoning.
Preference Refinement: Sampling multiple times on data with preference labels to generate diverse CoT (chain-of-thought) and judgment results, and expanding and fine-grainedly distinguishing preference data based on perplexity scoring.
M-DPO: Weighted Direct Preference Optimization training on the above multi-level preference data.

Figure 1: The overall framework of GFRIEND. The steps for generating more preference data with a preference dataset that includes preference labels for a pair of answers to a question.

Dataset

The project primarily utilizes the following two types of datasets for training and evaluation as described in the paper:

General Domain Dataset: We selected the publicly available "Skywork-Reward-Preference-80K-v0.2" as the base preference data. For few-shot scenarios, we used a small number of high-quality samples (approximately 3,000) for experimentation and tested on public benchmarks such as

Reward Bench A dataset for evaluating the capabilities of reward models, covering multiple categories including chat, reasoning, and safety, is designed to test the performance of reward models in complex and structured queries.
UltraFeedback A large-scale, fine-grained, and diverse preference dataset, containing prompts from various resources, and annotated by GPT-4 in four aspects: instruction following, authenticity, honesty, and usefulness.
PKU-SafeRLHF A human-annotated preference dataset, containing over 300,000 human-labeled comparison data points, covering preferences for usefulness and harmlessness, aimed at promoting research on the safe alignment of large language models.

Medical Domain Dataset: To verify the effectiveness of the method in specialized scenarios, the paper constructed a medical preference dataset simulating a low-resource environment based on the iCliniq dataset. The dataset consists of 3,500 entries, with 3,000 used for training and 500 for validation. The data is derived from anonymized segments of real clinical conversations and publicly available medical data. It has undergone deduplication, normalization, anonymization, and expert annotation to form a structured preference format of (question, answer_pos, answer_neg).

When reproducing or conducting research using the above datasets, please note the following points:

The preprocessing and filtering methods for the general domain dataset are detailed in the paper and script comments. It is recommended to ensure that there is no overlap between the training and test sets before training.
If you have other custom preference data (such as for question-answering or dialogue scenarios), you can also integrate it into the same process in the format of (question, answer_neg, answer_pos).

Experimental Results

Main Results

Table 1:Accuracy of models’ judges on the test sets of Ultra-Feedback, PKU-SafeRLHF, and Reward-Bench. BT-model, ArmoRM and GFRIEND are trained on 3000 samples of Skywork-Reward-Preference-80K-v0.2 based on Llama3-8B-Instruct.

Table 2: Evaluation of different language model bases using supervised fine-tuning (SFT), BTmodel, and the GFRIEND method on the three benchmarks: UltraFeedback, PKU-SafeRLHF and Reward-Bench. With the exception of SFT, the data used to train the model were all 3000 samples.

Table 3(Left): Judgment accuracy of GFRIEND and other models on specific medical datasets. BT-model, ArmoRM and GFRIEND are trained on 3000 samples based on Llama3-8B-Instruct.

Table 4(Right): Judgment accuracy of GFRIEND and its variants. CoT-S indicates whether or not to use Preference Refinement. With the exception of SFT, the models are all trained on 3000 samples based on Llama3-8B-Instruct.

Project Structure

/data: Scripts for data loading and processing
/models: Core logic for models, trainers, etc.
/generate: Functions related to generating diverse preference data, including CoT sampling and perplexity calculation
/utils: General utility functions, such as log management
/run_sft.py: Script for running SFT training
/run_preference_refinement.py: Script for generating and scoring multi-level preference data
/run_m_dpo.py: Script for executing multi-level preference weighted DPO training

Environment Dependencies

Python 3.8+
PyTorch >= 1.13
transformers >= 4.30
[Optional] accelerate / deepspeed / flash-attention, and other optimization tools

Installation method:

pip install -r requirements.txt

Quick Start

SFT: Prepare your (question, chain-of-thought) data, adjust the data path in run_sft.py, and run:

python run_sft.py

Preference Refinement: Prepare your (question, a–, a+) preference pairs, then run:

python run_preference_refinement.py

M-DPO: Training with the generated multi-level preference data using a multi-level preference weighted loss.

python run_m_dpo.py

📖 Citation

If you find our work useful, please consider citing our papers:

@misc{zhao2025gfriendgenerativefewshotreward,
      title={GFRIEND: Generative Few-shot Reward Inference through Efficient DPO}, 
      author={Yiyang Zhao and Huiyu Bai and Xuejiao Zhao},
      year={2025},
      eprint={2506.08965},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.08965}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GFRIEND: Generative Few-shot Reward Inference through Efficient DPO

Authors

🔥 News

Overview

Dataset

Experimental Results

Main Results

Project Structure

Environment Dependencies

Quick Start

📖 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
fig		fig
generate		generate
models		models
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_m_dpo.py		run_m_dpo.py
run_preference_refinement.py		run_preference_refinement.py
run_sft.py		run_sft.py

License

SNOWTEAM2023/GFRIEND

Folders and files

Latest commit

History

Repository files navigation

GFRIEND: Generative Few-shot Reward Inference through Efficient DPO

Authors

🔥 News

Overview

Dataset

Experimental Results

Main Results

Project Structure

Environment Dependencies

Quick Start

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages