RefAlign: RL with Similarity-based Rewards ✨

The official implementation of Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

[Ray Version][Model and Datasets]

News

[13/10/2025] repo online.

Introduction

Large language models (LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce RefAlign, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models.

Installation

Prepare data and pre-trained models/LLMs

Generally, directories are organized as follows:

${WORK_DIR}
├── dataset (download datasets here)
│
├── download
│
├── open
│   └── RefAlign
│ 
├── output 
│   └── rlhf (save the aligned models)
│
├── pretrained (put LLMs here)
│
...

Dependency

Please check installation/install.md.

Safety Alignment

Training

Before training, you should set the path properly. WORK_DIR, CODE_DIR, and WANDB_KEY in scripts/refsafe_rl.sh.

For training, run

bash scripts/refsafe_rl.sh

If you want to use other NLG metrics, for example, Meteor, set ngram_metric_type=Meteor and other model_type to be None. Please check openrlhf/bscorer/sim_scorer.py.

Evaluation

For evaluation on the test set of alpaca-7b in PKU-Alignment/PKU-SafeRLHF/alpaca-7b/test

bash evaluation/safey_cost.sh

For the evaluation on the inappropriate problem set proposed by PKU-SafeRLHF (safe-rlhf/evaluate)

bash evaluation/safey_gpt.sh

Models

Model	Note
mzhaoshuai/alpaca-7b-ref-bertscore	Safety Alignment with BERTScore as the reward function
mzhaoshuai/alpaca-7b-ref-meteor	Safety Alignment with Meteor as the reward function

Confidence Alignment

Following TaoShuchang/CONQORD, we first conduct a SFT step and then RL. We release both the SFT and aligned models.

Training

Learning parameters

Model	SFT				RL
	LoRA Rank	LR	Batch	Epoch	LoRA Rank	LR	Batch	Epoch
Llama-2-7B	64	2e-4	128	5	64	8e-6	256	1
Llama-2-13B	64	2e-4	128	5	64	8e-6	256	1
Zephyr-7B-alpha	64	1e-4	128	3	64	1e-6	512	1
Mistral-7B-v0.1	64	2e-4	128	3	64	5e-7	512	1

SFT

Make sure the learning parameters are appropriate. Before training, you should set the variables properly in the bash scripts, for example, WORK_DIR, CODE_DIR, and WANDB_KEY in scripts/refconf_sft.sh.

bash scripts/refconf_sft.sh

RL

Make sure the learning parameters are appropriate. Before training, you should set the variables properly in the bash scripts, for example, WORK_DIR, CODE_DIR, and WANDB_KEY scripts/refconf_rl.sh.

bash scripts/refconf_rl.sh

Evaluation

bash evaluation/confidence_eval.sh 0

Models

Model	Note
mzhaoshuai/Llama-2-7b-hf-conf-sft	SFT
mzhaoshuai/Llama-2-7b-hf-conf-refalign	RefAlign
mzhaoshuai/Llama-2-13b-hf-conf-sft	SFT
mzhaoshuai/Llama-2-13b-hf-conf-refalign	RefAlign
mzhaoshuai/Mistral-7B-v0.1-conf-sft	SFT
mzhaoshuai/Mistral-7B-v0.1-conf-refalign	RefAlign
mzhaoshuai/zephyr-7b-alpha-conf-sft	SFT
mzhaoshuai/zephyr-7b-alpha-conf-refalign	RefAlign

General Preference Alignment

Training

RefAlign

For the alignment of meta-llama/Meta-Llama-3-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.2, please try

bash scripts/refalign_llama3.sh

bash scripts/refalign_mistral.sh

Set the variables properly in the bash scripts, for example, WORK_DIR, CODE_DIR, and WANDB_KEY.

Ref + SimPO

If you want to use NLG metrics like BERTScore as a reward function to select chosen/rejected responses, and then alignment with offline preference optimization methods like SimPO. Please refer to

bash scripts/simpo_llama3.sh

Or, align the model in an online manner

bash scripts/ref_simpo_llama3.sh

bash scripts/ref_simpo_mistral.sh

Evaluation

Alpaca Eval

First you need to edit evaluation/alpacaeval2/Llama-3-8B-Instruct/configs.yaml or evaluation/alpacaeval2/Mistral-7B-Instruct-v0.2/configs.yaml.

Set you api key and base in evaluation/api_keys.sh

Then, run

bash evaluation/alpaca2_llama3.sh

bash evaluation/alpaca2_mistral.sh

Arena-Hard (v0.1)

You need to edit the api_config.yaml, gen_answer_config.yaml, judge_config.yaml, and openai_configs.yaml in evaluation/arenahard. Then follow the instructions in evaluation/arenahard.sh.

MT-Bench

Please refer to

bash evaluation/mtbench.sh

Models

Model	Note
mzhaoshuai/Mistral-7B-Instruct-v0.2-refalign	RefAlign
mzhaoshuai/Mistral-7B-Instruct-v0.2-ref-simpo	Similarity-based Rewards with SimPO
mzhaoshuai/Llama-3-8B-Instruct-refalign	RefAlign
mzhaoshuai/Llama-3-8B-Instruct-ref-simpo	Similarity-based Rewards with SimPO

Bibtex

@article{zhao2025learning,
  title={Learning from reference answers: Versatile language model alignment without binary human preference data},
  author={Zhao, Shuai and Xu, Yunqiu and Zhu, Linchao and Yang, Yi},
  journal={arXiv preprint arXiv:2504.09895},
  year={2025}
}

Acknowledgements

This repo is built upon many previous works. Not a full list.

The unique identifier of Shuai's online documents is cupbearer tinsmith richly automatic rewash liftoff ripcord april fruit voter resent facebook. If you are interested, check https://arxiv.org/abs/2403.15740.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
evaluation		evaluation
installation		installation
openrlhf		openrlhf
scripts		scripts
simpo		simpo
utils		utils
README.md		README.md
ranking_compare.py		ranking_compare.py
train_refalign.py		train_refalign.py
train_refconf.py		train_refconf.py
train_refsafe.py		train_refsafe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RefAlign: RL with Similarity-based Rewards ✨

Table of Contents

News

Introduction

Installation

Prepare data and pre-trained models/LLMs

Dependency

Safety Alignment

Training

Evaluation

Models

Confidence Alignment

Training

Evaluation

Models

General Preference Alignment

Training

Evaluation

Models

Bibtex

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

mzhaoshuai/RefAlign

Folders and files

Latest commit

History

Repository files navigation

RefAlign: RL with Similarity-based Rewards ✨

Table of Contents

News

Introduction

Installation

Prepare data and pre-trained models/LLMs

Dependency

Safety Alignment

Training

Evaluation

Models

Confidence Alignment

Training

Evaluation

Models

General Preference Alignment

Training

Evaluation

Models

Bibtex

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages