The official implementation of Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.
[Ray Version][Model and Datasets]
- Introduction
- Installation
- Safety Alignment
- Confidence Alignment
- General Preference Alignment
- Bibtex
- Acknowledgements
- [13/10/2025] repo online.
-
Safety Alignment
-
Confidence Alignment
-
General Preference Alignment
Generally, directories are organized as follows:
${WORK_DIR}
├── dataset (download datasets here)
│
├── download
│
├── open
│ └── RefAlign
│
├── output
│ └── rlhf (save the aligned models)
│
├── pretrained (put LLMs here)
│
...
Please check installation/install.md.
- Before training, you should set the path properly.
WORK_DIR,CODE_DIR, andWANDB_KEYin scripts/refsafe_rl.sh.
For training, run
bash scripts/refsafe_rl.sh
If you want to use other NLG metrics, for example, Meteor, set ngram_metric_type=Meteor and other model_type to be None.
Please check openrlhf/bscorer/sim_scorer.py.
- For evaluation on the
testset ofalpaca-7bin PKU-Alignment/PKU-SafeRLHF/alpaca-7b/test
bash evaluation/safey_cost.sh
- For the evaluation on the inappropriate problem set proposed by PKU-SafeRLHF (safe-rlhf/evaluate)
bash evaluation/safey_gpt.sh
| Model | Note |
|---|---|
| mzhaoshuai/alpaca-7b-ref-bertscore | Safety Alignment with BERTScore as the reward function |
| mzhaoshuai/alpaca-7b-ref-meteor | Safety Alignment with Meteor as the reward function |
Following TaoShuchang/CONQORD, we first conduct a SFT step and then RL. We release both the SFT and aligned models.
- Learning parameters
| Model | SFT | RL | ||||||
|---|---|---|---|---|---|---|---|---|
| LoRA Rank | LR | Batch | Epoch | LoRA Rank | LR | Batch | Epoch | |
| Llama-2-7B | 64 | 2e-4 | 128 | 5 | 64 | 8e-6 | 256 | 1 |
| Llama-2-13B | 64 | 2e-4 | 128 | 5 | 64 | 8e-6 | 256 | 1 |
| Zephyr-7B-alpha | 64 | 1e-4 | 128 | 3 | 64 | 1e-6 | 512 | 1 |
| Mistral-7B-v0.1 | 64 | 2e-4 | 128 | 3 | 64 | 5e-7 | 512 | 1 |
- SFT
Make sure the learning parameters are appropriate.
Before training, you should set the variables properly in the bash scripts, for example, WORK_DIR, CODE_DIR, and WANDB_KEY in scripts/refconf_sft.sh.
bash scripts/refconf_sft.sh
- RL
Make sure the learning parameters are appropriate.
Before training, you should set the variables properly in the bash scripts, for example, WORK_DIR, CODE_DIR, and WANDB_KEY scripts/refconf_rl.sh.
bash scripts/refconf_rl.sh
bash evaluation/confidence_eval.sh 0
- RefAlign
For the alignment of meta-llama/Meta-Llama-3-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.2, please try
bash scripts/refalign_llama3.sh
bash scripts/refalign_mistral.sh
Set the variables properly in the bash scripts, for example, WORK_DIR, CODE_DIR, and WANDB_KEY.
- Ref + SimPO
If you want to use NLG metrics like BERTScore as a reward function to select chosen/rejected responses, and then alignment with offline preference optimization methods like SimPO. Please refer to
bash scripts/simpo_llama3.sh
Or, align the model in an online manner
bash scripts/ref_simpo_llama3.sh
bash scripts/ref_simpo_mistral.sh
- Alpaca Eval
First you need to edit evaluation/alpacaeval2/Llama-3-8B-Instruct/configs.yaml or evaluation/alpacaeval2/Mistral-7B-Instruct-v0.2/configs.yaml.
Set you api key and base in evaluation/api_keys.sh
Then, run
bash evaluation/alpaca2_llama3.sh
bash evaluation/alpaca2_mistral.sh
- Arena-Hard (v0.1)
You need to edit the api_config.yaml, gen_answer_config.yaml, judge_config.yaml, and openai_configs.yaml in evaluation/arenahard. Then follow the instructions in evaluation/arenahard.sh.
- MT-Bench
Please refer to
bash evaluation/mtbench.sh
| Model | Note |
|---|---|
| mzhaoshuai/Mistral-7B-Instruct-v0.2-refalign | RefAlign |
| mzhaoshuai/Mistral-7B-Instruct-v0.2-ref-simpo | Similarity-based Rewards with SimPO |
| mzhaoshuai/Llama-3-8B-Instruct-refalign | RefAlign |
| mzhaoshuai/Llama-3-8B-Instruct-ref-simpo | Similarity-based Rewards with SimPO |
@article{zhao2025learning,
title={Learning from reference answers: Versatile language model alignment without binary human preference data},
author={Zhao, Shuai and Xu, Yunqiu and Zhu, Linchao and Yang, Yi},
journal={arXiv preprint arXiv:2504.09895},
year={2025}
}
This repo is built upon many previous works. Not a full list.
- OpenRLHF/OpenRLHF
- PKU-Alignment/safe-rlhf
- princeton-nlp/SimPO
- TaoShuchang/CONQORD
- Tiiiger/bert_score
- neulab/BARTScore
- nltk/nltk
- vllm-project/vllm
- huggingface/transformers
- huggingface/trl
- mzhaoshuai/RLCF
The unique identifier of Shuai's online documents is cupbearer tinsmith richly automatic rewash liftoff ripcord april fruit voter resent facebook. If you are interested, check https://arxiv.org/abs/2403.15740.