UP-RLHF Implementation

This repository contains the implementation of UP-RLHF, an uncertainty-aware framework designed to align large language models.

The implementation presented here is utilized in the paper titled "Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles".

Environment Setup

It is recommended to run the code within a conda virtual environment. Create a virtual environment by:

conda create -n UP-RLHF python=3.7

Activate the virtual environment by running:

conda activate UP-RLHF

Install the following dependencies:

datasets >= 2.8.0
torch >= 1.12.0
deepspeed >= 0.9.0
transformers == 4.28.1

Install dependencies by running the following command in the root directory of this repository (in the virtual environment):

cd peft
pip install -e .

Step 1: Supervised Fine-Tuning

# Move into the first step of the pipeline.
cd Step1_SFT
# Run the training script for the summarization task.
bash training_scripts/summarization/run_opt_1.3b.sh
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_llama2_7b.sh

Step 2: Reward Modeling

# Move into the second step of the pipeline.
cd training/Step2_Diverse-reward-LoRA-ensemble
# Run the training script for the summarization task.
bash training_scripts/Summarization/run_Diverse-LoRA-ensebmle_opt-350m.py 1 1
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Diverse-LoRA-ensebmle_Llama2-7b.py 0.075 0.1

Step 3: RL Fine-Tuning

Before proceeding with RL fine-tuning, it is possible to deploy reward model services on other accessible nodes to conserve GPU memory. Ensure to set your IP address and port accordingly.

# Move into the final step of the pipeline
cd training/Step3_UP-RLHF/
# Deploy the gold reward model for the evaluation of summarization task.
python RM_server/Summarization_RMsever_GPT-J-6B.py

# Deploy SteamSHP-XL for the evaluation of helpful dialogue task.
python RM_server/Helpful_Dialogue_RMserver_SteamSHP_xl.py
# Deploy Diverse Reward LoRA Ensemble for the training of helpful dialogue task.
python RM_server/Helpful_Dialogue_RMserver_Llama2_Diverse-LoRA-Ensemble.py

Then the summarization task can be conducted on 4*32G V100, and the helpful dialogue task can be conducted on 8*80G A100.

# Move into the final step of the pipeline
cd training/Step3_UP-RLHF
# Run the training script for the summarization task.
bash training_scripts/Summarization/run_opt_1.3b_350m_UP-RLHF.sh 0.2 0 0 kl1 0.001 0 offline
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Llama2_7B_7B_UP-RLHF.sh 0.1 0 0 kl1 0.02 kl2 offline

Implementations of Baselines

RLHF baseline

# Move into the final step of the pipeline
cd training/Step3_UP-RLHF
# Run the training script for the summarization task.
bash training_scripts/Summarization/run_opt_1.3b_350m_RLHF.sh 0.2 0 0 kl1 0.001 0 offline
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Llama2_7B_7B_RLHF.sh 0.1 0 0 kl1 0.02 kl2 offline

DPO baseline

cd DPO_IPO
# Run the training script for the summarization task.
bash training_scripts/summarization/run_opt_1.3b.sh 0.5
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Llama2_7B.sh 0.1

IPO baseline

cd DPO_IPO
# Run the training script for the summarization task.
bash training_scripts/summarization/run_opt_1.3b_IPO.sh 0.5
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Llama2_7B_IPO.sh 0.1

SFT baseline

# Move into the first step of the pipeline.
cd Step1_SFT
# Run the training script for the summarization task.
bash training_scripts/Summarization/run_opt_1.3b_SFT-baseline.sh
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_llama2_7b_SFT-baseline.sh

References

Deepspeed Chat: https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat
Parameter-Efficient Fine-Tuning (PEFT): https://github.com/huggingface/peft
Official DPO implementation: https://github.com/eric-mitchell/direct-preference-optimization

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
RM_server		RM_server
Step1_Supervised-Fine-Tuning		Step1_Supervised-Fine-Tuning
Step2_Diversified-reward-LoRA-ensemble		Step2_Diversified-reward-LoRA-ensemble
Step3_DPO_IPO		Step3_DPO_IPO
Step3_UP-RLHF		Step3_UP-RLHF
dataset		dataset
peft		peft
utils		utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UP-RLHF Implementation

Environment Setup

Step 1: Supervised Fine-Tuning

Step 2: Reward Modeling

Step 3: RL Fine-Tuning

Implementations of Baselines

RLHF baseline

DPO baseline

IPO baseline

SFT baseline

References

About

Uh oh!

Releases

Packages

Languages

George-Chia/UP-RLHF

Folders and files

Latest commit

History

Repository files navigation

UP-RLHF Implementation

Environment Setup

Step 1: Supervised Fine-Tuning

Step 2: Reward Modeling

Step 3: RL Fine-Tuning

Implementations of Baselines

RLHF baseline

DPO baseline

IPO baseline

SFT baseline

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages