This repository contains the implementation of UP-RLHF, an uncertainty-aware framework designed to align large language models.
The implementation presented here is utilized in the paper titled "Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles".
It is recommended to run the code within a conda virtual environment. Create a virtual environment by:
conda create -n UP-RLHF python=3.7
Activate the virtual environment by running:
conda activate UP-RLHF
Install the following dependencies:
- datasets >= 2.8.0
- torch >= 1.12.0
- deepspeed >= 0.9.0
- transformers == 4.28.1
Install dependencies by running the following command in the root directory of this repository (in the virtual environment):
cd peft
pip install -e .
# Move into the first step of the pipeline.
cd Step1_SFT
# Run the training script for the summarization task.
bash training_scripts/summarization/run_opt_1.3b.sh
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_llama2_7b.sh# Move into the second step of the pipeline.
cd training/Step2_Diverse-reward-LoRA-ensemble
# Run the training script for the summarization task.
bash training_scripts/Summarization/run_Diverse-LoRA-ensebmle_opt-350m.py 1 1
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Diverse-LoRA-ensebmle_Llama2-7b.py 0.075 0.1Before proceeding with RL fine-tuning, it is possible to deploy reward model services on other accessible nodes to conserve GPU memory. Ensure to set your IP address and port accordingly.
# Move into the final step of the pipeline
cd training/Step3_UP-RLHF/
# Deploy the gold reward model for the evaluation of summarization task.
python RM_server/Summarization_RMsever_GPT-J-6B.py
# Deploy SteamSHP-XL for the evaluation of helpful dialogue task.
python RM_server/Helpful_Dialogue_RMserver_SteamSHP_xl.py
# Deploy Diverse Reward LoRA Ensemble for the training of helpful dialogue task.
python RM_server/Helpful_Dialogue_RMserver_Llama2_Diverse-LoRA-Ensemble.pyThen the summarization task can be conducted on 4*32G V100, and the helpful dialogue task can be conducted on 8*80G A100.
# Move into the final step of the pipeline
cd training/Step3_UP-RLHF
# Run the training script for the summarization task.
bash training_scripts/Summarization/run_opt_1.3b_350m_UP-RLHF.sh 0.2 0 0 kl1 0.001 0 offline
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Llama2_7B_7B_UP-RLHF.sh 0.1 0 0 kl1 0.02 kl2 offline # Move into the final step of the pipeline
cd training/Step3_UP-RLHF
# Run the training script for the summarization task.
bash training_scripts/Summarization/run_opt_1.3b_350m_RLHF.sh 0.2 0 0 kl1 0.001 0 offline
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Llama2_7B_7B_RLHF.sh 0.1 0 0 kl1 0.02 kl2 offline cd DPO_IPO
# Run the training script for the summarization task.
bash training_scripts/summarization/run_opt_1.3b.sh 0.5
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Llama2_7B.sh 0.1cd DPO_IPO
# Run the training script for the summarization task.
bash training_scripts/summarization/run_opt_1.3b_IPO.sh 0.5
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_Llama2_7B_IPO.sh 0.1# Move into the first step of the pipeline.
cd Step1_SFT
# Run the training script for the summarization task.
bash training_scripts/Summarization/run_opt_1.3b_SFT-baseline.sh
# Run the training script for the helpful dialogue task.
bash training_scripts/Helpful_Dialogue/run_llama2_7b_SFT-baseline.sh- Deepspeed Chat: https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat
- Parameter-Efficient Fine-Tuning (PEFT): https://github.com/huggingface/peft
- Official DPO implementation: https://github.com/eric-mitchell/direct-preference-optimization