Probability-Consistent Preference Optimization for enhanced LLM Reasoning

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks.

🔍 Table of Contents

⚙️ Install Requirements

We followed alignment-handbook repo to bulid our code. You can do as follows to setup the training environment:

create a Python virtual environment using e.g. Conda:

conda create -n handbook python=3.10 && conda activate handbook

install PyTorch v2.2.2. Since this is hardware-dependent, you can access to the PyTorch Installation Page.
You can then install the remaining package dependencies of alignment-handbook as follows:

git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .

Flash Attention 2 should be installed for training, run:

python -m pip install flash-attn --no-build-isolation

install other required packages:

pip install -r requirements.txt

💻 Training Scripts

Prompt Dataset

Our training data includes 7.5k GSM8K training set, 7.5k MATH training set, 7.5k subset of Orca-math, and 7.5k subset of Cn-k12, 30k in total. We provide them in the ./qwen-math/data dir.

Response Generation

You have to first generate responses using the prompt dataset.

bash ./scripts/generate.sh MODEL_NAME_OR_PATH iteration_dir_path

Construct Candidate Pair Set

We use yaml files to manage the following steps, we provide an example file in the ./yamls/candidate_pair_example.yaml You can run the scripts below to generate candidate pair set:

bash ./scripts/only_levenstein_Judgepair.py yaml_path iteration_dir_path responses_path output_name

Calculate Weighted Score

In this step, we utilize the model to get the weighted score of each pair in the candidate pair set.

bash ./scripts/infer_test.bash

Extract Preference Pairs

Now, we can extract the final preference pairs based on the weighted score and the candidate pair set.

bash ./scripts/extract_s_t.py infer_result_path iteration_dir_path s_t_values_weighted sample_training_data random

PCPO Train

You can train the next iteration model using the script below:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file ./alignment-handbook/recipes/accelerate_configs/deepspeed_zero3.yaml ./scripts/run_pcpo.py ./yamls/train_example.yaml

💹 Evaluation

We followed Qwen-math for evaluation

You can run the scripts followed as:

#Pass@1
bash ./scripts/test.bash model_path data_name
#Maj@8
bash ./scripts/test_maj.bash model_path data_name

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
alignment		alignment
figure		figure
qwen_math		qwen_math
scripts		scripts
trainer		trainer
yamls		yamls
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Probability-Consistent Preference Optimization for enhanced LLM Reasoning

🔍 Table of Contents

⚙️ Install Requirements

💻 Training Scripts

Prompt Dataset

Response Generation

Construct Candidate Pair Set

Calculate Weighted Score

Extract Preference Pairs

PCPO Train

💹 Evaluation

About

Uh oh!

Releases

Packages

Languages

YunqiaoYang/PCPO

Folders and files

Latest commit

History

Repository files navigation

Probability-Consistent Preference Optimization for enhanced LLM Reasoning

🔍 Table of Contents

⚙️ Install Requirements

💻 Training Scripts

Prompt Dataset

Response Generation

Construct Candidate Pair Set

Calculate Weighted Score

Extract Preference Pairs

PCPO Train

💹 Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages