Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks.
We followed alignment-handbook repo to bulid our code. You can do as follows to setup the training environment:
- create a Python virtual environment using e.g. Conda:
conda create -n handbook python=3.10 && conda activate handbook-
install PyTorch
v2.2.2. Since this is hardware-dependent, you can access to the PyTorch Installation Page. -
You can then install the remaining package dependencies of alignment-handbook as follows:
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .- Flash Attention 2 should be installed for training, run:
python -m pip install flash-attn --no-build-isolation- install other required packages:
pip install -r requirements.txtOur training data includes 7.5k GSM8K training set, 7.5k MATH training set, 7.5k subset of Orca-math, and 7.5k subset of Cn-k12, 30k in total. We provide them in the ./qwen-math/data dir.
You have to first generate responses using the prompt dataset.
bash ./scripts/generate.sh MODEL_NAME_OR_PATH iteration_dir_pathWe use yaml files to manage the following steps, we provide an example file in the ./yamls/candidate_pair_example.yaml
You can run the scripts below to generate candidate pair set:
bash ./scripts/only_levenstein_Judgepair.py yaml_path iteration_dir_path responses_path output_nameIn this step, we utilize the model to get the weighted score of each pair in the candidate pair set.
bash ./scripts/infer_test.bashNow, we can extract the final preference pairs based on the weighted score and the candidate pair set.
bash ./scripts/extract_s_t.py infer_result_path iteration_dir_path s_t_values_weighted sample_training_data randomYou can train the next iteration model using the script below:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file ./alignment-handbook/recipes/accelerate_configs/deepspeed_zero3.yaml ./scripts/run_pcpo.py ./yamls/train_example.yamlWe followed Qwen-math for evaluation
You can run the scripts followed as:
#Pass@1
bash ./scripts/test.bash model_path data_name
#Maj@8
bash ./scripts/test_maj.bash model_path data_name