Thanks to visit codestin.com
Credit goes to github.com

Skip to content

NEUIR/LongMab-PO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Overview

LongMab-PO is a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training.

Set Up

Use git clone to download this project

git clone https://github.com/NEUIR/LongMab-PO.git
cd LongMab-PO

Create environment for training and evaluation.

conda create -n longmab python=3.10
conda activate longmab
pip install -r requirements.txt

Training LongMab-PO

If you do not want to train the model, you can download models trained with LongMab-PO and skip this section to Evaluation.

If you want to use the ready-to-use synthetic preference data directly, you can download from here and skip this section to DPO Training

1. Prepare the Training Data

You can follow SeaLong to synthesize raw training data, or download the file from here and place them in the data/train_data/ directory. Each sample must contain the following four required fields:

{
  "id": "A unique identifier for the sample (int)",
  "input": "The input question (str)",
  "answer": "The ground truth answer to the question (str)",
  "context": "The synthesized long context (str)"
}     

2. Run the LongMab-PO Pipeline

(1) Generate Probe CoT: You should download the MiniCPM-Embedding to calculate the initial importance score of each chunk based on the probe cot.

cd scripts
bash gen_probe.sh

(2) Running the Multi-Armed Bandit Rollout Process:

bash rollout.sh

(3) Construct Preference Data Pairs:

bash construct_po_data.sh

3. DPO Training

You can train the model by utilizing LLaMA-Factory framework quickly, we provide the yaml files. Please refer to LLaMA-Factory for relevant environment installation and configuration.

cd scripts
bash llama3_dpo.sh
bash qwen2_dpo.sh

Evaluation

We provide the evalation datasets here, or you can download from LongBench and InfiniteBench. You should place the datasets in the data/test_data/ directory.

cd scripts
bash eval.sh

Acknowledgement

We gratefully acknowledge the following projects that LongMab-PO builds upon:

Citation

We appreciate your citations if you find our paper related and useful to your research!

@misc{duan2025chunksarmsmultiarmedbanditguided,
      title={Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization}, 
      author={Shaohua Duan and Xinze Li and Zhenghao Liu and Xiaoyuan Yi and Yukun Yan and Shuo Wang and Yu Gu and Ge Yu and Maosong Sun},
      year={2025},
      eprint={2508.13993},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.13993}, 
}

Contact Us

If you have questions, suggestions, and bug reports, please email us, we will try our best to help you.

About

Long-Context LLM Preference Optimization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published