Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Overview

LongMab-PO is a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training.

Set Up

Use git clone to download this project

git clone https://github.com/NEUIR/LongMab-PO.git
cd LongMab-PO

Create environment for training and evaluation.

conda create -n longmab python=3.10
conda activate longmab
pip install -r requirements.txt

Training LongMab-PO

If you do not want to train the model, you can download models trained with LongMab-PO and skip this section to Evaluation.

If you want to use the ready-to-use synthetic preference data directly, you can download from here and skip this section to DPO Training

1. Prepare the Training Data

You can follow SeaLong to synthesize raw training data, or download the file from here and place them in the data/train_data/ directory. Each sample must contain the following four required fields:

{
  "id": "A unique identifier for the sample (int)",
  "input": "The input question (str)",
  "answer": "The ground truth answer to the question (str)",
  "context": "The synthesized long context (str)"
}

2. Run the LongMab-PO Pipeline

(1) Generate Probe CoT: You should download the MiniCPM-Embedding to calculate the initial importance score of each chunk based on the probe cot.

cd scripts
bash gen_probe.sh

(2) Running the Multi-Armed Bandit Rollout Process:

bash rollout.sh

(3) Construct Preference Data Pairs:

bash construct_po_data.sh

3. DPO Training

You can train the model by utilizing LLaMA-Factory framework quickly, we provide the yaml files. Please refer to LLaMA-Factory for relevant environment installation and configuration.

cd scripts
bash llama3_dpo.sh
bash qwen2_dpo.sh

Evaluation

We provide the evalation datasets here, or you can download from LongBench and InfiniteBench. You should place the datasets in the data/test_data/ directory.

cd scripts
bash eval.sh

Acknowledgement

We gratefully acknowledge the following projects that LongMab-PO builds upon:

Citation

We appreciate your citations if you find our paper related and useful to your research!

@misc{duan2025chunksarmsmultiarmedbanditguided,
      title={Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization}, 
      author={Shaohua Duan and Xinze Li and Zhenghao Liu and Xiaoyuan Yi and Yukun Yan and Shuo Wang and Yu Gu and Ge Yu and Maosong Sun},
      year={2025},
      eprint={2508.13993},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.13993}, 
}

Contact Us

If you have questions, suggestions, and bug reports, please email us, we will try our best to help you.

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
fig		fig
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Overview

Set Up

Training LongMab-PO

1. Prepare the Training Data

2. Run the LongMab-PO Pipeline

3. DPO Training

Evaluation

Acknowledgement

Citation

Contact Us

About

Uh oh!

Releases

Packages

Languages

NEUIR/LongMab-PO

Folders and files

Latest commit

History

Repository files navigation

Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Overview

Set Up

Training LongMab-PO

1. Prepare the Training Data

2. Run the LongMab-PO Pipeline

3. DPO Training

Evaluation

Acknowledgement

Citation

Contact Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages