This repository contains code for Model Predictive Prompt Selection (MoPPS), a framework for online predicting prompt difficulty to accelerate reinforcement learning (RL) finetuning of Large Reasoning Models.
Ensure you have CUDA ≥ 12.4, then run:
bash prepare.shThis script installs all required packages and dependencies.
We support multiple reasoning tasks. Run the following scripts to preprocess each dataset:
# Mathematics dataset
python recipe/ours/data_preprocess/math_dataset.py --local_dir='./data/math'
# Mathematics Evaluation Benchmarks from deepscaler
python recipe/ours/data_preprocess/deepscaler/deepscaler_dataset.py --local_dir='./data/deepscaler'
# Countdown-34
python recipe/ours/data_preprocess/countdown.py --local_dir='./data/countdown3to4'
# Countdown-4
python recipe/ours/data_preprocess/countdown4.py --local_dir='./data/countdown4'
# Geometry3k
python recipe/ours/data_preprocess/geo3k.py --local_dir='./data/geo3k'You can download models from Hugging Face as follows (example shown with DeepSeek-R1-Distill-Qwen-1.5B):
huggingface-cli download --resume-download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir models/DeepSeek-R1-Distill-Qwen-1.5BTip: You can change
--local-dirto your own model path. Be sure to match it with your training configs.
All training scripts are located in:
recipe/ours/scripts/
These include task-specific scripts for launching MoPPS and baseline methods with different backbones and datasets.
Below is an example of how to launch MoPPS training on the Countdown task with Qwen2.5-3B:
bash recipe/ours/scripts/countdown/cd_verl_3b_topk_noinit.sh
Model Predictive Prompt Selection (MoPPS) is a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt's success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection.
The main implementation is in mopps.py, featuring two key operations:
We formulate online prompt selection as a sequential decision-making problem and solve it with a dynamic Bernoulli bandit machine:
Recursive Bayesian Update with Temporal Discounting:
Implementation:
def train(self, batch_candidates_dict, y):
if self.no_update:
return None, None, None
indices = batch_candidates_dict['index']
for idx, s in zip(indices, y):
idx = str(idx)
n_rollout = self.args.actor_rollout_ref.rollout.n if self.args.actor_rollout_ref.rollout.n > 1 else 8
self.alpha[idx] = (self.alpha[idx] * self.decay_ratio +
self.prior_alpha * (1 - self.decay_ratio) +
s * n_rollout)
self.beta[idx] = (self.beta[idx] * self.decay_ratio +
self.prior_beta * (1 - self.decay_ratio) +
(1 - s) * n_rollout)
return None, None, NoneThe posterior distribution's estimate of a prompt's success rate correlates strongly with ground truth. We adopt Thompson Sampling for its natural exploration-exploitation trade-off:
Fast Success Rate Estimates from Approximate Posteriors:
sampled_r = np.random.beta(local_alpha, local_beta)Top-B Selection Strategy:
distances = (sampled_r - target_mu) ** 2
sampled_index = np.argsort(distances)[:self.real_batch_size]💡 Note: MoPPS can easily integrate with alternative selection strategies, e.g., threshold filtering.
MoPPS can be seamlessly integrated into your RL training pipeline.
Key Integration Points:
- 📊 Before rollout: Call
sample_batch()to select prompts - 🔄 After reward: Call
train()to update Bayesian posteriors
for epoch in range(self.config.trainer.total_epochs):
for batch_dict in self.train_dataloader:
# Step 1: Active pre-rollout prompt selection
if self.task_sampler is not None:
batch_dict, acquisition_score = self.task_sampler.sample_batch(batch_dict)
# ... (rollout generation & evaluation)
# Step 2: Update posterior with observed success rates
ts_loss, ts_recon_loss, ts_kl_loss = self.task_sampler.train(
batch_dict,
success_rate
)
# ... (RL training)If you find this work useful for your research, please cite our paper:
@article{qu2025can,
title={Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?},
author={Qu, Yun and Wang, Qi and Mao, Yixiu and Hu, Vincent Tao and Ommer, Bj{\"o}rn and Ji, Xiangyang},
journal={arXiv preprint arXiv:2507.04632},
year={2025}
}This work is inspired by the following prior research:
@article{wang2025model,
title={Model predictive task sampling for efficient and robust adaptation},
author={Wang, Qi and Xiao, Zehao and Mao, Yixiu and Qu, Yun and Shen, Jiayi and Lv, Yiqin and Ji, Xiangyang},
journal={arXiv preprint arXiv:2501.11039},
year={2025}
}
@inproceedings{qu2025fast,
title={Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments},
author={Qu, Yun and Wang, Cheems and Mao, Yixiu and Lv, Yiqin and Ji, Xiangyang},
booktitle={Forty-second International Conference on Machine Learning},
year={2025}
}