MoPPS: Model Predictive Prompt Selection

This repository contains code for Model Predictive Prompt Selection (MoPPS), a framework for online predicting prompt difficulty to accelerate reinforcement learning (RL) finetuning of Large Reasoning Models.

🔧 Installation

Ensure you have CUDA ≥ 12.4, then run:

bash prepare.sh

This script installs all required packages and dependencies.

📦 Dataset Preparation

We support multiple reasoning tasks. Run the following scripts to preprocess each dataset:

# Mathematics dataset
python recipe/ours/data_preprocess/math_dataset.py --local_dir='./data/math'

# Mathematics Evaluation Benchmarks from deepscaler
python recipe/ours/data_preprocess/deepscaler/deepscaler_dataset.py --local_dir='./data/deepscaler'

# Countdown-34
python recipe/ours/data_preprocess/countdown.py --local_dir='./data/countdown3to4'

# Countdown-4
python recipe/ours/data_preprocess/countdown4.py --local_dir='./data/countdown4'

# Geometry3k
python recipe/ours/data_preprocess/geo3k.py --local_dir='./data/geo3k'

📥 Download Pretrained Models

You can download models from Hugging Face as follows (example shown with DeepSeek-R1-Distill-Qwen-1.5B):

huggingface-cli download --resume-download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir models/DeepSeek-R1-Distill-Qwen-1.5B

Tip: You can change --local-dir to your own model path. Be sure to match it with your training configs.

🚀 Training

All training scripts are located in:

recipe/ours/scripts/

These include task-specific scripts for launching MoPPS and baseline methods with different backbones and datasets.

Below is an example of how to launch MoPPS training on the Countdown task with Qwen2.5-3B:

bash recipe/ours/scripts/countdown/cd_verl_3b_topk_noinit.sh

📥 Integration Guide

Model Predictive Prompt Selection (MoPPS) is a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt's success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection.

The main implementation is in mopps.py, featuring two key operations:

1️⃣ `train`: Bayesian Inference towards Prompt Success Rate

We formulate online prompt selection as a sequential decision-making problem and solve it with a dynamic Bernoulli bandit machine:

Recursive Bayesian Update with Temporal Discounting:

$$ \alpha^{t'}_{\tau} = \lambda \cdot \alpha^{t}_{\tau} + (1 - \lambda) \cdot \alpha_{\tau}^{0} + s^t_\tau $$

$$ \beta^{t'}_{\tau} = \lambda \cdot \beta^{t}_{\tau} + (1 - \lambda) \cdot \beta_{\tau}^{0} + k - s^t_\tau $$

Implementation:

def train(self, batch_candidates_dict, y):
    if self.no_update:
        return None, None, None
    
    indices = batch_candidates_dict['index']
    for idx, s in zip(indices, y):
        idx = str(idx)
        n_rollout = self.args.actor_rollout_ref.rollout.n if self.args.actor_rollout_ref.rollout.n > 1 else 8
        
        self.alpha[idx] = (self.alpha[idx] * self.decay_ratio + 
                          self.prior_alpha * (1 - self.decay_ratio) + 
                          s * n_rollout)
        
        self.beta[idx] = (self.beta[idx] * self.decay_ratio + 
                         self.prior_beta * (1 - self.decay_ratio) + 
                         (1 - s) * n_rollout)
    
    return None, None, None

2️⃣ `sample_batch`: Active Prompt Selection

The posterior distribution's estimate of a prompt's success rate correlates strongly with ground truth. We adopt Thompson Sampling for its natural exploration-exploitation trade-off:

Fast Success Rate Estimates from Approximate Posteriors:

$$ \hat{\gamma}^t_\tau \sim \mathrm{Beta}(\alpha_\tau^{t}, \beta_\tau^{t}) \quad \forall\tau\in\mathcal{T}_{t}^{\hat{\mathcal{B}}} $$

sampled_r = np.random.beta(local_alpha, local_beta)

Top-B Selection Strategy:

$$ \mathcal{T}_t^\mathcal{B} = \text{Top-}\mathcal{B}( { \tau \in \mathcal{T}_{t}^{\hat{\mathcal{B}}} | -\lVert \hat{\gamma}^t_\tau - \gamma^* \rVert_2^2 } ) $$

distances = (sampled_r - target_mu) ** 2
sampled_index = np.argsort(distances)[:self.real_batch_size]

💡 Note: MoPPS can easily integrate with alternative selection strategies, e.g., threshold filtering.

🔌 Easy Integration

MoPPS can be seamlessly integrated into your RL training pipeline.

Key Integration Points:

📊 Before rollout: Call sample_batch() to select prompts
🔄 After reward: Call train() to update Bayesian posteriors

for epoch in range(self.config.trainer.total_epochs):
    for batch_dict in self.train_dataloader:
        # Step 1: Active pre-rollout prompt selection
        if self.task_sampler is not None:
            batch_dict, acquisition_score = self.task_sampler.sample_batch(batch_dict)
        
        # ... (rollout generation & evaluation)

        # Step 2: Update posterior with observed success rates
        ts_loss, ts_recon_loss, ts_kl_loss = self.task_sampler.train(
            batch_dict, 
            success_rate
        )
        # ... (RL training)

📚 Citation

If you find this work useful for your research, please cite our paper:

@article{qu2025can,
  title={Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?},
  author={Qu, Yun and Wang, Qi and Mao, Yixiu and Hu, Vincent Tao and Ommer, Bj{\"o}rn and Ji, Xiangyang},
  journal={arXiv preprint arXiv:2507.04632},
  year={2025}
}

This work is inspired by the following prior research:

@article{wang2025model,
  title={Model predictive task sampling for efficient and robust adaptation},
  author={Wang, Qi and Xiao, Zehao and Mao, Yixiu and Qu, Yun and Shen, Jiayi and Lv, Yiqin and Ji, Xiangyang},
  journal={arXiv preprint arXiv:2501.11039},
  year={2025}
}
@inproceedings{qu2025fast,
  title={Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments},
  author={Qu, Yun and Wang, Cheems and Mao, Yixiu and Lv, Yiqin and Ji, Xiangyang},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
recipe/ours		recipe/ours
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prepare.sh		prepare.sh
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoPPS: Model Predictive Prompt Selection

🔧 Installation

📦 Dataset Preparation

📥 Download Pretrained Models

🚀 Training

📥 Integration Guide

1️⃣ `train`: Bayesian Inference towards Prompt Success Rate

2️⃣ `sample_batch`: Active Prompt Selection

🔌 Easy Integration

📚 Citation

About

Uh oh!

Releases

Packages

Languages

License

thu-rllab/MoPPS

Folders and files

Latest commit

History

Repository files navigation

MoPPS: Model Predictive Prompt Selection

🔧 Installation

📦 Dataset Preparation

📥 Download Pretrained Models

🚀 Training

📥 Integration Guide

1️⃣ train: Bayesian Inference towards Prompt Success Rate

2️⃣ sample_batch: Active Prompt Selection

🔌 Easy Integration

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1️⃣ `train`: Bayesian Inference towards Prompt Success Rate

2️⃣ `sample_batch`: Active Prompt Selection

Packages