[π Website] β’ [π€ Demo Dataset] β’ [π Paper] β’ [π± GitHub] β’ [π¦ Twitter] β’ [π Rednote]
Repo for "SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning"
Figure 1: 32B model performance across mainstream reasoning benchmarks and different domains.
- [2025/10/14] π₯ We release all code, including implementations for RL training and problem synthesis.
- [2025/09/18] SwS has been accepted to NeurIPS 2025! Welcome any discussions during the conference.
- [2025/06/13] We release all prompts used in the SwS framework in prompts.
- [2025/06/13] We update the demo set of synthetic problems from SwS in datasets, including 500 samples for each model and category. You can also find them in Demo Dataset.
- [2025/06/10] Our full code and datasets are under review by Microsoft and will be released upon approval.
- [2025/06/10] SwS paper, repo, website and demo datasets released.
Figure 2: An overview of our proposed weakness-driven problem synthesis framework that targets at mitigating the modelβs reasoning limitations within the RLVR paradigm.
| Model | GSM8K | MATH 500 | Minerva Math | Olympiad Bench | GaoKao 2023 | AMC23 | AIME24 (Avg@1 / 32) | AIME25 (Avg@1 / 32) | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B | 88.1 | 63.0 | 27.6 | 30.5 | 55.8 | 35.0 | 6.7 / 5.4 | 0.0 / 1.2 | 38.3 |
| Qwen2.5-7B-IT | 91.7 | 75.6 | 38.2 | 40.6 | 63.9 | 50.0 | 16.7 / 10.5 | 13.3 / 6.7 | 48.8 |
| Open-Reasoner-7B | 93.6 | 80.4 | 39.0 | 45.6 | 72.0 | 72.5 | 10.0 / 16.8 | 13.3 / 17.9 | 53.3 |
| SimpleRL-Base-7B | 90.8 | 77.2 | 35.7 | 41.0 | 66.2 | 62.5 | 13.3 / 14.8 | 6.7 / 6.7 | 49.2 |
| BaseRL-7B | 92.0 | 78.4 | 36.4 | 41.6 | 63.4 | 45.0 | 10.0 / 14.5 | 6.7 / 6.5 | 46.7 |
| SwS-7B | 93.9 | 82.6 | 41.9 | 49.6 | 71.7 | 67.5 | 26.7 / 18.3 | 20.0 / 18.5 | 56.7 |
| Ξ (vs. BaseRL) | +1.9 | +4.2 | +5.5 | +8.0 | +8.3 | +22.5 | +16.7 / +3.8 | +13.3 / +12.0 | +10.0 |
| Model | GSM8K | MATH 500 | Minerva Math | Olympiad Bench | GaoKao 2023 | AMC23 | AIME24 (Avg@1 / 32) | AIME25 (Avg@1 / 32) | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-32B | 90.1 | 66.8 | 34.9 | 29.8 | 55.3 | 50.0 | 10.0 / 4.2 | 6.7 / 2.5 | 42.9 |
| Qwen2.5-32B-IT | 95.6 | 83.2 | 42.3 | 49.5 | 72.5 | 62.5 | 23.3 / 15.0 | 20.0 / 13.1 | 56.1 |
| Open-Reasoner-32B | 95.5 | 82.2 | 46.3 | 54.4 | 75.6 | 57.5 | 23.3 / 23.5 | 33.3 / 31.7 | 58.5 |
| SimpleRL-Base-32B | 95.2 | 81.0 | 46.0 | 47.4 | 69.9 | 82.5 | 33.3 / 26.2 | 20.0 / 15.0 | 59.4 |
| BaseRL-32B | 96.1 | 85.6 | 43.4 | 54.7 | 73.8 | 85.0 | 40.0 / 30.7 | 6.7 / 24.6 | 60.7 |
| SwS-32B | 96.3 | 89.4 | 47.1 | 60.5 | 80.3 | 90.0 | 43.3 / 33.0 | 40.0 / 31.8 | 68.4 |
| Ξ (vs. BaseRL) | +0.2 | +3.8 | +3.7 | +5.8 | +6.5 | +5.0 | +3.3 / +2.3 | +33.3 / +7.2 | +7.7 |
We recommend using Conda to manage your environment. We use vLLM (0.10.1.1) to accelerate inference. Run the following commands to setup your environment:
git [email protected]:MasterVito/SwS.git && cd SwS
conda create -n sws python=3.10.16
conda activate sws
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128 # CUDA 12.8 for example
pip install -r requirements.txtModel downloading: Here we utilize the Qwen2.5-7B model trained on the MATH-12k dataset. You can download the model using the following command:
mkdir -p models
pip install -U "huggingface_hub[cli]"
huggingface-cli login # use your huggingface token
huggingface-cli download Qwen/Qwen2.5-7B --local-dir models/Qwen2.5-7BWe provide a bash script for running the weakness identification stage on the Qwen2.5-7B base model. During this stage, we do not filter out problems with 0% or 100% accuracy, as we set data.accuracy_lower_bound=0.0 and data.accuracy_upper_bound=1.0. The indices of the selected problems from the training set will be saved to the specified save_path.
bash scripts/qwen25_7b_weakness_identification.shThe sampling accuracy of problems at each step is also stored in the model checkpoint path. You can compute and summarize these accuracies following the format in the record folder.
Given the recorded problems with low learning efficiency, we begin by extracting key concepts from the recorded problems using the Llama-3.3-70B-Instruct model:
bash scripts/synthesis/step1_concepts_extraction.sh
Next, the extracted concepts are encoded into embeddings using the Llama-3.1-8B model:
bash scripts/synthesis/step2_concepts_encoding.sh
After embedding the concepts, we aggregate them by category and allocate a sampling budget for each category based on their normalized failure ratios across categories:
bash scripts/synthesis/step3_concepts_sampling.sh
Here we start generating new questions using Llama-3.3-70B-Instruct based on the sampled concepts derived from the model's low-efficiency learning problems, i.e., the weaknesses identified in our study.
bash scripts/synthesis/step4_problem_generation.sh
We then evaluate the quality of the synthetic questions using both the Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct models, filtering out those that do not meet our standardβspecifically, requiring at least one perfect rating and one acceptable rating.
bash scripts/synthesis/step5_quality_evaluation.sh
Next, we generate reference answers for the high-quality synthetic problems using strong reasoning models such as QwQ-32B .
bash scripts/synthesis/step6_answer_verification.sh
After generating the reference answers, we prompt the initially trained model with the synthetic questions and retain only those that fall within an acceptable accuracy range and exhibit an appropriate level of difficulty. Finally, we incorporate the remaining questions into the original set and start the second round of the augmented RL training.
Here is the bash script for running the augmented RL training on the Qwen2.5-7B base model. During this stage, we set data.accuracy_lower_bound=0.125 and data.accuracy_upper_bound=0.875.
bash scripts/qwen25_7b_augment_training.shWe provide a script for inference, simply config the model_name_or_path and data_path (default as using MATH-500 and AIME24 & AIME25 for evaluation) in scripts/evaluation.sh and run the following command:
bash scripts/evaluation.shIf you find this repository helpful, please consider citing our paper:
@misc{liang2025swsselfawareweaknessdrivenproblem,
title={SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning},
author={Xiao Liang and Zhong-Zhi Li and Yeyun Gong and Yang Wang and Hengyuan Zhang and Yelong Shen and Ying Nian Wu and Weizhu Chen},
year={2025},
eprint={2506.08989},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.08989},
}
We sincerely appreciate the outstanding work of BigMath, PromptCoT, and veRL. The prompts used in the SwS framework are largely inspired by BigMath and PromptCoT, while the training code is adapted from the excellent veRL repository.