Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MasterVito/SwS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SwS-Logo
SwS: A Weakness-driven Problem Synthesis Framework

[🌐 Website] β€’ [πŸ€— Demo Dataset] β€’ [πŸ“œ Paper] β€’ [🐱 GitHub] β€’ [🐦 Twitter] β€’ [πŸ“• Rednote]

Repo for "SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning"


Figure 1: 32B model performance across mainstream reasoning benchmarks and different domains.

πŸ”₯ News

  • [2025/10/14] πŸ”₯ We release all code, including implementations for RL training and problem synthesis.
  • [2025/09/18] SwS has been accepted to NeurIPS 2025! Welcome any discussions during the conference.
  • [2025/06/13] We release all prompts used in the SwS framework in prompts.
  • [2025/06/13] We update the demo set of synthetic problems from SwS in datasets, including 500 samples for each model and category. You can also find them in Demo Dataset.
  • [2025/06/10] Our full code and datasets are under review by Microsoft and will be released upon approval.
  • [2025/06/10] SwS paper, repo, website and demo datasets released.

πŸ’‘ Introduction

The Self-aware Weakness-driven problem Synthesis framework (SwS) framework proposes to identifies model deficiencies and leverages them for problem augmentation. The weaknesses are defined as questions that the model consistently fails to learn through during RL training. SwS extracts the core concepts from these failure cases and synthesize new problems to strengthen the model's weak areas in subsequent augmented training, enabling it to focus on and gradually overcome its weaknesses.


Figure 2: An overview of our proposed weakness-driven problem synthesis framework that targets at mitigating the model’s reasoning limitations within the RLVR paradigm.

πŸ“Š Evaluation Results

7B Model Performance

Model GSM8K MATH 500 Minerva Math Olympiad Bench GaoKao 2023 AMC23 AIME24 (Avg@1 / 32) AIME25 (Avg@1 / 32) Avg.
Qwen2.5-7B 88.1 63.0 27.6 30.5 55.8 35.0 6.7 / 5.4 0.0 / 1.2 38.3
Qwen2.5-7B-IT 91.7 75.6 38.2 40.6 63.9 50.0 16.7 / 10.5 13.3 / 6.7 48.8
Open-Reasoner-7B 93.6 80.4 39.0 45.6 72.0 72.5 10.0 / 16.8 13.3 / 17.9 53.3
SimpleRL-Base-7B 90.8 77.2 35.7 41.0 66.2 62.5 13.3 / 14.8 6.7 / 6.7 49.2
BaseRL-7B 92.0 78.4 36.4 41.6 63.4 45.0 10.0 / 14.5 6.7 / 6.5 46.7
SwS-7B 93.9 82.6 41.9 49.6 71.7 67.5 26.7 / 18.3 20.0 / 18.5 56.7
Ξ” (vs. BaseRL) +1.9 +4.2 +5.5 +8.0 +8.3 +22.5 +16.7 / +3.8 +13.3 / +12.0 +10.0

32B Model Performance

Model GSM8K MATH 500 Minerva Math Olympiad Bench GaoKao 2023 AMC23 AIME24 (Avg@1 / 32) AIME25 (Avg@1 / 32) Avg.
Qwen2.5-32B 90.1 66.8 34.9 29.8 55.3 50.0 10.0 / 4.2 6.7 / 2.5 42.9
Qwen2.5-32B-IT 95.6 83.2 42.3 49.5 72.5 62.5 23.3 / 15.0 20.0 / 13.1 56.1
Open-Reasoner-32B 95.5 82.2 46.3 54.4 75.6 57.5 23.3 / 23.5 33.3 / 31.7 58.5
SimpleRL-Base-32B 95.2 81.0 46.0 47.4 69.9 82.5 33.3 / 26.2 20.0 / 15.0 59.4
BaseRL-32B 96.1 85.6 43.4 54.7 73.8 85.0 40.0 / 30.7 6.7 / 24.6 60.7
SwS-32B 96.3 89.4 47.1 60.5 80.3 90.0 43.3 / 33.0 40.0 / 31.8 68.4
Ξ” (vs. BaseRL) +0.2 +3.8 +3.7 +5.8 +6.5 +5.0 +3.3 / +2.3 +33.3 / +7.2 +7.7
P.S: Additional results for Qwen2.5-3B and Qwen2.5-7B-Math are provided in the paper.

πŸš€ Quick Start

We recommend using Conda to manage your environment. We use vLLM (0.10.1.1) to accelerate inference. Run the following commands to setup your environment:

git [email protected]:MasterVito/SwS.git && cd SwS
conda create -n sws python=3.10.16
conda activate sws
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128 # CUDA 12.8 for example
pip install -r requirements.txt

Model downloading: Here we utilize the Qwen2.5-7B model trained on the MATH-12k dataset. You can download the model using the following command:

mkdir -p models
pip install -U "huggingface_hub[cli]"
huggingface-cli login # use your huggingface token
huggingface-cli download Qwen/Qwen2.5-7B --local-dir models/Qwen2.5-7B

1. Weakness Identification in Initial RL

We provide a bash script for running the weakness identification stage on the Qwen2.5-7B base model. During this stage, we do not filter out problems with 0% or 100% accuracy, as we set data.accuracy_lower_bound=0.0 and data.accuracy_upper_bound=1.0. The indices of the selected problems from the training set will be saved to the specified save_path.

bash scripts/qwen25_7b_weakness_identification.sh

2. Problem Synthesis

The sampling accuracy of problems at each step is also stored in the model checkpoint path. You can compute and summarize these accuracies following the format in the record folder.

Given the recorded problems with low learning efficiency, we begin by extracting key concepts from the recorded problems using the Llama-3.3-70B-Instruct model:

bash scripts/synthesis/step1_concepts_extraction.sh

Next, the extracted concepts are encoded into embeddings using the Llama-3.1-8B model:

bash scripts/synthesis/step2_concepts_encoding.sh

After embedding the concepts, we aggregate them by category and allocate a sampling budget for each category based on their normalized failure ratios across categories:

bash scripts/synthesis/step3_concepts_sampling.sh

Here we start generating new questions using Llama-3.3-70B-Instruct based on the sampled concepts derived from the model's low-efficiency learning problems, i.e., the weaknesses identified in our study.

bash scripts/synthesis/step4_problem_generation.sh

We then evaluate the quality of the synthetic questions using both the Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct models, filtering out those that do not meet our standardβ€”specifically, requiring at least one perfect rating and one acceptable rating.

bash scripts/synthesis/step5_quality_evaluation.sh

Next, we generate reference answers for the high-quality synthetic problems using strong reasoning models such as QwQ-32B .

bash scripts/synthesis/step6_answer_verification.sh

After generating the reference answers, we prompt the initially trained model with the synthetic questions and retain only those that fall within an acceptable accuracy range and exhibit an appropriate level of difficulty. Finally, we incorporate the remaining questions into the original set and start the second round of the augmented RL training.

3. Augmented RL Training

Here is the bash script for running the augmented RL training on the Qwen2.5-7B base model. During this stage, we set data.accuracy_lower_bound=0.125 and data.accuracy_upper_bound=0.875.

bash scripts/qwen25_7b_augment_training.sh

πŸ”Ž Evaluation

We provide a script for inference, simply config the model_name_or_path and data_path (default as using MATH-500 and AIME24 & AIME25 for evaluation) in scripts/evaluation.sh and run the following command:

bash scripts/evaluation.sh

β˜•οΈ Citation

If you find this repository helpful, please consider citing our paper:

@misc{liang2025swsselfawareweaknessdrivenproblem,
      title={SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning}, 
      author={Xiao Liang and Zhong-Zhi Li and Yeyun Gong and Yang Wang and Hengyuan Zhang and Yelong Shen and Ying Nian Wu and Weizhu Chen},
      year={2025},
      eprint={2506.08989},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.08989}, 
}

πŸ™ Acknowledgement

We sincerely appreciate the outstanding work of BigMath, PromptCoT, and veRL. The prompts used in the SwS framework are largely inspired by BigMath and PromptCoT, while the training code is adapted from the excellent veRL repository.

🌟 Star History

Star History Chart

About

Official Repo for SwS: A Weakness-driven Problem Synthesis Framework in RL for LLM Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published