This repository contains the code and experiments for the research paper "Decomposing Elements of Problem Solving: What 'Math' Does RL Teach?" which investigates how reinforcement learning (RL) affects mathematical reasoning capabilities in large language models.
Mathematical reasoning tasks have become prominent benchmarks for assessing LLM reasoning capabilities, especially with RL methods like GRPO showing significant performance gains. However, accuracy metrics alone don't reveal which problem-solving skills have been internalized.
-
Reasoning Decomposition Framework: We propose decomposing math problem solving into three fundamental capabilities:
- Plan: Mapping questions to sequences of solution steps
- Execute: Correctly performing solution steps
- Verify: Identifying the correctness of a solution
-
Empirical Analysis of RL: We show that GRPO primarily improves execution on known problems through a "temperature distillation" effect, but fails to solve previously unsolved problems, revealing a "coverage wall".
-
Synthetic Validation: We construct a minimal synthetic task that replicates our empirical findings and identifies conditions under which RL can overcome the coverage wall.
- Temperature Distillation: GRPO makes correct solutions more likely regardless of sampling temperature, enhancing execution robustness
- Coverage Wall: RL fails to help models solve fundamentally new problems due to insufficient planning skills
- Execution Enhancement: RL primarily strengthens execution by reducing spurious correlations and basic errors
RL-Wall/
βββ eval/ # Evaluation framework and utilities
β βββ utils.py # Core evaluation utilities and verifiers
β βββ generate_responses.py # Response generation script
β βββ extract_correct.py # Answer extraction utilities
β βββ scripts/ # Collection of evaluation scripts for different models
βββ synthetic/ # Synthetic environment for controlled experiments
β βββ make_data_synthetic_v5.ipynb # Synthetic data generation notebook
β βββ make_models_v5.py # Synthetic model creation script
β βββ eval_f.py & eval_t.py # Evaluation scripts for synthetic models
β βββ configs/ # YAML training configurations (v5_1.yaml, etc.)
β βββ sft/ # Supervised fine-tuning code
β β βββ run_sft_accelerate.py # SFT training script
β β βββ lm_tools.py # Language model utilities
β βββ rl/ # Reinforcement learning setup (VERL framework)
βββ tree_vis/ # Solution tree visualization tools
β βββ make_tree_04_14.ipynb # Interactive tree visualization notebook
β βββ trees/ # Generated solution tree files
β βββ *.html # Example visualization files
βββ math_rl/ # Mathematical RL experiments (minimal content)
βββ README.md
The repository uses several dependencies. You'll need:
# Core dependencies
pip install torch transformers datasets numpy pandas
pip install vllm accelerate wandb tqdm
pip install sympy pylatexenc
# For RL training (VERL framework is included)
cd synthetic/rl/verl
pip install -e .
# For evaluation with GPT-based verification
pip install openaiutils.py: Comprehensive utilities with multiple answer verifiers (VERL, SymPy, GPT-based)generate_responses.py: Script for generating model responses with various parametersextract_correct.py: Utilities for extracting and processing answersscripts/: Collection of bash scripts for running evaluations (e.g.,qwen-1.5b-instruct_temps.sh)
make_data_synthetic_v5.ipynb: Jupyter notebook for creating synthetic datasetsmake_models_v5.py: Script for synthetic model creationeval_f.pyandeval_t.py: Evaluation scripts for synthetic experimentsconfigs/: YAML configuration files (v5_1.yaml through v5_17.yaml)sft/run_sft_accelerate.py: Training script using Acceleraterl/verl/: Complete VERL framework for RL training
make_tree_04_14.ipynb: Notebook for generating interactive solution trees- Various HTML files: Pre-generated visualization examples
trees.json: Solution tree data
You can generate responses using the evaluation framework:
cd eval
python generate_responses.py \
--model_name qwen-2.5-1.5b-instruct \
--dataset_name math_500 \
--exp_dir ./results/test \
--temperature 0.1 \
--n 64The synthetic environment can be explored through the notebooks:
cd synthetic
# Open and run the data generation notebook
jupyter notebook make_data_synthetic_v5.ipynb
# Train a synthetic model (requires proper setup)
python sft/run_sft_accelerate.py configs/v5_1.yamlcd tree_vis
# Open the visualization notebook
jupyter notebook make_tree_04_14.ipynbThe eval/utils.py file contains:
- Multiple answer verification methods
- Support for various model architectures (Qwen, Llama, DeepSeek, etc.)
- Batch processing capabilities
- Temperature and sampling analysis tools
The synthetic setup models mathematical reasoning as:
- State-action navigation through transition tables
- Built-in spurious correlations for robustness testing
- Configurable complexity and dimensions
- Interactive HTML-based solution tree visualization
- Statistical analysis of model behavior patterns
- Tools for comparing pre/post-RL model performance
Based on the code and experiments in this repository:
- GRPO improves precision through temperature distillation but doesn't increase coverage
- Models plan well but struggle with execution on high school math
- RL reduces basic errors but doesn't teach new mathematical knowledge
- Coverage improvements are possible under specific conditions (less spurious correlation, more RL data)
This repository contains the research code and experimental setup. Some components may require additional setup or configuration to run fully. The code represents the state used for the research paper and may need adaptation for different environments or use cases.
Coming Soon
This project is licensed under the MIT License - see the LICENSE file for details.
- VERL framework for efficient RL training
- MATH and GSM8K datasets for evaluation
- Qwen model family for base models
For questions about the code or experiments, please open a GitHub issue.