Thanks to visit codestin.com
Credit goes to github.com

Skip to content

cfpark00/RL-Wall

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

License: MIT

This repository contains the code and experiments for the research paper "Decomposing Elements of Problem Solving: What 'Math' Does RL Teach?" which investigates how reinforcement learning (RL) affects mathematical reasoning capabilities in large language models.

πŸ”¬ Research Overview

Mathematical reasoning tasks have become prominent benchmarks for assessing LLM reasoning capabilities, especially with RL methods like GRPO showing significant performance gains. However, accuracy metrics alone don't reveal which problem-solving skills have been internalized.

Key Contributions

  1. Reasoning Decomposition Framework: We propose decomposing math problem solving into three fundamental capabilities:

    • Plan: Mapping questions to sequences of solution steps
    • Execute: Correctly performing solution steps
    • Verify: Identifying the correctness of a solution
  2. Empirical Analysis of RL: We show that GRPO primarily improves execution on known problems through a "temperature distillation" effect, but fails to solve previously unsolved problems, revealing a "coverage wall".

  3. Synthetic Validation: We construct a minimal synthetic task that replicates our empirical findings and identifies conditions under which RL can overcome the coverage wall.

Key Findings

  • Temperature Distillation: GRPO makes correct solutions more likely regardless of sampling temperature, enhancing execution robustness
  • Coverage Wall: RL fails to help models solve fundamentally new problems due to insufficient planning skills
  • Execution Enhancement: RL primarily strengthens execution by reducing spurious correlations and basic errors

πŸ“ Repository Structure

RL-Wall/
β”œβ”€β”€ eval/                    # Evaluation framework and utilities
β”‚   β”œβ”€β”€ utils.py            # Core evaluation utilities and verifiers
β”‚   β”œβ”€β”€ generate_responses.py # Response generation script
β”‚   β”œβ”€β”€ extract_correct.py   # Answer extraction utilities
β”‚   └── scripts/            # Collection of evaluation scripts for different models
β”œβ”€β”€ synthetic/              # Synthetic environment for controlled experiments
β”‚   β”œβ”€β”€ make_data_synthetic_v5.ipynb  # Synthetic data generation notebook
β”‚   β”œβ”€β”€ make_models_v5.py   # Synthetic model creation script
β”‚   β”œβ”€β”€ eval_f.py & eval_t.py # Evaluation scripts for synthetic models
β”‚   β”œβ”€β”€ configs/            # YAML training configurations (v5_1.yaml, etc.)
β”‚   β”œβ”€β”€ sft/               # Supervised fine-tuning code
β”‚   β”‚   β”œβ”€β”€ run_sft_accelerate.py # SFT training script
β”‚   β”‚   └── lm_tools.py    # Language model utilities
β”‚   └── rl/                # Reinforcement learning setup (VERL framework)
β”œβ”€β”€ tree_vis/              # Solution tree visualization tools
β”‚   β”œβ”€β”€ make_tree_04_14.ipynb # Interactive tree visualization notebook
β”‚   β”œβ”€β”€ trees/             # Generated solution tree files
β”‚   └── *.html            # Example visualization files
β”œβ”€β”€ math_rl/               # Mathematical RL experiments (minimal content)
└── README.md

πŸš€ Getting Started

Prerequisites

The repository uses several dependencies. You'll need:

# Core dependencies
pip install torch transformers datasets numpy pandas
pip install vllm accelerate wandb tqdm
pip install sympy pylatexenc

# For RL training (VERL framework is included)
cd synthetic/rl/verl
pip install -e .

# For evaluation with GPT-based verification
pip install openai

πŸ“‹ What's Actually Here

Evaluation Framework (eval/)

  • utils.py: Comprehensive utilities with multiple answer verifiers (VERL, SymPy, GPT-based)
  • generate_responses.py: Script for generating model responses with various parameters
  • extract_correct.py: Utilities for extracting and processing answers
  • scripts/: Collection of bash scripts for running evaluations (e.g., qwen-1.5b-instruct_temps.sh)

Synthetic Environment (synthetic/)

  • make_data_synthetic_v5.ipynb: Jupyter notebook for creating synthetic datasets
  • make_models_v5.py: Script for synthetic model creation
  • eval_f.py and eval_t.py: Evaluation scripts for synthetic experiments
  • configs/: YAML configuration files (v5_1.yaml through v5_17.yaml)
  • sft/run_sft_accelerate.py: Training script using Accelerate
  • rl/verl/: Complete VERL framework for RL training

Tree Visualization (tree_vis/)

  • make_tree_04_14.ipynb: Notebook for generating interactive solution trees
  • Various HTML files: Pre-generated visualization examples
  • trees.json: Solution tree data

πŸ§ͺ Running Experiments

Basic Evaluation

You can generate responses using the evaluation framework:

cd eval
python generate_responses.py \
    --model_name qwen-2.5-1.5b-instruct \
    --dataset_name math_500 \
    --exp_dir ./results/test \
    --temperature 0.1 \
    --n 64

Synthetic Experiments

The synthetic environment can be explored through the notebooks:

cd synthetic
# Open and run the data generation notebook
jupyter notebook make_data_synthetic_v5.ipynb

# Train a synthetic model (requires proper setup)
python sft/run_sft_accelerate.py configs/v5_1.yaml

Solution Tree Visualization

cd tree_vis
# Open the visualization notebook
jupyter notebook make_tree_04_14.ipynb

πŸ“Š Key Components

Evaluation Utilities

The eval/utils.py file contains:

  • Multiple answer verification methods
  • Support for various model architectures (Qwen, Llama, DeepSeek, etc.)
  • Batch processing capabilities
  • Temperature and sampling analysis tools

Synthetic Environment Design

The synthetic setup models mathematical reasoning as:

  • State-action navigation through transition tables
  • Built-in spurious correlations for robustness testing
  • Configurable complexity and dimensions

Visualization Tools

  • Interactive HTML-based solution tree visualization
  • Statistical analysis of model behavior patterns
  • Tools for comparing pre/post-RL model performance

πŸ“ˆ Research Findings

Based on the code and experiments in this repository:

  1. GRPO improves precision through temperature distillation but doesn't increase coverage
  2. Models plan well but struggle with execution on high school math
  3. RL reduces basic errors but doesn't teach new mathematical knowledge
  4. Coverage improvements are possible under specific conditions (less spurious correlation, more RL data)

⚠️ Repository Status

This repository contains the research code and experimental setup. Some components may require additional setup or configuration to run fully. The code represents the state used for the research paper and may need adaptation for different environments or use cases.

πŸ“ Citation

Coming Soon

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • VERL framework for efficient RL training
  • MATH and GSM8K datasets for evaluation
  • Qwen model family for base models

For questions about the code or experiments, please open a GitHub issue.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •