Thanks to visit codestin.com
Credit goes to github.com

Skip to content

PRIME-RL/RL-Compositionality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Paper Notion Hugging Face Collection Github Twitter

πŸ“˜ This repository contains the code accompanying the paper "FROM $f(x)$ AND $g(x)$ TO $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones". The repo is built upon veRL, and we provide the synthetic data generators and training pipelines required to reproduce results reported in the paper.

⚑ The repository is organized around reproducible bash entrypoints located in bash/, grouped by the paper sections they support.


πŸ—‚οΈ Table of Contents


βš™οΈ Environment Setup

  1. Clone the repository πŸ“₯

    git clone https://github.com/PRIME-RL/RL-Compositionality.git
    cd RL-Compositionality
  2. Create a Python environment πŸ§‘β€πŸ’» (Python β‰₯ 3.10 recommended, tested with 3.12)

    virtualenv rl_comp
    source rl_comp/bin/activate
    pip install -e .
    pip install flash-attn --no-build-isolation
  3. Login to Wandb πŸ”‘

    export WANDB_API_KEY="..."
  4. Model checkpoints πŸ“‚ Update MODEL_PATH variables inside the bash scripts if your checkpoint is stored elsewhere.


πŸ“ Repository Layout

  • bash/: One-click pipelines grouped by paper section numbers.

    • section41_42/: Data generation, Stage 1 RFT, Stage 2 RL/RFT experiments.
    • section43/: Countdown data generation and transfer experiments.
    • section44/: Pass@k evaluation utilities.
  • examples/: Python entrypoints used by the bash scripts.


πŸ§‘β€πŸ« Stage 1: Atomic Skill Acquisition

Stage 1 corresponds to rejection fine-tuning (RFT) on tasks where the function definitions are visible. The process produces a Stage 1 checkpoint that serves as the initialization for all Stage 2 experiments.

You can directly download our Stage-1 RFT training data on πŸ€—Huggingface.

hf download --repo-type dataset --local-dir data/string_task/stage1_level1/rft_data weizechen/RL-Compositionality-Stage1-RFT-Data
Alternatively, you can create the problems and collect the RFT data by yourself
  1. Generate synthetic training data

    bash bash/section41_42/stage1_create_problems.sh

    This script populates data/string_task/stage1_level1/ with train and test Parquet files containing atomic string transformations (the test Parquet is actually not used in the experiment).

  2. Collect rollouts & convert into RFT datasets

    bash bash/section41_42/stage1_create_train_data.sh

    The script samples N_SAMPLES responses per prompt and stores them in data/string_task/stage1_level1/rollout.parquet. And then filters the rollouts based on the accuracy and emits train.parquet / test.parquet splits under data/string_task/stage1_level1/rft_data/.

Then train the Llama 3.1 8B Instruct model with the RFT data:

bash bash/section41_42/stage1_rft.sh

The resulting checkpoint is saved to checkpoints/string-task/stage1-rft/ and initializes every Stage 2 run.


🧩 Stage 2: Learning Compositional Skills

Stage 2 removes access to function implementations and focuses on compositional reasoning. All scripts assume that bash/section41_42/stage1_rft.sh has been executed successfully.

πŸ€– RL Variants (Sections 4.1 & 4.2)

Download the RL training and evaluation dataset from πŸ€—Huggingface

# Training
hf download --repo-type dataset --local-dir data/string_task/stage2_level1 weizechen/RL-Compositionality-Stage2-RL-Level1-TrainData
hf download --repo-type dataset --local-dir data/string_task/stage2_level2 weizechen/RL-Compositionality-Stage2-RL-Level2-TrainData

# Evaluation
hf download --repo-type dataset --local-dir data/string_task/stage2_level1to8 weizechen/RL-Compositionality-Stage2-RL-Level2-TestData
Alternatively, you can create the problems yourself
bash bash/section41_42/stage2_create_problems.sh

➑️ Generates Level-1, Level-2 training splits and a Level-1-to-8 evaluation split.

Launch RL training with your desired setting:

bash bash/section41_42/stage2_rl_level1.sh      # Level-1 only
bash bash/section41_42/stage2_rl_level2.sh      # Level-2 only
bash bash/section41_42/stage2_rl_level1to2.sh   # Mixed Level-1+2

πŸ“Š RFT Baseline (Section 4.2)

To compare RL with supervised rejection fine-tuning:

  1. Partition Level-2 data into iterative RFT chunks

    bash bash/section41_42/stage2_rft_create_problems.sh
  2. Collect rollouts & convert them into RFT datasets

    bash bash/section41_42/stage2_rft_create_train_data.sh
  3. Fine-tune with Stage 1 checkpoint

    bash bash/section41_42/stage2_rft_train.sh
  4. Iterative RFT πŸ” Repeat steps 2–3 with updated model/data paths.


πŸ”€ Cross-Task Transfer to Countdown (Section 4.3)

  1. Generate Countdown arithmetic datasets

    bash bash/section43/create_countdown_data.sh
  2. Collect rollouts & convert to RFT data

    bash bash/section43/stage1_collect_train_data.sh

    The script will collect model's rollout on Countdown Level 2 problems, filter according to the accuracy, merge them with string task RFT data, and save to data/string_countdown_task/stage1_rft_data/

  3. Train the model πŸ‹οΈ You can reuse the script from bash/section41_42/stage1_rft.sh, and change the data path to data/string_countdown_task/stage1_rft_data.

The Stage 2 training is the same as Section 4.1/4.2. You can reuse the scripts and change the model path.


πŸ“ˆ Pass@k Analysis (Sections 4.4)

  1. 1000 Rollout Collection

    bash bash/section44/passk.sh

    This command samples completions from a trained policy, saves them to results/stage2_rl_level1/all.parquet, and enables downstream computation of pass@k metrics. Change the model path to obtain the 1000 responses from other models.


πŸ“š Citing

If you build upon this work, please cite the accompanying paper:

@article{yuan2025rlcompose,
  author    = {Lifan Yuan and Weize Chen and Yuchen Zhang and Ganqu Cui and Hanbin Wang and Ziming You and Ning Ding and Zhiyuan Liu and Maosong Sun and Hao Peng},
  title     = {From $f(x)$ and $g(x)$ to $f(g(x))$: {LLMs} Learn New Skills in {RL} by Composing Old Ones},
  journal   = {arXiv preprint arXiv:2509.25123},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.25123}
}

About

FROM $f(x)$ AND $g(x)$ TO $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages