π This repository contains the code accompanying the paper "FROM
β‘ The repository is organized around reproducible bash entrypoints located in bash/, grouped by the paper sections they support.
- βοΈ Environment Setup
- π Repository Layout
- π§βπ« Stage 1: Atomic Skill Acquisition
- π§© Stage 2: Learning Compositional Skills
- π Cross-Task Transfer to Countdown (Section 4.3)
- π Pass@k Analysis (Sections 4.4)
- π Citing
-
Clone the repository π₯
git clone https://github.com/PRIME-RL/RL-Compositionality.git cd RL-Compositionality -
Create a Python environment π§βπ» (Python β₯ 3.10 recommended, tested with 3.12)
virtualenv rl_comp source rl_comp/bin/activate pip install -e . pip install flash-attn --no-build-isolation
-
Login to Wandb π
export WANDB_API_KEY="..."
-
Model checkpoints π Update
MODEL_PATHvariables inside the bash scripts if your checkpoint is stored elsewhere.
-
bash/: One-click pipelines grouped by paper section numbers.section41_42/: Data generation, Stage 1 RFT, Stage 2 RL/RFT experiments.section43/: Countdown data generation and transfer experiments.section44/: Pass@k evaluation utilities.
-
examples/: Python entrypoints used by the bash scripts.
Stage 1 corresponds to rejection fine-tuning (RFT) on tasks where the function definitions are visible. The process produces a Stage 1 checkpoint that serves as the initialization for all Stage 2 experiments.
You can directly download our Stage-1 RFT training data on π€Huggingface.
hf download --repo-type dataset --local-dir data/string_task/stage1_level1/rft_data weizechen/RL-Compositionality-Stage1-RFT-DataAlternatively, you can create the problems and collect the RFT data by yourself
-
Generate synthetic training data
bash bash/section41_42/stage1_create_problems.sh
This script populates
data/string_task/stage1_level1/with train and test Parquet files containing atomic string transformations (the test Parquet is actually not used in the experiment). -
Collect rollouts & convert into RFT datasets
bash bash/section41_42/stage1_create_train_data.sh
The script samples
N_SAMPLESresponses per prompt and stores them indata/string_task/stage1_level1/rollout.parquet. And then filters the rollouts based on the accuracy and emitstrain.parquet/test.parquetsplits underdata/string_task/stage1_level1/rft_data/.
Then train the Llama 3.1 8B Instruct model with the RFT data:
bash bash/section41_42/stage1_rft.shThe resulting checkpoint is saved to checkpoints/string-task/stage1-rft/ and initializes every Stage 2 run.
Stage 2 removes access to function implementations and focuses on compositional reasoning. All scripts assume that bash/section41_42/stage1_rft.sh has been executed successfully.
Download the RL training and evaluation dataset from π€Huggingface
# Training
hf download --repo-type dataset --local-dir data/string_task/stage2_level1 weizechen/RL-Compositionality-Stage2-RL-Level1-TrainData
hf download --repo-type dataset --local-dir data/string_task/stage2_level2 weizechen/RL-Compositionality-Stage2-RL-Level2-TrainData
# Evaluation
hf download --repo-type dataset --local-dir data/string_task/stage2_level1to8 weizechen/RL-Compositionality-Stage2-RL-Level2-TestDataAlternatively, you can create the problems yourself
bash bash/section41_42/stage2_create_problems.shβ‘οΈ Generates Level-1, Level-2 training splits and a Level-1-to-8 evaluation split.
Launch RL training with your desired setting:
bash bash/section41_42/stage2_rl_level1.sh # Level-1 only
bash bash/section41_42/stage2_rl_level2.sh # Level-2 only
bash bash/section41_42/stage2_rl_level1to2.sh # Mixed Level-1+2To compare RL with supervised rejection fine-tuning:
-
Partition Level-2 data into iterative RFT chunks
bash bash/section41_42/stage2_rft_create_problems.sh
-
Collect rollouts & convert them into RFT datasets
bash bash/section41_42/stage2_rft_create_train_data.sh
-
Fine-tune with Stage 1 checkpoint
bash bash/section41_42/stage2_rft_train.sh
-
Iterative RFT π Repeat steps 2β3 with updated model/data paths.
-
Generate Countdown arithmetic datasets
bash bash/section43/create_countdown_data.sh
-
Collect rollouts & convert to RFT data
bash bash/section43/stage1_collect_train_data.sh
The script will collect model's rollout on Countdown Level 2 problems, filter according to the accuracy, merge them with string task RFT data, and save to
data/string_countdown_task/stage1_rft_data/ -
Train the model ποΈ You can reuse the script from
bash/section41_42/stage1_rft.sh, and change the data path todata/string_countdown_task/stage1_rft_data.
The Stage 2 training is the same as Section 4.1/4.2. You can reuse the scripts and change the model path.
-
1000 Rollout Collection
bash bash/section44/passk.sh
This command samples completions from a trained policy, saves them to
results/stage2_rl_level1/all.parquet, and enables downstream computation of pass@k metrics. Change the model path to obtain the 1000 responses from other models.
If you build upon this work, please cite the accompanying paper:
@article{yuan2025rlcompose,
author = {Lifan Yuan and Weize Chen and Yuchen Zhang and Ganqu Cui and Hanbin Wang and Ziming You and Ning Ding and Zhiyuan Liu and Maosong Sun and Hao Peng},
title = {From $f(x)$ and $g(x)$ to $f(g(x))$: {LLMs} Learn New Skills in {RL} by Composing Old Ones},
journal = {arXiv preprint arXiv:2509.25123},
year = {2025},
url = {https://arxiv.org/abs/2509.25123}
}