From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

📘 This repository contains the code accompanying the paper "FROM $f(x)$ AND $g(x)$ TO $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones". The repo is built upon veRL, and we provide the synthetic data generators and training pipelines required to reproduce results reported in the paper.

⚡ The repository is organized around reproducible bash entrypoints located in bash/, grouped by the paper sections they support.

🗂️ Table of Contents

⚙️ Environment Setup

Clone the repository 📥

git clone https://github.com/PRIME-RL/RL-Compositionality.git
cd RL-Compositionality

Create a Python environment 🧑‍💻 (Python ≥ 3.10 recommended, tested with 3.12)

virtualenv rl_comp
source rl_comp/bin/activate
pip install -e .
pip install flash-attn --no-build-isolation

Login to Wandb 🔑
```
export WANDB_API_KEY="..."
```
Model checkpoints 📂 Update MODEL_PATH variables inside the bash scripts if your checkpoint is stored elsewhere.

📁 Repository Layout

bash/: One-click pipelines grouped by paper section numbers.
- section41_42/: Data generation, Stage 1 RFT, Stage 2 RL/RFT experiments.
- section43/: Countdown data generation and transfer experiments.
- section44/: Pass@k evaluation utilities.
examples/: Python entrypoints used by the bash scripts.

🧑‍🏫 Stage 1: Atomic Skill Acquisition

Stage 1 corresponds to rejection fine-tuning (RFT) on tasks where the function definitions are visible. The process produces a Stage 1 checkpoint that serves as the initialization for all Stage 2 experiments.

You can directly download our Stage-1 RFT training data on 🤗Huggingface.

hf download --repo-type dataset --local-dir data/string_task/stage1_level1/rft_data weizechen/RL-Compositionality-Stage1-RFT-Data

Alternatively, you can create the problems and collect the RFT data by yourself

Generate synthetic training data
```
bash bash/section41_42/stage1_create_problems.sh
```
This script populates data/string_task/stage1_level1/ with train and test Parquet files containing atomic string transformations (the test Parquet is actually not used in the experiment).
Collect rollouts & convert into RFT datasets
```
bash bash/section41_42/stage1_create_train_data.sh
```
The script samples N_SAMPLES responses per prompt and stores them in data/string_task/stage1_level1/rollout.parquet. And then filters the rollouts based on the accuracy and emits train.parquet / test.parquet splits under data/string_task/stage1_level1/rft_data/.

Then train the Llama 3.1 8B Instruct model with the RFT data:

bash bash/section41_42/stage1_rft.sh

The resulting checkpoint is saved to checkpoints/string-task/stage1-rft/ and initializes every Stage 2 run.

🧩 Stage 2: Learning Compositional Skills

Stage 2 removes access to function implementations and focuses on compositional reasoning. All scripts assume that bash/section41_42/stage1_rft.sh has been executed successfully.

🤖 RL Variants (Sections 4.1 & 4.2)

Download the RL training and evaluation dataset from 🤗Huggingface

# Training
hf download --repo-type dataset --local-dir data/string_task/stage2_level1 weizechen/RL-Compositionality-Stage2-RL-Level1-TrainData
hf download --repo-type dataset --local-dir data/string_task/stage2_level2 weizechen/RL-Compositionality-Stage2-RL-Level2-TrainData

# Evaluation
hf download --repo-type dataset --local-dir data/string_task/stage2_level1to8 weizechen/RL-Compositionality-Stage2-RL-Level2-TestData

Alternatively, you can create the problems yourself

bash bash/section41_42/stage2_create_problems.sh

➡️ Generates Level-1, Level-2 training splits and a Level-1-to-8 evaluation split.

Launch RL training with your desired setting:

bash bash/section41_42/stage2_rl_level1.sh      # Level-1 only
bash bash/section41_42/stage2_rl_level2.sh      # Level-2 only
bash bash/section41_42/stage2_rl_level1to2.sh   # Mixed Level-1+2

📊 RFT Baseline (Section 4.2)

To compare RL with supervised rejection fine-tuning:

Partition Level-2 data into iterative RFT chunks

bash bash/section41_42/stage2_rft_create_problems.sh

Collect rollouts & convert them into RFT datasets

bash bash/section41_42/stage2_rft_create_train_data.sh

Fine-tune with Stage 1 checkpoint

bash bash/section41_42/stage2_rft_train.sh

Iterative RFT 🔁 Repeat steps 2–3 with updated model/data paths.

🔀 Cross-Task Transfer to Countdown (Section 4.3)

Generate Countdown arithmetic datasets

bash bash/section43/create_countdown_data.sh

Collect rollouts & convert to RFT data
```
bash bash/section43/stage1_collect_train_data.sh
```
The script will collect model's rollout on Countdown Level 2 problems, filter according to the accuracy, merge them with string task RFT data, and save to data/string_countdown_task/stage1_rft_data/
Train the model 🏋️ You can reuse the script from bash/section41_42/stage1_rft.sh, and change the data path to data/string_countdown_task/stage1_rft_data.

The Stage 2 training is the same as Section 4.1/4.2. You can reuse the scripts and change the model path.

📈 Pass@k Analysis (Sections 4.4)

1000 Rollout Collection
```
bash bash/section44/passk.sh
```
This command samples completions from a trained policy, saves them to results/stage2_rl_level1/all.parquet, and enables downstream computation of pass@k metrics. Change the model path to obtain the 1000 responses from other models.

📚 Citing

If you build upon this work, please cite the accompanying paper:

@article{yuan2025rlcompose,
  author    = {Lifan Yuan and Weize Chen and Yuchen Zhang and Ganqu Cui and Hanbin Wang and Ziming You and Ning Ding and Zhiyuan Liu and Maosong Sun and Hao Peng},
  title     = {From $f(x)$ and $g(x)$ to $f(g(x))$: {LLMs} Learn New Skills in {RL} by Composing Old Ones},
  journal   = {arXiv preprint arXiv:2509.25123},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.25123}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bash		bash
docker		docker
docs		docs
examples		examples
patches		patches
recipe/prime		recipe/prime
scripts		scripts
tests		tests
verl		verl
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
Dockerfile		Dockerfile
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

🗂️ Table of Contents

⚙️ Environment Setup

📁 Repository Layout

🧑‍🏫 Stage 1: Atomic Skill Acquisition

🧩 Stage 2: Learning Compositional Skills

🤖 RL Variants (Sections 4.1 & 4.2)

📊 RFT Baseline (Section 4.2)

🔀 Cross-Task Transfer to Countdown (Section 4.3)

📈 Pass@k Analysis (Sections 4.4)

📚 Citing

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

PRIME-RL/RL-Compositionality

Folders and files

Latest commit

History

Repository files navigation

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

🗂️ Table of Contents

⚙️ Environment Setup

📁 Repository Layout

🧑‍🏫 Stage 1: Atomic Skill Acquisition

🧩 Stage 2: Learning Compositional Skills

🤖 RL Variants (Sections 4.1 & 4.2)

📊 RFT Baseline (Section 4.2)

🔀 Cross-Task Transfer to Countdown (Section 4.3)

📈 Pass@k Analysis (Sections 4.4)

📚 Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages