Synthetic Data Generation that Incentivizes RL Exploration

This project combines reinforcement learning (RL) and large language models (LLMs) to improve exploration during inference through synthetic data generation.

Project Overview

The project implements a multi-step process where different tools (calculator, dataset access, search engine) are used with a reward function that incentivizes diverse tool usage and effective problem-solving.

Setup

Using pip

# Clone the repository
git clone [repo-url]
cd synthetic-explore-rl

# Install dependencies (this requirements.txt covers all project needs, including spark_rl)
pip install -r requirements.txt

# Download datasets
# python src/data/download_datasets.py # Optional, depends on specific needs

Using conda (recommended)

# Clone the repository
git clone [repo-url]
cd synthetic-explore-rl

# Method 1: Using environment.yml
conda env create -f environment.yml
conda activate explore-rl

# Method 2: Using the setup script (auto-detects platform)
chmod +x scripts/setup_conda.sh
./scripts/setup_conda.sh

# Download datasets
python -m src.data.download_datasets --base_dir .

### PPO-based Reinforcement Learning from Offline Trajectories (spark_rl)

This project includes capabilities for fine-tuning LLMs using Proximal Policy Optimization (PPO) from offline trajectory data, leveraging QLoRA for efficiency. This is particularly useful for adapting models to complex sequential decision-making tasks, such as those involving tool use, based on pre-collected interaction data.

The PPO implementation, training scripts, and detailed instructions are located in the `spark_rl/` directory. Please refer to `spark_rl/README.md` for specific setup steps, example training commands, and information on dependencies.

**Key Features within `spark_rl`:**
- Offline PPO agent for LLM fine-tuning.
- QLoRA integration for memory-efficient training.
- Flexible data loading from local files or Hugging Face Hub (for training trajectories).
- Advanced evaluation script (`spark_rl/evaluate.py`) capable of:
    - Loading test questions and options directly from Hugging Face datasets (e.g., MMLU-Pro).
    - Utilizing a local CSV to select specific question IDs for evaluation from the HF dataset.
    - Employing GPT-4o for robust extraction of multiple-choice answers from model generations.
    - Calculating accuracy against ground truth answers from the Hugging Face dataset.

#### Platform-specific Setup Notes

- **macOS (Apple Silicon M1/M2/M3)**: The environment will automatically install PyTorch for Apple Silicon.
- **macOS (Intel)**: The environment will install Intel-compatible PyTorch.
- **Linux with NVIDIA GPU**: The setup script will detect CUDA capability and install appropriate PyTorch packages with CUDA support.
- **Linux without GPU**: CPU-only PyTorch will be installed.

## Project Structure

synthetic-explore-rl/ ├── configs/ # hydra yaml configurations ├── data/ # datasets and processed data │ ├── hotpotqa/ │ └── aimo/ ├── src/ │ ├── tools/ # calculator.py, search.py, etc. │ ├── env/ # rollout_driver.py, langgraph_node.py │ ├── rewards/ # judge_llm.py, kl.py │ ├── agent/ # policy_lora.py, value_lora.py │ ├── rl/ # ppo_trainer.py │ └── eval/ # hotpot_eval.py, mmlu_eval.py ├── scripts/ # utility scripts │ ├── setup_conda.sh # conda environment setup │ ├── run_sft.sh # run supervised fine-tuning │ └── test_sft_model.sh # test fine-tuned models ├── environment.yml # conda environment specification └── requirements.txt # pip requirements


## Model Fine-tuning

### Supervised Fine-tuning on HotpotQA

```bash
# Activate the conda environment
conda activate explore-rl

# Run supervised fine-tuning
./scripts/run_sft.sh

# Test the fine-tuned model
./scripts/test_sft_model.sh --model_path checkpoints/sft-hotpotqa/final

Hardware Requirements

For training and fine-tuning:
- NVIDIA GPU with at least 16GB VRAM (recommended)
- Or, Apple Silicon Mac with at least 16GB RAM (slower but supported)
For inference only:
- At least 8GB RAM/VRAM

Training Process

Baseline establishment with zero-shot/few-shot LLM
Setting up task environment & tool interface
Reward & value design
Synthetic-data RL loop (SWiRL-style)
Evaluation & ablations

License

[License information]

HotpotQA Fine-tuning with Llama-3

This repository contains code for fine-tuning the Llama-3 model on the HotpotQA dataset using LoRA (Low-Rank Adaptation).

Requirements

Python 3.10+
CUDA-capable GPU(s) (training requires GPU)
At least 16GB GPU memory per GPU

Setup

Clone this repository:

git clone <repository-url>
cd <repository-name>

Create and activate a conda environment:

conda create -n explore-rl python=3.10
conda activate explore-rl

Install dependencies:

pip install -r requirements.txt

Download the required data:

chmod +x data/download_data.sh
./data/download_data.sh

Training

Single GPU Training

python -m src.train_sft \
  --base_model meta-llama/Llama-3.1-8b-instruct \
  --data_path data/hotpotqa \
  --output_dir checkpoints/sft-baseline \
  --device cuda:0 \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --lora_r 8 \
  --lora_alpha 16 \
  --lora_dropout 0.1 \
  --max_seq_length 512 \
  --max_samples 1000 \
  --log_steps 10 \
  --save_steps 100

Multi-GPU Training

torchrun --nproc_per_node=2 -m src.train_sft \
  --base_model meta-llama/Llama-3.1-8b-instruct \
  --data_path data/hotpotqa \
  --output_dir checkpoints/sft-baseline \
  --device auto \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --lora_r 8 \
  --lora_alpha 16 \
  --lora_dropout 0.1 \
  --max_seq_length 512 \
  --max_samples 1000 \
  --log_steps 10 \
  --save_steps 100

Replace --nproc_per_node=2 with the number of GPUs you want to use.

Important Notes

GPU Requirements: Training requires CUDA-capable GPUs. The model is too large to train on CPU.
Memory Requirements: Each GPU should have at least 16GB of memory. Adjust batch size and gradient accumulation steps if you encounter out-of-memory errors.
Data Location: The training script expects data in data/hotpotqa/hotpot_data/. Make sure to run the download script first.
Checkpoints: Model checkpoints will be saved in the specified output_dir. These are not tracked by git.

Troubleshooting

If you encounter CUDA out-of-memory errors:
- Reduce the batch size
- Increase gradient accumulation steps
- Reduce max sequence length
If you encounter file not found errors:
- Make sure you've run the download script
- Check that the data is in the correct location
- Ensure you have write permissions in the output directory
For multi-GPU training issues:
- Make sure all GPUs are available and not in use
- Check that CUDA is properly installed
- Verify that torch can see all GPUs with torch.cuda.device_count()

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
baseline		baseline
baseline_sft_together		baseline_sft_together
baseline_together		baseline_together
data		data
eval/eval_results_spark		eval/eval_results_spark
scripts		scripts
spark_rl		spark_rl
src		src
swirl_synth		swirl_synth
visualizations		visualizations
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
test_and_upload.sh		test_and_upload.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Data Generation that Incentivizes RL Exploration

Project Overview

Setup

Using pip

Using conda (recommended)

Hardware Requirements

Training Process

License

HotpotQA Fine-tuning with Llama-3

Requirements

Setup

Training

Single GPU Training

Multi-GPU Training

Important Notes

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

JustinGu32/explore-rl

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generation that Incentivizes RL Exploration

Project Overview

Setup

Using pip

Using conda (recommended)

Hardware Requirements

Training Process

License

HotpotQA Fine-tuning with Llama-3

Requirements

Setup

Training

Single GPU Training

Multi-GPU Training

Important Notes

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages