This project combines reinforcement learning (RL) and large language models (LLMs) to improve exploration during inference through synthetic data generation.
The project implements a multi-step process where different tools (calculator, dataset access, search engine) are used with a reward function that incentivizes diverse tool usage and effective problem-solving.
# Clone the repository
git clone [repo-url]
cd synthetic-explore-rl
# Install dependencies (this requirements.txt covers all project needs, including spark_rl)
pip install -r requirements.txt
# Download datasets
# python src/data/download_datasets.py # Optional, depends on specific needs# Clone the repository
git clone [repo-url]
cd synthetic-explore-rl
# Method 1: Using environment.yml
conda env create -f environment.yml
conda activate explore-rl
# Method 2: Using the setup script (auto-detects platform)
chmod +x scripts/setup_conda.sh
./scripts/setup_conda.sh
# Download datasets
python -m src.data.download_datasets --base_dir .
### PPO-based Reinforcement Learning from Offline Trajectories (spark_rl)
This project includes capabilities for fine-tuning LLMs using Proximal Policy Optimization (PPO) from offline trajectory data, leveraging QLoRA for efficiency. This is particularly useful for adapting models to complex sequential decision-making tasks, such as those involving tool use, based on pre-collected interaction data.
The PPO implementation, training scripts, and detailed instructions are located in the `spark_rl/` directory. Please refer to `spark_rl/README.md` for specific setup steps, example training commands, and information on dependencies.
**Key Features within `spark_rl`:**
- Offline PPO agent for LLM fine-tuning.
- QLoRA integration for memory-efficient training.
- Flexible data loading from local files or Hugging Face Hub (for training trajectories).
- Advanced evaluation script (`spark_rl/evaluate.py`) capable of:
- Loading test questions and options directly from Hugging Face datasets (e.g., MMLU-Pro).
- Utilizing a local CSV to select specific question IDs for evaluation from the HF dataset.
- Employing GPT-4o for robust extraction of multiple-choice answers from model generations.
- Calculating accuracy against ground truth answers from the Hugging Face dataset.
#### Platform-specific Setup Notes
- **macOS (Apple Silicon M1/M2/M3)**: The environment will automatically install PyTorch for Apple Silicon.
- **macOS (Intel)**: The environment will install Intel-compatible PyTorch.
- **Linux with NVIDIA GPU**: The setup script will detect CUDA capability and install appropriate PyTorch packages with CUDA support.
- **Linux without GPU**: CPU-only PyTorch will be installed.
## Project Structure
synthetic-explore-rl/ ├── configs/ # hydra yaml configurations ├── data/ # datasets and processed data │ ├── hotpotqa/ │ └── aimo/ ├── src/ │ ├── tools/ # calculator.py, search.py, etc. │ ├── env/ # rollout_driver.py, langgraph_node.py │ ├── rewards/ # judge_llm.py, kl.py │ ├── agent/ # policy_lora.py, value_lora.py │ ├── rl/ # ppo_trainer.py │ └── eval/ # hotpot_eval.py, mmlu_eval.py ├── scripts/ # utility scripts │ ├── setup_conda.sh # conda environment setup │ ├── run_sft.sh # run supervised fine-tuning │ └── test_sft_model.sh # test fine-tuned models ├── environment.yml # conda environment specification └── requirements.txt # pip requirements
## Model Fine-tuning
### Supervised Fine-tuning on HotpotQA
```bash
# Activate the conda environment
conda activate explore-rl
# Run supervised fine-tuning
./scripts/run_sft.sh
# Test the fine-tuned model
./scripts/test_sft_model.sh --model_path checkpoints/sft-hotpotqa/final
- For training and fine-tuning:
- NVIDIA GPU with at least 16GB VRAM (recommended)
- Or, Apple Silicon Mac with at least 16GB RAM (slower but supported)
- For inference only:
- At least 8GB RAM/VRAM
- Baseline establishment with zero-shot/few-shot LLM
- Setting up task environment & tool interface
- Reward & value design
- Synthetic-data RL loop (SWiRL-style)
- Evaluation & ablations
[License information]
This repository contains code for fine-tuning the Llama-3 model on the HotpotQA dataset using LoRA (Low-Rank Adaptation).
- Python 3.10+
- CUDA-capable GPU(s) (training requires GPU)
- At least 16GB GPU memory per GPU
- Clone this repository:
git clone <repository-url>
cd <repository-name>- Create and activate a conda environment:
conda create -n explore-rl python=3.10
conda activate explore-rl- Install dependencies:
pip install -r requirements.txt- Download the required data:
chmod +x data/download_data.sh
./data/download_data.shpython -m src.train_sft \
--base_model meta-llama/Llama-3.1-8b-instruct \
--data_path data/hotpotqa \
--output_dir checkpoints/sft-baseline \
--device cuda:0 \
--epochs 3 \
--batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--max_seq_length 512 \
--max_samples 1000 \
--log_steps 10 \
--save_steps 100torchrun --nproc_per_node=2 -m src.train_sft \
--base_model meta-llama/Llama-3.1-8b-instruct \
--data_path data/hotpotqa \
--output_dir checkpoints/sft-baseline \
--device auto \
--epochs 3 \
--batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--max_seq_length 512 \
--max_samples 1000 \
--log_steps 10 \
--save_steps 100Replace --nproc_per_node=2 with the number of GPUs you want to use.
- GPU Requirements: Training requires CUDA-capable GPUs. The model is too large to train on CPU.
- Memory Requirements: Each GPU should have at least 16GB of memory. Adjust batch size and gradient accumulation steps if you encounter out-of-memory errors.
- Data Location: The training script expects data in
data/hotpotqa/hotpot_data/. Make sure to run the download script first. - Checkpoints: Model checkpoints will be saved in the specified
output_dir. These are not tracked by git.
-
If you encounter CUDA out-of-memory errors:
- Reduce the batch size
- Increase gradient accumulation steps
- Reduce max sequence length
-
If you encounter file not found errors:
- Make sure you've run the download script
- Check that the data is in the correct location
- Ensure you have write permissions in the output directory
-
For multi-GPU training issues:
- Make sure all GPUs are available and not in use
- Check that CUDA is properly installed
- Verify that torch can see all GPUs with
torch.cuda.device_count()