Thanks to visit codestin.com
Credit goes to github.com

Skip to content

This project combines reinforcement learning (RL) and large language models (LLMs) to improve exploration using diverse tool generation during inference by training it using synthetic data generation (self-improving ML & NLP).

Notifications You must be signed in to change notification settings

JustinGu32/explore-rl

 
 

Repository files navigation

Synthetic Data Generation that Incentivizes RL Exploration

This project combines reinforcement learning (RL) and large language models (LLMs) to improve exploration during inference through synthetic data generation.

Project Overview

The project implements a multi-step process where different tools (calculator, dataset access, search engine) are used with a reward function that incentivizes diverse tool usage and effective problem-solving.

Setup

Using pip

# Clone the repository
git clone [repo-url]
cd synthetic-explore-rl

# Install dependencies (this requirements.txt covers all project needs, including spark_rl)
pip install -r requirements.txt

# Download datasets
# python src/data/download_datasets.py # Optional, depends on specific needs

Using conda (recommended)

# Clone the repository
git clone [repo-url]
cd synthetic-explore-rl

# Method 1: Using environment.yml
conda env create -f environment.yml
conda activate explore-rl

# Method 2: Using the setup script (auto-detects platform)
chmod +x scripts/setup_conda.sh
./scripts/setup_conda.sh

# Download datasets
python -m src.data.download_datasets --base_dir .

### PPO-based Reinforcement Learning from Offline Trajectories (spark_rl)

This project includes capabilities for fine-tuning LLMs using Proximal Policy Optimization (PPO) from offline trajectory data, leveraging QLoRA for efficiency. This is particularly useful for adapting models to complex sequential decision-making tasks, such as those involving tool use, based on pre-collected interaction data.

The PPO implementation, training scripts, and detailed instructions are located in the `spark_rl/` directory. Please refer to `spark_rl/README.md` for specific setup steps, example training commands, and information on dependencies.

**Key Features within `spark_rl`:**
- Offline PPO agent for LLM fine-tuning.
- QLoRA integration for memory-efficient training.
- Flexible data loading from local files or Hugging Face Hub (for training trajectories).
- Advanced evaluation script (`spark_rl/evaluate.py`) capable of:
    - Loading test questions and options directly from Hugging Face datasets (e.g., MMLU-Pro).
    - Utilizing a local CSV to select specific question IDs for evaluation from the HF dataset.
    - Employing GPT-4o for robust extraction of multiple-choice answers from model generations.
    - Calculating accuracy against ground truth answers from the Hugging Face dataset.

#### Platform-specific Setup Notes

- **macOS (Apple Silicon M1/M2/M3)**: The environment will automatically install PyTorch for Apple Silicon.
- **macOS (Intel)**: The environment will install Intel-compatible PyTorch.
- **Linux with NVIDIA GPU**: The setup script will detect CUDA capability and install appropriate PyTorch packages with CUDA support.
- **Linux without GPU**: CPU-only PyTorch will be installed.

## Project Structure

synthetic-explore-rl/ ├── configs/ # hydra yaml configurations ├── data/ # datasets and processed data │ ├── hotpotqa/ │ └── aimo/ ├── src/ │ ├── tools/ # calculator.py, search.py, etc. │ ├── env/ # rollout_driver.py, langgraph_node.py │ ├── rewards/ # judge_llm.py, kl.py │ ├── agent/ # policy_lora.py, value_lora.py │ ├── rl/ # ppo_trainer.py │ └── eval/ # hotpot_eval.py, mmlu_eval.py ├── scripts/ # utility scripts │ ├── setup_conda.sh # conda environment setup │ ├── run_sft.sh # run supervised fine-tuning │ └── test_sft_model.sh # test fine-tuned models ├── environment.yml # conda environment specification └── requirements.txt # pip requirements


## Model Fine-tuning

### Supervised Fine-tuning on HotpotQA

```bash
# Activate the conda environment
conda activate explore-rl

# Run supervised fine-tuning
./scripts/run_sft.sh

# Test the fine-tuned model
./scripts/test_sft_model.sh --model_path checkpoints/sft-hotpotqa/final

Hardware Requirements

  • For training and fine-tuning:
    • NVIDIA GPU with at least 16GB VRAM (recommended)
    • Or, Apple Silicon Mac with at least 16GB RAM (slower but supported)
  • For inference only:
    • At least 8GB RAM/VRAM

Training Process

  1. Baseline establishment with zero-shot/few-shot LLM
  2. Setting up task environment & tool interface
  3. Reward & value design
  4. Synthetic-data RL loop (SWiRL-style)
  5. Evaluation & ablations

License

[License information]

HotpotQA Fine-tuning with Llama-3

This repository contains code for fine-tuning the Llama-3 model on the HotpotQA dataset using LoRA (Low-Rank Adaptation).

Requirements

  • Python 3.10+
  • CUDA-capable GPU(s) (training requires GPU)
  • At least 16GB GPU memory per GPU

Setup

  1. Clone this repository:
git clone <repository-url>
cd <repository-name>
  1. Create and activate a conda environment:
conda create -n explore-rl python=3.10
conda activate explore-rl
  1. Install dependencies:
pip install -r requirements.txt
  1. Download the required data:
chmod +x data/download_data.sh
./data/download_data.sh

Training

Single GPU Training

python -m src.train_sft \
  --base_model meta-llama/Llama-3.1-8b-instruct \
  --data_path data/hotpotqa \
  --output_dir checkpoints/sft-baseline \
  --device cuda:0 \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --lora_r 8 \
  --lora_alpha 16 \
  --lora_dropout 0.1 \
  --max_seq_length 512 \
  --max_samples 1000 \
  --log_steps 10 \
  --save_steps 100

Multi-GPU Training

torchrun --nproc_per_node=2 -m src.train_sft \
  --base_model meta-llama/Llama-3.1-8b-instruct \
  --data_path data/hotpotqa \
  --output_dir checkpoints/sft-baseline \
  --device auto \
  --epochs 3 \
  --batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --lora_r 8 \
  --lora_alpha 16 \
  --lora_dropout 0.1 \
  --max_seq_length 512 \
  --max_samples 1000 \
  --log_steps 10 \
  --save_steps 100

Replace --nproc_per_node=2 with the number of GPUs you want to use.

Important Notes

  1. GPU Requirements: Training requires CUDA-capable GPUs. The model is too large to train on CPU.
  2. Memory Requirements: Each GPU should have at least 16GB of memory. Adjust batch size and gradient accumulation steps if you encounter out-of-memory errors.
  3. Data Location: The training script expects data in data/hotpotqa/hotpot_data/. Make sure to run the download script first.
  4. Checkpoints: Model checkpoints will be saved in the specified output_dir. These are not tracked by git.

Troubleshooting

  1. If you encounter CUDA out-of-memory errors:

    • Reduce the batch size
    • Increase gradient accumulation steps
    • Reduce max sequence length
  2. If you encounter file not found errors:

    • Make sure you've run the download script
    • Check that the data is in the correct location
    • Ensure you have write permissions in the output directory
  3. For multi-GPU training issues:

    • Make sure all GPUs are available and not in use
    • Check that CUDA is properly installed
    • Verify that torch can see all GPUs with torch.cuda.device_count()

About

This project combines reinforcement learning (RL) and large language models (LLMs) to improve exploration using diverse tool generation during inference by training it using synthetic data generation (self-improving ML & NLP).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.7%
  • Shell 3.2%
  • Dockerfile 0.1%