A comprehensive implementation of Group Relative Policy Optimization (GRPO) algorithm with synthetic data and real-world uncertainties for reinforcement learning.
- Overview
- Features
- Installation
- Quick Start
- Detailed Usage
- Algorithm Details
- Project Structure
- End-to-End Workflow
- Examples
- Results and Visualizations
- Contributing
- License
GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm that maintains multiple groups of policies and uses relative performance between groups to guide policy updates. This approach provides:
- Robustness: Multiple policy groups handle different types of uncertainties
- Exploration: Group diversity prevents premature convergence
- Adaptability: Dynamic group selection based on performance
- Scalability: Parallel training of multiple policy groups
This implementation includes:
- Complete GRPO algorithm with detailed explanations
- Synthetic RL environments with real-world uncertainties
- Comprehensive evaluation and visualization tools
- Easy-to-understand code with extensive documentation
- Group-based Policy Optimization: Multiple policy groups with relative performance tracking
- Generalized Advantage Estimation (GAE): Advanced advantage computation
- Clipped Policy Loss: PPO-style policy updates with clipping
- Value Function Learning: Separate value network for each group
- Observation Noise: Sensor measurement uncertainties
- Action Execution Noise: Imperfect action execution
- Transition Model Uncertainty: Model mismatch and dynamics noise
- Reward Noise and Delays: Delayed and noisy reward signals
- Partial Observability: Missing or masked observations
- Non-stationary Dynamics: Changing environment parameters
- Performance Analysis: Across different uncertainty levels
- Robustness Testing: Against various perturbations
- Group Diversity Analysis: Specialization and diversity metrics
- Baseline Comparisons: Against single-policy methods
- Statistical Analysis: Significance testing and confidence intervals
- Real-time Training Plots: Episode rewards, lengths, and losses
- Group Performance Tracking: Evolution of group performances
- Uncertainty Statistics: Applied uncertainties and their effects
- Comprehensive Reports: JSON summaries and analysis
- Python 3.8 or higher
- pip package manager
-
Clone or download the project:
git clone <repository-url> cd GRPO
-
Run the setup script:
python setup.py
-
Or install manually:
pip install -r requirements.txt
python -c "import torch, numpy, matplotlib; print('Installation successful!')"python main.py demo --episodes 50python main.py train --episodes 500 --groups 4python main.py pipeline --train-episodes 500 --eval-episodes 100The project provides a comprehensive CLI with multiple commands:
python main.py train [OPTIONS]
Options:
--episodes INT Number of training episodes (default: 1000)
--groups INT Number of policy groups (default: 4)
--group-size INT Number of policies per group (default: 8)
--learning-rate FLOAT Learning rate (default: 3e-4)
--uncertainty-level STR Uncertainty level: low/medium/high (default: medium)
--log-dir STR Directory for logging (default: logs)python main.py evaluate --model-path PATH [OPTIONS]
Options:
--model-path STR Path to trained model (required)
--episodes INT Number of evaluation episodes (default: 100)
--log-dir STR Directory for logging (default: logs)python main.py demo [OPTIONS]
Options:
--episodes INT Number of demo episodes (default: 50)
--uncertainty-level STR Uncertainty level: low/medium/high (default: medium)python main.py pipeline [OPTIONS]
Options:
--train-episodes INT Number of training episodes (default: 500)
--eval-episodes INT Number of evaluation episodes (default: 100)
--groups INT Number of policy groups (default: 4)
--log-dir STR Directory for logging (default: logs)from config import GRPOConfig
from trainer import run_training
# Create configuration
config = GRPOConfig(
num_episodes=1000,
num_groups=4,
group_size=8,
learning_rate=3e-4
)
# Train the agent
trainer = run_training(config)from evaluator import run_comprehensive_evaluation
# Evaluate the trained agent
evaluator = run_comprehensive_evaluation(config, trainer.agent)from uncertain_env import create_uncertain_env, UncertaintyConfig
# Create custom uncertainty configuration
uncertainty_config = UncertaintyConfig(
observation_noise_std=0.1,
action_noise_std=0.05,
transition_noise_std=0.02,
reward_noise_std=0.01
)
# Create environment
env = create_uncertain_env('cartpole', 'high')GRPO extends traditional policy optimization by maintaining multiple groups of policies:
- Group Initialization: Create N groups, each with M policies
- Group Selection: Select group based on performance-weighted sampling
- Policy Execution: Execute action from selected group's policy
- Performance Tracking: Update group performance using episode rewards
- Policy Updates: Update policies using GRPO loss with group-specific advantages
- Group Evolution: Groups adapt based on relative performance
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
# Feedforward network with ReLU activations
# Xavier weight initialization
# Softmax output for action probabilitiesclass ValueNetwork(nn.Module):
def __init__(self, state_dim, hidden_dim=128):
# Separate value estimation network
# Used for advantage computation# Policy loss with clipping
ratio = torch.exp(log_probs - old_log_probs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1-clip_ratio, 1+clip_ratio) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value loss
value_loss = F.mse_loss(values, returns)
# Entropy loss for exploration
entropy_loss = -action_dist.entropy().mean()- Purpose: Simulates sensor measurement errors
- Implementation: Gaussian noise added to state observations
- Real-world example: Camera blur, sensor drift
- Purpose: Simulates imperfect action execution
- Implementation: Random action flipping with small probability
- Real-world example: Motor control errors, actuator delays
- Purpose: Simulates model mismatch and dynamics uncertainty
- Implementation: Noise added to force/control inputs
- Real-world example: Unmodeled dynamics, parameter drift
- Purpose: Simulates delayed and noisy reward signals
- Implementation: Random reward delays and Gaussian noise
- Real-world example: Delayed feedback, measurement noise
- Purpose: Simulates missing or masked observations
- Implementation: Random masking of state components
- Real-world example: Sensor failures, occlusions
- Purpose: Simulates changing environment parameters
- Implementation: Random changes to gravity, mass, etc.
- Real-world example: Weather changes, wear and tear
GRPO/
├── main.py # Main CLI script
├── setup.py # Setup and installation script
├── requirements.txt # Python dependencies
├── README.md # This file
├── config.py # Configuration classes
├── grpo_agent.py # GRPO algorithm implementation
├── uncertain_env.py # Synthetic environments with uncertainties
├── trainer.py # Training loop and visualization
├── evaluator.py # Evaluation and analysis tools
├── logs/ # Training and evaluation logs
├── models/ # Saved model checkpoints
├── plots/ # Generated plots and visualizations
├── results/ # Evaluation results and reports
└── data/ # Data storage (if needed)
config.py: Configuration classes for GRPO and uncertainty parametersgrpo_agent.py: Core GRPO algorithm with policy/value networksuncertain_env.py: Synthetic environments with various uncertainty typestrainer.py: Training pipeline with logging and visualizationevaluator.py: Comprehensive evaluation tools and metricsmain.py: Command-line interface for all functionality
# Install dependencies
python setup.py
# Verify installation
python -c "import torch, numpy, matplotlib; print('Ready!')"# Run a quick demo to see GRPO in action
python main.py demo --episodes 50 --uncertainty-level medium# Train GRPO agent
python main.py train \
--episodes 1000 \
--groups 4 \
--group-size 8 \
--learning-rate 3e-4 \
--uncertainty-level mediumWhat happens during training:
- Multiple policy groups are initialized
- Agent interacts with uncertain environment
- Policies are updated using GRPO algorithm
- Group performances are tracked and updated
- Training progress is logged and visualized
# Evaluate trained model
python main.py evaluate \
--model-path logs/grpo_run_*/models/final_model.pth \
--episodes 100What happens during evaluation:
- Performance across different uncertainty levels
- Robustness testing against perturbations
- Group diversity analysis
- Comparison with baseline methods
- Statistical significance testing
# Run complete training and evaluation pipeline
python main.py pipeline \
--train-episodes 500 \
--eval-episodes 100 \
--groups 4After training and evaluation, you'll find:
- Training logs:
logs/grpo_run_*/training.log - Model checkpoints:
logs/grpo_run_*/models/ - Training plots:
logs/grpo_run_*/plots/ - Evaluation results:
logs/evaluation_*/results/ - Summary reports:
logs/evaluation_*/evaluation_summary.json
from config import GRPOConfig
from trainer import run_training
# Configure GRPO
config = GRPOConfig(
num_episodes=500,
num_groups=3,
group_size=6,
learning_rate=3e-4,
uncertainty_level='medium'
)
# Train agent
trainer = run_training(config)
# Check results
print(f"Final group performances: {trainer.agent.group_performances}")
print(f"Best episode reward: {max(trainer.episode_rewards)}")from uncertain_env import UncertainCartPoleEnv, UncertaintyConfig
# Create custom uncertainty configuration
uncertainty_config = UncertaintyConfig(
observation_noise_std=0.15,
action_noise_std=0.08,
transition_noise_std=0.03,
reward_noise_std=0.02,
reward_delay_prob=0.15,
partial_obs_prob=0.08,
non_stationary_prob=0.03
)
# Create environment
env = UncertainCartPoleEnv(uncertainty_config)
# Test environment
state, _ = env.reset()
for step in range(100):
action = env.action_space.sample()
state, reward, terminated, truncated, _ = env.step(action)
if terminated or truncated:
breakfrom evaluator import GRPOEvaluator
# Create evaluator
evaluator = GRPOEvaluator(config, trained_agent)
# Run comprehensive evaluation
performance_results = evaluator.evaluate_performance(num_episodes=100)
robustness_results = evaluator.evaluate_robustness(num_episodes=50)
diversity_results = evaluator.evaluate_group_diversity(num_episodes=100)
# Create visualizations
evaluator.create_evaluation_plots()
# Save results
evaluator.save_evaluation_results()from uncertain_env import MultiTaskUncertainEnv
# Create multi-task environment
env = MultiTaskUncertainEnv()
# Train on multiple tasks
for episode in range(1000):
state, _ = env.reset()
current_task = env.get_current_task()
# Train on current task
# ... training code ...
if episode % 100 == 0:
print(f"Current task: {current_task}")The training process generates several types of plots:
- Episode Rewards: Shows learning progress over time
- Episode Lengths: Tracks episode duration
- Group Performances: Evolution of group performance over time
- Training Losses: Policy, value, and entropy losses
- Group Performance Heatmap: Visual representation of group diversity
The evaluation process creates comprehensive analysis plots:
- Performance Comparison: Across different uncertainty levels
- Robustness Analysis: Performance under various perturbations
- Group Diversity: Individual group performance and diversity metrics
- Baseline Comparison: GRPO vs single-policy methods
- Statistical Analysis: Confidence intervals and significance tests
Typical results from GRPO training:
- Learning Curve: Steady improvement in episode rewards
- Group Specialization: Different groups excel in different scenarios
- Robustness: Consistent performance across uncertainty levels
- Diversity: Groups maintain distinct behaviors and strategies
All training and evaluation processes are logged with:
- Real-time Progress: Episode rewards, lengths, and group performances
- Uncertainty Statistics: Applied uncertainties and their effects
- Training Metrics: Losses, learning rates, and convergence
- Evaluation Metrics: Performance, robustness, and diversity measures
We welcome contributions to improve GRPO! Here's how you can help:
-
Algorithm Improvements:
- New group selection strategies
- Advanced uncertainty handling
- Multi-objective optimization
-
Environment Extensions:
- New uncertainty types
- Additional RL environments
- Real-world environment interfaces
-
Evaluation Enhancements:
- New evaluation metrics
- Additional baseline comparisons
- Statistical analysis improvements
-
Documentation:
- Code documentation
- Tutorial notebooks
- Algorithm explanations
-
Fork the repository
-
Create a development environment:
python -m venv grpo_dev source grpo_dev/bin/activate # On Windows: grpo_dev\Scripts\activate pip install -r requirements.txt
-
Make your changes
-
Run tests:
python main.py demo --episodes 10 python main.py train --episodes 50
-
Submit a pull request
- Follow PEP 8 style guidelines
- Add type hints where appropriate
- Include docstrings for all functions and classes
- Write clear, descriptive variable names
This project is licensed under the MIT License - see the LICENSE file for details.
- PPO Algorithm: Base policy optimization approach
- Gymnasium: RL environment framework
- PyTorch: Deep learning framework
- CartPole Environment: Classic RL benchmark
If you use this implementation in your research, please cite:
@software{grpo_implementation,
title={GRPO: Group Relative Policy Optimization Implementation},
author={Sibi Vishtan},
year={2024},
url={https://github.com/urstrulyvishtan/GRPO}
}For questions, issues, or contributions:
- Issues: Use GitHub Issues for bug reports and feature requests
- Discussions: Use GitHub Discussions for questions and ideas
- Email: Contact [[email protected]]
Happy Learning with GRPO! 🚀