A multi-agent reinforcement learning (MARL) project implementing cooperative robot agents for emergency evacuation scenarios using Multi-Agent Proximal Policy Optimization (MAPPO). The project simulates a dynamic environment where autonomous robots must efficiently evacuate humans from a hazardous area with spreading fire and potential robot malfunctions.
Watch the trained MAPPO agents in action:
MARL.Best.Model.mp4
The video shows trained robots (blue) autonomously navigating a grid environment to rescue humans (green) from spreading fire (red) and safely evacuate them through exits (yellow).
This project explores how multiple autonomous agents can learn to cooperate in high-stakes evacuation scenarios through deep reinforcement learning. The simulation features:
- Dynamic Hazards: Fire that spreads probabilistically across the grid
- Robot Malfunctions: Agents can randomly freeze for periods of time
- Complex Cooperation: Multiple robots must coordinate to maximize rescues
- Distance-Aware Navigation: Agents use positional encoding to make informed decisions
- Adaptive Learning: MAPPO algorithm enables efficient multi-agent training
- Environment: Custom Gymnasium environment with grid-based evacuation simulation
- MARL Algorithm: MAPPO (Multi-Agent PPO) with shared critic and individual actors
- Baseline Comparisons: A*, greedy, and random agents for performance benchmarking
- Reward Shaping: Carefully designed reward structure to encourage rescue efficiency
- Grid-based Navigation: Customizable maps loaded from CSV files
- Entity Types: Walls, exits, humans, fire, and robots with distinct behaviors
- Fire Spreading: Probabilistic fire propagation mechanics
- Robot Freezing: Random malfunction events that temporarily disable agents
- One-hot Observations: 8-channel observation space including entity positions and distance maps
- Actor-Critic Architecture: Shared critic with individual actor policies
- GAE (Generalized Advantage Estimation): For variance reduction
- Entropy Regularization: Encourages exploration during training
- Gradient Clipping: Ensures training stability
- Checkpoint System: Save and resume training sessions
- TensorBoard Integration: Real-time training metrics visualization
- Baseline Agents: Compare MAPPO against rule-based and heuristic approaches
- Visualization Tools: Render evacuation episodes with Pygame
- Performance Metrics: Track rescue rates, efficiency, and coordination
- Python 3.8+
- PyTorch 2.0+
- Gymnasium
- NumPy
- Pygame
# Clone the repository
git clone https://github.com/yourusername/marl-evacuation-project.git
cd marl-evacuation-project
# Install dependencies
pip install torch gymnasium numpy pygame tensorboard tqdm
# (Optional) Verify installation
python main.pypython training/train_mappo.pyConfigure training parameters in config.py:
NUM_EPISODES: Total training episodes (default: 1,000,000)LR: Learning rate (default: 0.0001)GAMMA: Discount factor (default: 0.99)ENTROPY_COEF: Exploration coefficient (default: 0.01)
# Evaluate MAPPO
python training/eval_mappo.py
# Evaluate baselines
python training/eval_baselines.py# Run with trained MAPPO model
python main.pySet DISPLAY = True in config.py to visualize the simulation in real-time with Pygame.
marl-evacuation-project/
│
├── baselines/
│ ├── astar_baseline.py # Rule-based A* pathfinding agent
│ ├── greedy_baseline.py # Greedy heuristic agent
│ └── random_baseline.py # Random-action agent
│
├── demos/
│ └── MARL Best Model.mp4 # Video demonstration of trained agents
│
├── environment/
│ ├── maps/ # Predefined map layouts
│ │ ├── map1.csv # Simple 10x10 grid
│ │ ├── map2.csv # Medium complexity
│ │ └── map3.csv # Complex environment
│ └── evacuation_env.py # Gymnasium environment implementation
│
├── mappo_core/
│ ├── actor_critic.py # Actor-Critic neural network model
│ └── mappo_trainer.py # MAPPO training algorithm
│
├── results/ # Generated by utils/visualization.py
│ ├── baseline/ # Baseline agent results
│ └── mappo/ # MAPPO agent results
│
├── training/
│ ├── eval_baselines.py # Baseline evaluation script
│ ├── eval_mappo.py # MAPPO evaluation script
│ └── train_mappo.py # MAPPO training script
│
├── utils/
│ ├── rollout_buffer.py # GAE + rollout buffer implementation
│ └── visualization.py # Rendering and visualization utilities
│
├── checkpoints/ # Saved model checkpoints
├── .gitignore # Git ignore rules
├── config.py # Hyperparameters and environment configuration
├── main.py # Entry point for running simulations
└── README.md # This file
| Symbol | Description |
|---|---|
| 0 | Empty space |
| 1 | Wall (obstacle) |
| 2 | Exit |
| 3 | Human (randomized placement) |
| 4 | Fire (randomized placement with spreading) |
| 5 | Robot |
| 6 | Frozen Robot (malfunctioned) |
Key parameters in config.py:
GRID_SIZE: Default grid dimensions (10x10)MAP_FILE: CSV file defining the map layoutNUM_HUMANS: Number of humans to rescue (default: 5)NUM_ROBOTS: Number of robot agents (default: 5)P_FIRE: Fire spreading probability (default: 0.3)P_FREEZE: Robot freeze probability (default: 0.05)FREEZE_DURATION: Freeze duration in steps (default: 3)
HUMAN_RESCUE_REWARD: +1000 for successful rescueHUMAN_PICKUP_REWARD: +200 for picking up a humanFIRE_PENALTY: -100 when robot touches fireHUMAN_FIRE_PENALTY: -200 when human touches fireEARLY_EXIT_PENALTY: -1000 for premature exitMOVEMENT_REWARD: +0.1 for distance-reducing movementHUMAN_DISTANCE_PENALTY: -0.5 for being far from humansEXIT_DISTANCE_PENALTY: -0.03 for being far from exit (when carrying)
MAX_ENV_STEPS: 40 steps per episodeNUM_EPISODES: 1,000,000 training episodesLR: 0.0001 learning rateGAMMA: 0.99 discount factorLAMBDA: 0.95 GAE parameterCLIP_PARAM: 0.2 PPO clipping parameterENTROPY_COEF: 0.01 entropy coefficient
Multi-Agent Proximal Policy Optimization (MAPPO) is a state-of-the-art MARL algorithm that extends PPO to multi-agent settings:
- Shared Critic: All agents share a centralized value function for better credit assignment
- Individual Actors: Each agent has its own policy network for decentralized execution
- Centralized Training, Decentralized Execution (CTDE): Leverage global information during training while maintaining independent policies during execution
- PPO Updates: Clipped surrogate objective for stable policy updates
- Actor Network: CNN encoder + MLP head for action logits
- Critic Network: CNN encoder + MLP head for value estimation
- Input: 8-channel one-hot observations + 2D positional encoding
- Output: 5-dimensional action space (noop, up, down, left, right)
The trained MAPPO agents demonstrate:
- Coordinated rescue operations
- Adaptive fire avoidance
- Efficient exit utilization
- Superior performance vs. rule-based baselines
This project is open-source and available under the MIT License.
Built using: