"Disaster in the Dataverse" – A Chaos Engineering Simulation
Scenario
You are a cloud architecture team working for AstroNet, a company that runs a
massively multiplayer online game (MMO) with millions of global users. The game
backend is hosted on AWS EC2 instances, game assets (e.g. 3D models, textures,
audio) are stored in Amazon S3, and the authentication system is hosted on-premises
(classic private cloud).
Last night, a rogue developer accidentally deleted the wrong IAM role and triggered a
cascade of failures. Some EC2s are inaccessible, asset load times are slow, and S3
buckets seem misconfigured. You have one hour to:
Restore service functionality
Harden the architecture
Justify your design choices under budget and performance constraints
Activity Setup
Break into 3 groups (teams of 4–6 works well):
1. Recovery & Compute Team
Responsible for restoring EC2-based backend and improving fault tolerance
2. Storage & Delivery Team
Deals with fixing S3 permissions, optimizing asset delivery (think CloudFront/S3),
and bucket policies
3. Security & Hybrid Team
Focuses on IAM role restoration, securing the VPC/subnets, and managing
hybrid connectivity with on-prem auth (Site-to-Site VPN or Direct Connect
decision-making)
The Challenge Components
Each team receives a chaos packet — a simulated problem statement with data and
constraints. They must solve their packet and interface with other teams to avoid siloed
designs. Here’s a breakdown of what each team handles:
1. Recovery & Compute Team
EC2 instances were in a public subnet but lost their public IPs. Users can't
connect.
Question: Will you redeploy EC2s or use an Auto Scaling Group?
Constraint: You have a $500/month compute budget. Choose wisely between
instance types.
Bonus curveball: Can you make the game servers stateless?
2. Storage & Delivery Team
The S3 bucket game-assets-prod has been misconfigured — objects are now
publicly exposed, triggering a compliance audit.
Task: Rebuild bucket policies to restrict access, then integrate CloudFront to
accelerate delivery.
Decision point: Should you move to S3 Intelligent-Tiering?
Security twist: Ensure data is encrypted at rest and in transit.
3. Security & Hybrid Team
The IAM role GameServerEC2Access was deleted. Recreate it with least
privilege.
The on-prem auth server (running LDAP) needs to re-establish secure
communication with AWS. Choose between VPN and Direct Connect.
Gotcha: If you choose VPN, latency increases; if you choose Direct Connect,
cost increases.
Deliverables (within 1 hour)
Each team presents:
1. Architecture Diagram (hand-drawn or whiteboard-style is fine)
2. Service Choices with Justification
o Why you chose X over Y
o What trade-offs you accepted
o What security implications were involved
3. Failure Mode Prediction
o What will break first in your design?
o How would you recover faster next time?