Running Chaos Engineering Game Days to Test System Resilience
Designed for Senior Site Reliability Engineers (SREs) and DevOps Leads in organizations adopting microservices architectures who are responsible for system reliability and uptime to spark real collaboration and high-energy learning.
This is a 90-minute hybrid workshop for SRE/DevOps leads at a fintech company rapidly scaling its cloud infrastructure. Participants have high technical fluency but have not run structured chaos experiments before; their main pain points are fear of introducing instability and a lack of psychological safety in surfacing hidden weaknesses. Sessions combine live-demo, small group breakouts, and whole-room debriefs.
Mystery Outage Story Opener
Open with a dramatic reenactment of a real, high-profile system outage (like Netflix’s Christmas Eve AWS region failure), but pause at the climax and ask teams to predict the actual root cause and how it was found. Let participants brainstorm wild theories in chat or on sticky notes before revealing the outcome.
Tap to view the full activity.
Why this works
Unexpected stories stimulate curiosity and prime brains for learning; guessing before knowing encourages deeper engagement and recall.
Chaos Engineering Mythbusting
Show three popular statements about chaos engineering (e.g., 'Chaos Engineering is just about breaking things randomly,' 'Only hyperscalers need it,' 'It always creates downtime'). Have participants vote true/false via poll, then discuss which are myths and which have a kernel of truth—with clarifying evidence.
Tap to view the full activity.
Why this works
Surface misconceptions early so the group builds shared understanding; cognitive dissonance helps correct persistent myths.
Safe Bet Mini Poll
Pose a low-stakes, relatable prompt: 'If you could safely inject a single failure into your system today, which would you choose?' Offer quick-select options (e.g., DB connection drops, network latency spike, service dependency fails). Collect responses via sticky notes or polls, and affirm that all choices are valid.
Tap to view the full activity.
Why this works
Low-pressure participation builds psychological safety, letting quieter voices surface risk concerns without judgment.
Resilience Rapid-Fire Teams
In small teams, run a competitive round: Who can brainstorm the most realistic failure scenarios in 90 seconds for a given microservice? Use a countdown clock and high-energy music, then have groups shout out their wildest ideas and tally scores for fun.
Tap to view the full activity.
Why this works
Short, intense bursts of collaboration drive energy, break barriers, and encourage lateral thinking—crucial for uncovering blind spots.
Stakeholder Dilemma Hotseat
Present a scenario where two stakeholders—Product Manager and SRE Lead—disagree on whether to run a Game Day before a major release. Invite two volunteers to role-play each side (with prompt cards), then open the floor: 'If you were in the CTO’s seat, what would you decide?'
Tap to view the full activity.
Why this works
Real-world dilemmas build empathy for competing priorities and highlight the importance of cross-team communication.
Resilience Wins & Fails Gallery Walk
Prompt everyone to recall a moment when a system they worked on either withstood a surprising failure (win) or crumbled unexpectedly (fail). Collect 1-2 sentence vignettes on sticky notes or an online board. Then, do a gallery walk: read others’ stories and comment one takeaway or action they’d try as a result.
Tap to view the full activity.
Why this works
Active reflection cements learning and personalizes risk; sharing stories fosters connection and vulnerability.
Sign up to unlock 3 more activities
Get the full pack, facilitation flow, and more ready-to-run ideas.