Running SRE Fire Drills to Train Teams on Incident Response
Designed for Site Reliability Engineers (SREs) and Incident Commanders in mid-sized SaaS organizations with recent growth, now expected to coordinate multi-team incident responses under real-time pressure. to spark real collaboration and high-energy learning.
A 120-minute hybrid workshop, blending in-person and remote participants. Previous incident reviews have shown inconsistent response playbooks and unclear handoffs. Teams crave hands-on practice but struggle with engagement, fearing blame for mistakes.
Mystery Outage Scenario Reveal
Kick off with a live poll showing a vague, ambiguous error message from production. Invite teams to speculate on possible root causes in chat or sticky notes. Then, progressively reveal more clues every 60 seconds. This builds curiosity and opens minds to the complexity of real incidents.
Tap to view the full activity.
Why this works
Uncertainty and curiosity prime learners for exploration; ambiguity mirrors real-world incident onset, promoting attentive engagement.
Blame Game Myth Buster
Present a quick quiz: ‘Which team member is most likely to be blamed in a fire drill?’ Follow with a real stat—most errors are systemic, not individual. Share a mini case study where focusing blame delayed root cause analysis. Invite teams to debunk common incident myths.
Tap to view the full activity.
Why this works
Surface and correct misconceptions early to reduce anxiety and foster psychological safety for risk-free learning.
Silent Slack Coordination
Run a mini fire drill using Slack or Post-its: teams coordinate a ‘response’ to a sample outage, but must do so silently (no talking, only written messages/sticky notes). This lowers pressure and lets introverts shine, while surfacing coordination challenges.
Tap to view the full activity.
Why this works
Low-pressure asynchronous communication highlights coordination gaps; silent mode levels the playing field for quieter participants.
Rapid-Fire Role Swaps
Challenge teams with a fast-paced drill: every 90 seconds, roles (Incident Commander, Scribe, Comms Lead, Resolver) rotate. Each participant instantly steps into a new role, maintaining the flow. The room buzzes with urgency and collaboration as people adapt.
Tap to view the full activity.
Why this works
High-energy, dynamic switching simulates real-world chaos and builds confidence in cross-role response agility.
Public Outage Dilemma Debate
Present a dilemma: ‘Should we notify customers early, even before the root cause is known?’ Teams break out for 3 minutes to argue both sides, then share quick pros and cons. Facilitator connects this to real public incident communication cases (e.g., Stripe’s transparency).
Tap to view the full activity.
Why this works
Dilemmas drive real-world relevance and spark debate, deepening understanding of communication trade-offs during crises.
Personal Incident Journaling
Wrap up by inviting each participant to reflect: ‘Write down one moment you felt uncertain or empowered during today’s drills.’ Then, share (anonymously if preferred) a takeaway—what will they do differently next time? Facilitator collects insights for a team-ready improvement board.
Tap to view the full activity.
Why this works
Personal journaling anchors learning, fostering ownership and emotional resonance; active reflection turns experience into actionable habit change.
Sign up to unlock 3 more activities
Get the full pack, facilitation flow, and more ready-to-run ideas.