A SRE Guide to Designing Effective Alerting and Paging Rules
Designed for Mid-career Site Reliability Engineers (SREs) and platform team leads responsible for designing and maintaining alerting and paging systems in fast-scaling SaaS environments to spark real collaboration and high-energy learning.
A 90-minute hybrid session with a mixed group of remote and onsite SREs and team leads. Participants are overwhelmed by noisy alerts and struggle to balance system reliability with personal wellbeing. Many have inherited legacy paging rules and lack frameworks for rationalizing them. The group is highly technical, skeptical of 'soft skills', and eager for practical wins.
Alert Archaeology Kickoff
Open with a quick 'alert archeology' exercise: display anonymized screenshots of real-world alert dashboards from different companies (e.g., PagerDuty, OpsGenie). Ask participants to guess what kind of system generated these, and which alerts are likely actionable versus noise. They jot down their guesses in chat or sticky notes.
Tap to view the full activity.
Why this works
Visually encountering authentic alerts sparks curiosity and primes participants to start thinking critically about signal vs. noise in their own environments.
Myths of ‘Noisy’ Alerts
Reveal a set of statements about common alerting misconceptions (e.g., 'More alerts mean higher reliability', 'Paging on every error prevents incidents'). Use a quick poll or thumbs up/down to let participants vote true or false before unpacking the real impact.
Tap to view the full activity.
Why this works
Directly confronting misconceptions leverages cognitive dissonance, helping learners reframe their beliefs and become receptive to new frameworks.
Paging Rule Icebreaker
Invite participants to share a time when a single alert woke them up at 3AM for a non-urgent issue. Frame this as a low-stakes share-out—no judgment, just commiseration. Use a poll to anonymously collect the most common causes.
Tap to view the full activity.
Why this works
Sharing low-pressure, real-life stories builds trust and normalizes the experience of imperfect alerting.
Signal or Noise? Lightning Rounds
Divide into small groups. Present each group with 60-second hypothetical alert scenarios (e.g., 'CPU spikes to 92% for 5 minutes on a single node'). Each group debates: Actionable page? Just a log? Ignore? Reconvene to rapid-fire share decisions.
Tap to view the full activity.
Why this works
Fast-paced, gamified group work energizes the room and pushes participants to articulate their criteria for high-signal alerting, uncovering differences in judgment.
The Pager Duty Dilemma
Pose a real-world dilemma: 'You’ve inherited an alerting system that triggers 50+ pages/week, burning out your team. Leadership insists every alert is essential. You can only change three rules this quarter. What do you cut, and how do you make your case?' Participants outline their triage and negotiation strategy in pairs.
Tap to view the full activity.
Why this works
Real-world dilemmas connect learning to participants’ lived experience, prompting complex problem-solving and application of new frameworks.
Personal Alert Audit & Commit
Guide participants through a structured self-audit: they list their top three most irritating current alerts, then pick one actionable next step (e.g., 'I’ll review thresholds with my team this week'). Close with participants sharing their commitment with a partner or the group.
Tap to view the full activity.
Why this works
Active reflection and goal-setting drive transfer—learners are more likely to effect change when they commit to concrete action in front of peers.
Sign up to unlock 3 more activities
Get the full pack, facilitation flow, and more ready-to-run ideas.