If everything is urgent, nothing is.
Author:Sambath Kumar Natarajan(Connect)Version:1.0
Alert Fatigue
Sunday, 3:00 AM. PagerDuty fires. "CPU usage > 80% on worker-node-45." You wake up, check the graph. It dropped back to 70%. You go back to sleep.
The Boy Who Cried Wolf
This is the most dangerous state for an operations team. When 90% of alerts are "noise" or "self-healing", the operator ignores the 10% that are real signals of catastrophic failure.
The Rule of Actionable Alerts
Every alert must meet criteria:
- Is a user affected? (If CPU is 90% but Latency is 20ms, nobody cares. Do not alert).
- Is human intervention required? (If the auto-scaler handles it, do not alert).
- Do I know what to do? (An alert saying "Unknown Error" usually just stresses people out).
Shut down the noise
| Factor | Weight | Score | Note |
|---|---|---|---|
| Symptom-based | 5 | 5 | Alert on Latency/Errors (User Pain) |
| Cause-based | 5 | 1 | Don't alert on CPU/Disk (Implementation Detail) |
| Sleep-friendliness | 3 | 5 | Can it wait until morning? |
