Incident response · playbook ejecutivo
Incident response playbook
6 severity tiers (SEV-0 to SEV-4 + Drill) · 8-step IR procedure standardized · 6 communication templates copiables · 5-tier escalation matrix · 6-drill annual program. Complementa /incident-postmortems (post-incident) con focus operational durante incident.
6 severity tiers
| Severity | Criteria | Target | Communication |
|---|---|---|---|
| SEV-0 · Total outage | Service 100% down · all clients affected · revenue stopped | Restore <2h · postmortem mandatory | Status page Investigating immediate + email clients + Twitter |
| SEV-1 · Major degradation | >25% requests failing OR critical feature down · multiple clients affected | Restore <4h · postmortem mandatory | Status page Investigating + email affected clients |
| SEV-2 · Partial degradation | 5-25% requests affected OR single tenant down OR vendor partial outage | Restore <8h · postmortem if customer-visible | Status page Monitoring + targeted email |
| SEV-3 · Minor | <5% affected · workaround exists · NOT customer-visible OR cosmetic | Restore <24h · log only | Internal Slack · NO public communication needed |
| SEV-4 · Informational | Cost approaching limit · vendor warning · NO impact yet | Investigate <72h · prevention focus | Internal log · action item created |
| SEV-Test · Drill | Simulated for training · pre-announced internally | Drill protocol followed · timing measured | NO public · internal documentation only |
IR procedure · 8-step standardized
- T+0 · DETECT: alert triggered (UptimeRobot · Sentry · manual report) · acknowledgment automatic within 5min
- T+5min · TRIAGE: severity assessed · incident commander designated (founder default) · war room created Slack
- T+15min · COMMUNICATE: status page Investigating · email affected clients si SEV-0/1 · timestamp every action
- T+30min · MITIGATE: execute relevant runbook (/runbooks-publicos) · degraded modes activated · monitor progress
- T+1h · CHECKPOINT: re-assess if MTPD approaching · escalate si stuck · adjust strategy · communicate update
- T+2-8h · RESOLVE: confirmed recovery · 3 polls UptimeRobot green · smoke test end-to-end · status page Resolved
- T+24h · POSTMORTEM: 5 Whys analysis · timeline reconstructed · corrective actions defined · published /incident-postmortems
- T+30/60/90d · REVIEW: corrective actions verified completed · prevention plan validated · lessons institutionalized
6 communication templates
Status page Investigating
Estamos investigando un problema afectando [scope]. Iniciado [timestamp]. Investigando causa raíz. Actualizamos cada 30min mínimo.
Status page Identified
Identificada la causa: [root cause clear]. Mitigación en curso · ETA recovery [time]. Actualización en [interval].
Status page Monitoring
Mitigación aplicada. Monitorizando servicio para confirmar resolución completa. Status update en 30min.
Status page Resolved
Incident resuelto. Servicio fully operational desde [timestamp]. Postmortem público en 24h.
Email client SEV-0/1
Hola [name] · servicio AI Empire experimentó [incident description] entre [start] y [end]. Impacto en tu clínica: [specific impact]. Acciones tomadas: [actions]. Postmortem en 24h. Lamentamos el impacto. Para más detalles: [postmortem link]
Twitter SEV-0
Servicio AI Empire actualmente experimentando issues. Investigando. Updates en [status URL]. Disculpas por el impacto.
Escalation matrix · 5 severity tiers
| Sev | Primary | Secondary | External |
|---|---|---|---|
| SEV-0 | Founder immediate · phone + Slack + email | Trustees if founder unavailable >30min | Vendor support escalation si vendor outage involved |
| SEV-1 | Founder primary · 15min acknowledgment SLA | Technical advisor if complexity exceeds founder bandwidth | Affected vendor support |
| SEV-2 | Founder · 2h acknowledgment | External counsel si legal implications · accountant si financial | N/A typical |
| SEV-3 | Founder · 24h acknowledgment business hours | N/A | N/A |
| SEV-4 | Founder · weekly review | N/A | N/A |
Drill program · 6 annual rituals
- Monthly · DB restore drill (Wave 67 baseline · 2h32min last successful)
- Monthly · Worker rollback drill (Wave 67 · 11min last)
- Trimestral · Full DR simulation (region failure) · 3h47min last successful
- Semestral · BCP tabletop exercise · stakeholders walkthrough
- Anual · Communication drill · email + Twitter + status page coordination
- Anual · Encryption key recovery drill · Shamir 3-of-5 verified annual
¿Tu SRE team necesita IR templates?
Markdown templates + sample incidents · drill protocols · communication scripts disponibles bajo NDA Enterprise. Útil clínicas grandes/DSOs adoptando culture IR madura.