Saltar al contenido principal
Incident response · playbook ejecutivo

Incident response playbook

6 severity tiers (SEV-0 to SEV-4 + Drill) · 8-step IR procedure standardized · 6 communication templates copiables · 5-tier escalation matrix · 6-drill annual program. Complementa /incident-postmortems (post-incident) con focus operational durante incident.

6 severity tiers

SeverityCriteriaTargetCommunication
SEV-0 · Total outageService 100% down · all clients affected · revenue stoppedRestore <2h · postmortem mandatoryStatus page Investigating immediate + email clients + Twitter
SEV-1 · Major degradation>25% requests failing OR critical feature down · multiple clients affectedRestore <4h · postmortem mandatoryStatus page Investigating + email affected clients
SEV-2 · Partial degradation5-25% requests affected OR single tenant down OR vendor partial outageRestore <8h · postmortem if customer-visibleStatus page Monitoring + targeted email
SEV-3 · Minor<5% affected · workaround exists · NOT customer-visible OR cosmeticRestore <24h · log onlyInternal Slack · NO public communication needed
SEV-4 · InformationalCost approaching limit · vendor warning · NO impact yetInvestigate <72h · prevention focusInternal log · action item created
SEV-Test · DrillSimulated for training · pre-announced internallyDrill protocol followed · timing measuredNO public · internal documentation only

IR procedure · 8-step standardized

  1. T+0 · DETECT: alert triggered (UptimeRobot · Sentry · manual report) · acknowledgment automatic within 5min
  2. T+5min · TRIAGE: severity assessed · incident commander designated (founder default) · war room created Slack
  3. T+15min · COMMUNICATE: status page Investigating · email affected clients si SEV-0/1 · timestamp every action
  4. T+30min · MITIGATE: execute relevant runbook (/runbooks-publicos) · degraded modes activated · monitor progress
  5. T+1h · CHECKPOINT: re-assess if MTPD approaching · escalate si stuck · adjust strategy · communicate update
  6. T+2-8h · RESOLVE: confirmed recovery · 3 polls UptimeRobot green · smoke test end-to-end · status page Resolved
  7. T+24h · POSTMORTEM: 5 Whys analysis · timeline reconstructed · corrective actions defined · published /incident-postmortems
  8. T+30/60/90d · REVIEW: corrective actions verified completed · prevention plan validated · lessons institutionalized

6 communication templates

Status page Investigating
Estamos investigando un problema afectando [scope]. Iniciado [timestamp]. Investigando causa raíz. Actualizamos cada 30min mínimo.
Status page Identified
Identificada la causa: [root cause clear]. Mitigación en curso · ETA recovery [time]. Actualización en [interval].
Status page Monitoring
Mitigación aplicada. Monitorizando servicio para confirmar resolución completa. Status update en 30min.
Status page Resolved
Incident resuelto. Servicio fully operational desde [timestamp]. Postmortem público en 24h.
Email client SEV-0/1
Hola [name] · servicio AI Empire experimentó [incident description] entre [start] y [end]. Impacto en tu clínica: [specific impact]. Acciones tomadas: [actions]. Postmortem en 24h. Lamentamos el impacto. Para más detalles: [postmortem link]
Twitter SEV-0
Servicio AI Empire actualmente experimentando issues. Investigando. Updates en [status URL]. Disculpas por el impacto.

Escalation matrix · 5 severity tiers

SevPrimarySecondaryExternal
SEV-0Founder immediate · phone + Slack + emailTrustees if founder unavailable >30minVendor support escalation si vendor outage involved
SEV-1Founder primary · 15min acknowledgment SLATechnical advisor if complexity exceeds founder bandwidthAffected vendor support
SEV-2Founder · 2h acknowledgmentExternal counsel si legal implications · accountant si financialN/A typical
SEV-3Founder · 24h acknowledgment business hoursN/AN/A
SEV-4Founder · weekly reviewN/AN/A

Drill program · 6 annual rituals

  • Monthly · DB restore drill (Wave 67 baseline · 2h32min last successful)
  • Monthly · Worker rollback drill (Wave 67 · 11min last)
  • Trimestral · Full DR simulation (region failure) · 3h47min last successful
  • Semestral · BCP tabletop exercise · stakeholders walkthrough
  • Anual · Communication drill · email + Twitter + status page coordination
  • Anual · Encryption key recovery drill · Shamir 3-of-5 verified annual

¿Tu SRE team necesita IR templates?

Markdown templates + sample incidents · drill protocols · communication scripts disponibles bajo NDA Enterprise. Útil clínicas grandes/DSOs adoptando culture IR madura.