Observability strategy
3 pillars (metrics · logs · traces) · 4 alert tiers (P0/P1/P2/P3) · 6 privacy rules (NO PII en telemetry) · 6 cost guards quantified. Observability que respeta privacy + controla costs · NO data hoarding fishing-expedition.
3 pillars observability
| Pillar | Descripción | Tools | Retention |
|---|---|---|---|
| Metrics · quantitative time-series | Latency · throughput · error rates · cost · business metrics (messages processed · handoffs · resolutions) · aggregated per-tenant | Cloudflare Analytics + Supabase pg_stat + custom Postgres metrics table + Sentry performance | Hot 7d full granularity · warm 30d 1h aggregation · cold 365d daily aggregation · cost-effective tiered |
| Logs · structured + contextual | Application logs structured JSON · request_id + clinic_id + user_id + operation propagated · error stack traces · audit trail compliance | Cloudflare Workers logs + Supabase pgaudit + Sentry breadcrumbs + custom error_log table durable | 30d hot · 365d warm · 7y compliance-required (healthcare) cold storage encrypted |
| Traces · distributed request flows | Span hierarchy request → webhook validation → DB query → LLM call → DB write → response · timing per-span · errors highlighted · sampling smart | Sentry Performance + custom trace_id propagation + Cloudflare Workers trace context | Sampling 10% normal · 100% errors · 100% high-latency (>5s) · 30d retention · NO PII captured spans |
4 alert tiers · severity-based response
| Severity | Trigger | Action | Response time |
|---|---|---|---|
| P0 · service down | Health endpoint 504 2+ consecutive polls · OR error rate >10% 5min · OR breaker tripped (OpenAI/Meta) | Slack + email + SMS founder immediate · runbook /incident-rca activated · status page Investigating | <5min acknowledge |
| P1 · degraded performance | p95 latency >5s 5min · OR error rate >5% 10min · OR queue lag >10min | Slack + email founder · runbook investigation · status page maybe-degraded | <30min acknowledge |
| P2 · partial failure | Specific endpoint error rate >2% 15min · OR specific clinic affected · OR vendor partial outage | Slack founder business hours · investigate · workaround if available | <2h acknowledge |
| P3 · informational | Cost threshold approaching · disk usage approaching · API quota approaching · token approaching expiry | Slack non-urgent channel · action item created tracker · NO immediate response required | <24h review |
Privacy rules · 6 mandatory
- NEVER log PII (patient_name · phone · medical_history · conversation content) en metrics/logs/traces · scrubbed antes de telemetry export
- PII columns en DB encrypted via pgsodium · audit log decrypt operations · NEVER stored plaintext beyond active session
- Sentry PII scrubbing aggressive · custom DataProcessor rejects detected PII · pre-send hook scrubs known patterns
- URL parameters NEVER carry PII · use POST body for sensitive data · prevents leak browser history + referer + logs
- Customer-supplied diagnostics (support ticket attachments) require explicit consent + auto-expire 30d · stored encrypted
- Auditor external access (ChatGPT bundle) ALWAYS anonymized data · NO PII · NO raw conversations · only aggregated stats
Cost guards · 6 quantified limits
| Guard | Limit | Monitoring |
|---|---|---|
| OpenAI cost-cap monthly | 500€/mes hard cap · breaker trips at 80% (400€) · fallback templates only | Daily Slack report + alert each 100€ threshold |
| Cloudflare Workers requests cap | 10M requests/día · ~30€/mes upper bound · auto-scale within plan | Hourly metric · alert spike anomaly detection |
| Supabase compute budget | Pro plan 25€/mes baseline · auto-scale compute alerts 50€/mes spike | Daily compute usage report · alert sustained high usage |
| Upstash QStash messages | Free tier 500/día · alert at 400 (80%) · upgrade trigger sustained breach 3 días | Real-time queue metrics + daily summary |
| Sentry events quota | Team plan 50k events/mes · alert at 40k (80%) · sample rate adjust if approaching | Weekly Sentry usage report |
| R2 storage backups | 1TB free · alert at 700GB (70%) · cleanup old backups beyond retention policy | Monthly storage usage report |
Sentry-mute incident mencionado CLAUDE.md priorities: 7. Sentry alerts intermitentemente silenciados sin root cause identified. Status: HIGH priority post-CIF · 2d effort · workaround manual log checking founder daily.
Distributed tracing maturity: trace_id propagation funciona endpoint→DB pero NO completa entre QStash workers + LLM provider calls. Roadmap Q3 2026 · OpenTelemetry adoption planned cuando OpenTelemetry CF Workers SDK estable.
¿Tu SRE team necesita observability deep-dive?
Para Enterprise procurement · sample dashboards · alert configuration · SLO definitions · cost tracking detallado disponibles bajo NDA Enterprise.