Observability · transparent strategy

Observability strategy

3 pillars (metrics · logs · traces) · 4 alert tiers (P0/P1/P2/P3) · 6 privacy rules (NO PII en telemetry) · 6 cost guards quantified. Observability que respeta privacy + controla costs · NO data hoarding fishing-expedition.

Real-time dashboard Postmortems Metodología uptime

3 pillars observability

Pillar	Descripción	Tools	Retention
Metrics · quantitative time-series	Latency · throughput · error rates · cost · business metrics (messages processed · handoffs · resolutions) · aggregated per-tenant	Cloudflare Analytics + Supabase pg_stat + custom Postgres metrics table + Sentry performance	Hot 7d full granularity · warm 30d 1h aggregation · cold 365d daily aggregation · cost-effective tiered
Logs · structured + contextual	Application logs structured JSON · request_id + clinic_id + user_id + operation propagated · error stack traces · audit trail compliance	Cloudflare Workers logs + Supabase pgaudit + Sentry breadcrumbs + custom error_log table durable	30d hot · 365d warm · 7y compliance-required (healthcare) cold storage encrypted
Traces · distributed request flows	Span hierarchy request → webhook validation → DB query → LLM call → DB write → response · timing per-span · errors highlighted · sampling smart	Sentry Performance + custom trace_id propagation + Cloudflare Workers trace context	Sampling 10% normal · 100% errors · 100% high-latency (>5s) · 30d retention · NO PII captured spans

4 alert tiers · severity-based response

Severity	Trigger	Action	Response time
P0 · service down	Health endpoint 504 2+ consecutive polls · OR error rate >10% 5min · OR breaker tripped (OpenAI/Meta)	Slack + email + SMS founder immediate · runbook /incident-rca activated · status page Investigating	<5min acknowledge
P1 · degraded performance	p95 latency >5s 5min · OR error rate >5% 10min · OR queue lag >10min	Slack + email founder · runbook investigation · status page maybe-degraded	<30min acknowledge
P2 · partial failure	Specific endpoint error rate >2% 15min · OR specific clinic affected · OR vendor partial outage	Slack founder business hours · investigate · workaround if available	<2h acknowledge
P3 · informational	Cost threshold approaching · disk usage approaching · API quota approaching · token approaching expiry	Slack non-urgent channel · action item created tracker · NO immediate response required	<24h review

Privacy rules · 6 mandatory

NEVER log PII (patient_name · phone · medical_history · conversation content) en metrics/logs/traces · scrubbed antes de telemetry export
PII columns en DB encrypted via pgsodium · audit log decrypt operations · NEVER stored plaintext beyond active session
Sentry PII scrubbing aggressive · custom DataProcessor rejects detected PII · pre-send hook scrubs known patterns
URL parameters NEVER carry PII · use POST body for sensitive data · prevents leak browser history + referer + logs
Customer-supplied diagnostics (support ticket attachments) require explicit consent + auto-expire 30d · stored encrypted
Auditor external access (ChatGPT bundle) ALWAYS anonymized data · NO PII · NO raw conversations · only aggregated stats

Cost guards · 6 quantified limits

Guard	Limit	Monitoring
OpenAI cost-cap monthly	500€/mes hard cap · breaker trips at 80% (400€) · fallback templates only	Daily Slack report + alert each 100€ threshold
Cloudflare Workers requests cap	10M requests/día · ~30€/mes upper bound · auto-scale within plan	Hourly metric · alert spike anomaly detection
Supabase compute budget	Pro plan 25€/mes baseline · auto-scale compute alerts 50€/mes spike	Daily compute usage report · alert sustained high usage
Upstash QStash messages	Free tier 500/día · alert at 400 (80%) · upgrade trigger sustained breach 3 días	Real-time queue metrics + daily summary
Sentry events quota	Team plan 50k events/mes · alert at 40k (80%) · sample rate adjust if approaching	Weekly Sentry usage report
R2 storage backups	1TB free · alert at 700GB (70%) · cleanup old backups beyond retention policy	Monthly storage usage report

Known gaps · transparent disclosure

Sentry-mute incident mencionado CLAUDE.md priorities: 7. Sentry alerts intermitentemente silenciados sin root cause identified. Status: HIGH priority post-CIF · 2d effort · workaround manual log checking founder daily.

Distributed tracing maturity: trace_id propagation funciona endpoint→DB pero NO completa entre QStash workers + LLM provider calls. Roadmap Q3 2026 · OpenTelemetry adoption planned cuando OpenTelemetry CF Workers SDK estable.

¿Tu SRE team necesita observability deep-dive?

Para Enterprise procurement · sample dashboards · alert configuration · SLO definitions · cost tracking detallado disponibles bajo NDA Enterprise.

Solicitar SRE pack SLA Postmortem template