Saltar al contenido principal
Observability · transparent strategy

Observability strategy

3 pillars (metrics · logs · traces) · 4 alert tiers (P0/P1/P2/P3) · 6 privacy rules (NO PII en telemetry) · 6 cost guards quantified. Observability que respeta privacy + controla costs · NO data hoarding fishing-expedition.

3 pillars observability

PillarDescripciónToolsRetention
Metrics · quantitative time-seriesLatency · throughput · error rates · cost · business metrics (messages processed · handoffs · resolutions) · aggregated per-tenantCloudflare Analytics + Supabase pg_stat + custom Postgres metrics table + Sentry performanceHot 7d full granularity · warm 30d 1h aggregation · cold 365d daily aggregation · cost-effective tiered
Logs · structured + contextualApplication logs structured JSON · request_id + clinic_id + user_id + operation propagated · error stack traces · audit trail complianceCloudflare Workers logs + Supabase pgaudit + Sentry breadcrumbs + custom error_log table durable30d hot · 365d warm · 7y compliance-required (healthcare) cold storage encrypted
Traces · distributed request flowsSpan hierarchy request → webhook validation → DB query → LLM call → DB write → response · timing per-span · errors highlighted · sampling smartSentry Performance + custom trace_id propagation + Cloudflare Workers trace contextSampling 10% normal · 100% errors · 100% high-latency (>5s) · 30d retention · NO PII captured spans

4 alert tiers · severity-based response

SeverityTriggerActionResponse time
P0 · service downHealth endpoint 504 2+ consecutive polls · OR error rate >10% 5min · OR breaker tripped (OpenAI/Meta)Slack + email + SMS founder immediate · runbook /incident-rca activated · status page Investigating<5min acknowledge
P1 · degraded performancep95 latency >5s 5min · OR error rate >5% 10min · OR queue lag >10minSlack + email founder · runbook investigation · status page maybe-degraded<30min acknowledge
P2 · partial failureSpecific endpoint error rate >2% 15min · OR specific clinic affected · OR vendor partial outageSlack founder business hours · investigate · workaround if available<2h acknowledge
P3 · informationalCost threshold approaching · disk usage approaching · API quota approaching · token approaching expirySlack non-urgent channel · action item created tracker · NO immediate response required<24h review

Privacy rules · 6 mandatory

  • NEVER log PII (patient_name · phone · medical_history · conversation content) en metrics/logs/traces · scrubbed antes de telemetry export
  • PII columns en DB encrypted via pgsodium · audit log decrypt operations · NEVER stored plaintext beyond active session
  • Sentry PII scrubbing aggressive · custom DataProcessor rejects detected PII · pre-send hook scrubs known patterns
  • URL parameters NEVER carry PII · use POST body for sensitive data · prevents leak browser history + referer + logs
  • Customer-supplied diagnostics (support ticket attachments) require explicit consent + auto-expire 30d · stored encrypted
  • Auditor external access (ChatGPT bundle) ALWAYS anonymized data · NO PII · NO raw conversations · only aggregated stats

Cost guards · 6 quantified limits

GuardLimitMonitoring
OpenAI cost-cap monthly500€/mes hard cap · breaker trips at 80% (400€) · fallback templates onlyDaily Slack report + alert each 100€ threshold
Cloudflare Workers requests cap10M requests/día · ~30€/mes upper bound · auto-scale within planHourly metric · alert spike anomaly detection
Supabase compute budgetPro plan 25€/mes baseline · auto-scale compute alerts 50€/mes spikeDaily compute usage report · alert sustained high usage
Upstash QStash messagesFree tier 500/día · alert at 400 (80%) · upgrade trigger sustained breach 3 díasReal-time queue metrics + daily summary
Sentry events quotaTeam plan 50k events/mes · alert at 40k (80%) · sample rate adjust if approachingWeekly Sentry usage report
R2 storage backups1TB free · alert at 700GB (70%) · cleanup old backups beyond retention policyMonthly storage usage report
Known gaps · transparent disclosure

Sentry-mute incident mencionado CLAUDE.md priorities: 7. Sentry alerts intermitentemente silenciados sin root cause identified. Status: HIGH priority post-CIF · 2d effort · workaround manual log checking founder daily.

Distributed tracing maturity: trace_id propagation funciona endpoint→DB pero NO completa entre QStash workers + LLM provider calls. Roadmap Q3 2026 · OpenTelemetry adoption planned cuando OpenTelemetry CF Workers SDK estable.

¿Tu SRE team necesita observability deep-dive?

Para Enterprise procurement · sample dashboards · alert configuration · SLO definitions · cost tracking detallado disponibles bajo NDA Enterprise.