Incident Questions
Root cause in three minutes, not three hours.
What Escher is great at
| You ask | What you get | Time |
|---|---|---|
| "What changed in the last hour?" | Diff of deployments, configs, IAM, ranked by risk | ~2 min |
| "What caused this incident?" | Timeline + responsible change + affected resources | ~3 min |
| "Why is this service slow?" | Correlated metric anomalies + recent changes | ~4 min |
| "What's the blast radius of this outage?" | Downstream services + customer-facing impact | ~3 min |
| "Who deployed last and what did they change?" | Last 10 deploys + diff summary + author | ~1 min |
Example: 2am page
You're on call. Latency on the checkout API is 3× normal. You ask:
"What changed in production in the last 90 minutes that could explain elevated checkout-api latency?"
Escher returns a Canvas:
Conclusion: A deployment of checkout-api v2.41.3 at 01:48 UTC
(46 min ago) introduced a synchronous call to a new
recommendation service that has 800ms p99 latency.
Timeline:
01:48 UTC Deploy: checkout-api v2.41.3 (alice@company.com)
01:51 UTC First customer-facing latency alert
02:04 UTC Incident #4827 opened in your incident system
02:14 UTC You asked Escher
Responsible change:
Commit: abc123de "Add personalized recommendations"
PR: github.com/company/checkout-api/pull/847
New dep: recommendations-service:v0.3.1
Affected:
Service: checkout-api (4 pods)
Customers: ~60% of checkout traffic
SLO: p99 latency budget consumed 4.2× normal rate
Recommended:
Revert to checkout-api v2.41.2 (estimated MTTR: 4 min)
Evidence: 9 citations (deployment log, commit, PR, metric, traces).You revert, the page resolves, and Escher's Canvas becomes the incident write-up.
Example: Cross-cloud incident
"BigQuery jobs from prod-analytics started failing at 14:00 — what changed?"
Escher correlates across Azure (the BigQuery-equivalent job) and AWS (which holds the source S3 data) and surfaces a recent S3 bucket policy change that broke the cross-cloud read path.
Tips that get better incident answers
TIP
Anchor to a time window. "Last 90 minutes," "since 14:00 UTC," "after the deploy at 02:30." Escher uses the window to filter events.
TIP
Name the symptom. "Latency on checkout-api" / "5xx errors on payments" / "BigQuery jobs failing" — Escher uses the symptom to pick the right correlation strategy.
TIP
Ask for the change first, then the impact. "What changed?" returns a ranked list. "What's the blast radius?" then traces the impact of the most likely culprit.