Skip to content

Incident Questions

Root cause in three minutes, not three hours.


What Escher is great at

You askWhat you getTime
"What changed in the last hour?"Diff of deployments, configs, IAM, ranked by risk~2 min
"What caused this incident?"Timeline + responsible change + affected resources~3 min
"Why is this service slow?"Correlated metric anomalies + recent changes~4 min
"What's the blast radius of this outage?"Downstream services + customer-facing impact~3 min
"Who deployed last and what did they change?"Last 10 deploys + diff summary + author~1 min

Example: 2am page

You're on call. Latency on the checkout API is 3× normal. You ask:

"What changed in production in the last 90 minutes that could explain elevated checkout-api latency?"

Escher returns a Canvas:

Conclusion: A deployment of checkout-api v2.41.3 at 01:48 UTC
(46 min ago) introduced a synchronous call to a new
recommendation service that has 800ms p99 latency.

Timeline:
  01:48 UTC  Deploy: checkout-api v2.41.3 (alice@company.com)
  01:51 UTC  First customer-facing latency alert
  02:04 UTC  Incident #4827 opened in your incident system
  02:14 UTC  You asked Escher

Responsible change:
  Commit:  abc123de "Add personalized recommendations"
  PR:      github.com/company/checkout-api/pull/847
  New dep: recommendations-service:v0.3.1

Affected:
  Service:   checkout-api (4 pods)
  Customers: ~60% of checkout traffic
  SLO:       p99 latency budget consumed 4.2× normal rate

Recommended:
  Revert to checkout-api v2.41.2  (estimated MTTR: 4 min)

Evidence: 9 citations (deployment log, commit, PR, metric, traces).

You revert, the page resolves, and Escher's Canvas becomes the incident write-up.


Example: Cross-cloud incident

"BigQuery jobs from prod-analytics started failing at 14:00 — what changed?"

Escher correlates across Azure (the BigQuery-equivalent job) and AWS (which holds the source S3 data) and surfaces a recent S3 bucket policy change that broke the cross-cloud read path.


Tips that get better incident answers

TIP

Anchor to a time window. "Last 90 minutes," "since 14:00 UTC," "after the deploy at 02:30." Escher uses the window to filter events.

TIP

Name the symptom. "Latency on checkout-api" / "5xx errors on payments" / "BigQuery jobs failing" — Escher uses the symptom to pick the right correlation strategy.

TIP

Ask for the change first, then the impact. "What changed?" returns a ranked list. "What's the blast radius?" then traces the impact of the most likely culprit.


What's next

Escher — Agentic CloudOps by Tessell