Skip to main content
Detection identifies what’s broken, where it’s broken, and when it started.

Process

When you ask AI SRE about a failure or alert, it:
  1. Characterizes the failure — Identifies error type, failure mode, and symptoms
  2. Determines scope — Identifies affected services, endpoints, and components
  3. Establishes timeline — Determines when the issue started and its pattern
  4. Ranks suspects — Identifies likely causes based on evidence
  5. Proposes diagnostics — Suggests specific actions to confirm or rule out hypotheses

Example

Question: “Why is checkout failing?” Response:
Failure: Connection pool exhausted (503 Service Unavailable)
Scope: Checkout API, payment service blocked
Timeline: Started 14:23 UTC, ongoing
Suspect: Connection leak in PR #1847 (deployed 14:11 UTC)
Confidence: Medium-High

Diagnostics:
1. Query pg_stat_activity for active connections
2. Check for long-running transactions since 14:20
3. Review PR #1847 error handling paths
4. Check application metrics for traffic spike

Data sources

AI SRE uses:
  • GitHub — Code analysis, recent changes, deployment history
  • Observability tools — Logs, metrics, traces
  • Correlation — Cross-references findings across systems

Confidence levels

  • High — Multiple corroborating signals, clear error signatures
  • Medium — Strong correlation in one dimension, some data gaps
  • Low — Limited data, alternative explanations possible

Best practices

  • Provide specific error messages or symptoms
  • Include relevant logs or metrics when available
  • Mention when the issue started
  • Describe what’s affected