Process
When you ask AI SRE about a failure or alert, it:- Characterizes the failure — Identifies error type, failure mode, and symptoms
- Determines scope — Identifies affected services, endpoints, and components
- Establishes timeline — Determines when the issue started and its pattern
- Ranks suspects — Identifies likely causes based on evidence
- Proposes diagnostics — Suggests specific actions to confirm or rule out hypotheses
Example
Question: “Why is checkout failing?” Response:Data sources
AI SRE uses:- GitHub — Code analysis, recent changes, deployment history
- Observability tools — Logs, metrics, traces
- Correlation — Cross-references findings across systems
Confidence levels
- High — Multiple corroborating signals, clear error signatures
- Medium — Strong correlation in one dimension, some data gaps
- Low — Limited data, alternative explanations possible
Best practices
- Provide specific error messages or symptoms
- Include relevant logs or metrics when available
- Mention when the issue started
- Describe what’s affected