> ## Documentation Index
> Fetch the complete documentation index at: https://docs-sre.lightrun.com/llms.txt
> Use this file to discover all available pages before exploring further.

# SREs

> AI SRE workflow for Site Reliability Engineers

Site Reliability Engineers are responsible for system reliability, incident response, and ensuring services meet SLOs. AI SRE helps investigate incidents, perform root cause analysis, and improve system reliability.

## Workflow stages

### Alert intake & triage

**Challenge:** Reviews alert quality; suspects missing signals

**How AI SRE helps:**

* Identifies missing signals in alerts
* Evaluates alert quality and completeness
* Identifies what signals are missing
* Suggests improvements to alerting

### Scope & impact assessment

**Challenge:** Reconstructs dependencies from diagrams or tribal knowledge

**How AI SRE helps:**

* Analyzes dependencies from code and integrations
* Maps dependencies from evidence, not diagrams
* Identifies blast radius from system analysis
* Quantifies impact with evidence

### Root cause investigation

**Challenge:** Drives investigation but blocked by lack of evidence

**How AI SRE helps:**

* Gathers evidence from multiple sources
* Correlates data across systems
* Builds evidence chains systematically
* Identifies what evidence is missing

### Fix design

**Challenge:** Pushes for defensive fixes to reduce risk

**How AI SRE helps:**

* Suggests fixes based on evidence
* Recommends fixes that address root cause
* Assesses fix risks with evidence
* Validates fix approach

### Deployment & verification

**Challenge:** Monitors SLIs/SLOs, hoping metrics stabilize

**How AI SRE helps:**

* Monitors SLOs automatically
* Verifies fixes address root cause
* Validates fixes with evidence
* Provides confidence in resolution

### Post-incident learning

**Challenge:** Authors postmortems from partial data

**How AI SRE helps:**

* Provides complete investigation data
* Documents all evidence gathered
* Reconstructs complete timeline
* Enables comprehensive postmortems

## Key workflows

### Alert quality review

1. Review alerts with AI SRE
2. Identify missing signals
3. Assess alert quality
4. Improve alerting
5. Validate improvements

### Deep investigation

1. Start investigation with AI SRE
2. Gather evidence systematically
3. Build evidence chain
4. Identify root cause
5. Document findings

### SLO management

1. Monitor SLOs with AI SRE
2. Investigate SLO breaches
3. Identify root causes
4. Implement fixes
5. Validate SLO recovery

## Best practices

* Use AI SRE systematically for investigations
* Verify findings before acting
* Improve alerting based on AI SRE insights
* Use AI SRE for comprehensive documentation
* Build complete evidence chains

## Next steps

<CardGroup cols={2}>
  <Card title="Production Engineers" icon="code" href="/workflows/production-engineers" />

  <Card title="Working with AI SRE" icon="wrench" href="/working-with-ai-sre/overview/detection" />
</CardGroup>
