Process
1. Incident summary
Document:- Timeline (when it happened, duration)
- Impact (user, business, technical)
- Resolution (how it was resolved)
2. Root cause analysis
Document:- Root cause (underlying cause)
- Contributing factors
- Evidence supporting the conclusion
- Confidence level
3. Impact assessment
Document:- User impact (how many users affected)
- Business impact (revenue, SLA)
- Technical impact (service availability)
- Reputation impact (if applicable)
4. Timeline
Document:- Detection time
- Response start time
- Mitigation time
- Resolution time
5. Response analysis
Document:- What went well
- What could improve
- Response time
- Communication effectiveness
6. Action items
Document action items with owners and due dates:- Immediate — Quick fixes needed now
- Short-term — Fixes in next sprint
- Long-term — Improvements over time
- Process — Process improvements
Example
Best practices
- Blameless — Focus on systems, not people
- Timely — Schedule within days of incident
- Comprehensive — Document thoroughly
- Actionable — Assign owners and due dates
- Follow-up — Review action items regularly
AI SRE’s role
AI SRE provides:- Investigation timeline
- Evidence chain
- Root cause analysis
- Suggested action items