Skip to main content
Post-mortems document what happened, why it happened, and how to prevent recurrence.

Process

1. Incident summary

Document:
  • Timeline (when it happened, duration)
  • Impact (user, business, technical)
  • Resolution (how it was resolved)

2. Root cause analysis

Document:
  • Root cause (underlying cause)
  • Contributing factors
  • Evidence supporting the conclusion
  • Confidence level

3. Impact assessment

Document:
  • User impact (how many users affected)
  • Business impact (revenue, SLA)
  • Technical impact (service availability)
  • Reputation impact (if applicable)

4. Timeline

Document:
  • Detection time
  • Response start time
  • Mitigation time
  • Resolution time

5. Response analysis

Document:
  • What went well
  • What could improve
  • Response time
  • Communication effectiveness

6. Action items

Document action items with owners and due dates:
  • Immediate — Quick fixes needed now
  • Short-term — Fixes in next sprint
  • Long-term — Improvements over time
  • Process — Process improvements

Example

Incident: Checkout Service Failure
Date: 2026-01-21
Duration: 45 minutes (14:23 - 15:08 UTC)
Impact: 100% checkout failure, revenue-blocking
Resolution: Mitigated by increasing connection pool, fixed in PR #1850

Root Cause: Connection leak in OrderRepository.findPendingOrders()
Introduced: PR #1847 (deployed 14:11 UTC)
Confidence: High

Action Items:
- [ ] Fix connection leak (PR #1850) - DONE
- [ ] Add connection pool monitoring - In Progress
- [ ] Improve error handling in batch queries - Next Sprint

Best practices

  • Blameless — Focus on systems, not people
  • Timely — Schedule within days of incident
  • Comprehensive — Document thoroughly
  • Actionable — Assign owners and due dates
  • Follow-up — Review action items regularly

AI SRE’s role

AI SRE provides:
  • Investigation timeline
  • Evidence chain
  • Root cause analysis
  • Suggested action items