Process
- Establish timeline — Incident onset time and correlation window (typically T-1h)
- Identify changes — Recent deployments, commits, config changes in the window
- Build evidence chain — Temporal + code + signal correlation
- Present findings — Root cause with confidence level
Example
Question: “What caused checkout failures at 14:23 UTC?” Response:Correlation dimensions
- Temporal — Change timing vs. incident onset
- Code — What changed vs. what’s failing
- Signal — Logs/metrics matching changed code
Confidence levels
- High — Strong correlation across all dimensions
- Medium — Good correlation, some gaps
- Low — Limited evidence, multiple possibilities