Skip to main contentMitigation reduces incident impact quickly, often before root cause is fully understood. The goal is to restore service or minimize damage.
Mitigation strategies
Quick fixes
Immediate actions to restore service:
- Restart services or pods
- Rollback deployment
- Scale up capacity
- Disable feature flags
- Clear caches
When to use: Clear correlation with recent change, known workaround available, low risk.
Traffic management
Control traffic to reduce impact:
- Rate limiting
- Circuit breakers
- Load balancing
- Traffic shaping
- Geographic routing
When to use: Partial degradation, capacity issues, dependency failures.
Resource management
Manage system resources:
- Scale resources (CPU, memory, connections)
- Increase limits (connection pools)
- Clear stuck resources
- Kill slow queries
When to use: Resource exhaustion, capacity constraints.
Feature flags
Disable problematic features:
- Turn off new features
- Enable fallback implementations
- Reduce feature exposure
- Switch to control group
When to use: New feature causing issues, quick toggle available.
Example
Incident: Checkout service failing due to connection pool exhaustion
Mitigation:
- Assess: Connection pool exhausted, 100% failure, critical
- Options: Increase pool size (quick), restart service (temporary), rollback (if correlated)
- Execute: Increase pool + restart service
- Monitor: Check service health, error rates, connection usage
AI SRE’s role
AI SRE suggests mitigation strategies based on:
- Incident type and evidence
- Recent changes
- Historical patterns
- Risk assessment
AI SRE helps assess risks and suggests validation steps.
Mitigation vs. root cause
- Mitigation — Quick action, temporary, reduces impact, restores service
- Root cause — Thorough, permanent, prevents recurrence, improves system
Best practices
- Act quickly to reduce impact
- Prefer low-risk mitigations
- Monitor closely for side effects
- Document actions taken
- Don’t stop at mitigation—investigate root cause