Skip to main content
Mitigation reduces incident impact quickly, often before root cause is fully understood. The goal is to restore service or minimize damage.

Mitigation strategies

Quick fixes

Immediate actions to restore service:
  • Restart services or pods
  • Rollback deployment
  • Scale up capacity
  • Disable feature flags
  • Clear caches
When to use: Clear correlation with recent change, known workaround available, low risk.

Traffic management

Control traffic to reduce impact:
  • Rate limiting
  • Circuit breakers
  • Load balancing
  • Traffic shaping
  • Geographic routing
When to use: Partial degradation, capacity issues, dependency failures.

Resource management

Manage system resources:
  • Scale resources (CPU, memory, connections)
  • Increase limits (connection pools)
  • Clear stuck resources
  • Kill slow queries
When to use: Resource exhaustion, capacity constraints.

Feature flags

Disable problematic features:
  • Turn off new features
  • Enable fallback implementations
  • Reduce feature exposure
  • Switch to control group
When to use: New feature causing issues, quick toggle available.

Example

Incident: Checkout service failing due to connection pool exhaustion Mitigation:
  1. Assess: Connection pool exhausted, 100% failure, critical
  2. Options: Increase pool size (quick), restart service (temporary), rollback (if correlated)
  3. Execute: Increase pool + restart service
  4. Monitor: Check service health, error rates, connection usage

AI SRE’s role

AI SRE suggests mitigation strategies based on:
  • Incident type and evidence
  • Recent changes
  • Historical patterns
  • Risk assessment
AI SRE helps assess risks and suggests validation steps.

Mitigation vs. root cause

  • Mitigation — Quick action, temporary, reduces impact, restores service
  • Root cause — Thorough, permanent, prevents recurrence, improves system

Best practices

  • Act quickly to reduce impact
  • Prefer low-risk mitigations
  • Monitor closely for side effects
  • Document actions taken
  • Don’t stop at mitigation—investigate root cause