Mitigation

Mitigation reduces incident impact quickly, often before root cause is fully understood. The goal is to restore service or minimize damage.

Mitigation strategies

Quick fixes

Immediate actions to restore service:

Restart services or pods
Rollback deployment
Scale up capacity
Disable feature flags
Clear caches

When to use: Clear correlation with recent change, known workaround available, low risk.

Traffic management

Control traffic to reduce impact:

Rate limiting
Circuit breakers
Load balancing
Traffic shaping
Geographic routing

When to use: Partial degradation, capacity issues, dependency failures.

Resource management

Manage system resources:

Scale resources (CPU, memory, connections)
Increase limits (connection pools)
Clear stuck resources
Kill slow queries

When to use: Resource exhaustion, capacity constraints.

Feature flags

Disable problematic features:

Turn off new features
Enable fallback implementations
Reduce feature exposure
Switch to control group

When to use: New feature causing issues, quick toggle available.

Example

Incident: Checkout service failing due to connection pool exhaustion Mitigation:

Assess: Connection pool exhausted, 100% failure, critical
Options: Increase pool size (quick), restart service (temporary), rollback (if correlated)
Execute: Increase pool + restart service
Monitor: Check service health, error rates, connection usage

AI SRE’s role

AI SRE suggests mitigation strategies based on:

Incident type and evidence
Recent changes
Historical patterns
Risk assessment

AI SRE helps assess risks and suggests validation steps.

Mitigation vs. root cause

Mitigation — Quick action, temporary, reduces impact, restores service
Root cause — Thorough, permanent, prevents recurrence, improves system

Best practices

Act quickly to reduce impact
Prefer low-risk mitigations
Monitor closely for side effects
Document actions taken
Don’t stop at mitigation—investigate root cause

Overview

Getting Started

Supported Connectors

SRE Workflow by Persona

Questions

Mitigation strategies

Quick fixes

Traffic management

Resource management

Feature flags

Example

AI SRE’s role

Mitigation vs. root cause

Best practices

RCA

Long-term Fix

Overview

Getting Started

Supported Connectors

SRE Workflow by Persona

Questions

​Mitigation strategies

​Quick fixes

​Traffic management

​Resource management

​Feature flags

​Example

​AI SRE’s role

​Mitigation vs. root cause

​Best practices

RCA

Long-term Fix

Mitigation strategies

Quick fixes

Traffic management

Resource management

Feature flags

Example

AI SRE’s role

Mitigation vs. root cause

Best practices