Resilience and Chaos Engineering

Scenario	How	What To Test
Service down	Kill pod/container	Failover, alerts, health checks
Database slow	Inject latency	Circuit breakers, timeouts, fallbacks
Network partition	Block traffic	Split-brain handling, consistency
Disk full	Fill filesystem	Graceful degradation, alerts
Memory leak	Reduce available memory	OOM handling, autoscaling
High CPU	CPU throttle	Performance, cascading failures

# Chaos Mesh: Kill random pods in production namespace
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-random-pods
  namespace: production
spec:
  action: pod-kill
  mode: random
  duration: 5m
  scheduler:
    cron: "0 2 * * *"  # 2 AM daily
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  # Grace period for graceful shutdown
  gracePeriod: 30s
---
# Add latency to database connections
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: db-latency
spec:
  action: delay
  mode: all
  duration: 10m
  scheduler:
    cron: "0 3 * * *"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  delay:
    latency: "500ms"  # Add 500ms latency to DB
    jitter: "100ms"
  target:
    namespaces:
      - production
    labelSelectors:
      app: postgres-db
---
# Network partition: split brain scenario
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
spec:
  action: partition
  mode: all
  duration: 2m
  selector:
    namespaces:
      - production
    labelSelectors:
      zone: us-east-1a
  target:
    namespaces:
      - production
    labelSelectors:
      zone: us-east-1b

# Chaos Toolkit: Comprehensive chaos experiments in Python
from chaoslib.extension import init_extension
from chaoslib.action import action
from chaosgke.pod.actions import kill_pod
import time

def setup_database_state():
    """Setup: Ensure test data exists"""
    # Create test database with known state
    conn = db.connect()
    cursor = conn.cursor()
    cursor.execute("CREATE TABLE test_orders (id INT, status VARCHAR)")
    cursor.execute("INSERT INTO test_orders VALUES (1, 'pending')")
    conn.commit()

def teardown():
    """Cleanup after experiment"""
    conn = db.connect()
    cursor = conn.cursor()
    cursor.execute("DROP TABLE test_orders")
    conn.commit()

@action
def inject_database_latency(duration_ms=500, target_namespace='production'):
    """Inject latency into database calls"""
    # This would use actual chaos toolkit libraries
    print(f"Injecting {duration_ms}ms latency for {target_namespace}")
    # Implementation: use iptables, tc (traffic control), or proxy middleware

@action
def verify_circuit_breaker_triggered():
    """After injecting latency, verify circuit breaker activates"""
    # Try to call DB-dependent service
    try:
        response = requests.get('http://api:8080/orders', timeout=2)
        if response.status_code == 503:  # Service unavailable
            print("Circuit breaker correctly triggered!")
            return True
    except requests.Timeout:
        print("Circuit breaker activated (timeout)")
        return True
    return False

@action
def verify_graceful_degradation():
    """Verify system degraded gracefully, not crashed"""
    # Check: primary features work, non-essential features disabled
    response = requests.get('http://api:8080/health')
    health = response.json()

    assert health['status'] == 'degraded'  # Not 'healthy', not 'down'
    assert 'orders' in health['disabled_services']  # DB unavailable
    assert 'search' in health['available_services']  # Cache still works
    print("Graceful degradation confirmed")
    return True

def experiment():
    """Run the chaos experiment"""
    setup_database_state()

    try:
        # Blast: Inject latency
        inject_database_latency(duration_ms=500, target_namespace='production')
        time.sleep(2)

        # Probe: Does circuit breaker activate?
        verify_circuit_breaker_triggered()

        # Verify: System degrades gracefully
        verify_graceful_degradation()

        print("Experiment passed!")
        return True
    except Exception as e:
        print(f"Experiment failed: {e}")
        return False
    finally:
        teardown()

# Database Failure Gameday

## Objective
Practice responding to database replica failure. Validate MTTR < 5 min.

## Setup (10 min, 2:00 PM)
- Monitoring dashboards open on big screen
- Incident commander (IC), on-call engineers, SRE team present
- Communication channels open (Slack, Zoom)

## Scenario (2:10 PM)
**Chaos engineer injects:** Kill database replica-1
- Database connection pool exhaustion expected
- Slowness in read-heavy endpoints

## Expected Behavior
- Monitoring alerts fire (database CPU high, connection pool near limit)
- Team page via on-call escalation
- Engineers discover replica down
- Failover to replica-2 (should be automatic, but verify)
- System recovers to normal latency

## Metrics (throughout)
- Track latency P99 (should spike, then recover)
- Track error rate (should stay < 1%)
- Track MTTR (time from alert to recovery)

## Post-Game (2:30 PM, 20 min)
- What went well?
- What went badly?
- What did we learn?
- Create tickets to fix issues found

## Success Criteria
- MTTR < 5 minutes
- No manual intervention required (automatic failover works)
- All team members understood their roles
- Alerts fired correctly

## Failure Modes Discovered (and Fixed)
1. Failover didn't activate automatically (fix: PagerDuty alert wasn't configured)
2. Engineers didn't know how to check replica status (fix: added runbook)
3. Error rate spiked to 2% during failover (fix: configured circuit breaker)

Resilience and Chaos Engineering

TL;DR

Learning Objectives

Motivating Scenario

Core Concepts

Failure Scenarios

Practical Example

When to Use / When Not to Use

Patterns and Pitfalls

Chaos Engineering Best Practices and Anti-Patterns

Design Review Checklist

Self-Check Questions

Next Steps

References

Resilience and Chaos Engineering

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Failure Scenarios​

Practical Example​

When to Use / When Not to Use​

Patterns and Pitfalls​

Chaos Engineering Best Practices and Anti-Patterns

Design Review Checklist​

Self-Check Questions​

Next Steps​

References​

TL;DR

Learning Objectives

Motivating Scenario

Core Concepts

Failure Scenarios

Practical Example

When to Use / When Not to Use

Patterns and Pitfalls

Design Review Checklist

Self-Check Questions

Next Steps

References