Skip to main content

Gamedays and Chaos Engineering

Practice failure in a controlled environment; discover and fix weaknesses before production.

TL;DR

Gameday: simulate failures in a controlled time window. Kill the database, watch how the system responds. Route traffic to the wrong datacenter. Reveal weaknesses: broken alerting, outdated runbooks, untrained team. Fix them before they cause real incidents.

Chaos engineering: continuously inject failures in staging. Randomly kill instances, inject latency, saturate resources. Measure how gracefully the system degrades. Prevents brittle systems that work until they catastrophically fail.

Learning Objectives

  • Design realistic gameday scenarios matching production risks
  • Run gamedays with clear scope and communication
  • Implement continuous chaos testing in staging
  • Measure system resilience objectively
  • Convert gameday findings into concrete improvements
  • Build team confidence in incident response

Motivating Scenario

Your payment platform processes 50,000 transactions per hour. Your database is a single leader with read replicas. You assume the system "handles database failover" based on configuration review.

One Tuesday, the leader dies. Your read replicas don't automatically promote. For 18 minutes, all transactions fail. 15,000 users lose access. Your incident report concludes: "We thought failover was automatic. It wasn't configured."

A gameday would have revealed this immediately. Day 1: "Kill the database leader." Day 2: Discover misconfiguration. Day 3: Fix it. All before production.

With chaos engineering running continuously, this configuration drift would have been caught in the next scheduled chaos test.

Core Concepts

Gameday Framework

flowchart TB Plan["Plan Gameday<br/>Scope & Scenario"] --> Pre["Pre-Gameday<br/>Notifications & Setup"] Pre --> Execute["Execute Failures<br/>Inject Problem"] Execute --> Observe["Observe System<br/>Behavior"] Observe --> Respond["Team Responds<br/>Using Runbooks"] Respond --> Measure["Measure Response<br/>Collect Data"] Measure --> Recover["Recover System<br/>Validate Fix"] Recover --> Debrief["Debrief & Document<br/>Lessons Learned"] Debrief --> Tickets["Create Tickets<br/>to Fix Issues"]

Gameday Scenarios by Severity

Tier 1: Single Component (1 hour)

  • Kill one database replica
  • Stop one worker queue
  • Disconnect one microservice
  • Turn off one region

Tier 2: Infrastructure Failure (2-3 hours)

  • Entire database unavailable for 30 minutes
  • Network partition (service A cannot reach B)
  • Entire Kubernetes cluster restart
  • Datacenter network disconnect

Tier 3: Cascading Failures (3-4 hours)

  • Database + cache both down
  • Multiple service failures in sequence
  • Quota exhaustion (API limits, database connections)
  • Memory leak cascade (slow failure over 30 minutes)

Chaos Engineering Dimensions

Infrastructure Level:

  • Kill random pods
  • Reboot nodes
  • Saturate disk/CPU
  • Inject network packet loss

Application Level:

  • Kill in-process cache
  • Timeout database connections
  • Slow down HTTP responses
  • Corrupt message payloads

Data Level:

  • Enable read-only mode
  • Delay data synchronization
  • Introduce stale data
  • Trigger backup restoration

Practical Examples

# gameday-setup.sh - Configure controlled failure injection

#!/bin/bash
set -euo pipefail

# Install Chaos toolkit
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-latest.yaml

# Define gameday scenario: Kill payment service
cat > gameday-scenario.yaml <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: gameday-payment-kill-pod
namespace: chaos-testing
spec:
# Only run during scheduled gamedays
engineState: "active"
appinfo:
appns: "production"
applabel: "app=payment-service"
appkind: "deployment"

experiments:
- name: pod-kill
spec:
components:
env:
# Kill 1 pod for 5 minutes
- name: TOTAL_CHAOS_DURATION
value: "300"
- name: PODS_AFFECTED_PERC
value: "50" # 50% of replicas
- name: FORCE
value: "false" # Graceful shutdown first

nodeSelector:
chaos-enabled: "true" # Only on test nodes

# Apply chaos scenario
kubectl apply -f gameday-scenario.yaml

# Monitor system response
echo "Gameday started: Payment service degradation"
kubectl logs -f -n production -l app=payment-service --tail=50

# In separate terminal, query metrics
watch 'curl -s http://prometheus:9090/api/v1/query?query=payment_error_rate | jq'

# After 5 minutes, chaos automatically stops
# Verify recovery
kubectl wait --for=condition=ready pod -l app=payment-service -n production --timeout=2m

When to Use Gamedays vs Continuous Chaos

Gameday vs Continuous Chaos
Gameday (Scheduled)
  1. High-risk, complex scenarios
  2. Team training and practice
  3. Infrastructure changes testing
  4. Quarterly or major releases
  5. Involves entire incident team
  6. Deep learning and new discoveries
Chaos (Continuous)
  1. Routine, repeatable failures
  2. Detect regressions automatically
  3. Staging environment validation
  4. Runs daily/weekly unattended
  5. Automated alert only if broken
  6. Catches configuration drift

Patterns and Pitfalls

Start with single-component failures. Once team is comfortable, progress to cascading failures. Month 1: Kill pod. Month 2: Kill database. Month 3: Kill database + cache simultaneously.
Track: detection latency, recovery time, false positive rate, impact scope. Over time, see these metrics improve as you fix issues. Quantifies system resilience improvements.
Running gameday but ignoring findings. If the same issue appears in gameday #3 that appeared in gameday #1, your process is broken. Every gameday should generate actionable tickets.
Gameday only works if team knows runbooks beforehand. Practice runbooks monthly. Junior engineers should be able to execute without senior guidance.
Production systems change. Chaos in staging won't catch production-specific failures. Use very conservative chaos in production (single pod, short duration, business hours only).

Design Review Checklist

  • Gameday schedule is published and recurring (quarterly minimum)
  • Each gameday scenario tests a specific architectural weakness or recent incident
  • Gameday has clear success criteria and metrics to measure
  • Both SREs and engineers participate in gamedays
  • Gameday findings generate tickets that are prioritized and tracked
  • Chaos tests run in staging on a defined schedule (weekly minimum)
  • Chaos experiment suite covers all critical services and failure modes
  • Team has practiced executing critical runbooks within past 3 months
  • Gameday and chaos results are documented and shared across team
  • Post-gameday, at least one finding is addressed before next gameday

Self-Check

  • How many gamedays has your team run in the past year?
  • After the last gameday, how many P0 issues were discovered?
  • Do you have continuous chaos testing in staging?
  • Can a new engineer execute your critical runbook without help?
  • Has a gameday ever prevented a production incident?

Next Steps

  1. Week 1: Schedule first gameday. Choose a single-component failure (kill one service).
  2. Week 2: Run gameday. Document findings. Create tickets.
  3. Week 3: Fix top 3 issues from gameday.
  4. Week 4: Set up chaos testing in staging to catch regressions.
  5. Ongoing: Monthly gameday practice; weekly chaos tests.

References

  1. Beyer, B., et al. (2016). Site Reliability Engineering. O'Reilly Media.
  2. Bellamy-McIntyre, A., et al. (2020). Chaos Engineering. O'Reilly Media.
  3. Gremlin. (2023). Chaos Engineering Best Practices. gremlin.com/chaos-engineering ↗️