Skip to main content

Auto-Remediation and Runbooks

Fix common incidents automatically; guide complex incidents with runbooks.

TL;DR

Auto-remediation automatically fixes common incidents: disk full? Auto-delete old logs. Service down? Auto-restart. CPU spike? Auto-scale. Response time drops from 30 minutes to 30 seconds, reducing MTTR significantly.

Runbooks guide humans through complex incidents with conditional logic: "if X, try Y. If Y fails, try Z. If Z fails, escalate." Junior engineers execute runbooks while seniors focus on design. Update after every incident to keep them current.

Learning Objectives

  • Identify safe candidates for auto-remediation
  • Implement auto-remediation with safety guardrails
  • Write step-by-step runbooks for complex incidents
  • Measure remediation impact on MTTR
  • Establish runbook update processes post-incident

Motivating Scenario

Your ecommerce platform experiences a disk-full incident at 2 AM. On-call engineer receives alert at 2:03 AM, wakes up, reads logs, manually deletes old transaction files, and service recovers at 2:35 AM. Total downtime: 32 minutes. 33,000 lost transactions.

With auto-remediation, the system detects disk usage >90% at 2:00 AM, automatically archives old logs to cold storage, and service continues uninterrupted. Downtime: 0 minutes. Lost transactions: 0.

With a runbook, the engineer would have clear steps within 60 seconds: "check disk usage → review log retention policy → escalate if unknown files → archive or delete selectively → verify service health."

Core Concepts

Auto-Remediation Architecture

flowchart TB Alert["Alert Triggered<br/>High Resource Usage"] --> Evaluate{"Meets Remediation<br/>Criteria?"} Evaluate -->|No| Escalate["Escalate to<br/>On-Call Engineer"] Evaluate -->|Yes| CheckLocks{"Safeguard Checks<br/>Passed?"} CheckLocks -->|No| Escalate CheckLocks -->|Yes| Execute["Execute<br/>Remediation Action"] Execute --> Verify{"Action<br/>Successful?"} Verify -->|Yes| Resolved["Incident Resolved<br/>Log Event"] Verify -->|No| Escalate Escalate --> Manual["Human Reviews<br/>Runbook"] Manual --> Resolved

Good Candidates for Auto-Remediation

Auto-remediation works best for idempotent, low-risk, frequent incidents with clear symptoms:

  • Disk full: Archive/delete old logs, rotate files, cleanup temp directories
  • Memory leaks: Restart service gracefully, swap in backup instance
  • Connection pool exhaustion: Force reconnection, cycle connections
  • Cache expiration: Trigger cache refresh, reload configuration
  • Disk I/O throttling: Rebalance workload across disks, circuit breaker

Risky Auto-Remediation (Avoid)

These incidents require human judgment and risk approval:

  • Data deletion: Could lose permanent data despite safeguards
  • Database failover: Might promote stale replica, causing data loss
  • Transaction rollback: Could break consistency or cascade failures
  • Network reconfiguration: Could isolate systems or create asymmetric partitions
  • Authentication token rotation: Could break dependent services unexpectedly

Runbook Structure

Effective runbooks follow decision trees with escalation paths:

SYMPTOM: Service returning 500 errors

IMMEDIATE CHECKS:
1. Is service responding? (curl health endpoint)
2. Check logs for panic/crash messages
3. Verify CPU/memory not maxed
4. Check database connectivity

IF healthy → restart service
- Backend: systemctl restart myservice
- Result: healthy? → RESOLVED
- Not healthy? → escalate to on-call lead

IF CPU maxed → check processes
- Runaway process? → kill -9 <pid> → RESOLVED
- Expected load? → scale out or escalate

IF database down → check connection
- Connection timeout? → check firewall rules
- Auth failure? → verify credentials → escalate

ESCALATE TO: On-call Lead (slack #incidents)
TIME LIMIT: 5 minutes to escalate or resolve

Practical Examples

# prometheus-rules.yml
groups:
- name: auto_remediation
interval: 30s
rules:
- alert: DiskUsageHigh
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 2m
annotations:
action: "auto_remediate_disk"

- alert: ServiceDown
expr: up{job="api"} == 0
for: 1m
annotations:
action: "auto_restart_service"

# alert-webhook.py - receives Prometheus alerts
@app.route('/webhook', methods=['POST'])
def handle_alert():
data = request.json
alert = data['alerts'][0]
action = alert['annotations']['action']

if action == 'auto_remediate_disk':
# Only run if safeguard checks pass
if is_staging() and disk_usage() > 90:
archive_old_logs(days=30)
return {"status": "remediated"}

if action == 'auto_restart_service':
if not is_production() or get_error_budget() > 0:
restart_service()
return {"status": "restarted"}

# Otherwise escalate
notify_oncall(f"Manual intervention needed: {action}")
return {"status": "escalated"}

When to Use Auto-Remediation vs Runbooks

Auto-Remediation vs Runbooks
Auto-Remediation
  1. Incident happens 10+ times/month
  2. Fix is deterministic, low-risk
  3. Problem is detected by monitoring
  4. Safeguards can prevent bad outcomes
  5. MTTR matters (financial impact)
  6. Idempotent action (safe to retry)
Runbook
  1. Incident is rare or variable
  2. Multiple paths to resolution
  3. Requires human judgment
  4. High-risk action (data, security)
  5. Helps junior engineer learning
  6. Non-deterministic conditions

Patterns and Pitfalls

Start with monitoring → logging → alerts → runbook escalation → auto-remediation. Example: CPU spike → log details → alert → try restart → if 3 restarts in 10 min, escalate.
Track which runbook steps are executed manually every incident. If one step is always taken, automate it. This converts runbooks into auto-remediation over time.
Auto-restart without checking if root cause is still present → infinite restart loop. Safeguard: only restart if service was running before, and success rate > 80%.
Runbooks describing old architecture or removed commands. Assign owner, version control, and test annually. Add 'last executed' date to every runbook.
Auto-remediation silently fixes issue, but no log of what happened. Always log: trigger alert, condition met, action taken, result. Include in incident review.

Design Review Checklist

  • Identify top 10 recurring incidents and map to auto-remediation or runbook
  • Each critical alert has corresponding runbook or auto-remediation rule
  • Auto-remediation includes safeguard checks (error budget, environment, locked resources)
  • Auto-remediation actions are idempotent and have max-retry limits
  • Runbooks are tested at least annually with junior engineers
  • Runbooks have clear escalation path and timeout (5-10 min recommended)
  • Auto-remediation logs all actions with timestamp, trigger, and result
  • Runbooks are stored in version control and linked from alert annotations
  • Post-incident review includes 'did we update runbooks?' as agenda item
  • MTTR trending shows improvement after remediation implementation

Self-Check

  • Can a junior engineer successfully execute your most critical runbook?
  • What incidents would benefit from auto-remediation but currently require manual steps?
  • When was your runbooks last tested? Last updated?
  • Do you track which auto-remediation rules fire most often?
  • Is there an audit trail showing what auto-remediation actions took?

Next Steps

  1. Week 1: Inventory top 10 recurring incidents. Categorize as auto-remediation or runbook.
  2. Week 2: Write runbooks for 3 critical alerts. Include escalation path and timeout.
  3. Week 3: Implement 1 auto-remediation rule with safeguard checks. Monitor MTTR.
  4. Week 4: Schedule annual runbook review. Assign owner. Add to incident checklist.
  5. Ongoing: After each incident, ask "could we have prevented this with auto-remediation?"

References

  1. Beyer, B., et al. (2016). Site Reliability Engineering. O'Reilly Media.
  2. Forsgren, N., et al. (2018). Accelerate. IT Revolution Press.
  3. Adaptive Capacity Labs. (2018). How Complex Systems Fail. how.complexsystems.fail ↗️