Skip to main content

Error Budgets and Toil

Quantify acceptable unreliability; measure and eliminate manual toil.

TL;DR

Error budget: SLO 99.9% = allowed to fail 0.1% = 43 minutes/month. Once consumed, freeze changes until next month. Use budget intentionally: risky deployments, chaos tests, experiments.

Toil: manual operational work (manual deployments, running scripts, repetitive monitoring). Measure it ruthlessly. Target: 20-30% of engineer time on toil; 70-80% on features. Automate > Hire. If toil >50%, you're understaffed in the wrong direction.

Learning Objectives

  • Calculate SLO and error budgets accurately
  • Use error budgets to make intentional risk decisions
  • Measure toil systematically across your team
  • Identify and prioritize toil-elimination projects
  • Build automation roadmaps for 50% toil reduction
  • Balance innovation velocity with reliability

Motivating Scenario

Your team shipped 8 features last quarter but spent 40% of time on manual deployments, monitoring, and incident response (toil). Your deployment pipeline is manual: run scripts, wait for tests, toggle feature flags, monitor logs manually.

Competitor's team shipped 20 features. Their deployment is fully automated: push to main, CI/CD runs everything, deploys automatically to production. They spend 10% on toil, 90% on features.

You have a 99.9% SLO but you've consumed your error budget by month 2 due to incidents caused by manual deployment mistakes.

Competitor has 99.95% SLO and budget remaining at month's end, so they can do a risky experiment in week 4.

Eliminating toil unlocks velocity. Error budgets give permission to deploy faster.

Core Concepts

Error Budget Math and Usage

graph LR SLO["SLO: 99.9%<br/>per month"] --> Budget["Error Budget<br/>0.1% = 43 min"] Budget --> Track["Track Usage<br/>Incidents + Degradation"] Track --> Decision{"Budget<br/>Consumed?"} Decision -->|Yes| Freeze["Freeze Changes<br/>Focus on Reliability"] Decision -->|No| Spend["Spend Intentionally<br/>Risky Deploy, Chaos"] Freeze --> Next["Next Month<br/>Fresh Budget"] Spend --> Next

SLO Breakdown:

  • 99.9% = "three nines" = 99.9% uptime
  • Failures allowed per month: (1 - 0.999) × 43,200 sec = 43.2 minutes
  • Failures allowed per day: 43.2 / 30 = 1.44 minutes
  • Per million requests: 1,000 errors out of 1,000,000 requests

Budget Consumption:

  • Incident down 5 min: -5 min budget
  • Deployment at 99% success: -0.1% × deploy volume
  • Cascading failure for 2 hours: consumes budget for entire month

Toil Definition and Measurement

Toil: Manual, repetitive, operational work that doesn't solve problems.

Examples:

  • Manually deploying to each environment
  • Running pre-deployment checklists
  • Manually scaling services during traffic spikes
  • Manually restarting failed services
  • Manually running batch jobs daily
  • Manually updating configuration files
  • Oncall escalation and alerts management

Not Toil: Feature development, design discussions, code review (these are valuable).

Measurement: Tracking toil requires honest logging.

Weekly Toil Audit:
- 2 hrs: Deploy process (multiple manual steps)
- 1.5 hrs: Manual scaling (peak traffic management)
- 1 hr: Runbook updates (post-incident)
- 2 hrs: Incident response (debugging + fixing)
- 0.5 hrs: Monitoring + alerting (manual checks)
--------
7 hours toil out of 40 hour week = 17.5% (GOOD)

Team average: 25%, Target: <30%

Toil-to-Automation Roadmap

Transform toil into automation:

  • Manual deployment → CI/CD pipeline
  • Manual scaling → Auto-scaling groups + metrics
  • Manual monitoring → Alert rules + dashboards
  • Manual runbook steps → Auto-remediation
  • Manual config changes → Infrastructure-as-code

Practical Examples

# error_budget_calculator.py

import json
from datetime import datetime, timedelta

class ErrorBudget:
SLO_TO_MINUTES = {
0.999: 43.2, # 99.9%
0.9999: 4.32, # 99.99%
0.99999: 0.432, # 99.999%
0.95: 216, # 95%
}

def __init__(self, slo: float, service_name: str):
self.slo = slo
self.service_name = service_name
self.budget_remaining = self.SLO_TO_MINUTES.get(slo)
self.incidents = []

def record_incident(self, name: str, downtime_minutes: float,
severity: str, cause: str):
"""Record incident and deduct from budget"""
self.incidents.append({
"name": name,
"downtime_minutes": downtime_minutes,
"severity": severity,
"cause": cause,
"timestamp": datetime.now().isoformat()
})
self.budget_remaining -= downtime_minutes

if self.budget_remaining < 0:
print(f"⚠️ ERROR BUDGET EXCEEDED for {self.service_name}")
self.alert_team()

def record_deployment_error_rate(self, error_rate: float,
requests_deployed: int):
"""Account for failed deploys"""
if error_rate > self.slo:
# Over budget on this deploy
excess_errors = requests_deployed * (error_rate - self.slo)
minutes_lost = excess_errors / (1000 * 60) # Rough conversion
self.budget_remaining -= minutes_lost

def get_status(self) -> dict:
"""Current budget status"""
percentage = (self.budget_remaining / self.SLO_TO_MINUTES[self.slo]) * 100
status = "🟢 OK" if percentage > 20 else "🟡 CAUTION" if percentage > 5 else "🔴 CRITICAL"

return {
"service": self.service_name,
"slo": f"{self.slo * 100}%",
"budget_remaining_min": round(self.budget_remaining, 2),
"percentage_remaining": round(percentage, 1),
"status": status,
"incidents_this_month": len(self.incidents),
"recommendation": self._recommendation(percentage)
}

def _recommendation(self, percentage: float) -> str:
if percentage > 50:
return "Safe to proceed with risky deployments"
elif percentage > 20:
return "Proceed with caution; avoid major changes"
elif percentage > 5:
return "Freeze non-critical changes until next period"
else:
return "CRITICAL: Freeze all changes; focus on stability"

def alert_team(self):
msg = f"ERROR BUDGET EXCEEDED: {self.service_name}"
print(f"\n🚨 {msg}\n")
# Send to Slack, PagerDuty, etc.

# Example usage
if __name__ == "__main__":
payment_budget = ErrorBudget(0.999, "payment-service")

# Record incidents
payment_budget.record_incident(
"Database connection pool exhaustion",
downtime_minutes=15,
severity="HIGH",
cause="Undetected leak"
)

payment_budget.record_incident(
"Bad deployment (config typo)",
downtime_minutes=5,
severity="CRITICAL",
cause="Manual deployment error"
)

payment_budget.record_deployment_error_rate(
error_rate=0.002, # 0.2% error rate
requests_deployed=50000
)

# Check status
print(json.dumps(payment_budget.get_status(), indent=2))

When to Prioritize Error Budget vs Toil

Focus: Error Budget vs Toil Elimination
Prioritize Error Budget
  1. SLO is at risk (trending down)
  2. Recent incidents consumed budget
  3. Team is reactive to failures
  4. Need to slow deployment velocity
  5. Building fundamental reliability
Prioritize Toil Elimination
  1. Toil is >40% of engineer time
  2. Team is burned out from manual work
  3. Deployment takes >1 hour
  4. Need to increase feature velocity
  5. Team has spare capacity

Patterns and Pitfalls

Publish budget status weekly in all-hands meeting. When budget is consumed, it's not a secret. This creates organizational alignment around reliability.
Each quarter, list top 5 toil sources. Estimate hours/week each costs. Prioritize automation by impact. This makes toil visible and prevents it from growing unbounded.
Never using budget for risky experiments defeats the purpose. Error budgets exist to enable innovation. If you're not spending it, your SLO is too conservative.
Without measurement, toil grows invisibly. Engineers shift to 50% toil and don't notice. Measure weekly, discuss monthly, automate ruthlessly.
Automate everything without priority. You can't automate 100% of toil. Focus on highest-impact items first: deployments, scaling, monitoring.

Design Review Checklist

  • SLO is defined for each critical service (99.9% minimum)
  • Error budget is calculated and tracked monthly
  • Budget status is visible to entire team (dashboard/meeting)
  • Team has decision rules for spending budget (governance policy)
  • Toil is measured weekly or bi-weekly across team
  • Toil measurement includes categories (deploy, incident, monitor, config, scaling)
  • Top 5 toil sources are identified and ranked by impact
  • Automation roadmap exists for toil reduction
  • Post-incident reviews include 'what toil could prevent this?'
  • Toil elimination projects are prioritized equally with feature work

Self-Check

  • Can you state your SLO and current error budget without checking a document?
  • What percentage of your time goes to toil? Is it trending down?
  • Have you spent your error budget intentionally this month, or did incidents consume it?
  • What single toil item would save you the most time if automated?
  • Does your team discuss error budget in planning meetings?

Next Steps

  1. Day 1: Calculate SLO and error budget for 3 critical services
  2. Week 1: Measure toil across team for one week
  3. Week 2: Publish budget status and toil metrics to team
  4. Week 3: Create automation roadmap for top 3 toil sources
  5. Month 1: Implement first toil automation (likely CI/CD improvement)
  6. Ongoing: Weekly budget/toil tracking; monthly planning

References

  1. Beyer, B., et al. (2016). Site Reliability Engineering. O'Reilly Media.
  2. Forsgren, N., et al. (2018). Accelerate. IT Revolution Press.
  3. Perez-Rios, J. (2021). SRE Fundamentals. Wiley. (SLO/Error Budget chapters)