Skip to main content

SLO, SLI, SLA, and Error Budgets

Define reliability targets and govern software delivery using error budgets.

TL;DR

SLI (Service Level Indicator): Metric that measures service behavior (uptime %, latency, error rate). SLO (Service Level Objective): Target for the SLI (99.9% uptime). SLA (Service Level Agreement): Legal contract; breach has consequences (refund, credits). Error budget: Allowable downtime/errors. If SLO is 99.9% uptime (0.1% downtime), monthly error budget is 43 minutes. When error budget is exhausted, freeze features; focus on stability. Error budget enforcement prevents teams from prioritizing velocity over reliability.

Learning Objectives

  • Define meaningful SLIs for your service
  • Set realistic SLOs using SLI data
  • Calculate error budgets
  • Use error budgets to govern deployment and feature development
  • Implement SLI monitoring and alerting
  • Establish SLAs with customers
  • Avoid SLO mistakes (too strict, unmeasurable)
  • Scale SLOs to multi-service systems

Motivating Scenario

Product wants to ship 10 features this quarter. Engineering says: "We need a day to stabilize after each feature." They ship 7 features, stability tank downtime, SLA breach imminent. With error budgets: Product knows the error budget monthly is 43 minutes. Current spend: 12 minutes. Room for 31 minutes more. Feature 8-10 are risky; shipping them uses up budget. Product decides: ship 8, focus on stability for the rest of the quarter. Everyone understands the tradeoff; no friction.

Core Concepts

Definitions

TermDefinitionExample
SLIMeasured indicator of service behavior99.5% of requests succeed
SLOTarget for the SLIWe aim for 99.9% success rate
SLALegal contract; penalty if breached99.5% uptime guaranteed; refund if breach
Error BudgetAllowed failures before SLO breachIf SLO is 99%, budget is 1% failures

Error Budget Calculation

SLO Target: 99.9% uptime

Downtime Budget = (100% - SLO%) × Time Period
Monthly: (100% - 99.9%) × 30 days × 24 hours × 60 min = 43.2 minutes
Quarterly: (100% - 99.9%) × 90 days × 24 hours × 60 min = 129.6 minutes
Annual: (100% - 99.9%) × 365 days × 24 hours × 60 min = 525.6 minutes

SLI Types

TypeExampleHow to Measure
AvailabilityHTTP requests returning success(Successful requests / Total requests)
LatencyRequests completing within 100ms(Requests < 100ms / Total requests)
Error RateRequests not returning 5xx errors(Non-5xx responses / Total requests)
DurabilityData not lost(Successful writes and reads / Total)
FreshnessData current within 5 minutes(Queries returning fresh data / Total)

SLO Targets (Common)

ServiceSLOError Budget
Basic SaaS99% (2 nines)7.2 hours/month
Standard SaaS99.9% (3 nines)43 minutes/month
Critical Infra99.99% (4 nines)4.3 minutes/month
Tier-1 Critical99.999% (5 nines)26 seconds/month

SLI Implementation

from prometheus_client import Counter, Histogram, Gauge
import time
from datetime import datetime, timedelta

# SLI: Success Rate
requests_total = Counter(
'requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)

requests_success = Counter(
'requests_success',
'Successful HTTP requests',
['method', 'endpoint']
)

# SLI: Latency
request_duration = Histogram(
'request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

# SLI: Error Rate
errors_total = Counter(
'errors_total',
'Total errors',
['type', 'service']
)

def handle_request(method, endpoint, handler):
"""Middleware to record SLIs"""
start = time.time()

try:
result = handler()
requests_total.labels(method=method, endpoint=endpoint, status=200).inc()
requests_success.labels(method=method, endpoint=endpoint).inc()

duration = time.time() - start
request_duration.labels(method=method, endpoint=endpoint).observe(duration)

return result

except Exception as e:
requests_total.labels(method=method, endpoint=endpoint, status=500).inc()
errors_total.labels(type=type(e).__name__, service='checkout').inc()

duration = time.time() - start
request_duration.labels(method=method, endpoint=endpoint).observe(duration)

raise

# SLI Calculator
class SLICalculator:
def __init__(self, prometheus_client):
self.prom = prometheus_client

def calculate_success_rate(self, endpoint, window_minutes=5):
"""Calculate success rate for endpoint"""
query = f'''
sum(rate(requests_success{{endpoint="{endpoint}"}}[{window_minutes}m]))
/
sum(rate(requests_total{{endpoint="{endpoint}"}}[{window_minutes}m]))
'''
return self.prom.query(query)

def calculate_latency_percentile(self, endpoint, percentile=99, window_minutes=5):
"""Calculate latency at percentile"""
query = f'''
histogram_quantile({percentile / 100},
sum(rate(request_duration_seconds_bucket{{endpoint="{endpoint}"}}[{window_minutes}m]))
)
'''
return self.prom.query(query)

def calculate_error_rate(self, service, window_minutes=5):
"""Calculate error rate"""
query = f'''
sum(rate(errors_total{{service="{service}"}}[{window_minutes}m]))
/
sum(rate(requests_total{{service="{service}"}}[{window_minutes}m]))
'''
return self.prom.query(query)

# SLO Monitoring
class SLOMonitor:
def __init__(self, slo_target_percent=99.9):
self.slo_target = slo_target_percent / 100
self.current_sli = None
self.error_budget = None

def update_sli(self, measured_sli):
"""Update current SLI measurement"""
self.current_sli = measured_sli
self.recalculate_error_budget()

def recalculate_error_budget(self):
"""Calculate remaining error budget"""
if not self.current_sli:
return

monthly_window = 30 * 24 * 60 # minutes

# How much error budget we used
error_used = (1 - self.current_sli) * monthly_window

# Total budget allowed
total_budget = (1 - self.slo_target) * monthly_window

# Remaining budget
remaining = total_budget - error_used

self.error_budget = {
'total_minutes': total_budget,
'used_minutes': error_used,
'remaining_minutes': remaining,
'percent_used': (error_used / total_budget * 100) if total_budget > 0 else 0
}

def is_budget_exhausted(self):
"""Check if error budget is exhausted"""
if not self.error_budget:
return False
return self.error_budget['remaining_minutes'] <= 0

def get_budget_status(self):
"""Return budget status for monitoring"""
return {
'slo_target': f"{self.slo_target * 100}%",
'current_sli': f"{self.current_sli * 100:.2f}%",
'error_budget': self.error_budget,
'budget_exhausted': self.is_budget_exhausted()
}

# SLO Alerting
class SLOAlerter:
def __init__(self, monitor):
self.monitor = monitor

def check_slo_breach(self):
"""Alert if SLO breached"""
if not self.monitor.current_sli:
return None

if self.monitor.current_sli < self.monitor.slo_target:
return {
'alert': 'SLO_BREACHED',
'slo_target': self.monitor.slo_target,
'current_sli': self.monitor.current_sli,
'message': f"SLO breached! Target: {self.monitor.slo_target * 100}%, Current: {self.monitor.current_sli * 100:.2f}%"
}

return None

def check_budget_warning(self):
"""Alert if budget almost exhausted"""
if not self.monitor.error_budget:
return None

percent_used = self.monitor.error_budget['percent_used']

if percent_used > 80:
return {
'alert': 'ERROR_BUDGET_WARNING',
'percent_used': percent_used,
'remaining_minutes': self.monitor.error_budget['remaining_minutes'],
'message': f"Error budget {percent_used:.1f}% exhausted. {self.monitor.error_budget['remaining_minutes']:.1f} minutes remaining."
}

return None

# Example Usage
monitor = SLOMonitor(slo_target_percent=99.9)

# Simulate SLI measurements
measurements = [0.9991, 0.9992, 0.9989, 0.9993, 0.9988] # 99.88% to 99.93%
average_sli = sum(measurements) / len(measurements)
monitor.update_sli(average_sli)

print(f"SLO Status: {monitor.get_budget_status()}")

alerter = SLOAlerter(monitor)
slo_alert = alerter.check_slo_breach()
budget_alert = alerter.check_budget_warning()

if slo_alert:
print(f"Alert: {slo_alert['message']}")
if budget_alert:
print(f"Warning: {budget_alert['message']}")

# Multi-SLI Service
class MultiSLIService:
def __init__(self):
self.slis = {
'availability': SLOMonitor(slo_target_percent=99.9),
'latency_p99': SLOMonitor(slo_target_percent=99.9),
'error_rate': SLOMonitor(slo_target_percent=99.9),
}

def update_slis(self, measurements):
"""Update all SLIs"""
for sli_name, measured_value in measurements.items():
if sli_name in self.slis:
self.slis[sli_name].update_sli(measured_value)

def get_overall_status(self):
"""Overall service status"""
all_within_slo = all(
not monitor.is_budget_exhausted()
for monitor in self.slis.values()
)

return {
'all_slos_met': all_within_slo,
'slis': {
name: monitor.get_budget_status()
for name, monitor in self.slis.items()
}
}

service = MultiSLIService()
service.update_slis({
'availability': 0.9991,
'latency_p99': 0.9989,
'error_rate': 0.9992,
})
print(f"Service Status: {service.get_overall_status()}")

Real-World Examples & Patterns

E-Commerce Checkout SLO

  • Availability SLI: (Successful checkouts / Total checkouts) × 100
  • SLO: 99.95% (99 minutes error budget/month)
  • Latency SLI: (Checkouts < 5 seconds / Total) × 100
  • SLO: 99% (p99 latency < 2 seconds)

When error budget exhausted: Freeze new checkout features; focus on stability.

API Service SLO

  • Success Rate SLI: (2xx + 3xx responses / Total) × 100
  • SLO: 99.9%
  • Latency SLI: (Requests < 100ms / Total) × 100
  • SLO: 99% for p99
  • Error Rate SLI: (Non-5xx / Total) × 100
  • SLO: 99.99%

SLA Examples

  • Basic SaaS: 95% uptime, 5% service credit
  • Standard SaaS: 99.5% uptime, 25% service credit per 0.5%
  • Enterprise: 99.99% uptime, 100% service credit if missed

Common Mistakes and Pitfalls

Mistake 1: SLO Too Strict

❌ WRONG: "We'll do 99.999% uptime"
- Requires 5-nines infrastructure (very expensive)
- Single datacenter can't meet this
- Unrealistic, customers don't care

✅ CORRECT: "99.9% meets customer needs"
- Realistic, achievable with decent infrastructure
- 43 minutes monthly downtime acceptable
- Balance cost and reliability

Mistake 2: Unmeasurable SLI

❌ WRONG: "Service should be fast"
- No measurement, no tracking
- Can't know if met

✅ CORRECT: "95% of requests < 100ms"
- Concrete, measurable
- Easy to track and alert

Mistake 3: Ignoring Error Budget

❌ WRONG: SLO exists but no enforcement
- Teams ship regardless
- Error budget unused
- Reliability issues accumulate

✅ CORRECT: Error budget governs deployment
- Error budget exhausted = no new features
- Prevents reliability debt
- Everyone understands tradeoff

Production Considerations

SLO Setting Process

  1. Measure current: How reliable is service today?
  2. Define targets: What do customers need?
  3. Validate: Can we meet targets?
  4. Implement monitoring: Track SLI continuously
  5. Enforce budget: Gate deployments on budget

Multi-Service SLOs

For microservices, total SLO is the product:

Service A: 99.9%
Service B: 99.9%
Service C: 99.9%

Overall: 99.9% × 99.9% × 99.9% = 99.7%

Workaround: Make internal service SLOs higher, or use fallbacks (caching, degradation).

Error Budget Allocation

  • Feature development: 70% of budget
  • Infrastructure improvements: 20%
  • Operational overhead: 10%

When budget exhausted, shift: 0% features, 100% stability.

Self-Check

  • What's the difference between SLO and SLA?
  • How do you calculate error budget?
  • What does it mean to exhaust error budget?
  • How should you respond to SLO breach?
  • What SLI is most important for your service?

Design Review Checklist

  • SLI defined and measurable?
  • SLO targets realistic?
  • Error budget calculated?
  • SLI monitoring configured?
  • Alerts set for breaches?
  • Alerts set for budget warnings?
  • Error budget enforcement process?
  • Multiple SLIs per service?
  • Customer expectations aligned?
  • SLA defined (if applicable)?
  • Error budget tracked publicly?
  • Runbook for SLO breaches?

Next Steps

  1. Define SLIs for your service
  2. Set SLO targets based on customer needs
  3. Implement SLI monitoring
  4. Calculate error budget monthly
  5. Create alerts for breaches and warnings
  6. Enforce error budget gating for deployments
  7. Review SLOs quarterly

References