Skip to main content

Health Probes

Detect failures quickly and enable automatic self-healing

TL;DR

Services silently fail without active monitoring. Liveness probes answer "Is the service alive?" (restart if dead). Readiness probes answer "Can it serve traffic?" (remove from load balancer if not ready). Implement /health and /ready endpoints that check dependencies and resource usage. Integrate with orchestrators like Kubernetes for automatic self-healing. Fast detection (< 30 seconds) prevents cascading failures. Both types are essential: liveness detects zombie processes; readiness handles temporary overload.

Learning Objectives

  • Distinguish liveness vs. readiness probes and their recovery actions
  • Design health checks that detect both hard failures and performance degradation
  • Implement deep health checks that validate dependencies, not just process existence
  • Configure probe frequency, timeout, and failure thresholds appropriately
  • Integrate probes with container orchestrators and load balancers

Motivating Scenario

A Kubernetes cluster runs 10 payment service replicas. One replica's database connection pool exhausts (a leak in the code). The service is still running (process alive) but cannot process payments. Without readiness probes, the load balancer continues routing traffic to it. Customers experience timeouts on 10% of requests. With readiness probes, Kubernetes detects database connectivity failure within 10 seconds, removes the replica from the service mesh, and distributes traffic to healthy replicas. Customer-facing latency stays within SLA.

Core Concepts

Liveness vs. Readiness Probes

Liveness Probes detect when a process is stuck and needs restarting. A stuck service won't recover on its own—restart it. Kubernetes kills the container and starts a new one.

Readiness Probes detect when a service is temporarily unable to serve traffic (database connection pool exhausted, downstream service slow, disk full). The process is alive but degraded. Remove it from the load balancer; clients route to healthy replicas. Once recovered, readd it to the balancer.

Both types query the service's /health endpoint. The endpoint must quickly check critical dependencies and return status. Slow health checks defeat their purpose (detection timeout > failure recovery time).

Practical Example

import time
import asyncio
from enum import Enum
from datetime import datetime

class ProbeStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"

class HealthProbe:
def __init__(self, timeout_sec=5, check_interval_sec=10):
self.timeout_sec = timeout_sec
self.check_interval_sec = check_interval_sec
self.db_connected = True
self.cache_connected = True
self.request_latency_ms = 50
self.queue_depth = 100
self.max_queue = 10000

async def check_database(self):
"""Simulate database connectivity check"""
try:
# In real code: execute SELECT 1 or equivalent
await asyncio.sleep(0.01)
return self.db_connected
except Exception:
return False

async def check_cache(self):
"""Simulate cache connectivity check"""
try:
await asyncio.sleep(0.005)
return self.cache_connected
except Exception:
return False

async def check_latency(self):
"""Check if request latency is acceptable"""
# Latency threshold: 500ms
return self.request_latency_ms < 500

async def check_resources(self):
"""Check queue depth and resource usage"""
queue_utilization = self.queue_depth / self.max_queue
return queue_utilization < 0.9

async def liveness_check(self):
"""Full process check (restart if fails)"""
try:
db = await asyncio.wait_for(
self.check_database(), timeout=self.timeout_sec
)
if not db:
return ProbeStatus.UNHEALTHY

# If database is reachable, process is alive
return ProbeStatus.HEALTHY
except asyncio.TimeoutError:
return ProbeStatus.UNHEALTHY

async def readiness_check(self):
"""Can serve traffic? (remove from LB if not)"""
try:
db = await asyncio.wait_for(
self.check_database(), timeout=2
)
cache = await asyncio.wait_for(
self.check_cache(), timeout=2
)
latency_ok = await self.check_latency()
resources_ok = await self.check_resources()

if not db or not cache:
return ProbeStatus.DEGRADED

if not latency_ok or not resources_ok:
return ProbeStatus.DEGRADED

return ProbeStatus.HEALTHY
except asyncio.TimeoutError:
return ProbeStatus.DEGRADED

async def get_health_report(self):
"""Full health report for /health endpoint"""
liveness = await self.liveness_check()
readiness = await self.readiness_check()

return {
"timestamp": datetime.now().isoformat(),
"liveness": liveness.value,
"readiness": readiness.value,
"uptime_seconds": 3600,
"dependencies": {
"database": "connected" if self.db_connected else "disconnected",
"cache": "connected" if self.cache_connected else "disconnected"
},
"metrics": {
"request_latency_ms": self.request_latency_ms,
"queue_depth": self.queue_depth,
"queue_utilization_percent": (self.queue_depth / self.max_queue) * 100
}
}

# Example usage
async def main():
probe = HealthProbe()

# Simulate normal operation
report = await probe.get_health_report()
print("Health Report:", report)

# Simulate degradation
probe.db_connected = False
report = await probe.get_health_report()
print("Degraded Report:", report)

asyncio.run(main())

When to Use vs. When NOT to Use

Use Health Probes
  1. Running in Kubernetes or auto-scaling environments
  2. Multiple service replicas behind a load balancer
Avoid Health Probes
  1. Dependencies that can fail (databases, caches, APIs)
  2. Long-running processes that may hang or deadlock
  3. Services with both availability and performance SLAs

Patterns and Pitfalls

/live checks process existence (minimal). /ready checks dependencies (complete). Kubernetes can fail readiness without restarting—better for temporary issues like database maintenance.
Readiness probes check all critical dependencies. Database down? Don't accept traffic. Cache timeout? Degrade but stay ready. Queue backed up? Don't accept traffic. Check actual connectivity, not just configuration.
A 30-second health check on a 10-second probe interval = overlapping checks. Health check failure takes 30+ seconds to detect. Use 1-5 second timeouts. Fast fail on any latency > threshold.
Liveness probe pings 5 dependencies, writes to log, queries database. This is a readiness check. Liveness should be minimal: process alive? Yes/no. Separate concerns = faster detection.
When readiness probe fails, don't reject mid-flight. Add PreStop hook to finish in-flight requests before container terminates. Prevents customer-facing errors from abrupt termination.
Failure threshold too low (fail after 1 probe) = false positives, thrashing. Too high (fail after 100) = slow detection. Use 2-3 consecutive failures for readiness, 3-5 for liveness.

Design Review Checklist

  • Liveness probe is minimal (< 2 sec) and only checks process state
  • Readiness probe checks all critical dependencies (database, cache, APIs)
  • Readiness probe includes latency and resource checks (queue depth, memory)
  • Health endpoints return HTTP 200 when healthy, 503 when unhealthy
  • Failure thresholds are tuned (2-3 for readiness, 3-5 for liveness)
  • Probe interval is 10-30 seconds; timeout is 1-5 seconds
  • Health checks have detailed logging for debugging failures
  • Load balancer configuration matches probe behavior (uses readiness)
  • Kubernetes probes are defined in container specs
  • Monitoring alerts on repeated probe failures

Self-Check

  • Can you explain why liveness and readiness are different actions?
  • What should a liveness probe check? What should it skip?
  • How do you distinguish between a bug (needs liveness) vs. overload (needs readiness)?
  • What's the impact if a health check takes longer than the probe timeout?
  • How does your load balancer respond to failed readiness probes?

Next Steps

  1. Circuit Breaker: Read Circuit Breaker ↗️ to detect dependency failures
  2. Load Shedding: Learn Load Shedding and Backpressure ↗️ to handle overload gracefully
  3. Bulkhead Isolation: Read Bulkhead Isolation ↗️ to prevent cascading resource exhaustion

References