Skip to main content

Partition Tolerance and Failure Modes

Design systems that gracefully degrade when networks fail and prepare for the inevitable cascading failures

TL;DR

Network partitions are inevitable, not rare. When a partition occurs, split-brain is a risk: multiple parts of the system operate independently and diverge. Failure detection is hard: you cannot distinguish a slow node from a dead node. Design for graceful degradation: when critical services fail, downstream services degrade functionality rather than failing completely. Cascade failures are the real danger: one service's failure triggers timeouts that exhaust resources in connected services, triggering their failures.

Learning Objectives

  • Understand different types of network partitions and how they cascade
  • Recognize the split-brain problem and how to prevent it
  • Implement failure detection strategies and their limitations
  • Design services that gracefully degrade when dependencies fail

Motivating Scenario

A database goes down. The API layer has connection pools sized for normal load. Requests that would normally complete in 50ms now wait for a timeout (30 seconds). The API exhausts its thread pool. Requests queue up, consuming memory. The memory pressure triggers garbage collection pauses, making the API slower, making timeouts more likely. Within minutes, the entire system is down even though only the database failed.

This is a cascading failure. Understanding partitions and failure modes is how you prevent it.

Types of Network Partitions

Types of Partitions

Hard Partitions

Complete isolation between parts of the system. No messages can pass. Unambiguous but requires a choice: serve stale data or reject requests.

Examples:

  • Network cable cut between data centers
  • BGP route failure blocking all packets
  • Firewall misconfiguration blocking traffic

Detection: Straightforward. After a timeout, it's clear no response is coming.

Soft Partitions

High latency or intermittent packet loss. Messages eventually arrive, but slowly. Ambiguous: is the node slow or dead?

Examples:

  • Congestion on a network link
  • Satellite networks with high jitter
  • System under extreme load

Detection: Difficult. Timeouts must be high enough to tolerate the jitter but low enough to detect real failures.

Asymmetric Partitions

A can reach B, but B cannot reach A. Dangerous because each side believes it's still functioning.

Examples:

  • Firewall with one-way rules
  • TCP timeout asymmetry (A times out, B still retrying)
  • BGP route asymmetries

Detection: Hardest of all. A believes it can talk to B. B believes A has abandoned it.

The Split-Brain Problem

When a partition divides a system into two parts, each part may think the other is dead. Both parts continue operating, diverging in their state. When the partition heals, you have two conflicting versions of reality.

Split-Brain Scenario

The Danger

During a partition:

  1. Nodes don't receive heartbeats
  2. Timeout period expires
  3. Both sides think they're the leader
  4. Both sides accept writes
  5. Data diverges irreparably

Prevention Strategies

Quorum-Based Consensus: Don't make decisions that affect the system unless you have a quorum (majority) of nodes agreeing. The split without a quorum cannot make decisions.

3-node system: A, B, C

Partition splits into {A} and {B, C}

Before partition:
- All nodes agree on state
- Any node can lead

After partition:
- {A} partition: no quorum (1 < 2)
→ Cannot be leader, rejects writes
- {B, C} partition: has quorum (2 >= 2)
→ Can elect leader, accepts writes
- When partition heals:
→ {A} was never conflicting (was read-only)
→ No split-brain!

Failure Detection

Failure detection is a fundamental problem. A timeout doesn't prove failure—it just proves no response arrived.

Strategies

Heartbeats: Periodic "I'm alive" messages from every node. Missing heartbeats suggest failure. Problem: still doesn't distinguish slow from dead.

Ping/Pong: Service A pings service B. B responds. A measures round-trip time. Problem: high latency networks falsely trigger failures.

Adaptive Timeout: Adjust timeout based on recent latencies. Use Phi Accrual Failure Detection: assign a confidence level to "node is down" based on latency patterns. Problem: complex and system-dependent.

External Monitoring: A separate system monitors all services. Problem: adds operational complexity and is itself a single point of failure.

The Fundamental Limitation

In an asynchronous system (no global clock), you cannot distinguish:

  • "Node died"
  • "Network partitioned"
  • "Network is very slow"

All look identical from the outside. Your timeouts and thresholds are guesses.

Cascading Failures

Failure in one service can trigger failures in dependent services.

Cascade Failure Sequence

Why It Happens

  1. Timeout propagation: Service A waits 30 seconds for Service B (which is down). Meanwhile, A's caller waits 30 seconds for A. These add up.

  2. Resource exhaustion: Waiting requests consume memory, threads, connections. Eventually, all capacity is consumed. Even requests to healthy services fail due to resource starvation.

  3. Thundering herd: When a service recovers, all waiting requests try at once, overwhelming it, causing it to fail again.

Prevention

Timeouts must be short: A timeout of 30 seconds cascades terribly. Timeouts of 1-5 seconds prevent resource exhaustion.

Circuit Breaker: Stop sending requests to a failing service. Let it recover. Resume gradually with health checks.

Bulkhead Isolation: Allocate separate resource pools for different services. One service's failure doesn't drain resources needed by others.

Load Shedding: When overloaded, reject new requests rather than queuing them. Better to fail fast than fail slowly to everyone.

# Service A calling Service B
circuit_breaker = CircuitBreaker(
failure_threshold=5, # 5 failures
recovery_timeout=60, # 60 second timeout
half_open_max_calls=2 # Test 2 calls before fully closing
)

for request in requests:
try:
# If circuit is open, fails immediately
# If closed or half-open, tries the call
response = circuit_breaker.call(
lambda: service_b.call(),
timeout=5 # Short timeout!
)
except CircuitBreakerOpen:
# Service B is down, serve cached/degraded response
response = cache.get(request.key, default='default value')

Graceful Degradation

When a service fails, downstream services shouldn't fail. Instead, they degrade gracefully:

  • Cache Hits: Serve cached data if source is unavailable
  • Fallback Values: Use sensible defaults for missing data
  • Reduced Features: Return basic functionality, not full functionality
  • Async Processing: Accept request, process later if synchronous source is unavailable
  • Read-Only Mode: Continue serving reads, reject writes that depend on the failed service

Self-Check

  1. What happens in your system when the database becomes unreachable?
  2. Can a single timeout cascade through your service chain?
  3. How would your system behave during a network partition?
  4. What services rely on each other? Where are the weak links?
One Takeaway

Partitions are not exceptions; they're inevitable. Design for graceful degradation, not for things working perfectly.

Next Steps

  1. Enable Safe Retries: Learn Idempotency
  2. Implement Resilience: Explore Timeouts and Retries
  3. Prevent Cascades: Read about Circuit Breaker

References

  • Kleppmann, M. (2017). "Designing Data-Intensive Applications". O'Reilly Media.
  • Coulouris, G., Dollimore, J., Kindberg, T., & Blair, G. (2011). "Distributed Systems: Concepts and Design" (5th ed.).
  • Rotem-Gal-Oz, A. (2006). "SOA Patterns". Published online.
  • Fowler, M., & Foemmel, M. (2010). "Microservice Prerequisites". Published online.