Partition Tolerance and Failure Modes

Design systems that gracefully degrade when networks fail and prepare for the inevitable cascading failures

TL;DR

Network partitions are inevitable, not rare. When a partition occurs, split-brain is a risk: multiple parts of the system operate independently and diverge. Failure detection is hard: you cannot distinguish a slow node from a dead node. Design for graceful degradation: when critical services fail, downstream services degrade functionality rather than failing completely. Cascade failures are the real danger: one service's failure triggers timeouts that exhaust resources in connected services, triggering their failures.

Learning Objectives

Understand different types of network partitions and how they cascade
Recognize the split-brain problem and how to prevent it
Implement failure detection strategies and their limitations
Design services that gracefully degrade when dependencies fail

Motivating Scenario

A database goes down. The API layer has connection pools sized for normal load. Requests that would normally complete in 50ms now wait for a timeout (30 seconds). The API exhausts its thread pool. Requests queue up, consuming memory. The memory pressure triggers garbage collection pauses, making the API slower, making timeouts more likely. Within minutes, the entire system is down even though only the database failed.

This is a cascading failure. Understanding partitions and failure modes is how you prevent it.

Types of Network Partitions

Types of Partitions

Hard Partitions

Complete isolation between parts of the system. No messages can pass. Unambiguous but requires a choice: serve stale data or reject requests.

Examples:

Network cable cut between data centers
BGP route failure blocking all packets
Firewall misconfiguration blocking traffic

Detection: Straightforward. After a timeout, it's clear no response is coming.

Soft Partitions

High latency or intermittent packet loss. Messages eventually arrive, but slowly. Ambiguous: is the node slow or dead?

Examples:

Congestion on a network link
Satellite networks with high jitter
System under extreme load

Detection: Difficult. Timeouts must be high enough to tolerate the jitter but low enough to detect real failures.

Asymmetric Partitions

A can reach B, but B cannot reach A. Dangerous because each side believes it's still functioning.

Examples:

Firewall with one-way rules
TCP timeout asymmetry (A times out, B still retrying)
BGP route asymmetries

Detection: Hardest of all. A believes it can talk to B. B believes A has abandoned it.

The Split-Brain Problem

When a partition divides a system into two parts, each part may think the other is dead. Both parts continue operating, diverging in their state. When the partition heals, you have two conflicting versions of reality.

Split-Brain Scenario

The Danger

During a partition:

Nodes don't receive heartbeats
Timeout period expires
Both sides think they're the leader
Both sides accept writes
Data diverges irreparably

Prevention Strategies

Quorum-Based Consensus: Don't make decisions that affect the system unless you have a quorum (majority) of nodes agreeing. The split without a quorum cannot make decisions.

Quorum Example

3-node system: A, B, C

Partition splits into {A} and {B, C}

Before partition:
- All nodes agree on state
- Any node can lead

After partition:
- {A} partition: no quorum (1 < 2)
  → Cannot be leader, rejects writes
- {B, C} partition: has quorum (2 >= 2)
  → Can elect leader, accepts writes
- When partition heals:
  → {A} was never conflicting (was read-only)
  → No split-brain!

Failure Detection

Failure detection is a fundamental problem. A timeout doesn't prove failure—it just proves no response arrived.

Strategies

Heartbeats: Periodic "I'm alive" messages from every node. Missing heartbeats suggest failure. Problem: still doesn't distinguish slow from dead.

Ping/Pong: Service A pings service B. B responds. A measures round-trip time. Problem: high latency networks falsely trigger failures.

Adaptive Timeout: Adjust timeout based on recent latencies. Use Phi Accrual Failure Detection: assign a confidence level to "node is down" based on latency patterns. Problem: complex and system-dependent.

External Monitoring: A separate system monitors all services. Problem: adds operational complexity and is itself a single point of failure.

The Fundamental Limitation

In an asynchronous system (no global clock), you cannot distinguish:

"Node died"
"Network partitioned"
"Network is very slow"

All look identical from the outside. Your timeouts and thresholds are guesses.

Cascading Failures

Failure in one service can trigger failures in dependent services.

Cascade Failure Sequence

Why It Happens

Timeout propagation: Service A waits 30 seconds for Service B (which is down). Meanwhile, A's caller waits 30 seconds for A. These add up.
Resource exhaustion: Waiting requests consume memory, threads, connections. Eventually, all capacity is consumed. Even requests to healthy services fail due to resource starvation.
Thundering herd: When a service recovers, all waiting requests try at once, overwhelming it, causing it to fail again.

Prevention

Timeouts must be short: A timeout of 30 seconds cascades terribly. Timeouts of 1-5 seconds prevent resource exhaustion.

Circuit Breaker: Stop sending requests to a failing service. Let it recover. Resume gradually with health checks.

Bulkhead Isolation: Allocate separate resource pools for different services. One service's failure doesn't drain resources needed by others.

Load Shedding: When overloaded, reject new requests rather than queuing them. Better to fail fast than fail slowly to everyone.

Circuit Breaker Example

# Service A calling Service B
circuit_breaker = CircuitBreaker(
    failure_threshold=5,      # 5 failures
    recovery_timeout=60,      # 60 second timeout
    half_open_max_calls=2     # Test 2 calls before fully closing
)

for request in requests:
    try:
        # If circuit is open, fails immediately
        # If closed or half-open, tries the call
        response = circuit_breaker.call(
            lambda: service_b.call(),
            timeout=5  # Short timeout!
        )
    except CircuitBreakerOpen:
        # Service B is down, serve cached/degraded response
        response = cache.get(request.key, default='default value')

Graceful Degradation

When a service fails, downstream services shouldn't fail. Instead, they degrade gracefully:

Cache Hits: Serve cached data if source is unavailable
Fallback Values: Use sensible defaults for missing data
Reduced Features: Return basic functionality, not full functionality
Async Processing: Accept request, process later if synchronous source is unavailable
Read-Only Mode: Continue serving reads, reject writes that depend on the failed service

Self-Check

What happens in your system when the database becomes unreachable?
Can a single timeout cascade through your service chain?
How would your system behave during a network partition?
What services rely on each other? Where are the weak links?

One Takeaway

Partitions are not exceptions; they're inevitable. Design for graceful degradation, not for things working perfectly.

Next Steps

Enable Safe Retries: Learn Idempotency
Implement Resilience: Explore Timeouts and Retries
Prevent Cascades: Read about Circuit Breaker

References

Kleppmann, M. (2017). "Designing Data-Intensive Applications". O'Reilly Media.
Coulouris, G., Dollimore, J., Kindberg, T., & Blair, G. (2011). "Distributed Systems: Concepts and Design" (5th ed.).
Rotem-Gal-Oz, A. (2006). "SOA Patterns". Published online.
Fowler, M., & Foemmel, M. (2010). "Microservice Prerequisites". Published online.

Partition Tolerance and Failure Modes

TL;DR​

Learning Objectives​

Motivating Scenario​

Types of Network Partitions​

Hard Partitions​

Soft Partitions​

Asymmetric Partitions​

The Split-Brain Problem​

The Danger​

Prevention Strategies​

Failure Detection​

Strategies​

The Fundamental Limitation​

Cascading Failures​

Why It Happens​

Prevention​

Graceful Degradation​

Self-Check​

Next Steps​

References​

TL;DR

Learning Objectives

Motivating Scenario

Types of Network Partitions

Hard Partitions

Soft Partitions

Asymmetric Partitions

The Split-Brain Problem

The Danger

Prevention Strategies

Failure Detection

Strategies

The Fundamental Limitation

Cascading Failures

Why It Happens

Prevention

Graceful Degradation

Self-Check

Next Steps

References