CAP & PACELC Theorems

Master the fundamental constraints that govern every distributed systems architecture decision

TL;DR

The CAP theorem proves that a distributed system cannot simultaneously guarantee Consistency (all nodes see the same data), Availability (the system responds to requests), and Partition tolerance (the system works despite network partitions). You can have at most two. PACELC extends this: if the network is Partitioned, choose between Consistency and Availability; else (if the network is working), choose between Latency and Consistency. These theorems guide every architectural decision.

Learning Objectives

Understand what CAP actually says (and doesn't say)
Identify which properties your system prioritizes
Apply PACELC to evaluate consistency-latency trade-offs
Recognize that you're always making a choice, whether intentionally or not

Motivating Scenario

Your team is building a financial system. You need strong consistency (all ledger entries must be immediately visible), high availability (the system must never be down), and partition tolerance (the system must work when networks fail). Consultants assure you it's possible. They're wrong.

During a network partition between data centers, you must choose: return errors to users (sacrificing availability) or serve stale data (sacrificing consistency). There is no third option. The CAP theorem proves it mathematically.

The CAP Theorem

CAP Theorem Space

What Each Property Means

Consistency (C): Every read returns the most recent write. All nodes agree on data values. If one node has a value, all nodes must eventually see that value.

Availability (A): Every request receives a non-error response, regardless of individual node failures. The system continues operating even when nodes are down.

Partition Tolerance (P): The system continues operating despite network partitions. It survives when communication between nodes is lost or delayed indefinitely.

The Central Claim

In the face of a network partition, you cannot have both Consistency and Availability. You must choose one. More precisely:

CP (Consistency + Partition Tolerance): When a partition occurs, some nodes stop responding to maintain consistency. Banking systems often choose this.
AP (Availability + Partition Tolerance): When a partition occurs, nodes keep responding but serve potentially stale data. Eventual consistency systems choose this.

The Often-Misunderstood Implication

Many people interpret CAP as a permanent choice. This is wrong. Brewer himself clarified: you make different choices at different times. During normal operation (no partition), you can have both consistency and availability. During a partition, you must sacrifice one.

This is where PACELC comes in.

PACELC extends CAP by considering what happens during normal (partition-free) operation. The acronym stands for:

Partition: During a partition
Else: During normal operation
C and A: Choose between Consistency and Availability
L and C: Choose between Latency and Consistency

PACELC Decision Tree

The PACELC Choices

When Partitioned:

PC: Partition tolerance + Consistency = system is unavailable during partitions
PA: Partition tolerance + Availability = system is inconsistent during partitions

When Normal (Else):

EL: If Eventual consistency, then Latency is low (you can serve stale data quickly)
EC: If Eventual consistency, you can be strongly Consistent at the cost of Latency

Real-World Examples

CP Systems (Consistency + Partition Tolerance)

Traditional SQL databases with synchronous replication
HBase
MongoDB with strong consistency
Financial transactions requiring immediate visibility
Inventory management needing accuracy over availability
Systems where stale data is unacceptable

AP Systems (Availability + Partition Tolerance)

DynamoDB (eventual consistency mode)
Cassandra
Amazon SimpleDB
High-scale systems requiring constant availability
Social media feeds (eventual consistency acceptable)
Caching systems (stale data acceptable)

Hybrid Approaches

Modern systems often use hybrid approaches, applying CAP decisions at different levels:

Per-Operation Decisions: Strong consistency for critical operations (payment processing), eventual consistency for non-critical operations (user profile updates).
Per-Data Decisions: Inventory data might be CP (must be accurate), while product reviews might be AP (eventual consistency acceptable).
Quorum-Based: Consistency level depends on how many replicas acknowledge the write. More replicas = higher consistency but lower availability.

Hybrid Consistency Strategy

Trade-offs in Practice

Explicitly identify whether your system is CP or AP
For CP systems, define fallback behavior during partitions (queue writes, serve errors?)
For AP systems, implement eventual consistency correctly (conflict resolution, convergence guarantees)
For normal operation, decide latency tolerance vs consistency requirements
Monitor actual partition frequency and duration to validate CAP choices
Test failure scenarios explicitly (simulate partitions, measure behavior)

Common Mistakes

Mistake 1: Assuming you can have all three. You can't. Choose deliberately.

Mistake 2: Assuming the choice is permanent. It's not. Different operations have different requirements.

Mistake 3: Assuming partition tolerance is optional. If you're distributed, you're partitioned. Partitions happen.

Mistake 4: Assuming your system is CA. If you have a network between nodes, you have P. CA systems exist only on single machines.

Practical Decision Framework

Decision Matrix: CAP for Different Use Cases

Use Case	Requirement	CAP Choice	Example
Inventory management	Prevent overselling	CP	If partition: reject sales, refuse orders, maintain accuracy
User authentication	Must never lock legitimate users	AP	If partition: accept logins, sync later, accept some stale data
Payment processing	Accuracy is non-negotiable	CP	If partition: queue transaction, don't process blind, validate later
Social media feed	Eventual consistency fine	AP	If partition: serve cached feed, sync when partition heals
Trading system	Real-time accuracy critical	CP	If partition: halt trading, prevent incorrect execution
Logging system	Availability more important	AP	If partition: buffer logs locally, ship when partition heals
DNS	Performance > consistency	AP	If partition: serve cached record, refresh when partition heals
Account balance	Consistency critical	CP	If partition: reject access, don't show stale balance

CAP in Real Systems

Google Search (AP System)

Indexes are distributed across data centers
Network partition between data centers is handled by serving stale indices
Consistency: index might be days old after partition
Availability: search works during partition
Users might see old results, but search is always fast

Database with Quorum Reads (CP System)

Quorum: require majority of replicas to agree on value
Network partition: some replicas isolated, quorum can't form
Result: system becomes unavailable (can't read)
Trade: sacrificed availability for consistency

Cassandra (AP System)

Read/write with tunable consistency
Default: eventual consistency
Network partition: different nodes might have different data
Example: User updates profile in DC1, reads from DC2 (old version)
Consistency eventually achieved (DC2 syncs from DC1)

Consistency Models in Detail

Consistency Spectrum

Strong Consistency: All nodes always have the same data.

Cost: High latency, low availability
Use when: Money, medical data, critical systems
Implementation: Quorum reads/writes, distributed locking

Eventual Consistency: Nodes eventually converge to the same value.

Cost: Temporary divergence, complexity handling conflicts
Use when: User profiles, recommendations, feeds
Implementation: Replication with async sync, conflict-free replicated data types

Causal Consistency: Causally related operations preserve order.

Cost: Medium latency, moderate complexity
Use when: Chat messages, email threads, comments
Implementation: Version vectors, logical clocks

Read-your-writes: Clients see their own writes immediately.

Cost: Session state needed, server affinity
Use when: User sessions, post visibility
Implementation: Sticky sessions, client-side caching

Trade-Off Examples in Code

# STRONG CONSISTENCY - High cost, low latency uncertainty
def transfer_money(from_account, to_account, amount):
    with distributed_lock(from_account, to_account):
        balance = read_from_quorum(from_account)
        if balance >= amount:
            write_to_quorum(from_account, balance - amount)
            write_to_quorum(to_account, balance + amount)
        # All nodes agree immediately; high latency

# EVENTUAL CONSISTENCY - Low cost, eventual latency
def update_user_profile(user_id, updates):
    write_to_primary(user_id, updates)
    # Replicate asynchronously to other nodes
    async_replicate_to_secondary(user_id, updates)
    # Returns immediately; eventual sync
    # Risk: read from secondary node might see old profile temporarily

Self-Check

What happens in your system when a network partition occurs? Does the system become unavailable or serve stale data? (Every system must choose one.)
For each critical operation, is the choice between consistency and availability explicit or implicit? (Explicit is better—document it.)
Can you trace back an availability incident to a CAP choice conflict? (If yes, was it the right choice in hindsight?)
What's your partition tolerance strategy? (No system can avoid partitions—they're inevitable in distributed systems.)
Have you tested your system behavior during actual network partitions? (Simulation isn't enough; real partition behavior surprises teams.)

One Takeaway

The CAP theorem isn't a puzzle to solve—it's a law of nature. Acknowledge the trade-off, make an explicit choice, and design around it. Document your choice in architecture decisions so when an incident happens, you know the system is behaving as designed. The best distributed systems teams spend less time debating CAP and more time testing their actual consistency behavior under partition.

CAP in Real Incidents

Case Study 1: AWS S3 Outage (2008)

During a network partition between data centers:

Amazon chose availability over consistency (AP system)
S3 continued accepting writes
Writes went to available partitions
When partition healed, conflicts resolved with last-write-wins

Result: Users experienced temporary inconsistency (writes to one region not visible in another), but service never went down. Trade-off paid off—users prefer slow/inconsistent service to unavailable service.

Case Study 2: Payment System Failure

Financial transaction system prioritized consistency (CP):

Network partition between payment gateway and database
System chose: reject transactions rather than risk double-charges
Merchants couldn't process payments for 4 hours
Revenue loss: $2M+ but no data corruption

Trade-off: Accept availability loss to prevent consistency loss (double-charged customers would be worse).

Both decisions were intentional applications of CAP. The failures happened not because CAP was violated, but because recovery was slow. Better observability and faster failover would have helped.

Advanced Topics

Multi-Object Transactions and CAP

What if a transaction touches multiple objects across partitions?

CP approach: Abort if partition detected. "Transaction cannot complete—partition detected, aborting."
AP approach: Commit locally, sync later. "Transaction committed locally; replication may be delayed."

Most systems hybrid: critical transactions are CP, non-critical are AP.

Leases as CAP Bridge

A lease is a time-bound permission to perform an action. Bridges CAP:

Client requests lease from server (e.g., 10 second lease to write)
If partition occurs, client can write for remaining lease time
After lease expires and partition heals, consistency is restored
Trade: small window of inconsistency, but system continues operating

Netflix uses leases for this reason—tiny inconsistency windows, but high availability.

Eventual Consistency Safety

Eventual consistency can be made safer with conflict resolution:

# Conflict resolution
def merge_versions(v1, v2):
    # Last-write-wins (simple but lossy)
    return v1 if v1.timestamp > v2.timestamp else v2

    # Merge by field (safer)
    return {
        'name': v2.name if v2.timestamp > v1.timestamp else v1.name,
        'email': v2.email if v2.timestamp > v1.timestamp else v1.email,
        'updated_at': max(v1.updated_at, v2.updated_at)
    }

    # Multi-value (no data loss)
    return [v1, v2]  # Application resolves conflict explicitly

Next Steps

Choose Your Model: Decide for each subsystem: CP or AP?
Test partitions: Introduce network partitions in staging; observe behavior
Prepare for Partitions: Explore Partition Tolerance
Build Resilience: Learn Timeouts and Retries
Document decisions: For each service: are we CP or AP? Why? What does that mean for clients?

Quick Decision Guide

Critical business operation (payment, inventory)? → Choose CP

Consistency non-negotiable
Availability sacrificed during partition
Example: "Transaction failed" > "Transaction processed twice"

User-facing feature (feed, recommendations)? → Choose AP

Availability critical
Eventual consistency acceptable
Example: "Here's your stale feed" > "Feed unavailable"

Read-heavy, write-infrequent (static content)? → Choose AP + Caching

Cache widely
Eventual consistency from cache to database
Low latency, acceptable staleness

Real-time coordination (distributed transactions)? → Choose CP

Cannot tolerate inconsistency
Partition means system stops
Example: Banking, stock trading

Conclusion

CAP theorem is not a limitation—it's a guide. Acknowledge which property you're sacrificing, design your system accordingly, and document the choice. The best distributed systems teams spend less time debating CAP theory and more time testing actual behavior during partitions. Know your system's CAP choice cold, and you'll make better architecture decisions.

References

Brewer, E. A. (2000). "Towards Robust Distributed Systems". PODC Keynote.
Gilbert, S., & Lynch, N. A. (2002). "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services". ACM SIGACT News.
Kleppmann, M. (2019). "CAP Twelve Years Later: How the 'Rules' Have Changed". IEEE Computer Magazine.
Vogels, W. (2008). "Eventually Consistent". Communications of the ACM.
Corbett, J. C., et al. (2013). "Spanner: Google's Globally-Distributed Database". OSDI.
Abadi, D. (2012). "Consistency Tradeoffs in Modern Distributed Database System Design". IEEE Computer Magazine.
Lynch, N. A., & Gilbert, S. (2012). "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services". ACM SIGACT News 33.2 (2002): 51-59.

CAP & PACELC Theorems

TL;DR​

Learning Objectives​

Motivating Scenario​

The CAP Theorem​

What Each Property Means​

The Central Claim​

The Often-Misunderstood Implication​

PACELC: A Refinement​

The PACELC Choices​

Real-World Examples​

Hybrid Approaches​

Trade-offs in Practice​

Common Mistakes​

Practical Decision Framework​

Decision Matrix: CAP for Different Use Cases​

CAP in Real Systems​

Consistency Models in Detail​

Consistency Spectrum​

Trade-Off Examples in Code​

Self-Check​

CAP in Real Incidents​

Case Study 1: AWS S3 Outage (2008)​

Case Study 2: Payment System Failure​

Advanced Topics​

Multi-Object Transactions and CAP​

Leases as CAP Bridge​

Eventual Consistency Safety​

Next Steps​

Quick Decision Guide​

Conclusion​

References​