Skip to main content

CAP & PACELC Theorems

Master the fundamental constraints that govern every distributed systems architecture decision

TL;DR

The CAP theorem proves that a distributed system cannot simultaneously guarantee Consistency (all nodes see the same data), Availability (the system responds to requests), and Partition tolerance (the system works despite network partitions). You can have at most two. PACELC extends this: if the network is Partitioned, choose between Consistency and Availability; else (if the network is working), choose between Latency and Consistency. These theorems guide every architectural decision.

Learning Objectives

  • Understand what CAP actually says (and doesn't say)
  • Identify which properties your system prioritizes
  • Apply PACELC to evaluate consistency-latency trade-offs
  • Recognize that you're always making a choice, whether intentionally or not

Motivating Scenario

Your team is building a financial system. You need strong consistency (all ledger entries must be immediately visible), high availability (the system must never be down), and partition tolerance (the system must work when networks fail). Consultants assure you it's possible. They're wrong.

During a network partition between data centers, you must choose: return errors to users (sacrificing availability) or serve stale data (sacrificing consistency). There is no third option. The CAP theorem proves it mathematically.

The CAP Theorem

CAP Theorem Space

What Each Property Means

Consistency (C): Every read returns the most recent write. All nodes agree on data values. If one node has a value, all nodes must eventually see that value.

Availability (A): Every request receives a non-error response, regardless of individual node failures. The system continues operating even when nodes are down.

Partition Tolerance (P): The system continues operating despite network partitions. It survives when communication between nodes is lost or delayed indefinitely.

The Central Claim

In the face of a network partition, you cannot have both Consistency and Availability. You must choose one. More precisely:

  • CP (Consistency + Partition Tolerance): When a partition occurs, some nodes stop responding to maintain consistency. Banking systems often choose this.
  • AP (Availability + Partition Tolerance): When a partition occurs, nodes keep responding but serve potentially stale data. Eventual consistency systems choose this.

The Often-Misunderstood Implication

Many people interpret CAP as a permanent choice. This is wrong. Brewer himself clarified: you make different choices at different times. During normal operation (no partition), you can have both consistency and availability. During a partition, you must sacrifice one.

This is where PACELC comes in.

PACELC: A Refinement

PACELC extends CAP by considering what happens during normal (partition-free) operation. The acronym stands for:

  • Partition: During a partition
  • Else: During normal operation
  • C and A: Choose between Consistency and Availability
  • L and C: Choose between Latency and Consistency
PACELC Decision Tree

The PACELC Choices

When Partitioned:

  • PC: Partition tolerance + Consistency = system is unavailable during partitions
  • PA: Partition tolerance + Availability = system is inconsistent during partitions

When Normal (Else):

  • EL: If Eventual consistency, then Latency is low (you can serve stale data quickly)
  • EC: If Eventual consistency, you can be strongly Consistent at the cost of Latency

Real-World Examples

CP Systems (Consistency + Partition Tolerance)
  1. Traditional SQL databases with synchronous replication
  2. HBase
  3. MongoDB with strong consistency
  4. Financial transactions requiring immediate visibility
  5. Inventory management needing accuracy over availability
  6. Systems where stale data is unacceptable
AP Systems (Availability + Partition Tolerance)
  1. DynamoDB (eventual consistency mode)
  2. Cassandra
  3. Amazon SimpleDB
  4. High-scale systems requiring constant availability
  5. Social media feeds (eventual consistency acceptable)
  6. Caching systems (stale data acceptable)

Hybrid Approaches

Modern systems often use hybrid approaches, applying CAP decisions at different levels:

  1. Per-Operation Decisions: Strong consistency for critical operations (payment processing), eventual consistency for non-critical operations (user profile updates).

  2. Per-Data Decisions: Inventory data might be CP (must be accurate), while product reviews might be AP (eventual consistency acceptable).

  3. Quorum-Based: Consistency level depends on how many replicas acknowledge the write. More replicas = higher consistency but lower availability.

Hybrid Consistency Strategy

Trade-offs in Practice

  • Explicitly identify whether your system is CP or AP
  • For CP systems, define fallback behavior during partitions (queue writes, serve errors?)
  • For AP systems, implement eventual consistency correctly (conflict resolution, convergence guarantees)
  • For normal operation, decide latency tolerance vs consistency requirements
  • Monitor actual partition frequency and duration to validate CAP choices
  • Test failure scenarios explicitly (simulate partitions, measure behavior)

Common Mistakes

Mistake 1: Assuming you can have all three. You can't. Choose deliberately.

Mistake 2: Assuming the choice is permanent. It's not. Different operations have different requirements.

Mistake 3: Assuming partition tolerance is optional. If you're distributed, you're partitioned. Partitions happen.

Mistake 4: Assuming your system is CA. If you have a network between nodes, you have P. CA systems exist only on single machines.

Practical Decision Framework

Decision Matrix: CAP for Different Use Cases

Use CaseRequirementCAP ChoiceExample
Inventory managementPrevent oversellingCPIf partition: reject sales, refuse orders, maintain accuracy
User authenticationMust never lock legitimate usersAPIf partition: accept logins, sync later, accept some stale data
Payment processingAccuracy is non-negotiableCPIf partition: queue transaction, don't process blind, validate later
Social media feedEventual consistency fineAPIf partition: serve cached feed, sync when partition heals
Trading systemReal-time accuracy criticalCPIf partition: halt trading, prevent incorrect execution
Logging systemAvailability more importantAPIf partition: buffer logs locally, ship when partition heals
DNSPerformance > consistencyAPIf partition: serve cached record, refresh when partition heals
Account balanceConsistency criticalCPIf partition: reject access, don't show stale balance

CAP in Real Systems

Google Search (AP System)

  • Indexes are distributed across data centers
  • Network partition between data centers is handled by serving stale indices
  • Consistency: index might be days old after partition
  • Availability: search works during partition
  • Users might see old results, but search is always fast

Database with Quorum Reads (CP System)

  • Quorum: require majority of replicas to agree on value
  • Network partition: some replicas isolated, quorum can't form
  • Result: system becomes unavailable (can't read)
  • Trade: sacrificed availability for consistency

Cassandra (AP System)

  • Read/write with tunable consistency
  • Default: eventual consistency
  • Network partition: different nodes might have different data
  • Example: User updates profile in DC1, reads from DC2 (old version)
  • Consistency eventually achieved (DC2 syncs from DC1)

Consistency Models in Detail

Consistency Spectrum

Strong Consistency: All nodes always have the same data.

  • Cost: High latency, low availability
  • Use when: Money, medical data, critical systems
  • Implementation: Quorum reads/writes, distributed locking

Eventual Consistency: Nodes eventually converge to the same value.

  • Cost: Temporary divergence, complexity handling conflicts
  • Use when: User profiles, recommendations, feeds
  • Implementation: Replication with async sync, conflict-free replicated data types

Causal Consistency: Causally related operations preserve order.

  • Cost: Medium latency, moderate complexity
  • Use when: Chat messages, email threads, comments
  • Implementation: Version vectors, logical clocks

Read-your-writes: Clients see their own writes immediately.

  • Cost: Session state needed, server affinity
  • Use when: User sessions, post visibility
  • Implementation: Sticky sessions, client-side caching

Trade-Off Examples in Code

# STRONG CONSISTENCY - High cost, low latency uncertainty
def transfer_money(from_account, to_account, amount):
with distributed_lock(from_account, to_account):
balance = read_from_quorum(from_account)
if balance >= amount:
write_to_quorum(from_account, balance - amount)
write_to_quorum(to_account, balance + amount)
# All nodes agree immediately; high latency

# EVENTUAL CONSISTENCY - Low cost, eventual latency
def update_user_profile(user_id, updates):
write_to_primary(user_id, updates)
# Replicate asynchronously to other nodes
async_replicate_to_secondary(user_id, updates)
# Returns immediately; eventual sync
# Risk: read from secondary node might see old profile temporarily

Self-Check

  1. What happens in your system when a network partition occurs? Does the system become unavailable or serve stale data? (Every system must choose one.)

  2. For each critical operation, is the choice between consistency and availability explicit or implicit? (Explicit is better—document it.)

  3. Can you trace back an availability incident to a CAP choice conflict? (If yes, was it the right choice in hindsight?)

  4. What's your partition tolerance strategy? (No system can avoid partitions—they're inevitable in distributed systems.)

  5. Have you tested your system behavior during actual network partitions? (Simulation isn't enough; real partition behavior surprises teams.)

One Takeaway

The CAP theorem isn't a puzzle to solve—it's a law of nature. Acknowledge the trade-off, make an explicit choice, and design around it. Document your choice in architecture decisions so when an incident happens, you know the system is behaving as designed. The best distributed systems teams spend less time debating CAP and more time testing their actual consistency behavior under partition.

CAP in Real Incidents

Case Study 1: AWS S3 Outage (2008)

During a network partition between data centers:

  • Amazon chose availability over consistency (AP system)
  • S3 continued accepting writes
  • Writes went to available partitions
  • When partition healed, conflicts resolved with last-write-wins

Result: Users experienced temporary inconsistency (writes to one region not visible in another), but service never went down. Trade-off paid off—users prefer slow/inconsistent service to unavailable service.

Case Study 2: Payment System Failure

Financial transaction system prioritized consistency (CP):

  • Network partition between payment gateway and database
  • System chose: reject transactions rather than risk double-charges
  • Merchants couldn't process payments for 4 hours
  • Revenue loss: $2M+ but no data corruption

Trade-off: Accept availability loss to prevent consistency loss (double-charged customers would be worse).

Both decisions were intentional applications of CAP. The failures happened not because CAP was violated, but because recovery was slow. Better observability and faster failover would have helped.

Advanced Topics

Multi-Object Transactions and CAP

What if a transaction touches multiple objects across partitions?

  • CP approach: Abort if partition detected. "Transaction cannot complete—partition detected, aborting."
  • AP approach: Commit locally, sync later. "Transaction committed locally; replication may be delayed."

Most systems hybrid: critical transactions are CP, non-critical are AP.

Leases as CAP Bridge

A lease is a time-bound permission to perform an action. Bridges CAP:

1. Client requests lease from server (e.g., 10 second lease to write)
2. If partition occurs, client can write for remaining lease time
3. After lease expires and partition heals, consistency is restored
4. Trade: small window of inconsistency, but system continues operating

Netflix uses leases for this reason—tiny inconsistency windows, but high availability.

Eventual Consistency Safety

Eventual consistency can be made safer with conflict resolution:

# Conflict resolution
def merge_versions(v1, v2):
# Last-write-wins (simple but lossy)
return v1 if v1.timestamp > v2.timestamp else v2

# Merge by field (safer)
return {
'name': v2.name if v2.timestamp > v1.timestamp else v1.name,
'email': v2.email if v2.timestamp > v1.timestamp else v1.email,
'updated_at': max(v1.updated_at, v2.updated_at)
}

# Multi-value (no data loss)
return [v1, v2] # Application resolves conflict explicitly

Next Steps

  1. Choose Your Model: Decide for each subsystem: CP or AP?
  2. Test partitions: Introduce network partitions in staging; observe behavior
  3. Prepare for Partitions: Explore Partition Tolerance
  4. Build Resilience: Learn Timeouts and Retries
  5. Document decisions: For each service: are we CP or AP? Why? What does that mean for clients?

Quick Decision Guide

Critical business operation (payment, inventory)? → Choose CP

  • Consistency non-negotiable
  • Availability sacrificed during partition
  • Example: "Transaction failed" > "Transaction processed twice"

User-facing feature (feed, recommendations)? → Choose AP

  • Availability critical
  • Eventual consistency acceptable
  • Example: "Here's your stale feed" > "Feed unavailable"

Read-heavy, write-infrequent (static content)? → Choose AP + Caching

  • Cache widely
  • Eventual consistency from cache to database
  • Low latency, acceptable staleness

Real-time coordination (distributed transactions)? → Choose CP

  • Cannot tolerate inconsistency
  • Partition means system stops
  • Example: Banking, stock trading

Conclusion

CAP theorem is not a limitation—it's a guide. Acknowledge which property you're sacrificing, design your system accordingly, and document the choice. The best distributed systems teams spend less time debating CAP theory and more time testing actual behavior during partitions. Know your system's CAP choice cold, and you'll make better architecture decisions.

References

  • Brewer, E. A. (2000). "Towards Robust Distributed Systems". PODC Keynote.
  • Gilbert, S., & Lynch, N. A. (2002). "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services". ACM SIGACT News.
  • Kleppmann, M. (2019). "CAP Twelve Years Later: How the 'Rules' Have Changed". IEEE Computer Magazine.
  • Vogels, W. (2008). "Eventually Consistent". Communications of the ACM.
  • Corbett, J. C., et al. (2013). "Spanner: Google's Globally-Distributed Database". OSDI.
  • Abadi, D. (2012). "Consistency Tradeoffs in Modern Distributed Database System Design". IEEE Computer Magazine.
  • Lynch, N. A., & Gilbert, S. (2012). "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services". ACM SIGACT News 33.2 (2002): 51-59.