Skip to main content

Wide-Column Stores

Distributed column-oriented storage for petabyte-scale analytics and time-series

TL;DR

Wide-column stores (Cassandra, HBase) store data in columns across distributed nodes, optimized for time-series, metrics, and analytics. Petabyte-scale, tunable consistency (Cassandra), and write-optimized. Trade-off: complex operations model, eventual consistency, designed for append-heavy workloads.

Learning Objectives

  • Understand column-family data model
  • Design partition and clustering keys for distribution
  • Recognize when wide-column stores outperform RDBMS
  • Choose appropriate consistency levels

Motivating Scenario

Collecting metrics from 1M servers, 1000 metrics each, 10-second intervals = 100K metrics/second. RDBMS struggles at this scale. Cassandra partitions data by metric_name, stores time-series rows. Scales linearly with nodes. Queries for last 24 hours complete in milliseconds.

Core Concepts

Partition and Clustering Keys

CREATE TABLE metrics (
metric_name TEXT, -- Partition key
server_id TEXT, -- Clustering key part 1
timestamp TIMESTAMP, -- Clustering key part 2
value FLOAT,
PRIMARY KEY ((metric_name), server_id, timestamp)
);

-- Partition key (metric_name) determines which node
-- Clustering keys (server_id, timestamp) order rows within partition
-- All data for one metric on same node
-- Time-series queries within metric are fast

Practical Example

from cassandra.cluster import Cluster
from cassandra.util import uuid_from_time
from datetime import datetime, timedelta
import time

cluster = Cluster(['localhost'])
session = cluster.connect('metrics')

# Create table
session.execute("""
CREATE TABLE IF NOT EXISTS metrics (
metric_name text,
server_id text,
timestamp timestamp,
value float,
PRIMARY KEY ((metric_name), server_id, timestamp)
) WITH CLUSTERING ORDER BY (server_id ASC, timestamp DESC)
""")

# Insert metrics
def insert_metric(metric_name, server_id, value):
timestamp = datetime.utcnow()
session.execute(
"INSERT INTO metrics (metric_name, server_id, timestamp, value) VALUES (%s, %s, %s, %s)",
(metric_name, server_id, timestamp, value)
)

# Bulk insert
from cassandra.concurrent import execute_concurrent_with_args

inserts = [
("INSERT INTO metrics (metric_name, server_id, timestamp, value) VALUES (%s, %s, %s, %s)",
('cpu', f'server-{i}', datetime.utcnow() - timedelta(seconds=j), 50 + (i*j) % 30))
for i in range(10)
for j in range(60)
]

execute_concurrent_with_args(session, inserts[0][0], inserts)

# Query: Get last 24 hours of CPU metrics for server
query = session.prepare("""
SELECT timestamp, value FROM metrics
WHERE metric_name = ? AND server_id = ? AND timestamp >= ?
LIMIT 1440
ALLOW FILTERING
""")

result = session.execute(query, (
'cpu',
'server-1',
datetime.utcnow() - timedelta(hours=24)
))

for row in result:
print(f"{row.timestamp}: {row.value}%")

# Tunable consistency
from cassandra import ConsistencyLevel

# Read: wait for 2 replicas, timeout faster
read_query = session.prepare("""
SELECT value FROM metrics WHERE metric_name = ? AND server_id = ? AND timestamp = ?
""")
read_query.consistency_level = ConsistencyLevel.TWO

result = session.execute(read_query, ('cpu', 'server-1', datetime.utcnow()))

When to Use Wide-Column Stores / When Not to Use

Use Wide-Column Stores When
  1. Time-series data (metrics, events, logs)
  2. Petabyte-scale data volume
  3. Write-heavy workloads
  4. Linear horizontal scaling needed
  5. Geographically distributed
Use RDBMS When
  1. Complex queries with JOINs
  2. Strong ACID required
  3. Dataset < 100GB easily fits
  4. Complex reporting queries
  5. Data normalization important

Patterns and Pitfalls

Design Review Checklist

  • Partition key chosen to avoid hotspots
  • Clustering key order matches query patterns
  • Partition size within limits (e.g., &lt;100GB)
  • Time-bucketing strategy for time-series data
  • Replication factor >= 3 for reliability
  • Consistency level tuned for each query
  • Compaction strategy configured
  • Monitoring for hotspots and imbalance
  • Backup and recovery procedures documented
  • Schema versioning strategy planned

Advanced Concepts: Compaction and Repair

Wide-column stores use compaction to optimize storage and query performance:

Write Path in Cassandra:
1. Write to memtable (in-memory)
2. Write to commit log (persistent, for recovery)
3. Memtable gets full; flush to SSTables (immutable files on disk)
4. Multiple SSTables accumulate over time
5. Compaction: merge SSTables, remove deleted data, optimize for queries

Compaction Strategies:
- SizeTieredCompactionStrategy: Merge similar-sized SSTables (default)
- LeveledCompactionStrategy: Keep data in levels, like LSM trees (low latency)
- TimeWindowCompactionStrategy: Good for time-series (e.g., one week per file)

Tombstones mark deleted data:

# Delete key at timestamp 1000
DELETE FROM metrics WHERE metric_name = 'cpu' AND server_id = 'server-1' AND timestamp = 1000

# Actually: insert tombstone with timestamp 1000
# Tombstone is kept for gc_grace_seconds (default 10 days)
# After 10 days, tombstone can be discarded

# Problem: If nodes down during delete, they don't get tombstone
# When they rejoin, old data resurfaces (zombie)
# Solution: Run repair command to sync nodes
nodetool repair cassandra-cluster

# Repair compares nodes' data, resyncs diverged rows
# Cost: Expensive I/O; should be scheduled during low traffic

Designing for Query Patterns

Wide-column stores don't support arbitrary queries; design tables for specific access patterns:

# Anti-pattern: Single table for all queries
# Problem: Can't easily query "all metrics for server X in last 24 hours"
# if partition key is metric_name

# Good: Denormalize for query pattern
CREATE TABLE metrics_by_server (
server_id TEXT,
timestamp TIMESTAMP,
metric_name TEXT,
value FLOAT,
PRIMARY KEY ((server_id), timestamp, metric_name)
) WITH CLUSTERING ORDER BY (timestamp DESC, metric_name ASC)

# Query: Get all metrics for server-1 in last 24 hours
SELECT * FROM metrics_by_server
WHERE server_id = 'server-1'
AND timestamp >= now() - '24h'

# Alternative table for different query
CREATE TABLE metrics_by_metric (
metric_name TEXT,
timestamp TIMESTAMP,
server_id TEXT,
value FLOAT,
PRIMARY KEY ((metric_name), timestamp, server_id)
)

# Query: Get all servers' CPU in last hour
SELECT * FROM metrics_by_metric
WHERE metric_name = 'cpu'
AND timestamp >= now() - '1h'

# Tradeoff: Storing data twice (denormalization)
# Benefit: Both queries are fast

Consistency and Availability Trade-offs

Wide-column stores typically offer eventual consistency, but you can tune per-query:

from cassandra import ConsistencyLevel

# Write: ensure data on N nodes before returning
session.execute(query, consistency_level=ConsistencyLevel.QUORUM)
# QUORUM = ceil(replication_factor / 2) nodes must acknowledge
# For RF=3: need 2 nodes
# Slow but safe: unlikely to lose committed writes

session.execute(query, consistency_level=ConsistencyLevel.ONE)
# Fast but risky: return after 1 node acknowledges
# If that node dies, data could be lost

# Read: wait for N nodes' data, return most recent
read_query = """
SELECT * FROM metrics
WHERE metric_name = ? AND server_id = ? AND timestamp >= ?
"""
read_query.consistency_level = ConsistencyLevel.QUORUM
# Read from 2 nodes, compare timestamps, return latest
# Trade-off: slower, but less likely to see stale data

read_query.consistency_level = ConsistencyLevel.ONE
# Fastest read, but most likely to see stale data
# OK for metrics (small staleness acceptable)
# NOT OK for financial data

Monitoring and Operational Excellence

Wide-column stores require proactive monitoring:

# Metrics to monitor
metrics = {
"read_latency_p99": "target: < 10ms",
"write_latency_p99": "target: < 5ms",
"gc_pause_duration": "target: < 500ms",
"compaction_pending_tasks": "target: < 10",
"tombstone_count_per_sstable": "alert if > 100k",
"heap_used": "alert if > 80% of max",
"disk_space_used": "alert if > 80% of capacity"
}

# Common operational tasks
operations = {
"nodetool status": "Check cluster health; shows each node's token range",
"nodetool repair": "Sync data between nodes; run weekly",
"nodetool compact": "Force compaction; use cautiously (I/O intensive)",
"nodetool tpstats": "Show thread pool stats; detect stalls",
"cqlsh": "Interactive shell for ad-hoc queries"
}

Real-World Scenario: Metrics at 1M Servers

Scale: 1M servers * 1000 metrics/server * 10 measurements/min
= 10B metrics per minute
= ~167M metrics per second (huge!)

Design:
Cluster: 10 nodes, RF=3 (3 copies of each metric)
Partition key: metric_name + date (e.g., "cpu:2025-02-15")
Clustering key: server_id, timestamp

Rationale:
- Daily partitions prevent single partition from getting too large
- ~10B metrics / 10 nodes / 3 copies = 333M metrics per node per day
- Manageable (not too big, not too small)
- Queries for "last 24 hours of CPU across all servers" hit <= 2 partitions

Query examples:
SELECT value FROM metrics
WHERE metric_name = 'cpu:2025-02-15'
AND server_id = 'server-123'
AND timestamp >= '2025-02-14 10:00'

Operations:
- Run repair weekly (distributed system eventually diverges)
- Monitor compaction queue (7-10 day retention per node)
- Plan capacity: add 1 node per 3-4 months to handle growth
- Backup: snapshot clusters daily; retain for 30 days

Self-Check

  • What's the difference between partition key and clustering key?
  • Why avoid hotspots in Cassandra?
  • How would you design metrics storage for 1M servers?
  • What's tunable consistency and when use it?
  • What are tombstones and why do they matter?
  • When would you denormalize and maintain multiple tables for the same data?
info

Wide-column stores are purpose-built for massive-scale time-series and analytical workloads, trading away complex queries and consistency for linear horizontal scaling and predictable latency. Use them for metrics, logs, and events when your volume exceeds RDBMS capabilities (100K+ writes/sec).

Next Steps

  • Explore Data Pipelines & Analytics for aggregating time-series
  • Learn Sharding Strategies for distributed data design
  • Study Materialized Views for pre-aggregated metrics
  • Dive into Monitoring and Observability for operational insights
  • Master Compaction Strategies for different workload patterns

References

  • Cassandra Documentation
  • HBase in Action (O'Reilly)
  • "Designing Data-Intensive Applications" by Martin Kleppmann
  • Google BigTable Paper (2006)
  • "Learning Cassandra" (O'Reilly) by Eben Hewitt
  • Cassandra Best Practices (DataStax blogs)