Normalization vs Denormalization

Strategic trade-offs between consistency and performance

TL;DR

Normalization reduces data duplication and anomalies through structured schema design (1NF, 2NF, 3NF). Denormalization intentionally reintroduces data duplication for read performance. Most production systems use hybrid approach: normalized for transactions, denormalized views/caches for analytics.

Learning Objectives

Understand normalization levels and why they matter
Recognize update/insert/delete anomalies in unnormalized data
Identify when denormalization helps performance
Design schemas that balance consistency and speed

Motivating Scenario

E-commerce: Store author name in every book row vs separate authors table? If author name changes, how many rows update? Normalized: one update. Denormalized: millions. But listing books by author requires join.

Core Concepts

Normalization Forms

1NF (First Normal Form): Atomic values only

No repeating groups in rows
Each column contains single value

2NF (Second Normal Form): 1NF + No partial dependencies

All non-key columns depend on full primary key
Remove columns that depend on part of composite key

3NF (Third Normal Form): 2NF + No transitive dependencies

Non-key columns depend only on primary key
Remove columns that depend on other non-key columns

BCNF (Boyce-Codd NF): Stricter than 3NF

Every determinant is a candidate key

Practical Example

Unnormalized (Problems)

-- Denormalized: Author names embedded in books
CREATE TABLE books_denormalized (
  book_id INT,
  title VARCHAR(255),
  author_id INT,
  author_name VARCHAR(255),  -- Duplication!
  author_email VARCHAR(255), -- Duplication!
  year INT
);

-- Problems:
-- INSERT: Adding new author requires adding book
-- UPDATE: Changing author name requires updating ALL books
-- DELETE: Deleting last book loses author data

INSERT INTO books_denormalized VALUES (1, 'Python Guide', 1, 'Alice Smith', 'alice@ex.com', 2025);
INSERT INTO books_denormalized VALUES (2, 'Python Advanced', 1, 'Alice Smith', 'alice@ex.com', 2026);

UPDATE books_denormalized SET author_name = 'Alice Johnson' WHERE author_id = 1;
-- Must update 2 rows!

Normalized (3NF)

-- Normalized: Separate tables
CREATE TABLE authors (
  author_id INT PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  email VARCHAR(255)
);

CREATE TABLE books (
  book_id INT PRIMARY KEY,
  title VARCHAR(255) NOT NULL,
  author_id INT,
  year INT,
  FOREIGN KEY (author_id) REFERENCES authors(author_id)
);

-- Benefits:
-- INSERT: Add author once, reference in multiple books
-- UPDATE: Change name once, all books reflect change
-- DELETE: Delete author independently from books

INSERT INTO authors VALUES (1, 'Alice Smith', 'alice@ex.com');
INSERT INTO books VALUES (1, 'Python Guide', 1, 2025);
INSERT INTO books VALUES (2, 'Python Advanced', 1, 2026);

UPDATE authors SET name = 'Alice Johnson' WHERE author_id = 1;
-- Single row update!

-- Query requires JOIN
SELECT b.title, a.name, b.year
FROM books b
JOIN authors a ON b.author_id = a.author_id
WHERE a.author_id = 1;

Strategic Denormalization

-- Base normalized schema, plus denormalized column for performance
CREATE TABLE books (
  book_id INT PRIMARY KEY,
  title VARCHAR(255),
  author_id INT,
  author_name VARCHAR(255),  -- Denormalized for fast reads
  year INT,
  FOREIGN KEY (author_id) REFERENCES authors(author_id)
);

-- Consistency: Update trigger to keep author_name in sync
CREATE TRIGGER update_book_author_name
AFTER UPDATE ON authors
FOR EACH ROW
BEGIN
  UPDATE books SET author_name = NEW.name WHERE author_id = NEW.author_id;
END;

-- Or: Periodic batch update
-- Performance: Queries don't need JOIN
SELECT title, author_name, year
FROM books
WHERE author_id = 1;  -- Single table, no JOIN

When to Use Normalization / Denormalization

Normalize When

Write-heavy workloads (OLTP)
Data consistency critical
Updates frequent and widespread
Storage cost matters
Complex relationships between entities

Denormalize When

Read-heavy workloads (OLAP)
Query performance critical
Updates infrequent to specific data
Can handle eventual consistency
Derived/computed values accessed frequently

Patterns and Pitfalls

Design Review Checklist

Self-Check

What's the difference between 2NF and 3NF?
How do you identify update anomalies in unnormalized data?
When would you denormalize instead of adding indexes?
How do you keep denormalized copies consistent?

info

Normalize your base schema to prevent anomalies, then strategically denormalize (with clear mechanisms for consistency) to optimize for measured performance bottlenecks.

Next Steps

Explore Indexing Strategies for query optimization
Learn Query Patterns for efficient data retrieval
Study Materialized Views for denormalization
Dive into Caching Patterns for application-level denormalization

Real-World Normalization Examples

E-Commerce Product Catalog

Normalized approach:

-- Normalized (3NF)
CREATE TABLE categories (
  id INT PRIMARY KEY,
  name VARCHAR(255) NOT NULL UNIQUE,
  description TEXT
);

CREATE TABLE products (
  id INT PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  category_id INT,
  description TEXT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  FOREIGN KEY (category_id) REFERENCES categories(id)
);

CREATE TABLE product_attributes (
  id INT PRIMARY KEY,
  product_id INT,
  attribute_name VARCHAR(255),
  attribute_value VARCHAR(255),
  FOREIGN KEY (product_id) REFERENCES products(id)
);

CREATE TABLE product_prices (
  id INT PRIMARY KEY,
  product_id INT,
  price DECIMAL(10,2),
  currency VARCHAR(3),
  region VARCHAR(50),
  effective_from DATE,
  FOREIGN KEY (product_id) REFERENCES products(id)
);

-- Queries require multiple JOINs
SELECT p.name, c.name as category, pr.price, a.attribute_name, a.attribute_value
FROM products p
JOIN categories c ON p.category_id = c.id
LEFT JOIN product_prices pr ON p.id = pr.product_id AND pr.region = 'US'
LEFT JOIN product_attributes pa ON p.id = pa.product_id
WHERE p.id = 123;

Advantages: Single source of truth for each piece of data. Update price once, affects all queries. Consistent category names.

Disadvantages: Multiple JOINs for simple queries. More round-trips to database. Complex schema to understand.

Denormalized Approach (Materialized View)

-- Pre-computed denormalized view
CREATE MATERIALIZED VIEW product_summary AS
SELECT
  p.id,
  p.name,
  c.name as category,
  STRING_AGG(pa.attribute_name || ':' || pa.attribute_value, '; ') as attributes,
  pr.price,
  pr.currency,
  p.created_at
FROM products p
JOIN categories c ON p.category_id = c.id
LEFT JOIN product_prices pr ON p.id = pr.product_id AND pr.region = 'US'
LEFT JOIN product_attributes pa ON p.id = pa.product_id
GROUP BY p.id, c.name, pr.price, pr.currency;

-- Refreshed nightly
REFRESH MATERIALIZED VIEW product_summary;

-- Query is instant
SELECT * FROM product_summary WHERE id = 123;

Advantages: Single table, no JOINs. Fast queries (under 10ms). Good for dashboards/reporting.

Disadvantages: Stale data (refreshed nightly). Disk space (duplicate data). Must maintain refresh schedule.

Denormalization Strategies

Strategy 1: Cache Denormalization

Keep cache layer in sync with normalized data:

class ProductService:
    def __init__(self, db, cache):
        self.db = db
        self.cache = cache

    def get_product(self, product_id):
        # Try cache first (denormalized)
        cached = self.cache.get(f"product:{product_id}")
        if cached:
            return cached

        # Cache miss: query normalized DB
        product = self.db.query(f"""
            SELECT p.*, c.name as category,
                   STRING_AGG(pa.value) as attributes
            FROM products p
            JOIN categories c ON p.category_id = c.id
            LEFT JOIN product_attrs pa ON p.id = pa.product_id
            WHERE p.id = ?
        """, [product_id])

        # Store in cache
        self.cache.set(f"product:{product_id}", product, ttl=3600)
        return product

    def update_product(self, product_id, data):
        # Update normalized DB
        self.db.update('products', data, {'id': product_id})

        # Invalidate cache
        self.cache.delete(f"product:{product_id}")

Strategy 2: Application-Level Denormalization

Compute and store denormalized columns in the database:

-- Add denormalized column to products
ALTER TABLE products ADD COLUMN summary TEXT;
ALTER TABLE products ADD COLUMN avg_rating DECIMAL(3,2);

-- Trigger to compute on insert/update
CREATE TRIGGER update_product_summary
AFTER INSERT ON products
FOR EACH ROW
BEGIN
  UPDATE products SET
    summary = (
      SELECT STRING_AGG(attribute, ', ')
      FROM product_attributes
      WHERE product_id = NEW.id
    ),
    avg_rating = (
      SELECT AVG(rating)
      FROM reviews
      WHERE product_id = NEW.id
    )
  WHERE id = NEW.id;
END;

-- Now queries are fast
SELECT id, name, summary, avg_rating FROM products WHERE id = 123;

Strategy 3: Read Replicas + Denormalization

Keep normalized master database, denormalize in read replicas:

Master (Normalized)          Replicas (Denormalized)
┌─────────────────┐          ┌──────────────────┐
│ products        │────sync──→│ products_summary │
│ categories      │           │ (precomputed)    │
│ attributes      │           └──────────────────┘
└─────────────────┘
   (writes)                    (reads)

-- Application logic
class ProductRepository:
    def __init__(self, master_db, replica_db):
        self.master = master_db
        self.replica = replica_db

    def update_product(self, data):
        # Write to master
        self.master.update('products', data)

    def get_product(self, id):
        # Read from replica (denormalized)
        return self.replica.query(
            "SELECT * FROM products_summary WHERE id = ?", [id]
        )

When to Denormalize: Decision Framework

1. Is this a write-heavy or read-heavy workload?
   ├─ Read-heavy: Consider denormalization
   └─ Write-heavy: Keep normalized

2. Is query performance critical?
   ├─ YES (< 100ms required): Might need denormalization
   └─ NO (> 1 second acceptable): Keep normalized

3. Can you tolerate eventual consistency?
   ├─ YES: Denormalize with batch refresh
   └─ NO: Keep normalized or use triggers

4. Do you have disk space to spare?
   ├─ YES: Denormalization is feasible
   └─ NO: Stick with normalized

5. Can you maintain consistency mechanisms?
   ├─ YES (triggers, batch jobs): Denormalize
   └─ NO: Too risky, keep normalized

Hybrid Approach: Best of Both Worlds

Most production systems use both:

class OptimalSchema:
    """
    Normalized core + denormalized views/cache
    """
    # Write path: normalized
    def create_product(self, name, category, attributes):
        # Insert to normalized tables
        product_id = db.insert('products', {'name': name, 'category_id': category})
        for attr_name, attr_value in attributes.items():
            db.insert('product_attrs', {
                'product_id': product_id,
                'name': attr_name,
                'value': attr_value
            })
        # Cache invalidated automatically

    # Read path: denormalized
    def get_product(self, product_id):
        # Try cache (denormalized)
        cached = cache.get(f"product:{product_id}")
        if cached:
            return cached

        # Fallback to materialized view (also denormalized)
        summary = db.query(
            "SELECT * FROM product_summary WHERE id = ?", [product_id]
        )
        cache.set(f"product:{product_id}", summary)
        return summary

    def list_products(self, category, limit=20):
        # List queries use denormalized view
        return db.query(
            "SELECT * FROM product_summary WHERE category = ? LIMIT ?",
            [category, limit]
        )

Performance Comparison Example

Real numbers from typical e-commerce system:

Operation	Normalized	Denormalized
Insert product	50ms	200ms (10 tables + triggers)
Update price	10ms	50ms (price + cache invalidation)
Get product (no cache)	150-300ms (3 JOINs)	20-50ms (single table)
Get product (cached)	5-10ms	5-10ms
List 20 products	200-500ms	50-100ms
Category change (1M products)	100ms (single update)	5-10 minutes (full refresh)

Conclusion: Normalized for writes, denormalized for reads. Cache for hot paths.

References

"Database Design" by C.J. Date
Database Normalization Tutorials
"Designing Data-Intensive Applications" by Martin Kleppmann
"SQL Performance Explained" by Markus Winand
"Database Internals" by Alex Petrov

Normalization vs Denormalization

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Normalization Forms​

Practical Example​

Unnormalized (Problems)​

Normalized (3NF)​

Strategic Denormalization​

When to Use Normalization / Denormalization​

Patterns and Pitfalls​

Design Review Checklist​

Self-Check​

Next Steps​

Real-World Normalization Examples​

E-Commerce Product Catalog​

Denormalized Approach (Materialized View)​

Denormalization Strategies​

Strategy 1: Cache Denormalization​

Strategy 2: Application-Level Denormalization​

Strategy 3: Read Replicas + Denormalization​

When to Denormalize: Decision Framework​

Hybrid Approach: Best of Both Worlds​

Performance Comparison Example​

References​