Skip to main content

ETL/ELT, Data Lakes & Warehouses

Build analytics platforms with raw and curated data

TL;DR

ETL (Extract-Transform-Load) applies transformations before storing—enforces schema upfront, slow to adapt. ELT (Extract-Load-Transform) loads raw data first, transforms in-place—flexible, modern, cloud-native. Data lakes store raw unstructured data at petabyte scale (S3 + data catalog). Data warehouses provide curated, optimized schemas for analytics queries. Most effective modern architecture: bronze (raw) → silver (cleaned) → gold (analytics) layers in a data lake, with warehouse as optimized query layer for BI tools.

Learning Objectives

By the end of this article, you will understand:

  • Fundamental differences between ETL and ELT approaches
  • Trade-offs between schema-on-write (ETL) and schema-on-read (ELT)
  • Data lake architecture and typical use cases
  • Data warehouse schemas: star, snowflake, denormalized
  • The medallion architecture (bronze-silver-gold layers)
  • How to choose between lake vs warehouse vs hybrid approach

Motivating Scenario

Your company collects data from 50 different SaaS tools (Salesforce, HubSpot, Stripe, etc.) with varying schemas that change monthly. You need analysts to query this data in real-time. With traditional ETL, you'd write transformations for each source—but schema changes break everything. With ELT, you dump all raw data to S3 as-is, define schemas in SQL queries, quickly adapt to changes. Then, for frequently-used metrics (revenue by region, customer lifetime value), you materialize optimized tables in a warehouse using dbt, giving analysts both flexibility (raw data queries) and performance (pre-built tables).

Core Concepts

Data Lake to Warehouse Architecture with Medallion Pattern

ETL vs ELT

ETL (Extract-Transform-Load):

  1. Extract data from sources
  2. Transform in separate pipeline (Informatica, Talend)
  3. Load into structured warehouse

Characteristics:

  • Schema defined upfront (schema-on-write)
  • Transformations centralized and versioned
  • Data quality enforced before warehouse entry
  • Slow to adapt to new source schemas
  • Complex for handling messy/unstructured data

ELT (Extract-Load-Transform):

  1. Extract data from sources
  2. Load raw into data lake
  3. Transform using SQL/dbt in the warehouse itself

Characteristics:

  • Schema flexible (schema-on-read)
  • Transformations stay with queries, easier to test
  • Raw data always available for recomputation
  • Adapt quickly to schema changes
  • Better for cloud data warehouses (Snowflake, BigQuery)

Data Lake Architecture

A data lake stores raw, unstructured, and structured data at massive scale. Typically S3 + metadata catalog (Glue, Unity Catalog).

Medallion Architecture organizes data in layers:

Bronze Layer:

  • Raw data as-received from sources
  • Minimal transformations: add timestamp, source metadata
  • Format: JSON, Parquet, CSV
  • Retention: Long-term (archive historical changes)

Silver Layer:

  • Deduplicated, validated data
  • Standard naming conventions
  • Dropped nulls/invalid records
  • Merged incremental updates
  • Format: Parquet with schema evolution

Gold Layer:

  • Business-ready metrics and dimensions
  • Denormalized for BI tools
  • Pre-aggregated tables (daily revenue, user cohorts)
  • Join fact and dimension tables
  • Format: Parquet optimized for analytical queries

Example flow:

Raw Salesforce JSON (Bronze)
→ Parse, add metadata, validate (Silver)
→ Group by customer, compute lifetime value (Gold)
→ Pre-build dashboard tables (Warehouse)

Data Warehouse

A data warehouse organizes data into highly optimized schemas for analytical queries.

Common schemas:

Star Schema:

  • Central fact table (transactions, events)
  • Dimension tables (users, products, dates)
  • Fact joins dimensions on foreign keys
  • Fast aggregations, simple joins

Snowflake Schema:

  • Extended star schema
  • Dimension tables further normalized
  • Reduces redundancy, increases query complexity
  • Used when dimension data is very large

Denormalized Schema:

  • Flatten fact + dimensions into single table
  • Fastest queries, most storage
  • Used in data lakes where storage cheap

Data Quality & Transformation

dbt (data build tool) defines transformations as SQL queries:

  • Version control transformations
  • Test data quality (NOT NULL, unique, foreign keys)
  • Generate documentation
  • Modular: reference other models

Practical Example

# Informatica-style ETL using Python
import pandas as pd
import psycopg2
from datetime import datetime

# EXTRACT
salesforce_df = read_salesforce_api() # Connect to Salesforce

# TRANSFORM
# Validate required columns
assert 'account_id' in salesforce_df.columns
assert 'amount' in salesforce_df.columns

# Clean and standardize
salesforce_df['amount'] = salesforce_df['amount'].fillna(0)
salesforce_df['created_at'] = pd.to_datetime(
salesforce_df['created_at']
)

# Compute derived fields
salesforce_df['year'] = salesforce_df['created_at'].dt.year
salesforce_df['month'] = salesforce_df['created_at'].dt.month

# Join with reference data
accounts_df = load_from_warehouse("SELECT id, name FROM accounts")
salesforce_df = salesforce_df.merge(
accounts_df,
left_on='account_id',
right_on='id'
)

# LOAD
conn = psycopg2.connect("dbname=warehouse")
cur = conn.cursor()

# Truncate (or use CDC logic)
cur.execute("TRUNCATE TABLE sales")

# Insert
salesforce_df.to_sql('sales', conn, if_exists='append')
conn.commit()
conn.close()

print(f"Loaded {len(salesforce_df)} records")

Issues:

  • If Salesforce adds new field, pipeline breaks
  • If validation logic changes, must reprocess all data
  • Tightly coupled to source schema

When to Use / When Not to Use

Data Lake
  1. Ingest diverse, unstructured sources
  2. Schema changes frequently
  3. Long-term historical storage
  4. Exploratory analytics
  5. Ad-hoc SQL queries
  6. Cost-optimized storage (S3)
Data Warehouse
  1. BI tool dashboards
  2. Consistent schema required
  3. Fast, predictable queries
  4. Business metrics reporting
  5. Complex aggregations pre-built
  6. Query performance critical

Patterns & Pitfalls

When Salesforce adds a field, ELT auto-detects it; ETL breaks. Use schema-on-read: define expected columns in transformations, ignore unknowns. Version schemas in dbt (v1, v2 models) for controlled evolution.
Tests at each layer catch issues early. Bronze tests: source data present? Silver tests: no duplicates? Gold tests: revenue_sum > 0? Early tests prevent garbage propagating downstream.
Parquet files partitioned by date: analysts query only needed dates. Query 1 year of sales data efficiently instead of scanning 10 years. S3 Select skips non-partition data automatically.
Data warehouses often denormalize (wide tables) for query speed. Data lakes can stay normalized if storage cheap. Choose based on query patterns: high-cardinality dimensions (millions of customers) justify normalization.
Full refresh: simple, but slow for large data. Incremental: only load changes since last run, much faster. Use CDC for incremental sources; timestamp-based for others. dbt handles both patterns.
Fact table shows yesterday's sales incomplete; late orders arrive today. Incremental loads with lookback window (reprocess last 2 days) or set-theoretic deltas (only process changed rows).

Design Review Checklist

  • Documented all data sources: structure, schema, change frequency
  • Decided ETL vs ELT: which transformation approach fits?
  • Designed medallion layers: Bronze (raw) → Silver (clean) → Gold (metrics)
  • Defined data quality tests at each layer (uniqueness, NOT NULL, ranges)
  • Chose partitioning strategy: by date? by region? what queries access?
  • For ETL: documented transformation logic, error handling, retry logic
  • For ELT: planned schema evolution, versioning, backward compatibility
  • Configured incremental loads: timestamp-based? offset-based? CDC?
  • Monitored pipeline: latency, record counts, quality metrics
  • Documented data lineage: where does each column come from?

Self-Check

  • What's the key difference between ETL and ELT? (Hint: when transformation happens: before or after load)
  • Why is schema-on-read flexible? (Hint: fields can be added/removed without breaking pipeline)
  • What's the medallion architecture? (Hint: Bronze raw → Silver clean → Gold metrics)
  • When would you denormalize in a warehouse? (Hint: when query speed matters more than storage)
  • How do incremental loads improve performance? (Hint: process only changes, not full dataset)

Next Steps

  • Build a simple ELT pipeline: load JSON to S3, transform with SQL queries
  • Design dbt models: write SQL models with tests, materialize as tables
  • Set up medallion layers: create Bronze, Silver, Gold schemas in BigQuery/Snowflake
  • Implement incremental loads: add timestamp tracking, process deltas
  • Monitor data quality: set up tests, alerting on failures

References