Skip to main content

Test Data Management

Create and maintain test data efficiently without compromising privacy or realism.

TL;DR

Test data is expensive to create and maintain. Use fixtures (static, pre-built data), factories (dynamic generation), and synthetic data (fake but realistic). Never use production data in tests (GDPR risk). Mask sensitive fields when copying production data. Keep test databases small and fast. Seed databases consistently so tests are repeatable. Use fixture factories to generate variation quickly (100 users with different patterns). Data isolation: each test creates own data, cleans up after (even on failure).

Learning Objectives

  • Design efficient test data strategies
  • Use fixtures and factories to generate test data
  • Create realistic synthetic data
  • Mask sensitive data for GDPR compliance
  • Maintain test database performance
  • Ensure data isolation and repeatability

Motivating Scenario

A test requires 100 realistic users, but creating them manually takes hours. A factory generates 100 users with varied attributes programmatically in seconds. A test touches production data; a user's SSN leaks. Proper data masking prevents this. Tests run in parallel; they share database state. Test A creates user "Alice", Test B modifies "Alice", Test A fails. Solution: data isolation.

Core Concepts

StrategyUse WhenProsCons
FixturesSmall, stable dataFast, reproducibleBrittle to changes
FactoriesDynamic variationFlexible, quickOverhead of generation
Synthetic DataRealistic but fakeSafe, realistic patternsTime to generate
Production SnapshotComplex real scenariosRealisticGDPR risk, slow

Fixtures (Static Data)

# users.yaml - pre-built test data
users:
- id: 1
name: "Alice"
email: "alice@example.com"
created_at: "2025-01-01T00:00:00Z"
- id: 2
name: "Bob"
email: "bob@example.com"
created_at: "2025-01-02T00:00:00Z"

Factories (Dynamic Generation)

from factory import Factory, Sequence
from faker import Faker

fake = Faker()

class UserFactory(Factory):
class Meta:
model = User

id = Sequence(lambda n: n)
name = fake.name()
email = fake.email()
created_at = fake.date_time()

# Generate test users
user1 = UserFactory() # Auto-generated data
users = UserFactory.create_batch(100) # 100 users instantly

Synthetic Data

from faker import Faker

fake = Faker()

# Realistic fake data (GDPR safe)
users = [
{
"id": i,
"name": fake.name(),
"email": fake.email(),
"phone": fake.phone_number(),
"address": fake.address(),
"created_at": fake.date_time()
}
for i in range(100)
]

Data Masking (Production Snapshot Anonymization)

import re
import hashlib

def mask_production_data(user: dict) -> dict:
"""Mask sensitive fields for testing."""
return {
"id": user["id"],
"name": fake.name(), # Replace with fake name
"email": f"test+{user['id']}@example.com", # Anonymize email
"ssn": None, # Remove SSN entirely
"phone": fake.phone_number(), # Replace with fake phone
"created_at": user["created_at"], # Keep timestamp (non-sensitive)
}

# Load production users, mask sensitive fields
prod_users = fetch_production_users()
masked_users = [mask_production_data(u) for u in prod_users]
# Now safe to use in tests

When to Use / When NOT to Use

Test Data: Best Practices vs Anti-Patterns
Best Practices
  1. DO: Use Factories for Variation: Generate varied test users (100 different names, ages, emails). Catches edge cases.
  2. DO: Mask Production Data: Copy production schema (realistic), but replace names/emails/SSNs with fakes. GDPR safe.
  3. DO: Isolate Test Data: Each test creates own data, cleans up after. Tests don't interfere.
  4. DO: Use Fixtures for Stable Data: Reference data (countries, currencies) in fixtures. Checked into version control.
  5. DO: Keep Test Database Small: In-memory SQLite for unit tests, Docker containers for integration tests. Fast test runs.
  6. DO: Reset Database Consistently: Each test starts with same initial data. Seeding is reproducible.
Anti-Patterns
  1. DO: Use Factories for Variation: Hardcode single user 'Alice'. Tests only cover one scenario.
  2. DO: Mask Production Data: Copy production data as-is. SSNs, emails leak. GDPR violation, lawsuit.
  3. DO: Isolate Test Data: Shared test database. Test A modifies data. Test B gets unexpected state. Fails intermittently.
  4. DO: Use Fixtures for Stable Data: Hardcode reference data. Drift from production when reference changes.
  5. DO: Keep Test Database Small: Full production database. Creates/drops database for each test. Minutes per test.
  6. DO: Reset Database Consistently: Random initial state. Tests pass sometimes, fail other times.

Patterns & Pitfalls

Copy production database for testing. Real user PII (names, emails, SSNs) in test logs. GDPR leak.
Test assumes user ID = 1, name = 'Alice'. When schema changes, tests break.
All tests use same database. Test A creates user. Test B modifies it. Test A fails unpredictably.
Factories generate realistic, varied data. Each test gets fresh data. Repeatable, isolated.
Snapshot production, mask PII, use for tests. Realistic schema, safe data.
Test database has production size (10GB, 1B records). Test setup takes 5 minutes.
Unit tests use SQLite :memory:. Setup: 10ms, teardown: 1ms. Fast test runs.
Tests don't clean up. Old data accumulates. Tests slow down. Database bloats.

Design Review Checklist

  • Is test data generated programmatically (factories/fixtures, not manual)?
  • Are factories used for variation (100s of test cases with different data)?
  • Is production data never used in tests (GDPR compliance)?
  • Are sensitive fields masked (no real SSNs, emails, passwords)?
  • Is each test data isolated (own data, not shared)?
  • Does each test clean up after itself (even on failure)?
  • Is test database small (MB, not GB)?
  • Is test database fast to create/destroy (seconds, not minutes)?
  • Are database seeds reproducible (same seed = same data)?
  • Are fixtures version-controlled (checked into git)?
  • Is synthetic data realistic (matches production patterns)?
  • Can tests run in parallel (database isolation)?
  • Are test data patterns documented (what data for what test)?
  • Is data factory maintenance easy (update in one place)?
  • Are GDPR requirements enforced (no real PII in tests)?

Self-Check

  1. Right now, how long does it take to create test data for a new test? If > 5 minutes, that's slow.
  2. Can a test run in isolation (no shared state)? If not, tests are fragile.
  3. Do you use production data in tests? If yes, that's a GDPR risk.
  4. Can tests run in parallel? If not, test suite is slow.

Next Steps

  1. Audit current test data — How is it created? Is it manual?
  2. Build factories — For each major entity (User, Order, Product)
  3. Add Faker — Generate realistic fake data
  4. Implement isolation — Each test: setup fresh data, teardown after
  5. Optimize database — Use in-memory for unit tests
  6. Mask production data — If copying prod, anonymize PII
  7. Document patterns — What data for what test

References

  1. Factory Boy: Python Test Data ↗️
  2. Faker: Generate Fake Data ↗️
  3. PostgreSQL: Testing ↗️
  4. Testcontainers: Docker for Tests ↗️