Test Data, Fixtures, and Synthetic Data

Generate and manage test data efficiently and safely.

TL;DR

Test data is expensive to create and maintain. Use fixtures (static YAML/SQL files for stable data) for simple cases. Use factories (programmatic generation) for dynamic variation and complex scenarios. Use synthetic data generators (Faker, random) to create realistic but fake data. Seed tests with deterministic data (same seed = same generated data) for reproducibility. Never use real production data in tests (GDPR violation, privacy risk). Organize test data by feature; keep fixtures DRY; version control them.

Learning Objectives

After reading this article, you will understand:

Trade-offs between fixtures, factories, and synthetic data
How to design reproducible test data
Data privacy and compliance in testing
Strategies for managing test data at scale
Tools for test data generation
How to organize and maintain test data

Motivating Scenario

Your tests depend on hand-crafted test data. Adding a new test requires manually creating complex fixtures. Tests pass locally but fail in CI because data changed. Another developer's tests interfere (shared test database). You spend 30% of time managing test data instead of writing tests.

Better approach: Use factories to generate realistic data on-demand. Each test gets fresh, isolated data. Seed the random generator for reproducibility. No manual fixture maintenance.

Core Concepts

Test Data Strategies

Trade-offs between different test data approaches

Strategy	Setup Time	Maintenance	Flexibility	Privacy
Fixtures	Low	Medium	Low	Good
Factories	Medium	Low	High	Good
Synthetic	Medium	Medium	High	Excellent
Prod Copy	Low	High	N/A	Poor

Practical Example

Factory Boy (Python)
Faker (JavaScript)
Test Fixtures (YAML)
Data Seeding

import factory
from faker import Faker
from app.models import User, Order

fake = Faker()

class UserFactory(factory.Factory):
    class Meta:
        model = User

    id = factory.Sequence(lambda n: n)
    email = factory.LazyFunction(fake.email)
    name = factory.LazyFunction(fake.name)
    created_at = factory.LazyFunction(fake.date_time)

class OrderFactory(factory.Factory):
    class Meta:
        model = Order

    id = factory.Sequence(lambda n: n)
    user = factory.SubFactory(UserFactory)
    total = factory.Faker('pydecimal', left_digits=4, right_digits=2)
    status = 'pending'

# Usage in tests
def test_order_total():
    user = UserFactory(name='Alice')
    order = OrderFactory(user=user, total=99.99)
    assert order.user.name == 'Alice'
    assert order.total == 99.99

def test_bulk_orders():
    # Generate 100 orders
    orders = OrderFactory.create_batch(100)
    assert len(orders) == 100
    assert all(o.status == 'pending' for o in orders)

import { faker } from '@faker-js/faker';

// Create realistic fake user
const user = {
  id: faker.string.uuid(),
  email: faker.internet.email(),
  name: faker.person.fullName(),
  address: faker.location.streetAddress(),
  phone: faker.phone.number(),
  createdAt: faker.date.past()
};

// Reproducible data with seed
faker.seed(42);  // Same seed = same data every time
const user1 = {
  id: faker.string.uuid(),
  email: faker.internet.email()
};

faker.seed(42);  // Reset to same seed
const user2 = {
  id: faker.string.uuid(),
  email: faker.internet.email()
};

expect(user1.email).toBe(user2.email);  // Same!

# fixtures/users.yml
users:
  alice:
    id: 1
    email: alice@example.com
    name: Alice
    status: active

  bob:
    id: 2
    email: bob@example.com
    name: Bob
    status: inactive

orders:
  alice_order_1:
    id: 101
    user_id: 1  # Reference to alice
    total: 99.99
    status: pending

  bob_order_1:
    id: 102
    user_id: 2  # Reference to bob
    total: 49.99
    status: completed

# Seed function: set up reproducible test database state
import random
from decimal import Decimal

def seed_test_database(seed=42):
    """Seed database with reproducible fake data"""
    random.seed(seed)

    # Generate deterministic fake data
    users = []
    for i in range(10):
        users.append(User(
            id=i,
            email=f'user{i}@example.com',
            name=f'User {i}'
        ))

    # Generate orders
    orders = []
    for i in range(50):
        orders.append(Order(
            id=i,
            user_id=random.randint(0, 9),  # Random but seeded
            total=Decimal(random.randint(1000, 99999)) / 100,
            status=random.choice(['pending', 'completed', 'cancelled'])
        ))

    # Save to database
    for user in users:
        db.session.add(user)
    for order in orders:
        db.session.add(order)
    db.session.commit()

# In test
def test_order_listing():
    seed_test_database()  # Fresh, reproducible data
    orders = Order.query.all()
    assert len(orders) == 50

When to Use / When Not to Use

Use Fixtures/Factories When:

You need stable, reproducible test data
Tests are isolated (each test gets fresh data)
Data setup is complex (relationships, dependencies)
You want tests to be maintainable and readable
Performance is important (generate only what's needed)

Avoid When:

You copy production data (privacy/compliance risk)
Tests manually create data in each test (hard to maintain)
Shared mutable test data (tests interfere)
Data generation is slower than actual testing

Patterns and Pitfalls

Test Data Best Practices and Anti-Patterns

✓ Best Practices

Use factories: Generate data dynamically, reduce fixture maintenance. Seed for reproducibility: Same seed = same data; deterministic tests. Isolate test data: Each test gets fresh data; no interference. Don't use production data: Fake it with synthetic data. Version control fixtures: Fixtures in git; changes tracked. Organize by feature: Test data structured logically. DRY test data: Extract common data setup. Document assumptions: If test data is special, document why.

✗ Anti-Patterns

Hardcoded data in tests: IDs, emails scattered in code; fragile. Shared mutable database: All tests hit same data; interference. Real production data in tests: GDPR violation, privacy breach. No test data organization: Finding/maintaining fixtures is hard. Brittle fixtures: Test data tightly coupled to implementation. Generate unnecessary data: Creating 1000 users when you need 5. Ignoring privacy: Sensitive data exposed in test runs.

Design Review Checklist

Self-Check Questions

Q: Should I use production data in tests? A: Never. It violates privacy/GDPR. Use synthetic data instead.
Q: How do I make test data reproducible? A: Seed the random number generator. Same seed = same data.
Q: What's the difference between fixtures and factories? A: Fixtures are static files (fast, hard to maintain). Factories generate dynamically (flexible, easy to maintain).
Q: How do I avoid test interference from shared data? A: Use factories to generate fresh data per test. Clean up after each test.
Q: Should test data look realistic? A: Yes. Use Faker to generate realistic-looking fake data. Catches bugs hidden by unrealistic data.

Next Steps

Audit test data — How is it currently managed?
Choose strategy — Fixtures, factories, or synthetic?
Implement factories — Reduce manual fixture maintenance
Seed for reproducibility — Same seed = same tests
Isolate tests — Fresh data per test
Remove production data — Replace with synthetic
Organize fixtures — By domain/feature
Document assumptions — Why is test data special?

References

TL;DR
Learning Objectives
Motivating Scenario
Core Concepts
- Test Data Strategies
Practical Example
When to Use / When Not to Use
Patterns and Pitfalls
Design Review Checklist
Self-Check Questions
Next Steps
References

Test Data, Fixtures, and Synthetic Data

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Test Data Strategies​

Practical Example​

When to Use / When Not to Use​

Patterns and Pitfalls​