Test Data, Fixtures, and Synthetic Data
Generate and manage test data efficiently and safely.
TL;DR
Test data is expensive to create and maintain. Use fixtures (static YAML/SQL files for stable data) for simple cases. Use factories (programmatic generation) for dynamic variation and complex scenarios. Use synthetic data generators (Faker, random) to create realistic but fake data. Seed tests with deterministic data (same seed = same generated data) for reproducibility. Never use real production data in tests (GDPR violation, privacy risk). Organize test data by feature; keep fixtures DRY; version control them.
Learning Objectives
After reading this article, you will understand:
- Trade-offs between fixtures, factories, and synthetic data
- How to design reproducible test data
- Data privacy and compliance in testing
- Strategies for managing test data at scale
- Tools for test data generation
- How to organize and maintain test data
Motivating Scenario
Your tests depend on hand-crafted test data. Adding a new test requires manually creating complex fixtures. Tests pass locally but fail in CI because data changed. Another developer's tests interfere (shared test database). You spend 30% of time managing test data instead of writing tests.
Better approach: Use factories to generate realistic data on-demand. Each test gets fresh, isolated data. Seed the random generator for reproducibility. No manual fixture maintenance.
Core Concepts
Test Data Strategies
| Strategy | Setup Time | Maintenance | Flexibility | Privacy |
|---|---|---|---|---|
| Fixtures | Low | Medium | Low | Good |
| Factories | Medium | Low | High | Good |
| Synthetic | Medium | Medium | High | Excellent |
| Prod Copy | Low | High | N/A | Poor |
Practical Example
- Factory Boy (Python)
- Faker (JavaScript)
- Test Fixtures (YAML)
- Data Seeding
import factory
from faker import Faker
from app.models import User, Order
fake = Faker()
class UserFactory(factory.Factory):
class Meta:
model = User
id = factory.Sequence(lambda n: n)
email = factory.LazyFunction(fake.email)
name = factory.LazyFunction(fake.name)
created_at = factory.LazyFunction(fake.date_time)
class OrderFactory(factory.Factory):
class Meta:
model = Order
id = factory.Sequence(lambda n: n)
user = factory.SubFactory(UserFactory)
total = factory.Faker('pydecimal', left_digits=4, right_digits=2)
status = 'pending'
# Usage in tests
def test_order_total():
user = UserFactory(name='Alice')
order = OrderFactory(user=user, total=99.99)
assert order.user.name == 'Alice'
assert order.total == 99.99
def test_bulk_orders():
# Generate 100 orders
orders = OrderFactory.create_batch(100)
assert len(orders) == 100
assert all(o.status == 'pending' for o in orders)
import { faker } from '@faker-js/faker';
// Create realistic fake user
const user = {
id: faker.string.uuid(),
email: faker.internet.email(),
name: faker.person.fullName(),
address: faker.location.streetAddress(),
phone: faker.phone.number(),
createdAt: faker.date.past()
};
// Reproducible data with seed
faker.seed(42); // Same seed = same data every time
const user1 = {
id: faker.string.uuid(),
email: faker.internet.email()
};
faker.seed(42); // Reset to same seed
const user2 = {
id: faker.string.uuid(),
email: faker.internet.email()
};
expect(user1.email).toBe(user2.email); // Same!
# fixtures/users.yml
users:
alice:
id: 1
email: alice@example.com
name: Alice
status: active
bob:
id: 2
email: bob@example.com
name: Bob
status: inactive
orders:
alice_order_1:
id: 101
user_id: 1 # Reference to alice
total: 99.99
status: pending
bob_order_1:
id: 102
user_id: 2 # Reference to bob
total: 49.99
status: completed
# Seed function: set up reproducible test database state
import random
from decimal import Decimal
def seed_test_database(seed=42):
"""Seed database with reproducible fake data"""
random.seed(seed)
# Generate deterministic fake data
users = []
for i in range(10):
users.append(User(
id=i,
email=f'user{i}@example.com',
name=f'User {i}'
))
# Generate orders
orders = []
for i in range(50):
orders.append(Order(
id=i,
user_id=random.randint(0, 9), # Random but seeded
total=Decimal(random.randint(1000, 99999)) / 100,
status=random.choice(['pending', 'completed', 'cancelled'])
))
# Save to database
for user in users:
db.session.add(user)
for order in orders:
db.session.add(order)
db.session.commit()
# In test
def test_order_listing():
seed_test_database() # Fresh, reproducible data
orders = Order.query.all()
assert len(orders) == 50
When to Use / When Not to Use
- You need stable, reproducible test data
- Tests are isolated (each test gets fresh data)
- Data setup is complex (relationships, dependencies)
- You want tests to be maintainable and readable
- Performance is important (generate only what's needed)
- You copy production data (privacy/compliance risk)
- Tests manually create data in each test (hard to maintain)
- Shared mutable test data (tests interfere)
- Data generation is slower than actual testing
Patterns and Pitfalls
Test Data Best Practices and Anti-Patterns
Design Review Checklist
- Each test gets fresh, isolated data (no shared state)
- Test data is generated programmatically (factories/synthetic)
- Fixtures are organized by domain/feature
- Random data seeded for reproducibility
- No production data used in tests
- Sensitive fields masked or synthetic
- Test data generation is fast (< 100ms per test)
- Fixtures version-controlled (in git)
- Test data setup is DRY (no duplication)
- Data relationships tested (foreign keys, constraints)
- Documentation explains non-obvious test data
- Data cleanup runs after tests (no orphaned data)
- Synthetic data realistic (not fake-looking)
- Test data doesn't expose implementation details
- GDPR/privacy compliance verified
Self-Check Questions
-
Q: Should I use production data in tests? A: Never. It violates privacy/GDPR. Use synthetic data instead.
-
Q: How do I make test data reproducible? A: Seed the random number generator. Same seed = same data.
-
Q: What's the difference between fixtures and factories? A: Fixtures are static files (fast, hard to maintain). Factories generate dynamically (flexible, easy to maintain).
-
Q: How do I avoid test interference from shared data? A: Use factories to generate fresh data per test. Clean up after each test.
-
Q: Should test data look realistic? A: Yes. Use Faker to generate realistic-looking fake data. Catches bugs hidden by unrealistic data.
Next Steps
- Audit test data — How is it currently managed?
- Choose strategy — Fixtures, factories, or synthetic?
- Implement factories — Reduce manual fixture maintenance
- Seed for reproducibility — Same seed = same tests
- Isolate tests — Fresh data per test
- Remove production data — Replace with synthetic
- Organize fixtures — By domain/feature
- Document assumptions — Why is test data special?