Load, Stress, Soak, and Spike Testing
Validate system behavior under various load conditions and discover bottlenecks before production.
TL;DR
Load testing validates that the system meets latency/throughput targets under expected peak load. Stress testing finds the breaking point by pushing load beyond expected capacity. Soak testing runs at constant load for hours/days to detect memory leaks and resource exhaustion. Spike testing suddenly doubles/triples load to validate auto-scaling and failover. Use tools like JMeter, Locust, or k6. Start with load testing in staging; escalate to stress/soak only after passing load tests. Never run these in production without careful isolation and rollback plans.
Learning Objectives
- Distinguish load, stress, soak, and spike testing and when to apply each
- Design realistic load profiles based on production traffic patterns
- Identify bottlenecks and determine system breaking points
- Detect memory leaks, resource leaks, and long-running issues
- Plan capacity based on testing results
- Automate performance testing in CI/CD
Motivating Scenario
A service handles 100 requests/sec smoothly in staging during load tests. Black Friday comes, traffic spikes to 500 requests/sec, and the service becomes unresponsive. Post-incident investigation reveals: (1) load test didn't match realistic traffic patterns; (2) no stress testing to find breaking point; (3) auto-scaling triggers were misconfigured. Proper testing would have surfaced all three issues before launch.
Core Concepts
Four Types of Performance Tests
- Purpose: Validate SLO compliance under expected peak load
- Load profile: Constant or gradual ramp to expected peak (e.g., 500 req/sec)
- Duration: 10-30 minutes (enough to warm caches, stabilize)
- Success criteria: P99 latency < SLO, zero errors
- Example: "System handles 500 req/sec with P99 < 200ms"
- Purpose: Find the breaking point and failure modes
- Load profile: Gradual increase until system fails (e.g., 500→1000→2000 req/sec)
- Duration: Until saturation or errors exceed threshold
- Success criteria: Identify breaking point; ensure graceful degradation
- Example: "System saturates at 1200 req/sec; then circuit breaker activates"
- Purpose: Detect memory leaks, resource exhaustion, staleness over time
- Load profile: Constant moderate load (70-80% capacity)
- Duration: Hours to days (4-48+ hours)
- Success criteria: Memory stable, no connection leaks, no degradation
- Example: "System stable for 24 hours at 400 req/sec"
- Purpose: Validate auto-scaling and failover under sudden load increase
- Load profile: Sudden spike (e.g., 500→1500 req/sec instantly)
- Duration: 5-10 minutes at spike, then ramp down
- Success criteria: Auto-scaling triggers; P99 latency recovers in < 2 min
- Example: "Spike to 1500 req/sec triggers scale-out; P99 recovers to < 300ms"
Load Profile Design
Design realistic load profiles based on production traffic patterns:
Practical Example
- JMeter Load Test
- Locust (Python)
- Load Test Analysis
<!-- JMeter Test Plan for API Load Testing -->
<TestPlan guiclass="TestPlanGui">
<elementProp name="TestPlan.user_defined_variables" elementType="Arguments"/>
<!-- Variables -->
<Arguments guiclass="ArgumentsPanel">
<elementProp name="base_url" elementType="Argument">
<stringProp name="Argument.value">http://api.example.com</stringProp>
</elementProp>
<elementProp name="target_rps" elementType="Argument">
<stringProp name="Argument.value">500</stringProp>
</elementProp>
</Arguments>
<!-- Thread Group: 500 concurrent users -->
<ThreadGroup guiclass="ThreadGroupGui">
<elementProp name="ThreadGroup.main_controller" elementType="LoopController">
<stringProp name="LoopController.loops">-1</stringProp>
</elementProp>
<stringProp name="ThreadGroup.num_threads">500</stringProp>
<stringProp name="ThreadGroup.ramp_time">60</stringProp>
<stringProp name="ThreadGroup.duration">600</stringProp>
</ThreadGroup>
<!-- HTTP Request Sampler -->
<HTTPSampler guiclass="HttpTestSampleGui">
<stringProp name="HTTPSampler.domain">${base_url}</stringProp>
<stringProp name="HTTPSampler.path">/api/products</stringProp>
<stringProp name="HTTPSampler.method">GET</stringProp>
</HTTPSampler>
<!-- Listeners: Results aggregation -->
<ResultCollector guiclass="SummaryReport">
<objProp name="sample_variables"/>
</ResultCollector>
<!-- Assertions: Success criteria -->
<ResponseAssertion guiclass="AssertionGui">
<TestType>6</TestType>
<stringProp name="Assertion.test_type">6</stringProp>
<stringProp name="Assertion.test_strings">200</stringProp>
</ResponseAssertion>
</TestPlan>
Expected Results:
- Throughput: 500 req/sec
- P50 latency: ~80ms
- P99 latency: < 200ms
- Error rate: 0%
from locust import HttpUser, task, between
import random
class APIUser(HttpUser):
wait_time = between(0.5, 2.0) # Think time between requests
@task(3)
def get_products(self):
"""Load test /api/products endpoint"""
self.client.get("/api/products")
@task(1)
def get_product_detail(self):
"""Load test /api/products/{id}"""
product_id = random.randint(1, 10000)
self.client.get(f"/api/products/{product_id}")
@task(1)
def create_order(self):
"""Load test POST /api/orders"""
payload = {
"user_id": random.randint(1, 1000),
"items": [{"product_id": random.randint(1, 10000), "qty": 1}]
}
self.client.post("/api/orders", json=payload)
# Run with:
# locust -f locustfile.py --host=http://api.example.com -u 500 -r 50 -t 10m
Locust Dashboard: Real-time view of RPS, latency, and failures.
Load Test Results: API Gateway
================================
Duration: 10 minutes (600 seconds)
Target RPS: 500
Actual RPS: 498 avg
Latency Percentiles (ms):
P50: 85
P95: 165
P99: 245
P99.9: 315
Error Rate: 0.02% (1 timeout in 5000 requests)
Analysis:
✓ P99 (245ms) is below SLO (300ms) — PASS
✓ Error rate < 0.1% — PASS
⚠ P99.9 approaching SLO cap — monitor closely
Bottleneck Analysis:
- Database CPU: 72% (add read replicas)
- API memory: 1.2 GB (stable, no leak detected)
- Network: 450 Mbps (headroom available)
Load Testing in CI/CD
Automate performance testing:
CI/CD Load Testing Checklist
Common Pitfalls
Pitfall 1: Load profile doesn't match production
- Risk: Staging passes tests; production fails.
- Fix: Analyze real production traffic (request distribution, think times, payload sizes); replicate in tests.
Pitfall 2: Cache warming ignored
- Risk: Cold cache makes latency look worse than production.
- Fix: Warm caches before measuring; run load tests long enough for steady state.
Pitfall 3: Stress test too aggressive
- Risk: Damages staging infrastructure.
- Fix: Start with low load; ramp gradually; have rollback plan.
Pitfall 4: Tests on under-resourced staging
- Risk: Staging bottleneck hides production problems.
- Fix: Ensure staging hardware matches production (or scale proportionally).
Real-World Case Studies
Case Study 1: E-Commerce Black Friday
Scenario: Normal traffic 200 req/sec. Black Friday peak: 2000 req/sec (10x).
Load Test Results:
Normal peak (200 req/sec): P99 = 85ms ✓
Expected peak (400 req/sec): P99 = 150ms ✓
Stress Test Results:
Push to 1000 req/sec: P99 = 1200ms, errors begin
Push to 2000 req/sec: P99 = 5000ms+, circuit breaker opens
Findings:
System saturates at ~800 req/sec
Breaking point: database CPU 100%, connection pool exhausted
Auto-scaling kicks in at 750 req/sec (configured threshold)
With 3 additional instances: handles 2000 req/sec with P99 = 200ms
Recommendations:
1. Scale threshold should trigger at 750 req/sec (before saturation)
2. Add read replicas to spread load
3. Implement request queuing with clear API feedback
4. Cache hot queries (product listings)
Case Study 2: Memory Leak in Batch Service
Soak Test Setup:
72 hours at 100 req/sec
Service processes files, should release memory after completion
Results (by hour):
Hour 0: 250 MB
Hour 24: 320 MB (80 MB growth)
Hour 48: 410 MB (160 MB growth)
Hour 72: 510 MB (260 MB growth)
Trend: Linear growth; memory never released
Root cause: Event listeners not unregistered after processing
Leak rate: ~3.6 MB/hour
Fix: Remove event listener after file processing
listener.on('complete', cleanup)
// Must call listener.off() or listener.removeListener()
Verification: Re-run 72-hour soak
Memory: ~250MB throughout (stable)
Interpreting Load Test Results
Latency Percentiles Explained
P50 (median): 50% of requests faster than this
P90: 90% of requests faster than this
P99: 99% of requests faster than this (99th percentile)
P99.9: 999 of 1000 requests faster than this
Example results from 10,000 request test:
P50: 80ms (request 5000 slower than this)
P90: 150ms (request 9000 slower than this)
P99: 300ms (request 9900 slower than this)
P99.9: 1200ms (request 9990 slower than this)
SLO: P99 < 300ms means 99% must be fast, but 100 out of 10,000 can be slow
Outliers matter: If 1% can be slow, ensure that 1% is acceptable (e.g., page load)
Error Budgets from Load Tests
SLO: 99.5% uptime, P99 < 200ms
From load test at capacity:
- Error rate: 0.3% (3 errors per 1000 requests)
- P99 latency: 180ms (within SLO)
Monthly error budget:
99.5% uptime = 3.6 hours of downtime allowed per month
During Black Friday:
- If peak load hits saturation
- Error rate jumps to 5%
- That's 5 failed requests per 100
- At 2000 req/sec, that's 100 failed requests per second
- Quickly exhausts monthly error budget
Lesson: Load test at actual peak traffic; don't underestimate
Distributed Load Testing
For large-scale systems, single load generator can't simulate realistic load:
# Instead of 1 JMeter client generating 1000 req/sec
# Use 5 distributed agents each generating 200 req/sec
# Simulates 5 geographically separate users
# JMeter Distributed Setup:
# Master (coordinates test)
# Agents (generate load):
# - agent-us-east: 200 req/sec
# - agent-us-west: 200 req/sec
# - agent-eu: 200 req/sec
# - agent-asia: 200 req/sec
# - agent-brazil: 200 req/sec
# Total: 1000 req/sec from 5 regions
# Results are aggregated on master
# Identifies regional issues (e.g., Asia latency higher)
Monitoring During Load Tests
Don't just measure latency; monitor system health:
# Metrics to track during load test
latency:
p50: 85ms
p99: 245ms
p99.9: 1200ms
error_rate: 0.02%
cpu_usage:
api_servers: 72%
database: 85%
cache: 45%
memory_usage:
api_servers: 1.2GB / 2GB (60%)
database: 8.5GB / 16GB (53%)
cache: 4.2GB / 8GB (52%)
connections:
database_connections: 450 / 500 (90% pool utilization)
thread_pool: 350 / 500 (70% utilization)
network:
bandwidth: 450 Mbps / 1 Gbps (45%)
packet_loss: 0.01%
resource_constraints:
- Database connections approaching limit; scale reads or reduce concurrency
- CPU on API servers 85%; auto-scaling should trigger
- Network utilization low; not the bottleneck
Post-Test Analysis
After running load test, analyze results thoroughly:
1. Latency analysis:
- Is P99 within SLO? If not, what's the limit before breaching?
- Are there anomalies (sudden spike at specific time)?
- Plot latency over time; look for degradation pattern
2. Error analysis:
- What errors occurred? 404? 500? Timeout?
- Are specific endpoints more error-prone?
- Errors by error type:
* 0.01% timeout (database slow)
* 0.005% 500 errors (application exception)
* 0.005% 503 (rate limiter)
3. Resource bottleneck:
- Which resource hit limit first? CPU? Memory? Connections?
- Could bottleneck be relieved with more of that resource?
- Is scaling the answer, or is it an architectural problem?
4. Recommendations:
- Add X more servers to handle 2000 req/sec
- Implement caching for hot queries
- Increase database connection pool from 50 to 100
- Optimize slow endpoint (API response time 800ms)
- Add read replicas to spread database load
Next Steps
- Analyze production traffic — Request distribution, think times, payload sizes.
- Design load profile — Realistic model based on production patterns.
- Run load test — Validate P99 < SLO under expected peak.
- Run stress test — Find breaking point; ensure graceful degradation.
- Run soak test — Detect leaks; validate long-running stability.
- Run spike test — Validate auto-scaling and failover under sudden load.
- Automate in CI/CD — Gate deployments on load test results.
- Monitor and iterate — Track performance over time; adjust budgets as scale increases.
- Document findings — Share bottlenecks, limits, and scaling recommendations across team
References
- JMeter Official Documentation
- Locust Load Testing Framework
- k6 Modern Load Testing
- Google SRE Book — Load Testing
- Brendan Gregg — Performance Testing (Systems Performance, 2nd Ed.)
- "Load Testing as You Grow" — AWS Architecture Blog
- Load Generator Comparison — JMeter vs Locust vs k6 vs Gatling