Skip to main content

Profiling and Bottleneck Analysis

Use profilers and tracing to find where time is spent and what to optimize.

TL;DR

Don't guess—profile. CPU profilers show which functions consume cycles. Memory profilers reveal leaks and allocations. Distributed tracers track requests across services. Flamegraphs visualize call stacks with time spent. Start with production sampling (low overhead 1-5%); drill deeper with synthetic workloads. Optimize the hottest path first; measure improvement. Typical finding: 80% of time spent in 20% of code. Target that 20%.

Learning Objectives

  • Distinguish CPU, memory, I/O, lock, and context-switch profilers
  • Generate and interpret flamegraphs
  • Use distributed tracing to find slow services in a call chain
  • Profile in production with sampling (low overhead)
  • Identify bottlenecks systematically (not guessing)
  • Measure improvements before/after optimization

Motivation: Data-Driven Optimization

API P99 is 1200ms instead of 300ms target. Team guesses:

  • "Database is slow, add cache"
  • "GC pauses, increase heap"
  • "Network latency, add retries"

Without profiling: wildly expensive optimizations that don't help.

With CPU profiling: 60% of time in JSON serialization. Optimize that single function → P99 drops to 400ms. Cost: 4 hours dev time. ROI: 3x improvement.

With distributed tracing: Only 5 out of 1000 requests hit slow DB query. Caching everything wastes memory; fix the 5 queries instead.

Lesson: Profile before optimizing. Measure after each change. Avoid premature optimization.

Profiler Types

Purpose: Which functions consume CPU time?

How it works:
✓ Samples call stack every ~1-10ms
✓ Accumulates CPU time per function
✓ Low overhead (1-5%)
✓ Suitable for production

Output: Flamegraph, call tree, function list

Example Output:
main() ────────────── 1000ms
├─ serialize_json() ─ 600ms (60%) ← HOTTEST
├─ database_query() ─ 300ms (30%)
└─ network_call() ── 100ms (10%)

Tools:
- Linux: perf, flamegraph
- Java: Java Flight Recorder, JProfiler
- Python: cProfile, py-spy
- Go: pprof
- .NET: dotTrace
- APM: Datadog, New Relic, Honeycomb

When to use:
✓ API is slow, find hotspot
✓ Batch job takes too long
✓ Understand algorithm efficiency
✗ Don't use for I/O-bound work (mostly waiting, not CPU)

Flamegraphs: Visualizing Call Stacks

Flamegraphs show the call stack with time spent:

Width = CPU time spent
Height = call depth
Color = function name (often random for clarity)

Example: Web request handling

┌─────────────────────────────────────────────────────────┐
│ main() [1000ms total] │
├─────────────────┬────────────────┬──────────────────────┤
│ receive_req() │ process_req() │ send_response() │
│ [100ms] │ [800ms] │ [100ms] │
├────────┬────────┼────────┬───────┴──────────────────────┤
│decode()│validate│parse() │ serialize_json() [600ms] ← HOTTEST
│[30ms] │[20ms] │[50ms] │(FOCUS HERE)
└────────┴────────┴────────┴─────────────────────────────┘

Optimization: Replace json library → P99 -400ms

Real flamegraph (text-based):

main
└─ handle_request [1000ms]
├─ json.loads [100ms]
├─ process [200ms]
│ ├─ database_query [150ms] ← Significant
│ └─ compute [50ms]
└─ json.dumps [600ms] ← HOTTEST, optimize here!
# Linux: CPU profiling with perf + flamegraph

# 1. Record CPU profile for 30 seconds
perf record -F 99 -p 12345 -g -- sleep 30

# 2. Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg

# View in browser, click to zoom

# Python: Generate flamegraph
pip install py-spy flamegraph

py-spy record -o profile.svg -- python app.py
# Opens in browser

# Go: Built-in pprof
# Add import _ "net/pprof" to your code

curl http://localhost:6060/debug/pprof/profile?seconds=30 > profile.prof
go tool pprof profile.prof
(pprof) top # Shows hottest functions
(pprof) web # Generates graph
(pprof) list serialize_json # Show source code

Distributed Tracing for Microservices

Find slow services in a call chain:

Problem: API request spans multiple services.
Client → API Gateway → Auth Service → Order Service → Payment Service → Response
Total: 1200ms (P99)
Which service is slow?

Solution: Distributed tracing (Jaeger, Datadog, Honeycomb)

Example trace:
Request: GET /orders/123

├─ Auth Service [10ms] ✓ Fast
├─ Order Service [950ms] ← SLOWEST
│ ├─ Database query [900ms] ← Root cause!
│ └─ Cache lookup [50ms]
└─ Payment Service [200ms] ✓ OK

Total: 1160ms

Action:
- Add index to database (order_id)
- Result: Database query [50ms]
- New total: 260ms ← 5x faster!

Trace visualization (like flamegraph but for services):
┌─────────────────────────────────────────┐
│ Request Trace (1200ms P99) │
├─ API Gateway ──────────── [10ms] │
├─ Auth Service ─────────── [10ms] │
├─ Order Service ──────────────────── [950ms] ← Drill into this
│ ├─ DB Query ──────────────────── [900ms] ← Root cause!
│ └─ Cache ───────────────── [50ms] │
└─ Payment Service ──── [200ms] │

Systematic Bottleneck Analysis

Process:

  1. Set target (e.g., P99 < 300ms)
  2. Measure current (e.g., P99 = 1200ms)
  3. Profile (identify hottest functions)
  4. Optimize hottest 20%
  5. Measure again (verify improvement)
  6. Repeat until target met

Real example (e-commerce API):

Target: P99 latency < 300ms
Current: P99 = 1200ms

Step 1: CPU Profile
serialize_json() 600ms (50%) ← Hottest
database_query() 300ms (25%)
cache_lookup() 150ms (12%)
request_validation() 100ms (8%)
error_handling() 50ms (5%)

Step 2: Optimize #1 (JSON serialization)
✓ Replace json library with faster one
✓ Cache serialized responses
✓ Use protobuf instead of JSON
Result: 600ms → 100ms

New profile:
database_query() 300ms (35%) ← Now hottest
cache_lookup() 150ms (18%)
serialize_json() 100ms (12%)
...
P99 = 550ms (target 300ms, still not there)

Step 3: Optimize #2 (Database query)
✓ Add index to order_id column
✓ Cache query results
✓ Use connection pooling
Result: 300ms → 50ms

New profile:
cache_lookup() 150ms (60%) ← Now hottest
database_query() 50ms (20%)
...
P99 = 200ms ✓ TARGET MET!

Common Pitfalls and Fixes

Tools Comparison

ToolLanguageTypeOverheadCost
perfC/C++/PythonCPU1-2%Free
Java Flight RecorderJavaCPU, Memory, Locks5%Built-in
pprofGoCPU, Memory, Goroutines2-5%Built-in
DatadogAnyAPM, distributed tracing5-15%$$
HoneycombAnyTracing, observability10-20%$$
JaegerAnyDistributed tracing5-10%Free
py-spyPythonCPU1-5%Free

Profiling Checklist

  • Have you profiled before optimizing (not guessing)?
  • Is your profiling under realistic load (not staging)?
  • Do you have CPU, memory, and I/O profiles?
  • Are you using distributed tracing across services?
  • Did you identify the hottest 20% of code?
  • Are you measuring before and after optimization?
  • Is production sampling in place (continuous profiling)?
  • Can you generate flamegraphs on demand?
  • Are performance regression caught (benchmarks/alerts)?
  • Is profiling overhead acceptable (<5%)?

Self-Check

  1. Why profile before optimizing? Guessing wrong wastes time. Profiling shows reality.
  2. What's the 80/20 rule in profiling? 80% of time spent in 20% of code. Optimize that 20%.
  3. How does sampling work? Interrupt thread every ~1-10ms, record call stack. Aggregate over many samples.
  4. What's a flamegraph? Visualization of call stack with time spent (width). Easier than lists to spot bottlenecks.
  5. Why use distributed tracing? Find slow service in a chain of 10+ services.
info

One Takeaway: Profile before optimizing. Use CPU profiler to find hottest functions, memory profiler for leaks, distributed tracing for microservices. Measure improvement after each change.

Next Steps

  • Set up continuous profiling in production (Datadog, Honeycomb)
  • Create performance regression tests (catch slowdowns before production)
  • Learn flamegraph generation for your language
  • Study distributed tracing setup (Jaeger, Datadog)
  • Benchmark algorithm changes before/after (measure, don't guess)

References

Production Profiling Best Practices

Continuous Profiling

Enable lightweight profiling (1-5% overhead) continuously in production:

Benefits:

  • Discover real-world bottlenecks (not visible in staging)
  • Detect performance regressions immediately
  • Historical data for trend analysis
  • Actionable insights without crashes

Tools:

  • Datadog Continuous Profiler: 5% overhead, production-safe
  • pyflame (Python): Lightweight, user-space profiling
  • Async Profiler (Java): Sub-percent overhead

Example setup:

# Datadog agent continuously profiles
# Every hour: sample call stacks, aggregate, send to Datadog
# Zero crashes, minimal overhead

Profiling in Microservices

Challenge: Single slow request spans 5+ services. Which one is bottleneck?

Solution: Distributed tracing + per-service profiling.

Request timeline:
t=0ms Client → API Gateway
t=10ms API Gateway → Auth Service
t=50ms Auth Service → Order Service ← SLOW (800ms here!)
t=850ms Order Service → Payment Service
t=900ms Response to Client
Total: 900ms

Drill down into Order Service:
├─ Validation (50ms)
├─ Database query (750ms) ← ROOT CAUSE
├─ Cache lookup (20ms)
└─ Response (10ms)

Action: Add database index → 750ms → 50ms
Result: 900ms → 150ms total (6x faster)

CPU Profiler Deep Dive

Sampling-based profilers:

  • Interrupt every ~1-10ms, record call stack
  • Accumulate statistics
  • Low overhead (1-5%)
  • Deterministic (shows real code paths)

Instrumentation-based profilers:

  • Hook every function entry/exit
  • Exact measurements
  • High overhead (10-50%)
  • Better for synthetic workloads

Hybrid approach:

  • Production: Sampling (low overhead)
  • Staging: Instrumentation (accurate details)
  • Synthetic: Both (complete picture)

Interpreting Flamegraphs

Color coding (different tools use different schemes):

  • Red = kernel space (system calls, I/O)
  • Yellow = user space (your code)
  • Sometimes random colors (just for readability)

Width interpretation:

  • Wide function = lots of CPU time (optimize!)
  • Narrow function = little CPU time (not worth optimizing)
  • Uniform width = well-balanced

Height interpretation:

  • Tall stack = deep call hierarchy
  • Optimization: Consider inlining frequently-called chains
  • Trade-off: Code size vs execution speed

Memory Profiler Usage

Java memory example:

Heap dump analysis: Which objects consume memory?

Result:
java.lang.String[]: 500MB (cached JSON responses)
byte[]: 300MB (buffers)
ArrayList: 150MB (parsed data)

Action: Reduce cache TTL, clear old buffers
Result: 950MB → 200MB

Lock Contention Profiling

Java threads competing for lock:

Thread 1: Waiting for lock on user_table (50% of time)
Thread 2: Holding lock on user_table (too long)
Thread 3: Waiting for lock on user_table

Solution: Reduce lock scope, use read-write locks, partition data
Result: 50% lock wait → 5% lock wait