Skip to main content

Distributed Tracing

Track requests across services to identify bottlenecks and failures.

TL;DR

Distributed tracing follows a single request (trace) as it flows through multiple microservices. A trace is a tree of spans—each span represents work in one service. Spans include: operation name, start/end time, status, tags, logs. Trace ID propagates through all services via HTTP headers (traceparent, baggage). Collectors (Jaeger, Zipkin) aggregate spans. Query traces to debug slow requests, find latency bottlenecks, and investigate failures. Instrumentation patterns: auto-instrumentation (easier, less control), manual instrumentation (more control, verbose). OpenTelemetry is the standard. Sample traces in development, all in production (store 30 days).

Learning Objectives

  • Understand trace structure (traces, spans, context propagation)
  • Implement tracing in microservices using OpenTelemetry
  • Configure tracing backends (Jaeger, Zipkin, cloud providers)
  • Use traces to debug latency issues and failures
  • Design sampling strategies for production
  • Correlate traces with logs and metrics
  • Avoid common pitfalls (missing context, high overhead, poor sampling)
  • Trace async/message-driven architectures

Motivating Scenario

Customer reports: "Checkout sometimes takes 30 seconds, sometimes instant." Without tracing, you see metrics (avg checkout time: 2s) but not individual requests. With tracing: you see that one specific checkout took 30s because:

  • Payment service took 1s (ok)
  • Fraud check took 5s (ok)
  • Inventory check took 24s (SLOW!)

Without tracing, you'd optimize payment. With tracing, you fix inventory.

Core Concepts

Trace Structure

A trace is a directed acyclic graph (DAG) of spans:

Trace ID: abc123
├── Span: api-gateway (0-100ms)
│ ├── Span: checkout-service (1-50ms)
│ │ ├── Span: payment-service (5-10ms)
│ │ ├── Span: fraud-service (5-25ms) ← SLOW
│ │ └── Span: inventory-service (1-30ms) ← VERY SLOW
│ └── Span: notification-service (60-95ms, async)
└── HTTP 200

Span: Unit of work in one service. Contains:

  • Operation name
  • Start/end time
  • Status (ok, error)
  • Tags (key-value pairs)
  • Logs/events
  • Parent span ID (links to parent)

Trace Context: Metadata propagated across services:

  • Trace ID (same for all spans in trace)
  • Span ID (unique per span)
  • Parent span ID (links to parent)
  • Baggage (key-value data passed to all children)
  • Sampled flag (should this trace be recorded?)

Instrumentation Levels

LevelDescriptionOverheadEffort
AutoFramework auto-instruments (HTTP, DB, etc.)LowLow
ManualExplicit span creationMediumHigh
HybridAuto + manual for complex flowsLow-MediumMedium

Sampling Strategies

  • No sampling: All traces recorded (100%). High volume, complete data.
  • Static sampling: Always sample X% (e.g., 10%). Low volume, partial data.
  • Head-based sampling: Decision made at start of trace. Client decides.
  • Tail-based sampling: Decision made after trace completes. Server decides based on content.

Head-based is most common; tail-based is best but complex.

Context Propagation

Trace context must flow through:

  • HTTP headers: traceparent, baggage (W3C Trace Context standard)
  • Message queues: Baggage in message metadata
  • RPC calls: Context in RPC metadata
  • Async jobs: Store context in job payload

Code Examples: OpenTelemetry Tracing

from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.propagators.jaeger import JaegerPropagator
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from flask import Flask, request
import requests

# Setup Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)

trace_provider = TracerProvider(
resource=Resource.create({SERVICE_NAME: "checkout-service"})
)
trace_provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(trace_provider)

# Setup context propagation
CompositePropagator.set_default([JaegerPropagator()])

# Auto-instrument libraries
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

app = Flask(__name__)
tracer = trace.get_tracer(__name__)

# Example 1: Auto-instrumented endpoint
@app.route("/checkout", methods=["POST"])
def checkout():
"""
Auto-instrumented via FlaskInstrumentor
Span created automatically for HTTP request
"""
order_id = request.json["order_id"]
amount = request.json["amount"]

# Automatic span from Flask
# trace context propagated to downstream services
return process_checkout(order_id, amount)

# Example 2: Manual span creation
def process_checkout(order_id, amount):
"""Create explicit spans for business logic"""
with tracer.start_as_current_span("process_checkout") as span:
span.set_attribute("order_id", order_id)
span.set_attribute("amount", amount)

# Call payment service - context automatically propagated
payment_result = call_payment_service(order_id, amount)

if not payment_result["success"]:
span.set_attribute("status", "payment_failed")
span.record_exception(Exception("Payment declined"))
raise Exception("Payment failed")

# Call fraud service - context automatically propagated
fraud_check = call_fraud_service(order_id, amount)

if not fraud_check["passed"]:
span.set_attribute("fraud_risk", "high")
span.add_event("fraud_check_failed", {
"risk_score": fraud_check["risk_score"]
})
raise Exception("Fraud detected")

# Call inventory service - context automatically propagated
inventory_result = call_inventory_service(order_id)

span.set_attribute("success", True)
return {"status": "confirmed", "order_id": order_id}

# Example 3: Nested spans
def call_payment_service(order_id, amount):
"""Demonstrates span nesting"""
with tracer.start_as_current_span("call_payment_service") as span:
span.set_attribute("service", "payment")
span.set_attribute("method", "POST")
span.set_attribute("endpoint", "/charge")

try:
# Manual instrumentation for HTTP call
# (RequestsInstrumentor would auto-instrument, but showing manual)
response = requests.post(
"http://payment-service:8080/charge",
json={"order_id": order_id, "amount": amount},
timeout=5
)

span.set_attribute("status_code", response.status_code)
return response.json()

except requests.Timeout:
span.set_attribute("error", "timeout")
span.set_attribute("error.type", "TimeoutError")
raise

except Exception as e:
span.set_attribute("error", True)
span.record_exception(e)
raise

# Example 4: Recording events and logs
def call_fraud_service(order_id, amount):
"""Record detailed events within span"""
with tracer.start_as_current_span("call_fraud_service") as span:
span.set_attribute("order_id", order_id)
span.set_attribute("amount", amount)

# Event: checking rules
span.add_event("fraud_check_started", {
"rules": "basic,velocity,pattern"
})

# Simulate check
import time
time.sleep(0.5) # Fraud check takes 500ms

span.add_event("fraud_check_completed", {
"duration_ms": 500,
"rules_passed": 3,
"risk_score": 25
})

return {"passed": True, "risk_score": 25}

# Example 5: Baggage (cross-service metadata)
from opentelemetry.baggage import get_baggage, set_baggage

def call_inventory_service(order_id):
"""Use baggage to propagate metadata"""
with tracer.start_as_current_span("call_inventory_service") as span:
span.set_attribute("order_id", order_id)

# Add to baggage (propagated to all downstream services)
set_baggage("customer_tier", "premium")
set_baggage("region", "us-west")

# Baggage is automatically included in context propagation
# Downstream services can read it
customer_tier = get_baggage("customer_tier")
span.set_attribute("customer_tier", customer_tier)

return {"status": "reserved", "warehouse": "PDX"}

# Example 6: Error handling with tracing
def handle_checkout_error(order_id, error):
"""Record errors in spans"""
with tracer.start_as_current_span("handle_checkout_error") as span:
span.set_attribute("order_id", order_id)
span.set_attribute("error.type", type(error).__name__)
span.set_attribute("error.message", str(error))
span.record_exception(error)

# Log compensation logic
span.add_event("initiating_compensation", {
"compensation": "cancel_order"
})

# Call compensation service
return {"status": "cancelled", "reason": str(error)}

# Example 7: Async context propagation
import asyncio
from opentelemetry.context import set_span

async def process_checkout_async(order_id, amount):
"""Handle async operations with tracing"""
with tracer.start_as_current_span("process_checkout_async") as span:
span.set_attribute("order_id", order_id)

# Context is automatically set for current async task
# Child operations inherit this context

# Create child spans for parallel operations
payment_task = asyncio.create_task(
call_payment_service_async(order_id, amount)
)
fraud_task = asyncio.create_task(
call_fraud_service_async(order_id, amount)
)

payment_result, fraud_result = await asyncio.gather(
payment_task,
fraud_task
)

return {
"payment": payment_result,
"fraud": fraud_result
}

async def call_payment_service_async(order_id, amount):
"""Async operation with tracing"""
with tracer.start_as_current_span("call_payment_service_async") as span:
span.set_attribute("order_id", order_id)
await asyncio.sleep(0.1)
return {"success": True}

async def call_fraud_service_async(order_id, amount):
"""Async operation with tracing"""
with tracer.start_as_current_span("call_fraud_service_async") as span:
span.set_attribute("order_id", order_id)
await asyncio.sleep(0.5) # Slower
return {"passed": True}

if __name__ == "__main__":
app.run(port=5000)

Real-World Examples

Latency Investigation

Customer reports: "Search is slow." Metrics show: avg 200ms, p99 1000ms.

Trace shows:

├── GET /search (0-150ms)
│ ├── Query Elasticsearch (0-50ms)
│ ├── Enrich results (50-100ms)
│ │ └── Call recommendation service (60-140ms) ← SLOW
│ │ └── Call ML model (80-120ms) ← VERY SLOW
│ └── Format response (140-150ms)

Without tracing: Blame Elasticsearch (fast but gets all blame). With tracing: Fix ML model latency.

Failure Investigation

Trace shows:

├── POST /checkout (0-5000ms) ERROR
│ ├── Call payment service (0-500ms) OK
│ ├── Call fraud service (500-2500ms) TIMEOUT
│ │ └── Network latency to fraud service (>2s)
│ └── [timeout, no inventory call]

Root cause: Fraud service timeout, not fraud logic.

Common Mistakes and Pitfalls

Mistake 1: Missing Context Propagation

❌ WRONG: Context lost between services
Service A creates span, calls Service B
Service B sees no trace context
→ Service B spans not linked to Service A

✅ CORRECT: Propagate context via headers
Service A: Serialize trace context to HTTP headers
Service B: Extract trace context from headers
→ Service B spans linked to Service A

Mistake 2: High Sampling Rate in Production

❌ WRONG: Sample 100% of traces
1M requests/day = 1M traces stored
Storage cost: high, retention: short

✅ CORRECT: Tail-based sampling
Sample 100% in dev, 10% in prod
Store failure traces + slow traces (100%)
Keep 30 days

Mistake 3: No Baggage for Context

❌ WRONG: No customer context in traces
Can't correlate user actions
Fraud detection blind

✅ CORRECT: Use baggage
Set baggage: customer_id, region, tier
Propagate to all downstream services
Available in logs and metrics

Production Considerations

Tracing Infrastructure

  • Jaeger: Open source, self-hosted. Good for on-prem.
  • Zipkin: Open source, simpler than Jaeger.
  • Cloud providers: AWS X-Ray, GCP Cloud Trace, Azure Application Insights.
  • SaaS: Datadog, New Relic, Lightstep.

Sampling Strategy

Development: 100% sampling (complete visibility) Production: Adaptive sampling

  • All errors (100%)
  • All slow requests (p95+)
  • All requests from specific users
  • Random 1-5% otherwise

Retention and Storage

  • Development: 7 days
  • Staging: 14 days
  • Production: 30 days
  • Archive old traces to cold storage

Correlating with Logs and Metrics

Trace ID in logs:

{
"level": "info",
"message": "processing checkout",
"trace_id": "abc123",
"order_id": "order-456",
"timestamp": "2024-01-01T12:00:00Z"
}

Reference trace from log: Click trace ID in log viewer → Jaeger.

Self-Check

  • What's a trace vs. a span?
  • How does trace context propagate?
  • What's the difference between head-based and tail-based sampling?
  • When should you create manual spans?
  • How do you correlate logs with traces?

Design Review Checklist

  • Auto-instrumentation enabled (HTTP, DB, cache)?
  • Context propagation configured (W3C Trace Context)?
  • Manual spans for business logic?
  • Error handling recorded in spans?
  • Baggage for critical context (user_id, region)?
  • Sampling strategy defined (head vs tail)?
  • Tracing backend configured (Jaeger, cloud)?
  • Storage and retention policy set?
  • Trace correlation with logs/metrics?
  • Performance overhead acceptable?
  • PII filtered from traces?
  • Runbooks for slow/error traces?

Next Steps

  1. Install OpenTelemetry libraries
  2. Configure auto-instrumentation
  3. Setup tracing backend
  4. Add manual spans for business logic
  5. Configure sampling strategy
  6. Create dashboards and alerts
  7. Document runbook for investigating traces

References