Distributed Tracing

Track requests across services to identify bottlenecks and failures.

TL;DR

Distributed tracing follows a single request (trace) as it flows through multiple microservices. A trace is a tree of spans—each span represents work in one service. Spans include: operation name, start/end time, status, tags, logs. Trace ID propagates through all services via HTTP headers (traceparent, baggage). Collectors (Jaeger, Zipkin) aggregate spans. Query traces to debug slow requests, find latency bottlenecks, and investigate failures. Instrumentation patterns: auto-instrumentation (easier, less control), manual instrumentation (more control, verbose). OpenTelemetry is the standard. Sample traces in development, all in production (store 30 days).

Learning Objectives

Understand trace structure (traces, spans, context propagation)
Implement tracing in microservices using OpenTelemetry
Configure tracing backends (Jaeger, Zipkin, cloud providers)
Use traces to debug latency issues and failures
Design sampling strategies for production
Correlate traces with logs and metrics
Avoid common pitfalls (missing context, high overhead, poor sampling)
Trace async/message-driven architectures

Motivating Scenario

Customer reports: "Checkout sometimes takes 30 seconds, sometimes instant." Without tracing, you see metrics (avg checkout time: 2s) but not individual requests. With tracing: you see that one specific checkout took 30s because:

Payment service took 1s (ok)
Fraud check took 5s (ok)
Inventory check took 24s (SLOW!)

Without tracing, you'd optimize payment. With tracing, you fix inventory.

Core Concepts

Trace Structure

A trace is a directed acyclic graph (DAG) of spans:

Trace ID: abc123
├── Span: api-gateway (0-100ms)
│   ├── Span: checkout-service (1-50ms)
│   │   ├── Span: payment-service (5-10ms)
│   │   ├── Span: fraud-service (5-25ms) ← SLOW
│   │   └── Span: inventory-service (1-30ms) ← VERY SLOW
│   └── Span: notification-service (60-95ms, async)
└── HTTP 200

Span: Unit of work in one service. Contains:

Operation name
Start/end time
Status (ok, error)
Tags (key-value pairs)
Logs/events
Parent span ID (links to parent)

Trace Context: Metadata propagated across services:

Trace ID (same for all spans in trace)
Span ID (unique per span)
Parent span ID (links to parent)
Baggage (key-value data passed to all children)
Sampled flag (should this trace be recorded?)

Instrumentation Levels

Level	Description	Overhead	Effort
Auto	Framework auto-instruments (HTTP, DB, etc.)	Low	Low
Manual	Explicit span creation	Medium	High
Hybrid	Auto + manual for complex flows	Low-Medium	Medium

Sampling Strategies

No sampling: All traces recorded (100%). High volume, complete data.
Static sampling: Always sample X% (e.g., 10%). Low volume, partial data.
Head-based sampling: Decision made at start of trace. Client decides.
Tail-based sampling: Decision made after trace completes. Server decides based on content.

Head-based is most common; tail-based is best but complex.

Context Propagation

Trace context must flow through:

HTTP headers: traceparent, baggage (W3C Trace Context standard)
Message queues: Baggage in message metadata
RPC calls: Context in RPC metadata
Async jobs: Store context in job payload

Code Examples: OpenTelemetry Tracing

Python
Go
Node.js

from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.propagators.jaeger import JaegerPropagator
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from flask import Flask, request
import requests

# Setup Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

trace_provider = TracerProvider(
    resource=Resource.create({SERVICE_NAME: "checkout-service"})
)
trace_provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(trace_provider)

# Setup context propagation
CompositePropagator.set_default([JaegerPropagator()])

# Auto-instrument libraries
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

app = Flask(__name__)
tracer = trace.get_tracer(__name__)

# Example 1: Auto-instrumented endpoint
@app.route("/checkout", methods=["POST"])
def checkout():
    """
    Auto-instrumented via FlaskInstrumentor
    Span created automatically for HTTP request
    """
    order_id = request.json["order_id"]
    amount = request.json["amount"]
    
    # Automatic span from Flask
    # trace context propagated to downstream services
    return process_checkout(order_id, amount)

# Example 2: Manual span creation
def process_checkout(order_id, amount):
    """Create explicit spans for business logic"""
    with tracer.start_as_current_span("process_checkout") as span:
        span.set_attribute("order_id", order_id)
        span.set_attribute("amount", amount)
        
        # Call payment service - context automatically propagated
        payment_result = call_payment_service(order_id, amount)
        
        if not payment_result["success"]:
            span.set_attribute("status", "payment_failed")
            span.record_exception(Exception("Payment declined"))
            raise Exception("Payment failed")
        
        # Call fraud service - context automatically propagated
        fraud_check = call_fraud_service(order_id, amount)
        
        if not fraud_check["passed"]:
            span.set_attribute("fraud_risk", "high")
            span.add_event("fraud_check_failed", {
                "risk_score": fraud_check["risk_score"]
            })
            raise Exception("Fraud detected")
        
        # Call inventory service - context automatically propagated
        inventory_result = call_inventory_service(order_id)
        
        span.set_attribute("success", True)
        return {"status": "confirmed", "order_id": order_id}

# Example 3: Nested spans
def call_payment_service(order_id, amount):
    """Demonstrates span nesting"""
    with tracer.start_as_current_span("call_payment_service") as span:
        span.set_attribute("service", "payment")
        span.set_attribute("method", "POST")
        span.set_attribute("endpoint", "/charge")
        
        try:
            # Manual instrumentation for HTTP call
            # (RequestsInstrumentor would auto-instrument, but showing manual)
            response = requests.post(
                "http://payment-service:8080/charge",
                json={"order_id": order_id, "amount": amount},
                timeout=5
            )
            
            span.set_attribute("status_code", response.status_code)
            return response.json()
        
        except requests.Timeout:
            span.set_attribute("error", "timeout")
            span.set_attribute("error.type", "TimeoutError")
            raise
        
        except Exception as e:
            span.set_attribute("error", True)
            span.record_exception(e)
            raise

# Example 4: Recording events and logs
def call_fraud_service(order_id, amount):
    """Record detailed events within span"""
    with tracer.start_as_current_span("call_fraud_service") as span:
        span.set_attribute("order_id", order_id)
        span.set_attribute("amount", amount)
        
        # Event: checking rules
        span.add_event("fraud_check_started", {
            "rules": "basic,velocity,pattern"
        })
        
        # Simulate check
        import time
        time.sleep(0.5)  # Fraud check takes 500ms
        
        span.add_event("fraud_check_completed", {
            "duration_ms": 500,
            "rules_passed": 3,
            "risk_score": 25
        })
        
        return {"passed": True, "risk_score": 25}

# Example 5: Baggage (cross-service metadata)
from opentelemetry.baggage import get_baggage, set_baggage

def call_inventory_service(order_id):
    """Use baggage to propagate metadata"""
    with tracer.start_as_current_span("call_inventory_service") as span:
        span.set_attribute("order_id", order_id)
        
        # Add to baggage (propagated to all downstream services)
        set_baggage("customer_tier", "premium")
        set_baggage("region", "us-west")
        
        # Baggage is automatically included in context propagation
        # Downstream services can read it
        customer_tier = get_baggage("customer_tier")
        span.set_attribute("customer_tier", customer_tier)
        
        return {"status": "reserved", "warehouse": "PDX"}

# Example 6: Error handling with tracing
def handle_checkout_error(order_id, error):
    """Record errors in spans"""
    with tracer.start_as_current_span("handle_checkout_error") as span:
        span.set_attribute("order_id", order_id)
        span.set_attribute("error.type", type(error).__name__)
        span.set_attribute("error.message", str(error))
        span.record_exception(error)
        
        # Log compensation logic
        span.add_event("initiating_compensation", {
            "compensation": "cancel_order"
        })
        
        # Call compensation service
        return {"status": "cancelled", "reason": str(error)}

# Example 7: Async context propagation
import asyncio
from opentelemetry.context import set_span

async def process_checkout_async(order_id, amount):
    """Handle async operations with tracing"""
    with tracer.start_as_current_span("process_checkout_async") as span:
        span.set_attribute("order_id", order_id)
        
        # Context is automatically set for current async task
        # Child operations inherit this context
        
        # Create child spans for parallel operations
        payment_task = asyncio.create_task(
            call_payment_service_async(order_id, amount)
        )
        fraud_task = asyncio.create_task(
            call_fraud_service_async(order_id, amount)
        )
        
        payment_result, fraud_result = await asyncio.gather(
            payment_task,
            fraud_task
        )
        
        return {
            "payment": payment_result,
            "fraud": fraud_result
        }

async def call_payment_service_async(order_id, amount):
    """Async operation with tracing"""
    with tracer.start_as_current_span("call_payment_service_async") as span:
        span.set_attribute("order_id", order_id)
        await asyncio.sleep(0.1)
        return {"success": True}

async def call_fraud_service_async(order_id, amount):
    """Async operation with tracing"""
    with tracer.start_as_current_span("call_fraud_service_async") as span:
        span.set_attribute("order_id", order_id)
        await asyncio.sleep(0.5)  # Slower
        return {"passed": True}

if __name__ == "__main__":
    app.run(port=5000)

package main

import (
	"context"
	"net/http"
	"time"

	"github.com/gin-gonic/gin"
	"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/exporters/jaeger"
	"go.opentelemetry.io/sdk/resource"
	"go.opentelemetry.io/sdk/trace"
	sdktrace "go.opentelemetry.io/sdk/trace"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/baggage"
	"go.opentelemetry.io/otel/codes"
	semconv "go.opentelemetry.io/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
	exporter, err := jaeger.New(
		jaeger.WithAgentHost("localhost"),
		jaeger.WithAgentPort(6831),
	)
	if err != nil {
		return nil, err
	}

	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exporter),
		sdktrace.WithResource(resource.NewWithAttributes(
			semconv.SchemaURL,
			semconv.ServiceNameKey.String("checkout-service"),
		)),
	)

	otel.SetTracerProvider(tp)
	return tp, nil
}

func main() {
	tp, err := initTracer()
	if err != nil {
		panic(err)
	}
	defer tp.Shutdown(context.Background())

	router := gin.Default()

	// Auto-instrumented middleware
	router.Use(otelgin.Middleware("checkout-service"))

	tracer := otel.Tracer("main")

	// Example 1: Auto-instrumented endpoint
	router.POST("/checkout", func(c *gin.Context) {
		handleCheckout(c, tracer)
	})

	router.Run(":8080")
}

// Example 2: Manual span creation
func handleCheckout(c *gin.Context, tracer trace.Tracer) {
	ctx := c.Request.Context()

	with Tracer.start(ctx, "process_checkout") as span:
		ctx = span.context

		var req struct {
			OrderID string  `json:"order_id"`
			Amount  float64 `json:"amount"`
		}

		if err := c.BindJSON(&req); err != nil {
			span.RecordError(err)
			span.SetStatus(codes.Error, "invalid_request")
			c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
			return
		}

		span.SetAttributes(
			attribute.String("order_id", req.OrderID),
			attribute.Float64("amount", req.Amount),
		)

		result, err := processCheckout(ctx, tracer, req.OrderID, req.Amount)
		if err != nil {
			span.RecordError(err)
			span.SetStatus(codes.Error, err.Error())
			c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
			return
		}

		c.JSON(http.StatusOK, result)
	}
}

// Example 3: Nested spans
func processCheckout(ctx context.Context, tracer trace.Tracer, orderID string, amount float64) (map[string]interface{}, error) {
	ctx, span := tracer.Start(ctx, "process_checkout")
	defer span.End()

	span.SetAttributes(
		attribute.String("order_id", orderID),
		attribute.Float64("amount", amount),
	)

	// Call payment service
	paymentResult, err := callPaymentService(ctx, tracer, orderID, amount)
	if err != nil {
		span.RecordError(err)
		return nil, err
	}

	// Call fraud service
	fraudResult, err := callFraudService(ctx, tracer, orderID, amount)
	if err != nil {
		span.RecordError(err)
		return nil, err
	}

	// Call inventory service
	inventoryResult, err := callInventoryService(ctx, tracer, orderID)
	if err != nil {
		span.RecordError(err)
		return nil, err
	}

	span.AddEvent("checkout_completed", trace.WithAttributes(
		attribute.String("payment_status", paymentResult["status"].(string)),
		attribute.Bool("fraud_passed", fraudResult["passed"].(bool)),
	))

	return map[string]interface{}{
		"order_id": orderID,
		"status":   "confirmed",
	}, nil
}

// Example 4: Recording events
func callPaymentService(ctx context.Context, tracer trace.Tracer, orderID string, amount float64) (map[string]interface{}, error) {
	ctx, span := tracer.Start(ctx, "call_payment_service")
	defer span.End()

	span.SetAttributes(
		attribute.String("service", "payment"),
		attribute.String("method", "POST"),
	)

	// Simulate payment call
	span.AddEvent("payment_initiated", trace.WithAttributes(
		attribute.Float64("amount", amount),
	))

	time.Sleep(100 * time.Millisecond)

	span.AddEvent("payment_completed", trace.WithAttributes(
		attribute.String("transaction_id", "txn-123"),
	))

	return map[string]interface{}{
		"status": "success",
		"transaction_id": "txn-123",
	}, nil
}

// Example 5: Error handling
func callFraudService(ctx context.Context, tracer trace.Tracer, orderID string, amount float64) (map[string]interface{}, error) {
	ctx, span := tracer.Start(ctx, "call_fraud_service")
	defer span.End()

	span.SetAttributes(
		attribute.String("service", "fraud"),
		attribute.Float64("amount", amount),
	)

	span.AddEvent("fraud_check_started")

	time.Sleep(500 * time.Millisecond)

	span.AddEvent("fraud_check_completed", trace.WithAttributes(
		attribute.Int("risk_score", 25),
	))

	return map[string]interface{}{
		"passed": true,
		"risk_score": 25,
	}, nil
}

// Example 6: Baggage propagation
func callInventoryService(ctx context.Context, tracer trace.Tracer, orderID string) (map[string]interface{}, error) {
	ctx, span := tracer.Start(ctx, "call_inventory_service")
	defer span.End()

	span.SetAttributes(
		attribute.String("order_id", orderID),
	)

	// Add to baggage (propagated downstream)
	baggage, _ := baggage.New(
		baggage.NewMember("customer_tier", "premium"),
		baggage.NewMember("region", "us-west"),
	)
	ctx = baggage.ContextWithBaggage(ctx, baggage)

	span.AddEvent("inventory_reserved", trace.WithAttributes(
		attribute.String("warehouse", "PDX"),
	))

	return map[string]interface{}{
		"status": "reserved",
		"warehouse": "PDX",
	}, nil
}

// Example 7: HTTP client instrumentation
func callExternalService(ctx context.Context, url string) (*http.Response, error) {
	// Wrap HTTP client to auto-instrument
	client := &http.Client{
		Transport: otelhttp.NewTransport(http.DefaultTransport),
	}

	req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
	return client.Do(req)
}

// Setup OpenTelemetry with Jaeger exporter
const { NodeTracer } = require('@opentelemetry/node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger-thrift');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { TracerProvider, Resource } = require('@opentelemetry/sdk-trace-base');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { trace } = require('@opentelemetry/api');
const { W3CTraceContextPropagator } = require('@opentelemetry/core');
const express = require('express');

// Initialize Jaeger exporter
const jaegerExporter = new JaegerExporter({
  serviceName: 'checkout-service',
  host: 'localhost',
  port: 6831,
});

// Setup tracer provider
const tracerProvider = new TracerProvider({
  resource: Resource.default().merge(
    new Resource({
      [SemanticResourceAttributes.SERVICE_NAME]: 'checkout-service',
    })
  ),
});

tracerProvider.addSpanProcessor(new BatchSpanProcessor(jaegerExporter));
trace.setGlobalTracerProvider(tracerProvider);

// Auto-instrument Express and HTTP
new ExpressInstrumentation().enable();
new HttpInstrumentation().enable();

const app = express();
app.use(express.json());

const tracer = trace.getTracer('checkout-service');

// Example 1: Auto-instrumented endpoint
app.post('/checkout', async (req, res) => {
  try {
    const { orderId, amount } = req.body;

    // Automatic span from Express middleware
    // Manual span creation within endpoint
    const result = await processCheckout(orderId, amount);

    res.json(result);
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

// Example 2: Manual span creation
async function processCheckout(orderId, amount) {
  const span = tracer.startSpan('process_checkout');
  span.setAttributes({
    'order_id': orderId,
    'amount': amount,
  });

  try {
    // Call payment service - context automatically propagated
    const paymentResult = await callPaymentService(orderId, amount);

    if (!paymentResult.success) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: 'Payment failed' });
      throw new Error('Payment declined');
    }

    // Call fraud service
    const fraudResult = await callFraudService(orderId, amount);

    if (!fraudResult.passed) {
      span.addEvent('fraud_check_failed', {
        'risk_score': fraudResult.riskScore,
      });
      span.setStatus({ code: SpanStatusCode.ERROR, message: 'Fraud detected' });
      throw new Error('Fraud detected');
    }

    // Call inventory service
    const inventoryResult = await callInventoryService(orderId);

    span.setStatus({ code: SpanStatusCode.OK });
    span.end();

    return {
      status: 'confirmed',
      orderId,
      warehouse: inventoryResult.warehouse,
    };
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    span.end();
    throw error;
  }
}

// Example 3: Nested spans
async function callPaymentService(orderId, amount) {
  const span = tracer.startSpan('call_payment_service', {
    attributes: {
      'service': 'payment',
      'method': 'POST',
      'endpoint': '/charge',
    },
  });

  try {
    span.addEvent('payment_initiated', {
      'amount': amount,
    });

    // Simulate payment call
    await new Promise(resolve => setTimeout(resolve, 100));

    span.addEvent('payment_completed', {
      'transaction_id': 'txn-123',
    });

    span.end();

    return { success: true, transactionId: 'txn-123' };
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    span.end();
    throw error;
  }
}

// Example 4: Recording events and logs
async function callFraudService(orderId, amount) {
  const span = tracer.startSpan('call_fraud_service', {
    attributes: {
      'order_id': orderId,
      'amount': amount,
    },
  });

  try {
    span.addEvent('fraud_check_started', {
      'rules': 'basic,velocity,pattern',
    });

    // Simulate fraud check
    await new Promise(resolve => setTimeout(resolve, 500));

    span.addEvent('fraud_check_completed', {
      'duration_ms': 500,
      'risk_score': 25,
    });

    span.end();

    return { passed: true, riskScore: 25 };
  } catch (error) {
    span.recordException(error);
    span.end();
    throw error;
  }
}

// Example 5: Baggage propagation
const { baggage } = require('@opentelemetry/api');

async function callInventoryService(orderId) {
  const span = tracer.startSpan('call_inventory_service', {
    attributes: {
      'order_id': orderId,
    },
  });

  try {
    // Add to baggage (propagated downstream)
    const b = baggage.createBaggage();
    b.setEntry('customer_tier', 'premium');
    b.setEntry('region', 'us-west');

    span.setAttributes({
      'customer_tier': 'premium',
      'region': 'us-west',
    });

    span.addEvent('inventory_reserved', {
      'warehouse': 'PDX',
    });

    span.end();

    return { status: 'reserved', warehouse: 'PDX' };
  } catch (error) {
    span.recordException(error);
    span.end();
    throw error;
  }
}

// Example 6: Context propagation with HTTP headers
async function callDownstreamService(url, data) {
  const span = tracer.startSpan('call_downstream_service');

  try {
    // HTTP client auto-instrumentation handles context propagation
    const response = await fetch(url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        // Trace context automatically added by instrumentation
      },
      body: JSON.stringify(data),
    });

    const result = await response.json();
    span.end();
    return result;
  } catch (error) {
    span.recordException(error);
    span.end();
    throw error;
  }
}

// Example 7: Async context tracking
async function processCheckoutAsync(orderId, amount) {
  const span = tracer.startSpan('process_checkout_async');

  try {
    // Run parallel operations (both tracked)
    const [paymentResult, fraudResult] = await Promise.all([
      callPaymentService(orderId, amount),
      callFraudService(orderId, amount),
    ]);

    span.addEvent('operations_completed', {
      'payment_success': paymentResult.success,
      'fraud_passed': fraudResult.passed,
    });

    span.end();

    return { payment: paymentResult, fraud: fraudResult };
  } catch (error) {
    span.recordException(error);
    span.end();
    throw error;
  }
}

app.listen(3000, () => {
  console.log('Checkout service listening on port 3000');
  console.log('Tracing to Jaeger on localhost:6831');
});

Real-World Examples

Latency Investigation

Customer reports: "Search is slow." Metrics show: avg 200ms, p99 1000ms.

Trace shows:

├── GET /search (0-150ms)
│   ├── Query Elasticsearch (0-50ms)
│   ├── Enrich results (50-100ms)
│   │   └── Call recommendation service (60-140ms) ← SLOW
│   │       └── Call ML model (80-120ms) ← VERY SLOW
│   └── Format response (140-150ms)

Without tracing: Blame Elasticsearch (fast but gets all blame). With tracing: Fix ML model latency.

Failure Investigation

Trace shows:

├── POST /checkout (0-5000ms) ERROR
│   ├── Call payment service (0-500ms) OK
│   ├── Call fraud service (500-2500ms) TIMEOUT
│   │   └── Network latency to fraud service (>2s)
│   └── [timeout, no inventory call]

Root cause: Fraud service timeout, not fraud logic.

Common Mistakes and Pitfalls

Mistake 1: Missing Context Propagation

❌ WRONG: Context lost between services
Service A creates span, calls Service B
Service B sees no trace context
→ Service B spans not linked to Service A

✅ CORRECT: Propagate context via headers
Service A: Serialize trace context to HTTP headers
Service B: Extract trace context from headers
→ Service B spans linked to Service A

Mistake 2: High Sampling Rate in Production

❌ WRONG: Sample 100% of traces
1M requests/day = 1M traces stored
Storage cost: high, retention: short

✅ CORRECT: Tail-based sampling
Sample 100% in dev, 10% in prod
Store failure traces + slow traces (100%)
Keep 30 days

Mistake 3: No Baggage for Context

❌ WRONG: No customer context in traces
Can't correlate user actions
Fraud detection blind

✅ CORRECT: Use baggage
Set baggage: customer_id, region, tier
Propagate to all downstream services
Available in logs and metrics

Production Considerations

Tracing Infrastructure

Jaeger: Open source, self-hosted. Good for on-prem.
Zipkin: Open source, simpler than Jaeger.
Cloud providers: AWS X-Ray, GCP Cloud Trace, Azure Application Insights.
SaaS: Datadog, New Relic, Lightstep.

Sampling Strategy

Development: 100% sampling (complete visibility) Production: Adaptive sampling

All errors (100%)
All slow requests (p95+)
All requests from specific users
Random 1-5% otherwise

Retention and Storage

Development: 7 days
Staging: 14 days
Production: 30 days
Archive old traces to cold storage

Correlating with Logs and Metrics

Trace ID in logs:

{
  "level": "info",
  "message": "processing checkout",
  "trace_id": "abc123",
  "order_id": "order-456",
  "timestamp": "2024-01-01T12:00:00Z"
}

Reference trace from log: Click trace ID in log viewer → Jaeger.

Self-Check

What's a trace vs. a span?
How does trace context propagate?
What's the difference between head-based and tail-based sampling?
When should you create manual spans?
How do you correlate logs with traces?

Design Review Checklist

Next Steps

Install OpenTelemetry libraries
Configure auto-instrumentation
Setup tracing backend
Add manual spans for business logic
Configure sampling strategy
Create dashboards and alerts
Document runbook for investigating traces

Distributed Tracing

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Trace Structure​

Instrumentation Levels​

Sampling Strategies​

Context Propagation​

Code Examples: OpenTelemetry Tracing​

Real-World Examples​

Latency Investigation​

Failure Investigation​

Common Mistakes and Pitfalls​

Mistake 1: Missing Context Propagation​

Mistake 2: High Sampling Rate in Production​

Mistake 3: No Baggage for Context​

Production Considerations​

Tracing Infrastructure​

Sampling Strategy​

Retention and Storage​

Correlating with Logs and Metrics​

Self-Check​

Design Review Checklist​

Next Steps​

References​