Observability (Cross-Reference)

Deep visibility into system behavior through metrics, logs, and traces. Full coverage in Section 12: Observability and Operations.

TL;DR

Observability is the ability to understand system behavior from external signals without internal code knowledge. It answers: "What's happening in production right now?" Built from three pillars: Metrics (quantitative, time-series data), Logs (textual event records), and Traces (request paths across services). Together, these provide visibility for incident diagnosis, performance optimization, and capacity planning. Strong observability is non-negotiable for all quality attributes: reliability, performance, security, and maintainability.

Why Observability Matters

Traditional monitoring asks: "Is the system up?" Modern observability asks: "Why did this happen? Where exactly is the problem? How do we fix it?"

Without observability: System is slow. You have no idea why. Blind debugging.
With observability: System is slow. Dashboard shows: 95% of latency is in database queries. One specific query is running slowly. You can fix it in minutes.

The Three Pillars

Metrics: Quantitative Time-Series
Logs: Textual Event Records
Traces: Request Paths Across Services
Profiles: CPU, Memory, I/O Analysis

Metrics are numerical measurements over time.

Examples:

CPU usage: 45% (now), 42% (1 min ago), 38% (2 min ago)
Request latency: p50=50ms, p95=200ms, p99=1000ms
Error rate: 2 errors per minute
Disk free: 150GB

Tools:

Prometheus (open-source, pull-based)
Datadog (commercial, push-based)
CloudWatch (AWS native)
Grafana (visualization)

Key queries:

"What's the error rate over the last hour?"
"Is CPU usage trending up or down?"
"Which service has the highest latency?"

Logs are detailed text records of what happened.

Examples:

2025-02-14T10:23:45Z INFO [checkout-service] User 12345 initiated checkout
2025-02-14T10:23:46Z DEBUG [payment-api] Processing payment for $99.99
2025-02-14T10:23:47Z ERROR [payment-api] Stripe API timeout (5s). Retrying...
2025-02-14T10:23:48Z INFO [payment-api] Payment succeeded (stripe_id=ch_123)

Tools:

Elasticsearch + Kibana (open-source)
Splunk (commercial)
Datadog (commercial)
CloudWatch Logs (AWS native)

Key queries:

"Show all ERROR logs from the last 30 minutes"
"What requests are timing out?"
"Debug: what happened to this user's session?"

Traces follow a single request/transaction as it flows through the system.

Example trace (checkout flow):

Trace ID: abc123
├─ API Gateway (0-100ms)
│  ├─ Checkout Service (0-95ms)
│  │  ├─ Inventory Service (0-20ms) ← slow check
│  │  ├─ Payment Service (20-85ms)
│  │  │  ├─ Stripe API (20-80ms) ← bottleneck!
│  │  │  └─ Logging (80-85ms)
│  │  └─ Notification Service (85-95ms)
│  └─ Load Balancer egress (95-100ms)

Tools:

Jaeger (open-source)
Zipkin (open-source)
Datadog APM (commercial)
AWS X-Ray (AWS native)

Key insights:

Stripe API is the bottleneck (80ms)
Inventory check is fast (20ms)
Total latency: 100ms, most in payment processing

Profiles show where CPU/memory is actually spent.

Example CPU profile:

30% in payment_processing()
25% in database_query()
20% in encryption()
15% in logging()
10% other

Tools:

pprof (Go)
jprofile (Java)
py-spy (Python)
Datadog Profiler (commercial, language-agnostic)

Use when:

"Why is CPU 80%?"
"Memory leak: which function allocates most?"
"Which part of the code is slowest?"

Observability in Practice

Motivating Scenario

Without observability:

Alert: "API error rate is 5%"
Engineer: "What's causing it? I don't know. Let's restart the service?"
Service restarts. Error rate drops. Lucky...

With observability:

Alert: "API error rate is 5%"
Engineer: "Let me check the dashboard."
Logs show: "Database connection pool exhausted"
Metrics confirm: DB CPU is 99%, connections: 150/100 (over limit)
Traces show: Checkout service has 50+ slow requests, each holding 1 connection for 30+ seconds
Root cause: Stripe API is slow, checkout service waits, connections starve
Fix: Increase connection pool to 200, add timeout to Stripe calls

Patterns & Pitfalls

Anti-Pattern: No Observability

System goes down. You have no data about what happened. Guessing and hoping. Post-mortem: 'We should have monitored that.'

Anti-Pattern: Metrics Only

You know error rate is 5%, but not why. Logs would tell you database is down, but you're not collecting them.

Anti-Pattern: Logs Without Aggregation

Logs exist, but scattered across 20 servers. Impossible to search. Might as well not have them.

Pattern: Three Pillars Integrated

Dashboard shows: Error rate up. Click to logs for that error. See stack trace. Click to traces for affected requests. Root cause obvious.

Pattern: Alerting Based on Observability

Alert: 'Error rate > 5% for 5 min'. Alert: 'Latency p95 > 500ms'. Alert: 'Database CPU > 80%'. Each alert is actionable.

Anti-Pattern: Alert Fatigue

Hundreds of alerts, most are false positives. Team ignores alerts. Real problem goes unnoticed.

Pattern: Observability-Driven Development

Engineer adds custom metrics to new feature. Adds trace instrumentation. Can measure impact immediately post-deploy.

Design Review Checklist

Self-Check

Right now, if a customer reports an issue, can you diagnose it within 5 minutes using your observability tools?
Do you know your P95 latency? Your error rate? Your top 5 slowest endpoints?
Can you trace a single user's request through all services?
Do you have a single pane of glass dashboard for on-call engineers?
If a deployment goes wrong, can you see it immediately (canary metrics, golden signals)?

Key Differences: Observability vs. Monitoring

Aspect	Monitoring	Observability
Scope	Predefined metrics	Unknown unknowns
Data	Metrics only	Metrics + logs + traces
Question	"Is it up?"	"Why did it fail?"
Dashboards	Static, pre-built	Dynamic, interactive drill-down
Alerting	Threshold-based	Anomaly-based, intelligent
Debugging	Slow (hunt for info)	Fast (all signals in one place)

Next Steps

Choose a platform — Prometheus/ELK stack (open-source) or Datadog/New Relic (commercial)
Instrument key services — Add metrics, logs, traces
Build dashboards — For on-call engineers, for product team
Configure alerts — Based on SLOs (latency, error rate, availability)
Create runbooks — Alert → diagnosis steps → fix
Train team — How to use observability tools effectively
Iterate — Add more signals, improve dashboards, reduce alert fatigue

References

Implementing Observability in Code

Adding Metrics

from prometheus_client import Counter, Histogram

request_count = Counter(
    'api_requests_total',
    'Total API requests',
    ['endpoint', 'method', 'status']
)

request_duration = Histogram(
    'api_request_duration_seconds',
    'API request duration',
    ['endpoint']
)

@app.route('/orders/<order_id>')
@request_duration.labels(endpoint='/orders/{order_id}').time()
def get_order(order_id):
    try:
        order = db.get_order(order_id)
        request_count.labels(
            endpoint='/orders/{order_id}',
            method='GET',
            status=200
        ).inc()
        return order
    except Exception as e:
        request_count.labels(
            endpoint='/orders/{order_id}',
            method='GET',
            status=500
        ).inc()
        raise

Adding Logs with Structure

import logging
import json
from uuid import uuid4

logger = logging.getLogger(__name__)

def create_order(order_data):
    request_id = str(uuid4())

    logger.info("Order creation started", extra={
        'request_id': request_id,
        'customer_id': order_data['customer_id'],
        'total_cents': order_data['total_cents']
    })

    try:
        order = order_service.create(order_data)

        logger.info("Order created successfully", extra={
            'request_id': request_id,
            'order_id': order['id'],
            'total_cents': order['total_cents']
        })

        return order
    except PaymentFailedError as e:
        logger.error("Order creation failed: payment declined", extra={
            'request_id': request_id,
            'error': str(e),
            'error_code': e.code
        })
        raise

Adding Traces

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

tracer = trace.get_tracer(__name__)

async def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("fetch_order"):
            order = await get_order(order_id)

        with tracer.start_as_current_span("validate_inventory"):
            await check_inventory(order.items)

        with tracer.start_as_current_span("process_payment"):
            await payment_service.charge(order.total)

        with tracer.start_as_current_span("send_confirmation"):
            await email_service.send(order.customer_email)

        return order

# Trace shows: fetch (10ms) + validate (50ms) + payment (200ms) + email (40ms) = 300ms
# Can see which spans are slow, what operations are nested

Anti-Patterns to Avoid

Anti-Pattern: No Context in Logs

# BAD: Can't correlate logs across services
logger.error("Database connection failed")

# GOOD: Include context
logger.error("Database connection failed", extra={
    'user_id': user_id,
    'operation': 'create_order',
    'request_id': request_id
})

Anti-Pattern: Metrics Without Context

# BAD: Just a number, no insight
counter.inc()

# GOOD: Labeled metrics enable drill-down
counter.labels(
    service='order-service',
    operation='create_order',
    status='success'
).inc()

Anti-Pattern: No Alerting

# BAD: Metrics are collected but no one cares
# System is slow; no one notices for hours

# GOOD: Alert-driven observability
if metrics.error_rate > 0.05:  # 5% error rate
    send_alert("Order service error rate high")

if metrics.latency_p95 > 500:  # 500ms latency
    send_alert("Order service slow (p95 > 500ms)")

Building Observability-Driven Culture

Expose metrics in dashboards: Team sees P95 latency, error rate, SLOs
Alert on anomalies: Alert when behavior changes, not just thresholds
Incident retrospectives: Every incident leads to better observability
Observability as requirement: New features require metrics, logs, traces
On-call playbooks: Runbooks for common alerts, linking to observability tools

Self-Check

Right now, if customer reports issue, can you diagnose within 5 minutes?
Do you know your P95 latency? Error rate? Top 5 slowest endpoints?
Can you trace single user request through all services?
Do you have single pane of glass dashboard for on-call?
If deployment goes wrong, can you see it immediately?

If you answer "no" to any, observability gaps exist.

info

One Takeaway: Observability is the foundation of all quality attributes. You can't maintain reliability, performance, or security without seeing what your system is doing. Invest early: metrics, logs, traces from day one. The cost of observability infrastructure is trivial compared to the cost of debugging production issues blind.

Key Differences: Observability vs. Monitoring

Aspect	Monitoring	Observability
Scope	Predefined metrics	Unknown unknowns
Data	Metrics only	Metrics + logs + traces
Question	"Is it up?"	"Why did it fail?"
Dashboards	Static, pre-built	Dynamic, interactive drill-down
Alerting	Threshold-based	Anomaly-based, intelligent
Debugging	Slow (hunt for info)	Fast (all signals in one place)
Cost	Lower	Higher (more data collected)
Time to diagnose	Hours	Minutes

Real-World Example: E-commerce Checkout Observability

Metrics:
  checkout_requests_total (counter) - total requests
  checkout_latency_seconds (histogram) - latency distribution
  checkout_errors_total (counter) - errors by type
  cart_size (gauge) - items in cart
  payment_processing_time (histogram) - payment duration

Logs (structured):
  User initiates checkout (user_id, cart_size, total)
  →Inventory check starts (request_id, item_count)
  →Inventory check complete (duration, items_available)
  →Payment processing starts (gateway, amount)
  →Payment processing complete (status, auth_code, duration)
  →Checkout complete (order_id, total_time)

Traces (distributed):
  checkout (total: 300ms)
    ├─ inventory_check (50ms)
    ├─ payment_processing (200ms)
    └─ order_persistence (40ms)
    └─ email_notification (async)

Alerts:
  - Checkout latency > 1000ms
  - Checkout error rate > 2%
  - Payment processing latency > 5s
  - Inventory service unavailable

Next Steps

Choose a platform — Prometheus/ELK (open-source) or Datadog/New Relic (commercial)
Instrument key services — Add metrics, logs, traces
Build dashboards — For on-call engineers, product team
Configure alerts — Based on SLOs (latency, error rate, availability)
Create runbooks — Alert → diagnosis steps → fix
Train team — How to use observability tools
Iterate — Add more signals, improve dashboards, reduce alert fatigue

References

Full Coverage: See Observability and Operations in Section 12 for comprehensive details on building and maintaining observable systems.

Observability (Cross-Reference)

TL;DR​

Why Observability Matters​

The Three Pillars​

Observability in Practice​

Motivating Scenario​

Patterns & Pitfalls​

Design Review Checklist​

Self-Check​

Key Differences: Observability vs. Monitoring​

Next Steps​

References​

Implementing Observability in Code​

Adding Metrics​

Adding Logs with Structure​

Adding Traces​

Anti-Patterns to Avoid​

Building Observability-Driven Culture​

Self-Check​

Key Differences: Observability vs. Monitoring​

Real-World Example: E-commerce Checkout Observability​

Next Steps​

References​

TL;DR

Why Observability Matters

The Three Pillars

Observability in Practice

Motivating Scenario

Patterns & Pitfalls

Design Review Checklist

Self-Check

Key Differences: Observability vs. Monitoring

Next Steps

References

Implementing Observability in Code

Adding Metrics

Adding Logs with Structure

Adding Traces

Anti-Patterns to Avoid

Building Observability-Driven Culture

Self-Check

Key Differences: Observability vs. Monitoring

Real-World Example: E-commerce Checkout Observability

Next Steps

References