Skip to main content

Observability (Cross-Reference)

Deep visibility into system behavior through metrics, logs, and traces. Full coverage in Section 12: Observability and Operations.

TL;DR

Observability is the ability to understand system behavior from external signals without internal code knowledge. It answers: "What's happening in production right now?" Built from three pillars: Metrics (quantitative, time-series data), Logs (textual event records), and Traces (request paths across services). Together, these provide visibility for incident diagnosis, performance optimization, and capacity planning. Strong observability is non-negotiable for all quality attributes: reliability, performance, security, and maintainability.

Why Observability Matters

Traditional monitoring asks: "Is the system up?" Modern observability asks: "Why did this happen? Where exactly is the problem? How do we fix it?"

  • Without observability: System is slow. You have no idea why. Blind debugging.
  • With observability: System is slow. Dashboard shows: 95% of latency is in database queries. One specific query is running slowly. You can fix it in minutes.

The Three Pillars

Metrics are numerical measurements over time.

Examples:

  • CPU usage: 45% (now), 42% (1 min ago), 38% (2 min ago)
  • Request latency: p50=50ms, p95=200ms, p99=1000ms
  • Error rate: 2 errors per minute
  • Disk free: 150GB

Tools:

  • Prometheus (open-source, pull-based)
  • Datadog (commercial, push-based)
  • CloudWatch (AWS native)
  • Grafana (visualization)

Key queries:

  • "What's the error rate over the last hour?"
  • "Is CPU usage trending up or down?"
  • "Which service has the highest latency?"

Observability in Practice

Motivating Scenario

Without observability:

  • Alert: "API error rate is 5%"
  • Engineer: "What's causing it? I don't know. Let's restart the service?"
  • Service restarts. Error rate drops. Lucky...

With observability:

  • Alert: "API error rate is 5%"
  • Engineer: "Let me check the dashboard."
  • Logs show: "Database connection pool exhausted"
  • Metrics confirm: DB CPU is 99%, connections: 150/100 (over limit)
  • Traces show: Checkout service has 50+ slow requests, each holding 1 connection for 30+ seconds
  • Root cause: Stripe API is slow, checkout service waits, connections starve
  • Fix: Increase connection pool to 200, add timeout to Stripe calls

Patterns & Pitfalls

System goes down. You have no data about what happened. Guessing and hoping. Post-mortem: 'We should have monitored that.'
You know error rate is 5%, but not why. Logs would tell you database is down, but you're not collecting them.
Logs exist, but scattered across 20 servers. Impossible to search. Might as well not have them.
Dashboard shows: Error rate up. Click to logs for that error. See stack trace. Click to traces for affected requests. Root cause obvious.
Alert: 'Error rate > 5% for 5 min'. Alert: 'Latency p95 > 500ms'. Alert: 'Database CPU > 80%'. Each alert is actionable.
Hundreds of alerts, most are false positives. Team ignores alerts. Real problem goes unnoticed.
Engineer adds custom metrics to new feature. Adds trace instrumentation. Can measure impact immediately post-deploy.

Design Review Checklist

  • Are key business metrics monitored (requests/sec, conversion rate, customer count)?
  • Are key technical metrics monitored (latency p50/p95/p99, error rate, CPU, memory, disk)?
  • Are logs centralized and searchable (not scattered across servers)?
  • Do application logs include request IDs for tracing?
  • Are distributed traces instrumented (all major services)?
  • Do dashboards exist for on-call engineers (quick triage)?
  • Are alerts configured for critical thresholds?
  • Is alert fatigue addressed (low false positive rate)?
  • Can you correlate metrics, logs, and traces (e.g., error in logs → affected requests in traces)?
  • Is observability cost understood and budgeted?
  • Are retention policies set (metrics: 15 days, logs: 30 days, traces: 7 days)?
  • Do new features include observability instrumentation?
  • Is observability testing part of CI/CD (verify telemetry)?
  • Can you answer 'What went wrong?' within 5 minutes of alert?
  • Is observability integrated into on-call runbook?

Self-Check

  1. Right now, if a customer reports an issue, can you diagnose it within 5 minutes using your observability tools?
  2. Do you know your P95 latency? Your error rate? Your top 5 slowest endpoints?
  3. Can you trace a single user's request through all services?
  4. Do you have a single pane of glass dashboard for on-call engineers?
  5. If a deployment goes wrong, can you see it immediately (canary metrics, golden signals)?

Key Differences: Observability vs. Monitoring

AspectMonitoringObservability
ScopePredefined metricsUnknown unknowns
DataMetrics onlyMetrics + logs + traces
Question"Is it up?""Why did it fail?"
DashboardsStatic, pre-builtDynamic, interactive drill-down
AlertingThreshold-basedAnomaly-based, intelligent
DebuggingSlow (hunt for info)Fast (all signals in one place)

Next Steps

  1. Choose a platform — Prometheus/ELK stack (open-source) or Datadog/New Relic (commercial)
  2. Instrument key services — Add metrics, logs, traces
  3. Build dashboards — For on-call engineers, for product team
  4. Configure alerts — Based on SLOs (latency, error rate, availability)
  5. Create runbooks — Alert → diagnosis steps → fix
  6. Train team — How to use observability tools effectively
  7. Iterate — Add more signals, improve dashboards, reduce alert fatigue

References

  1. Observability Engineering (O'Reilly) ↗️
  2. Prometheus: Metrics ↗️
  3. Elasticsearch: Logs ↗️
  4. Jaeger: Distributed Tracing ↗️
  5. Datadog: Unified Observability ↗️
  6. Google Cloud: Observability Best Practices ↗️

Implementing Observability in Code

Adding Metrics

from prometheus_client import Counter, Histogram

request_count = Counter(
'api_requests_total',
'Total API requests',
['endpoint', 'method', 'status']
)

request_duration = Histogram(
'api_request_duration_seconds',
'API request duration',
['endpoint']
)

@app.route('/orders/<order_id>')
@request_duration.labels(endpoint='/orders/{order_id}').time()
def get_order(order_id):
try:
order = db.get_order(order_id)
request_count.labels(
endpoint='/orders/{order_id}',
method='GET',
status=200
).inc()
return order
except Exception as e:
request_count.labels(
endpoint='/orders/{order_id}',
method='GET',
status=500
).inc()
raise

Adding Logs with Structure

import logging
import json
from uuid import uuid4

logger = logging.getLogger(__name__)

def create_order(order_data):
request_id = str(uuid4())

logger.info("Order creation started", extra={
'request_id': request_id,
'customer_id': order_data['customer_id'],
'total_cents': order_data['total_cents']
})

try:
order = order_service.create(order_data)

logger.info("Order created successfully", extra={
'request_id': request_id,
'order_id': order['id'],
'total_cents': order['total_cents']
})

return order
except PaymentFailedError as e:
logger.error("Order creation failed: payment declined", extra={
'request_id': request_id,
'error': str(e),
'error_code': e.code
})
raise

Adding Traces

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

tracer = trace.get_tracer(__name__)

async def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)

with tracer.start_as_current_span("fetch_order"):
order = await get_order(order_id)

with tracer.start_as_current_span("validate_inventory"):
await check_inventory(order.items)

with tracer.start_as_current_span("process_payment"):
await payment_service.charge(order.total)

with tracer.start_as_current_span("send_confirmation"):
await email_service.send(order.customer_email)

return order

# Trace shows: fetch (10ms) + validate (50ms) + payment (200ms) + email (40ms) = 300ms
# Can see which spans are slow, what operations are nested

Anti-Patterns to Avoid

Anti-Pattern: No Context in Logs

# BAD: Can't correlate logs across services
logger.error("Database connection failed")

# GOOD: Include context
logger.error("Database connection failed", extra={
'user_id': user_id,
'operation': 'create_order',
'request_id': request_id
})

Anti-Pattern: Metrics Without Context

# BAD: Just a number, no insight
counter.inc()

# GOOD: Labeled metrics enable drill-down
counter.labels(
service='order-service',
operation='create_order',
status='success'
).inc()

Anti-Pattern: No Alerting

# BAD: Metrics are collected but no one cares
# System is slow; no one notices for hours

# GOOD: Alert-driven observability
if metrics.error_rate > 0.05: # 5% error rate
send_alert("Order service error rate high")

if metrics.latency_p95 > 500: # 500ms latency
send_alert("Order service slow (p95 > 500ms)")

Building Observability-Driven Culture

  1. Expose metrics in dashboards: Team sees P95 latency, error rate, SLOs
  2. Alert on anomalies: Alert when behavior changes, not just thresholds
  3. Incident retrospectives: Every incident leads to better observability
  4. Observability as requirement: New features require metrics, logs, traces
  5. On-call playbooks: Runbooks for common alerts, linking to observability tools

Self-Check

  1. Right now, if customer reports issue, can you diagnose within 5 minutes?
  2. Do you know your P95 latency? Error rate? Top 5 slowest endpoints?
  3. Can you trace single user request through all services?
  4. Do you have single pane of glass dashboard for on-call?
  5. If deployment goes wrong, can you see it immediately?

If you answer "no" to any, observability gaps exist.

info

One Takeaway: Observability is the foundation of all quality attributes. You can't maintain reliability, performance, or security without seeing what your system is doing. Invest early: metrics, logs, traces from day one. The cost of observability infrastructure is trivial compared to the cost of debugging production issues blind.

Key Differences: Observability vs. Monitoring

AspectMonitoringObservability
ScopePredefined metricsUnknown unknowns
DataMetrics onlyMetrics + logs + traces
Question"Is it up?""Why did it fail?"
DashboardsStatic, pre-builtDynamic, interactive drill-down
AlertingThreshold-basedAnomaly-based, intelligent
DebuggingSlow (hunt for info)Fast (all signals in one place)
CostLowerHigher (more data collected)
Time to diagnoseHoursMinutes

Real-World Example: E-commerce Checkout Observability

Metrics:
checkout_requests_total (counter) - total requests
checkout_latency_seconds (histogram) - latency distribution
checkout_errors_total (counter) - errors by type
cart_size (gauge) - items in cart
payment_processing_time (histogram) - payment duration

Logs (structured):
User initiates checkout (user_id, cart_size, total)
→Inventory check starts (request_id, item_count)
→Inventory check complete (duration, items_available)
→Payment processing starts (gateway, amount)
→Payment processing complete (status, auth_code, duration)
→Checkout complete (order_id, total_time)

Traces (distributed):
checkout (total: 300ms)
├─ inventory_check (50ms)
├─ payment_processing (200ms)
└─ order_persistence (40ms)
└─ email_notification (async)

Alerts:
- Checkout latency > 1000ms
- Checkout error rate > 2%
- Payment processing latency > 5s
- Inventory service unavailable

Next Steps

  1. Choose a platform — Prometheus/ELK (open-source) or Datadog/New Relic (commercial)
  2. Instrument key services — Add metrics, logs, traces
  3. Build dashboards — For on-call engineers, product team
  4. Configure alerts — Based on SLOs (latency, error rate, availability)
  5. Create runbooks — Alert → diagnosis steps → fix
  6. Train team — How to use observability tools
  7. Iterate — Add more signals, improve dashboards, reduce alert fatigue

References

  1. Observability Engineering (O'Reilly) ↗️
  2. Prometheus: Metrics ↗️
  3. Elasticsearch: Logs ↗️
  4. Jaeger: Distributed Tracing ↗️
  5. Datadog: Unified Observability ↗️
  6. Google Cloud: Observability Best Practices ↗️
  7. Honeycomb: Observability for Engineers ↗️

Full Coverage: See Observability and Operations in Section 12 for comprehensive details on building and maintaining observable systems.