Skip to main content

Log Levels and Governance

Use log levels strategically, enforce consistency across teams, and optimize storage costs while maintaining debuggability.

TL;DR

Five standard log levels: DEBUG (verbose development info), INFO (normal operations), WARN (recoverable problems), ERROR (failures), FATAL (system shutdown). In production, default to INFO—capture decisions, state changes, errors. DEBUG is expensive; enable selectively for troubleshooting. WARN for deprecated features and correctable errors. ERROR/FATAL for problems requiring attention. Establish team standards: what should be logged at each level, consistency in message format, required fields. Review and adjust levels based on actual production experience—one team's INFO is noise to another. Governance prevents log explosion while preserving debuggability. Use dynamic log level adjustment to enable DEBUG temporarily during incidents without redeploying.

Learning Objectives

  • Understand the standard log levels and their purposes
  • Apply log levels strategically based on severity and frequency
  • Establish logging governance across teams and services
  • Balance verbosity with storage costs and performance
  • Implement dynamic log level adjustment for debugging
  • Design runbooks to guide what to log during incidents

Motivating Scenario

Your system suddenly generates 10x the logs overnight. Storage costs spike. The log aggregator becomes slow. You investigate: a junior developer added logger.debug() calls at every function entry, now enabled in production. Meanwhile, other teams log aggressively at INFO level. Some services log passwords in error messages. There's no standard for what "ERROR" means—one service logs recoverable database glitches as errors; another only logs catastrophic failures. Governance would have prevented this: clear standards for log levels, code review for logging, and periodic audits of log volume.

Core Concepts

Log Level Hierarchy

DEBUG: Fine-grained information for developers. Function entry/exit, variable values, detailed state. Rarely enabled in production; expensive.

INFO: Normal operations and key events. Requests processed, feature flags toggled, services starting. The expected noise of a running system.

WARN: Recoverable problems. Retries, deprecated features, degraded modes, missing optional data. Investigate, but system continues.

ERROR: Failures. Crashed processes, failed requests, exceptions. Requires investigation and usually human attention.

FATAL: System-level failures requiring shutdown. Core dependency unavailable, critical validation failures. Logs before exit.

Governance Principles

Establish standards: which level for which scenarios, consistent message format, required fields. Document exceptions (why does service X log differently?). Review logging practices during code reviews. Audit log volume and adjust levels as needed. Use dynamic configuration to adjust levels without redeploying.

Storage and Cost

Logs are expensive to store and query. Each INFO log might cost 1 cent per million entries after a year. DEBUG logs cost 10x more because they're verbose. WARN and above should be rare—a healthy system has few warnings. ERROR logs are valuable; keep them long-term. Balance: log enough to troubleshoot, not so much that you can't find the signal.

Practical Example

# ❌ POOR - No governance, inconsistent levels, excessive logging
import logging

logger = logging.getLogger(__name__)

def process_order(order_id, items):
# Too verbose
logger.debug(f"Processing order {order_id}")
logger.debug(f"Items: {items}")

for item in items:
logger.debug(f"Checking item {item}")
if item['quantity'] < 0:
logger.warn(f"Negative quantity: {item}") # Should be ERROR
return None

result = charge_card(order_id)
if not result:
logger.debug("Card charge failed") # Should be ERROR with details
return None

logger.debug("Order completed") # Should be INFO

return result

# Results: 1000+ log lines per order. Chaos.

# ✅ EXCELLENT - Governance with standards
import logging
from enum import Enum
from typing import Optional

class LogLevel(Enum):
"""Standard log levels with documented purposes."""
DEBUG = "DEBUG" # Function entry/exit, variable inspection
INFO = "INFO" # Key business events, state changes
WARN = "WARN" # Recoverable issues, retries
ERROR = "ERROR" # Failures requiring attention
FATAL = "FATAL" # System shutdown

class LogStandards:
"""Logging governance documentation."""

RULES = {
# Database operations
'database.query': 'INFO - slow queries (>500ms): WARN',
'database.error': 'ERROR - connection failures, deadlocks',
'database.debug': 'DEBUG - only in development',

# Business logic
'order.created': 'INFO - new order with amount',
'order.failed': 'ERROR - why it failed, user/order IDs',
'order.retry': 'WARN - attempt N of M with backoff details',

# Payments
'payment.attempt': 'INFO - payment initiation',
'payment.fraud_check': 'INFO - result and score',
'payment.failed': 'ERROR - processor response and reason',

# System
'startup': 'INFO - version, config summary',
'shutdown': 'INFO - graceful or error shutdown',
'feature_flag': 'INFO - flag toggled with new value',
}

# Required fields by log level
REQUIRED_FIELDS = {
'ERROR': ['error_code', 'error_message', 'impact'],
'WARN': ['issue', 'action_taken'],
'INFO': ['event_type'],
}

@staticmethod
def validate(level: str, fields: dict):
"""Validate log entry against standards."""
required = LogStandards.REQUIRED_FIELDS.get(level, [])
missing = [f for f in required if f not in fields]
if missing:
raise ValueError(f"Missing required fields for {level}: {missing}")

# Environment-based log level configuration
import os

LOG_CONFIG = {
'local': logging.DEBUG,
'staging': logging.INFO,
'production': logging.INFO,
}

DYNAMIC_LOG_LEVELS = {
'payment-service': logging.INFO,
'fraud-service': logging.INFO,
'worker-pool': logging.WARN, # Workers are noisy
}

def get_logger_for_service(service_name: str):
"""Get logger with governance applied."""
logger = logging.getLogger(service_name)
base_level = LOG_CONFIG.get(os.getenv('ENV', 'local'), logging.DEBUG)
service_level = DYNAMIC_LOG_LEVELS.get(service_name, base_level)
logger.setLevel(service_level)
return logger

logger = get_logger_for_service('order-service')

def log_event(level: str, event_type: str, **fields):
"""Log with validation against governance standards."""
try:
LogStandards.validate(level, fields)
fields['event_type'] = event_type

if level == 'ERROR':
logger.error(fields.pop('error_message', ''), extra=fields)
elif level == 'WARN':
logger.warning(fields.pop('issue', ''), extra=fields)
else:
logger.info(f"{event_type}", extra=fields)
except ValueError as e:
logger.error(f"Logging validation failed: {e}")

def process_order(order_id: str, items: list) -> Optional[dict]:
"""Process order with proper log levels."""

# INFO - order received
log_event('INFO', 'order.created', order_id=order_id, item_count=len(items))

# Validate items - ERROR if invalid
for item in items:
if item['quantity'] <= 0:
log_event('ERROR', 'order.invalid_item',
error_code='INVALID_QUANTITY',
error_message=f"Item {item['id']} has qty {item['quantity']}",
impact='order_rejected',
item_id=item['id'],
order_id=order_id)
return None

# Try to charge - ERROR if failed
for attempt in range(3):
try:
result = charge_card(order_id)
if result['success']:
log_event('INFO', 'order.completed',
order_id=order_id,
transaction_id=result['id'],
amount=result['amount'])
return result

if attempt < 2:
log_event('WARN', 'order.retry',
attempt=attempt + 1,
total_attempts=3,
last_error=result.get('error'),
backoff_seconds=2 ** attempt)
except Exception as e:
log_event('ERROR', 'order.failed',
error_code='PAYMENT_ERROR',
error_message=str(e),
impact='order_failed',
attempt=attempt + 1,
order_id=order_id)
return None

return None

# Dynamic log level adjustment
def set_debug_mode_for_service(service_name: str, enable: bool):
"""Enable DEBUG temporarily for troubleshooting."""
logger = logging.getLogger(service_name)
logger.setLevel(logging.DEBUG if enable else logging.INFO)
log_event('INFO', 'debug_mode_changed',
service=service_name,
debug_enabled=enable,
reason='incident_investigation')

Governance Policies

Log Level Standards by Category

CategoryDEBUGINFOWARNERROR
API requestsYes, rarelyEntry/exit onlySlow >500msFailed requests
DatabaseQueriesConn pool eventsRetriesConnection failures
Business logicState changesDecisionsRecoverable issuesCritical failures
External APIsNeverSuccess summariesRate limitsFailures
SecurityNeverAuth eventsUnusual patternsBreach attempts

Storage Cost Optimization

Estimate: Processing 100,000 orders/day
- DEBUG enabled: 50 logs/order = 5M logs/day = $1500/month storage
- INFO only: 5 logs/order = 500K logs/day = $150/month storage
- INFO + selective DEBUG: 10 logs/order = 1M logs/day = $300/month storage

Enable DEBUG selectively: by service, by user (debug headers), or during incidents.

Design Review Checklist

  • Have you defined log levels for this service's key operations?
  • Are ERROR logs reserved for actual failures, not info disguised as errors?
  • Do your WARN logs indicate recoverable issues that need attention?
  • Is DEBUG logging disabled or very sparse in production?
  • Can you explain why a log is at its chosen level?
  • Are there governance standards documented for the team?
  • Is log volume monitored, with alerts for 10x spikes?
  • Can log levels be adjusted without redeploying?

Self-Check

  1. Review a day's logs from one of your services. What percentage are DEBUG, INFO, WARN, ERROR? Is the distribution healthy?

  2. Design a logging standard for a payment processing service that processes 1 million requests daily. How would you balance debuggability with storage costs?

  3. How would you implement temporary debug logging during a production incident without redeploying?

One Takeaway

Log levels are governance tools, not just severity labels. Establish standards: INFO for routine operations and key decisions, WARN for recoverable issues, ERROR for failures. Enforce these in code review. Monitor log volume and adjust dynamically. This prevents log explosion while keeping debuggability intact.

Next Steps

References

  1. Google Cloud Logging Severity Levels. (2024). Retrieved from https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry#severity
  2. RFC 5424 - The Syslog Protocol. (2009). Retrieved from https://tools.ietf.org/html/rfc5424
  3. Sentry Best Practices. (2024). Retrieved from https://docs.sentry.io/product/
  4. Datadog Logging Best Practices. (2024). Retrieved from https://docs.datadoghq.com/logs/guide/best-practices/