Log Levels and Governance

Use log levels strategically, enforce consistency across teams, and optimize storage costs while maintaining debuggability.

TL;DR

Five standard log levels: DEBUG (verbose development info), INFO (normal operations), WARN (recoverable problems), ERROR (failures), FATAL (system shutdown). In production, default to INFO—capture decisions, state changes, errors. DEBUG is expensive; enable selectively for troubleshooting. WARN for deprecated features and correctable errors. ERROR/FATAL for problems requiring attention. Establish team standards: what should be logged at each level, consistency in message format, required fields. Review and adjust levels based on actual production experience—one team's INFO is noise to another. Governance prevents log explosion while preserving debuggability. Use dynamic log level adjustment to enable DEBUG temporarily during incidents without redeploying.

Learning Objectives

Understand the standard log levels and their purposes
Apply log levels strategically based on severity and frequency
Establish logging governance across teams and services
Balance verbosity with storage costs and performance
Implement dynamic log level adjustment for debugging
Design runbooks to guide what to log during incidents

Motivating Scenario

Your system suddenly generates 10x the logs overnight. Storage costs spike. The log aggregator becomes slow. You investigate: a junior developer added logger.debug() calls at every function entry, now enabled in production. Meanwhile, other teams log aggressively at INFO level. Some services log passwords in error messages. There's no standard for what "ERROR" means—one service logs recoverable database glitches as errors; another only logs catastrophic failures. Governance would have prevented this: clear standards for log levels, code review for logging, and periodic audits of log volume.

Core Concepts

Log Level Hierarchy

DEBUG: Fine-grained information for developers. Function entry/exit, variable values, detailed state. Rarely enabled in production; expensive.

INFO: Normal operations and key events. Requests processed, feature flags toggled, services starting. The expected noise of a running system.

WARN: Recoverable problems. Retries, deprecated features, degraded modes, missing optional data. Investigate, but system continues.

ERROR: Failures. Crashed processes, failed requests, exceptions. Requires investigation and usually human attention.

FATAL: System-level failures requiring shutdown. Core dependency unavailable, critical validation failures. Logs before exit.

Governance Principles

Establish standards: which level for which scenarios, consistent message format, required fields. Document exceptions (why does service X log differently?). Review logging practices during code reviews. Audit log volume and adjust levels as needed. Use dynamic configuration to adjust levels without redeploying.

Storage and Cost

Logs are expensive to store and query. Each INFO log might cost 1 cent per million entries after a year. DEBUG logs cost 10x more because they're verbose. WARN and above should be rare—a healthy system has few warnings. ERROR logs are valuable; keep them long-term. Balance: log enough to troubleshoot, not so much that you can't find the signal.

Practical Example

Python
Go

# ❌ POOR - No governance, inconsistent levels, excessive logging
import logging

logger = logging.getLogger(__name__)

def process_order(order_id, items):
    # Too verbose
    logger.debug(f"Processing order {order_id}")
    logger.debug(f"Items: {items}")

    for item in items:
        logger.debug(f"Checking item {item}")
        if item['quantity'] < 0:
            logger.warn(f"Negative quantity: {item}")  # Should be ERROR
            return None

    result = charge_card(order_id)
    if not result:
        logger.debug("Card charge failed")  # Should be ERROR with details
        return None

    logger.debug("Order completed")  # Should be INFO

    return result

# Results: 1000+ log lines per order. Chaos.

# ✅ EXCELLENT - Governance with standards
import logging
from enum import Enum
from typing import Optional

class LogLevel(Enum):
    """Standard log levels with documented purposes."""
    DEBUG = "DEBUG"      # Function entry/exit, variable inspection
    INFO = "INFO"        # Key business events, state changes
    WARN = "WARN"        # Recoverable issues, retries
    ERROR = "ERROR"      # Failures requiring attention
    FATAL = "FATAL"      # System shutdown

class LogStandards:
    """Logging governance documentation."""

    RULES = {
        # Database operations
        'database.query': 'INFO - slow queries (>500ms): WARN',
        'database.error': 'ERROR - connection failures, deadlocks',
        'database.debug': 'DEBUG - only in development',

        # Business logic
        'order.created': 'INFO - new order with amount',
        'order.failed': 'ERROR - why it failed, user/order IDs',
        'order.retry': 'WARN - attempt N of M with backoff details',

        # Payments
        'payment.attempt': 'INFO - payment initiation',
        'payment.fraud_check': 'INFO - result and score',
        'payment.failed': 'ERROR - processor response and reason',

        # System
        'startup': 'INFO - version, config summary',
        'shutdown': 'INFO - graceful or error shutdown',
        'feature_flag': 'INFO - flag toggled with new value',
    }

    # Required fields by log level
    REQUIRED_FIELDS = {
        'ERROR': ['error_code', 'error_message', 'impact'],
        'WARN': ['issue', 'action_taken'],
        'INFO': ['event_type'],
    }

    @staticmethod
    def validate(level: str, fields: dict):
        """Validate log entry against standards."""
        required = LogStandards.REQUIRED_FIELDS.get(level, [])
        missing = [f for f in required if f not in fields]
        if missing:
            raise ValueError(f"Missing required fields for {level}: {missing}")

# Environment-based log level configuration
import os

LOG_CONFIG = {
    'local': logging.DEBUG,
    'staging': logging.INFO,
    'production': logging.INFO,
}

DYNAMIC_LOG_LEVELS = {
    'payment-service': logging.INFO,
    'fraud-service': logging.INFO,
    'worker-pool': logging.WARN,  # Workers are noisy
}

def get_logger_for_service(service_name: str):
    """Get logger with governance applied."""
    logger = logging.getLogger(service_name)
    base_level = LOG_CONFIG.get(os.getenv('ENV', 'local'), logging.DEBUG)
    service_level = DYNAMIC_LOG_LEVELS.get(service_name, base_level)
    logger.setLevel(service_level)
    return logger

logger = get_logger_for_service('order-service')

def log_event(level: str, event_type: str, **fields):
    """Log with validation against governance standards."""
    try:
        LogStandards.validate(level, fields)
        fields['event_type'] = event_type

        if level == 'ERROR':
            logger.error(fields.pop('error_message', ''), extra=fields)
        elif level == 'WARN':
            logger.warning(fields.pop('issue', ''), extra=fields)
        else:
            logger.info(f"{event_type}", extra=fields)
    except ValueError as e:
        logger.error(f"Logging validation failed: {e}")

def process_order(order_id: str, items: list) -> Optional[dict]:
    """Process order with proper log levels."""

    # INFO - order received
    log_event('INFO', 'order.created', order_id=order_id, item_count=len(items))

    # Validate items - ERROR if invalid
    for item in items:
        if item['quantity'] <= 0:
            log_event('ERROR', 'order.invalid_item',
                     error_code='INVALID_QUANTITY',
                     error_message=f"Item {item['id']} has qty {item['quantity']}",
                     impact='order_rejected',
                     item_id=item['id'],
                     order_id=order_id)
            return None

    # Try to charge - ERROR if failed
    for attempt in range(3):
        try:
            result = charge_card(order_id)
            if result['success']:
                log_event('INFO', 'order.completed',
                         order_id=order_id,
                         transaction_id=result['id'],
                         amount=result['amount'])
                return result

            if attempt < 2:
                log_event('WARN', 'order.retry',
                         attempt=attempt + 1,
                         total_attempts=3,
                         last_error=result.get('error'),
                         backoff_seconds=2 ** attempt)
        except Exception as e:
            log_event('ERROR', 'order.failed',
                     error_code='PAYMENT_ERROR',
                     error_message=str(e),
                     impact='order_failed',
                     attempt=attempt + 1,
                     order_id=order_id)
            return None

    return None

# Dynamic log level adjustment
def set_debug_mode_for_service(service_name: str, enable: bool):
    """Enable DEBUG temporarily for troubleshooting."""
    logger = logging.getLogger(service_name)
    logger.setLevel(logging.DEBUG if enable else logging.INFO)
    log_event('INFO', 'debug_mode_changed',
             service=service_name,
             debug_enabled=enable,
             reason='incident_investigation')

// ❌ POOR - Inconsistent levels, no governance
package order

import (
    "log"
)

func ProcessOrder(orderID string, items []Item) error {
    log.Printf("Processing order %s", orderID)
    log.Printf("Items: %v", items)

    for i, item := range items {
        log.Printf("Checking item %d: %v", i, item)
        if item.Quantity < 0 {
            log.Printf("WARN: Negative quantity")  // Just a log.Printf
            return fmt.Errorf("invalid quantity")
        }
    }

    err := chargeCard(orderID)
    if err != nil {
        log.Printf("Error: %v", err)  // No severity
        return err
    }

    log.Printf("Order completed")
    return nil
}

// ✅ EXCELLENT - Governed logging with standards
package order

import (
    "context"
    "encoding/json"
    "fmt"
    "log/slog"
    "os"
    "time"
)

type LogLevel string

const (
    DEBUG LogLevel = "DEBUG"
    INFO  LogLevel = "INFO"
    WARN  LogLevel = "WARN"
    ERROR LogLevel = "ERROR"
    FATAL LogLevel = "FATAL"
)

type LogGovernance struct {
    level            LogLevel
    requiredFieldsBy map[LogLevel][]string
}

var governance = LogGovernance{
    level: INFO,
    requiredFieldsBy: map[LogLevel][]string{
        ERROR: {"error_code", "error_message", "impact"},
        WARN:  {"issue", "action_taken"},
        INFO:  {"event_type"},
    },
}

func getLogLevel(env string) slog.Level {
    switch env {
    case "production":
        return slog.LevelInfo
    case "staging":
        return slog.LevelInfo
    default:
        return slog.LevelDebug
    }
}

var logger *slog.Logger

func init() {
    env := os.Getenv("ENV")
    if env == "" {
        env = "local"
    }

    opts := &slog.HandlerOptions{
        Level: getLogLevel(env),
    }
    logger = slog.New(slog.NewJSONHandler(os.Stdout, opts))
}

type LogEntry struct {
    Timestamp time.Time
    Level     LogLevel
    EventType string
    Fields    map[string]interface{}
}

func logEvent(ctx context.Context, level LogLevel, eventType string, fields map[string]interface{}) {
    // Validate required fields
    required := governance.requiredFieldsBy[level]
    for _, field := range required {
        if _, ok := fields[field]; !ok {
            logger.ErrorContext(ctx, "Missing required field",
                slog.String("level", string(level)),
                slog.String("field", field),
                slog.String("event_type", eventType))
            return
        }
    }

    fields["event_type"] = eventType

    switch level {
    case ERROR:
        logger.ErrorContext(ctx, fields["error_message"].(string), fields)
    case WARN:
        logger.WarnContext(ctx, fields["issue"].(string), fields)
    case INFO:
        logger.InfoContext(ctx, fmt.Sprintf("%s", eventType), fields)
    case DEBUG:
        logger.DebugContext(ctx, fmt.Sprintf("%s", eventType), fields)
    }
}

type Item struct {
    ID       string
    Quantity int
    Price    float64
}

func ProcessOrder(ctx context.Context, orderID string, items []Item) error {
    // INFO - order received
    logEvent(ctx, INFO, "order.created", map[string]interface{}{
        "order_id":   orderID,
        "item_count": len(items),
    })

    // Validate - ERROR if invalid
    for _, item := range items {
        if item.Quantity <= 0 {
            logEvent(ctx, ERROR, "order.invalid_item", map[string]interface{}{
                "error_code":    "INVALID_QUANTITY",
                "error_message": fmt.Sprintf("Item %s has qty %d", item.ID, item.Quantity),
                "impact":        "order_rejected",
                "item_id":       item.ID,
                "order_id":      orderID,
            })
            return fmt.Errorf("invalid quantity")
        }
    }

    // Try to charge - WARN on retries, ERROR on failure
    for attempt := 0; attempt < 3; attempt++ {
        result, err := chargeCard(ctx, orderID)
        if err == nil && result.Success {
            logEvent(ctx, INFO, "order.completed", map[string]interface{}{
                "order_id":      orderID,
                "transaction_id": result.ID,
                "amount":        result.Amount,
            })
            return nil
        }

        if attempt < 2 {
            logEvent(ctx, WARN, "order.retry", map[string]interface{}{
                "issue":          "payment_failed",
                "action_taken":   "retry_scheduled",
                "attempt":        attempt + 1,
                "total_attempts": 3,
                "backoff_seconds": 1 << uint(attempt),
                "last_error":     err.Error(),
            })
            time.Sleep(time.Duration(1<<uint(attempt)) * time.Second)
        } else {
            logEvent(ctx, ERROR, "order.failed", map[string]interface{}{
                "error_code":    "PAYMENT_ERROR",
                "error_message": err.Error(),
                "impact":        "order_rejected",
                "order_id":      orderID,
                "attempt":       attempt + 1,
            })
            return err
        }
    }

    return fmt.Errorf("payment failed after retries")
}

// Dynamic log level adjustment
func SetDebugMode(serviceName string, enable bool) {
    var level slog.Level
    if enable {
        level = slog.LevelDebug
    } else {
        level = slog.LevelInfo
    }

    opts := &slog.HandlerOptions{
        Level: level,
    }
    logger = slog.New(slog.NewJSONHandler(os.Stdout, opts))

    logEvent(context.Background(), INFO, "debug_mode_changed", map[string]interface{}{
        "event_type":   "debug_mode_changed",
        "service":      serviceName,
        "debug_enabled": enable,
    })
}

Governance Policies

Log Level Standards by Category

Category	DEBUG	INFO	WARN	ERROR
API requests	Yes, rarely	Entry/exit only	Slow >500ms	Failed requests
Database	Queries	Conn pool events	Retries	Connection failures
Business logic	State changes	Decisions	Recoverable issues	Critical failures
External APIs	Never	Success summaries	Rate limits	Failures
Security	Never	Auth events	Unusual patterns	Breach attempts

Storage Cost Optimization

Estimate: Processing 100,000 orders/day
- DEBUG enabled: 50 logs/order = 5M logs/day = $1500/month storage
- INFO only: 5 logs/order = 500K logs/day = $150/month storage
- INFO + selective DEBUG: 10 logs/order = 1M logs/day = $300/month storage

Enable DEBUG selectively: by service, by user (debug headers), or during incidents.

Design Review Checklist

Have you defined log levels for this service's key operations?
Are ERROR logs reserved for actual failures, not info disguised as errors?
Do your WARN logs indicate recoverable issues that need attention?
Is DEBUG logging disabled or very sparse in production?
Can you explain why a log is at its chosen level?
Are there governance standards documented for the team?
Is log volume monitored, with alerts for 10x spikes?
Can log levels be adjusted without redeploying?

Self-Check

Review a day's logs from one of your services. What percentage are DEBUG, INFO, WARN, ERROR? Is the distribution healthy?
Design a logging standard for a payment processing service that processes 1 million requests daily. How would you balance debuggability with storage costs?
How would you implement temporary debug logging during a production incident without redeploying?

One Takeaway

Log levels are governance tools, not just severity labels. Establish standards: INFO for routine operations and key decisions, WARN for recoverable issues, ERROR for failures. Enforce these in code review. Monitor log volume and adjust dynamically. This prevents log explosion while keeping debuggability intact.

Next Steps

Learn log retention and privacy ↗ for managing sensitive data
Study metrics ↗ for aggregated views of system behavior
Explore alerting ↗ to act on log patterns
Review distributed tracing ↗ for request-level visibility

References

Google Cloud Logging Severity Levels. (2024). Retrieved from https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry#severity
RFC 5424 - The Syslog Protocol. (2009). Retrieved from https://tools.ietf.org/html/rfc5424
Sentry Best Practices. (2024). Retrieved from https://docs.sentry.io/product/
Datadog Logging Best Practices. (2024). Retrieved from https://docs.datadoghq.com/logs/guide/best-practices/

Log Levels and Governance

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Log Level Hierarchy​

Governance Principles​

Storage and Cost​

Practical Example​

Governance Policies​

Log Level Standards by Category​

Storage Cost Optimization​

Design Review Checklist​

Self-Check​

Next Steps​

References​