RED and USE Methodologies

Measure what matters: request rate, errors, duration (RED) and resource utilization (USE).

TL;DR

RED: Rate (requests/sec), Errors (failed requests), Duration (latency). For measuring services. USE: Utilization (% busy), Saturation (queue depth), Errors (errors). For measuring resources (CPU, memory, disk). RED tells you if service is healthy; USE tells you why. Use both together. RED alerts on user-visible issues; USE alerts on capacity and bottlenecks. Don't measure everything; focus on the golden signals.

Learning Objectives

Implement RED metrics for microservices
Implement USE metrics for infrastructure
Understand when to alert on each metric
Correlate RED and USE to diagnose problems
Avoid metric fatigue (measuring too much)
Scale metrics to multiple services and resources
Build dashboards around RED and USE

Motivating Scenario

Service is slow. You see: CPU at 50%, memory at 30%, disk at 20% (all green from USE perspective). But RED metrics show: rate 1000 req/s, 10% error rate (red). Problem: high latency despite low resource utilization. Root cause: N+1 query in code, not resource exhaustion. Without RED, you'd optimize infrastructure (wasted). With RED, you find code problem.

Core Concepts

RED Methodology (Service Level)

Rate: Requests per second Errors: Failed requests (4xx, 5xx, timeouts) Duration: Latency (p50, p95, p99)

Measures from the request perspective—what users see.

USE Methodology (Resource Level)

Utilization: Percent time resource is busy Saturation: Queue depth, tasks waiting Errors: Resource errors (I/O errors, timeouts)

Measures from the infrastructure perspective—what limits performance.

RED vs. USE

Metric	RED	USE
Scope	Service behavior	Resource behavior
Example	HTTP requests	CPU, disk, memory
User-visible	Yes	No (indirect)
Alerts	Yes	Yes
Dashboard	Service dashboard	Infrastructure dashboard

Implementation

Python
Go
Node.js

from prometheus_client import Counter, Histogram, Gauge
import time
import psutil
import os

# RED Metrics
request_rate = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint']
)

request_errors = Counter(
    'http_requests_errors_total',
    'HTTP request errors',
    ['method', 'endpoint', 'status_code']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

# USE Metrics (Resource level)
cpu_utilization = Gauge(
    'cpu_utilization_percent',
    'CPU utilization percentage'
)

cpu_saturation = Gauge(
    'cpu_saturation',
    'CPU saturation (load avg / core count)'
)

memory_utilization = Gauge(
    'memory_utilization_percent',
    'Memory utilization percentage'
)

memory_saturation = Gauge(
    'memory_page_faults_per_sec',
    'Memory page faults per second'
)

disk_utilization = Gauge(
    'disk_utilization_percent',
    'Disk utilization percentage',
    ['device']
)

disk_saturation = Gauge(
    'disk_io_wait_percent',
    'Disk I/O wait percentage',
    ['device']
)

io_errors = Counter(
    'io_errors_total',
    'I/O errors',
    ['device']
)

# RED Middleware
class REDMiddleware:
    def __init__(self):
        self.request_count = 0
        self.error_count = 0
    
    def handle_request(self, method, endpoint, handler):
        """Track RED metrics"""
        start = time.time()
        
        try:
            result = handler()
            request_rate.labels(method=method, endpoint=endpoint).inc()
            self.request_count += 1
            
            duration = time.time() - start
            request_duration.labels(method=method, endpoint=endpoint).observe(duration)
            
            return result
        
        except Exception as e:
            status_code = getattr(e, 'status_code', 500)
            request_errors.labels(
                method=method, 
                endpoint=endpoint, 
                status_code=status_code
            ).inc()
            self.error_count += 1
            
            duration = time.time() - start
            request_duration.labels(method=method, endpoint=endpoint).observe(duration)
            
            raise

# USE Metric Collector
class USEMetricsCollector:
    def __init__(self, interval_seconds=10):
        self.interval = interval_seconds
        self.cpu_count = os.cpu_count()
        self.prev_io_counters = None
    
    def update_cpu_metrics(self):
        """Collect CPU metrics"""
        # Utilization: percent of CPU in use
        cpu_percent = psutil.cpu_percent(interval=1)
        cpu_utilization.set(cpu_percent)
        
        # Saturation: load average / core count
        load_avg = os.getloadavg()[0]  # 1-minute average
        saturation = (load_avg / self.cpu_count) * 100
        cpu_saturation.set(saturation)
    
    def update_memory_metrics(self):
        """Collect memory metrics"""
        # Utilization: percent of memory in use
        mem = psutil.virtual_memory()
        memory_utilization.set(mem.percent)
        
        # Saturation: page faults
        try:
            # Major page faults per second
            # (Major = had to load from disk)
            swap = psutil.swap_memory()
            page_faults = swap.sin / self.interval if swap.sin > 0 else 0
            memory_saturation.set(page_faults)
        except:
            pass
    
    def update_disk_metrics(self):
        """Collect disk metrics"""
        # Utilization: percent of disk space used
        disk = psutil.disk_usage('/')
        disk_utilization.labels(device='/').set(disk.percent)
        
        # Saturation: I/O wait
        try:
            io_counters = psutil.disk_io_counters(perdisk=True)
            if self.prev_io_counters:
                for device, counters in io_counters.items():
                    prev = self.prev_io_counters.get(device)
                    if prev:
                        io_time_change = counters.read_time + counters.write_time - \
                                        (prev.read_time + prev.write_time)
                        io_wait = (io_time_change / 1000) / self.interval * 100
                        disk_saturation.labels(device=device).set(io_wait)
                        
                        # Track errors
                        if counters.read_merged_count > prev.read_merged_count:
                            io_errors.labels(device=device).inc()
            
            self.prev_io_counters = io_counters
        except:
            pass
    
    def collect_all(self):
        """Collect all USE metrics"""
        self.update_cpu_metrics()
        self.update_memory_metrics()
        self.update_disk_metrics()

# Alerting based on RED and USE
class MetricsAlerter:
    @staticmethod
    def check_red_alert(rate, errors, duration_p99):
        """Alert on RED metrics"""
        alerts = []
        
        # Error rate > 1%
        if rate > 0:
            error_ratio = errors / rate
            if error_ratio > 0.01:
                alerts.append({
                    'type': 'HIGH_ERROR_RATE',
                    'value': error_ratio,
                    'threshold': 0.01,
                    'message': f"Error rate {error_ratio*100:.1f}% is too high"
                })
        
        # p99 latency > 1 second
        if duration_p99 > 1.0:
            alerts.append({
                'type': 'HIGH_LATENCY',
                'value': duration_p99,
                'threshold': 1.0,
                'message': f"p99 latency {duration_p99:.2f}s exceeds 1 second"
            })
        
        # Rate drop (outage)
        if rate == 0 and rate != 0:  # Rate was > 0, now 0
            alerts.append({
                'type': 'OUTAGE',
                'value': rate,
                'threshold': 1,
                'message': "Request rate dropped to zero"
            })
        
        return alerts
    
    @staticmethod
    def check_use_alert(cpu_util, cpu_sat, mem_util, mem_sat, disk_util, disk_sat):
        """Alert on USE metrics"""
        alerts = []
        
        # CPU utilization > 80%
        if cpu_util > 80:
            alerts.append({
                'type': 'HIGH_CPU',
                'value': cpu_util,
                'threshold': 80,
                'message': f"CPU utilization {cpu_util:.1f}%"
            })
        
        # CPU saturation > 2 (more than 2 waiting per core)
        if cpu_sat > 200:  # saturation is percent
            alerts.append({
                'type': 'CPU_SATURATION',
                'value': cpu_sat,
                'threshold': 200,
                'message': f"CPU saturation {cpu_sat/100:.1f} tasks per core"
            })
        
        # Memory utilization > 85%
        if mem_util > 85:
            alerts.append({
                'type': 'HIGH_MEMORY',
                'value': mem_util,
                'threshold': 85,
                'message': f"Memory utilization {mem_util:.1f}%"
            })
        
        # Memory saturation (page faults)
        if mem_sat > 100:  # 100 page faults/sec is bad
            alerts.append({
                'type': 'MEMORY_SATURATION',
                'value': mem_sat,
                'threshold': 100,
                'message': f"High page fault rate {mem_sat:.0f}/sec"
            })
        
        # Disk full
        if disk_util > 90:
            alerts.append({
                'type': 'DISK_FULL',
                'value': disk_util,
                'threshold': 90,
                'message': f"Disk {disk_util:.1f}% full"
            })
        
        # Disk I/O saturation > 50%
        if disk_sat > 50:
            alerts.append({
                'type': 'DISK_SATURATION',
                'value': disk_sat,
                'threshold': 50,
                'message': f"Disk I/O wait {disk_sat:.1f}%"
            })
        
        return alerts

# Example: Diagnose using RED + USE
class Diagnosis:
    @staticmethod
    def diagnose_slow_service(red_metrics, use_metrics):
        """
        Slow service diagnosis:
        - If RED shows high latency + USE shows high CPU = code problem
        - If RED shows high latency + USE shows low resources = external dependency
        - If RED shows high error rate + USE shows high resources = resource exhaustion
        """
        
        high_latency = red_metrics['duration_p99'] > 1.0
        high_errors = red_metrics['error_rate'] > 0.01
        high_cpu = use_metrics['cpu_util'] > 80
        high_memory = use_metrics['mem_util'] > 80
        
        if high_latency and high_cpu and not high_errors:
            return "CPU bottleneck - optimize code or scale CPU"
        
        if high_latency and high_memory and not high_errors:
            return "Memory pressure - optimize memory or scale RAM"
        
        if high_latency and not high_cpu and not high_memory:
            return "External dependency slow (DB, API, network)"
        
        if high_errors and high_cpu:
            return "Service overloaded - scale horizontally"
        
        if high_errors and high_memory:
            return "Memory exhaustion - OOM errors or GC pauses"
        
        return "Service nominal"

# Usage
collector = USEMetricsCollector()
collector.collect_all()

# Example RED metrics
red_metrics = {
    'rate': 1000,  # req/s
    'errors': 10,  # err/s
    'duration_p99': 0.5,  # seconds
    'error_rate': 10 / 1000  # ratio
}

# Check alerts
alerter = MetricsAlerter()
red_alerts = alerter.check_red_alert(
    red_metrics['rate'],
    red_metrics['errors'],
    red_metrics['duration_p99']
)

use_alerts = alerter.check_use_alert(
    cpu_utilization._value.get(),
    cpu_saturation._value.get(),
    memory_utilization._value.get(),
    memory_saturation._value.get(),
    disk_utilization.labels(device='/')._value.get(),
    disk_saturation.labels(device='/')._value.get()
)

print("RED Alerts:", red_alerts)
print("USE Alerts:", use_alerts)

# Diagnose
diag = Diagnosis.diagnose_slow_service(red_metrics, {
    'cpu_util': 50,
    'mem_util': 40
})
print("Diagnosis:", diag)

package main

import (
	"fmt"
	"runtime"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/shirou/gopsutil/v3/cpu"
	"github.com/shirou/gopsutil/v3/mem"
	"github.com/shirou/gopsutil/v3/disk"
)

// RED Metrics
var (
	httpRequestsTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total HTTP requests",
		},
		[]string{"method", "endpoint"},
	)

	httpRequestErrors = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_errors_total",
			Help: "HTTP request errors",
		},
		[]string{"method", "endpoint", "status_code"},
	)

	httpRequestDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "HTTP request duration",
			Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0},
		},
		[]string{"method", "endpoint"},
	)
)

// USE Metrics
var (
	cpuUtilization = prometheus.NewGauge(
		prometheus.GaugeOpts{
			Name: "cpu_utilization_percent",
			Help: "CPU utilization percentage",
		},
	)

	memoryUtilization = prometheus.NewGauge(
		prometheus.GaugeOpts{
			Name: "memory_utilization_percent",
			Help: "Memory utilization percentage",
		},
	)

	diskUtilization = prometheus.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "disk_utilization_percent",
			Help: "Disk utilization percentage",
		},
		[]string{"device"},
	)
)

// USE Metrics Collector
type USEMetricsCollector struct {
	interval time.Duration
}

func NewUSEMetricsCollector(interval time.Duration) *USEMetricsCollector {
	return &USEMetricsCollector{interval: interval}
}

func (c *USEMetricsCollector) UpdateCPUMetrics() error {
	percent, err := cpu.Percent(0, false)
	if err != nil {
		return err
	}
	cpuUtilization.Set(percent[0])
	return nil
}

func (c *USEMetricsCollector) UpdateMemoryMetrics() error {
	vm, err := mem.VirtualMemory()
	if err != nil {
		return err
	}
	memoryUtilization.Set(vm.UsedPercent)
	return nil
}

func (c *USEMetricsCollector) UpdateDiskMetrics() error {
	usage, err := disk.Usage("/")
	if err != nil {
		return err
	}
	diskUtilization.WithLabelValues("/").Set(usage.UsedPercent)
	return nil
}

func (c *USEMetricsCollector) CollectAll() error {
	if err := c.UpdateCPUMetrics(); err != nil {
		return err
	}
	if err := c.UpdateMemoryMetrics(); err != nil {
		return err
	}
	if err := c.UpdateDiskMetrics(); err != nil {
		return err
	}
	return nil
}

// RED Metrics Middleware
type REDMiddleware struct {
	requestCount int
	errorCount   int
}

func (m *REDMiddleware) HandleRequest(method, endpoint string, handler func() (interface{}, error)) (interface{}, error) {
	start := time.Now()

	result, err := handler()

	duration := time.Since(start).Seconds()
	httpRequestsTotal.WithLabelValues(method, endpoint).Inc()
	httpRequestDuration.WithLabelValues(method, endpoint).Observe(duration)

	if err != nil {
		statusCode := "500"
		if e, ok := err.(interface{ StatusCode() int }); ok {
			statusCode = fmt.Sprintf("%d", e.StatusCode())
		}
		httpRequestErrors.WithLabelValues(method, endpoint, statusCode).Inc()
		m.errorCount++
		return nil, err
	}

	m.requestCount++
	return result, nil
}

// Alerting
type MetricsAlerter struct{}

type Alert struct {
	Type      string
	Value     float64
	Threshold float64
	Message   string
}

func (a *MetricsAlerter) CheckREDAlerts(rate, errors, durationP99 float64) []Alert {
	var alerts []Alert

	if rate > 0 {
		errorRatio := errors / rate
		if errorRatio > 0.01 {
			alerts = append(alerts, Alert{
				Type:      "HIGH_ERROR_RATE",
				Value:     errorRatio,
				Threshold: 0.01,
				Message:   fmt.Sprintf("Error rate %.1f%% is too high", errorRatio*100),
			})
		}
	}

	if durationP99 > 1.0 {
		alerts = append(alerts, Alert{
			Type:      "HIGH_LATENCY",
			Value:     durationP99,
			Threshold: 1.0,
			Message:   fmt.Sprintf("p99 latency %.2fs exceeds 1 second", durationP99),
		})
	}

	return alerts
}

func (a *MetricsAlerter) CheckUSEAlerts(cpuUtil, memUtil, diskUtil float64) []Alert {
	var alerts []Alert

	if cpuUtil > 80 {
		alerts = append(alerts, Alert{
			Type:      "HIGH_CPU",
			Value:     cpuUtil,
			Threshold: 80,
			Message:   fmt.Sprintf("CPU utilization %.1f%%", cpuUtil),
		})
	}

	if memUtil > 85 {
		alerts = append(alerts, Alert{
			Type:      "HIGH_MEMORY",
			Value:     memUtil,
			Threshold: 85,
			Message:   fmt.Sprintf("Memory utilization %.1f%%", memUtil),
		})
	}

	if diskUtil > 90 {
		alerts = append(alerts, Alert{
			Type:      "DISK_FULL",
			Value:     diskUtil,
			Threshold: 90,
			Message:   fmt.Sprintf("Disk %.1f%% full", diskUtil),
		})
	}

	return alerts
}

// Diagnosis
func DiagnoseSlowService(rate, errors, durationP99, cpuUtil, memUtil float64) string {
	highLatency := durationP99 > 1.0
	highErrors := errors > 0 && errors/rate > 0.01
	highCPU := cpuUtil > 80
	highMemory := memUtil > 80

	if highLatency && highCPU && !highErrors {
		return "CPU bottleneck - optimize code or scale CPU"
	}

	if highLatency && highMemory && !highErrors {
		return "Memory pressure - optimize memory or scale RAM"
	}

	if highLatency && !highCPU && !highMemory {
		return "External dependency slow (DB, API, network)"
	}

	if highErrors && highCPU {
		return "Service overloaded - scale horizontally"
	}

	return "Service nominal"
}

func main() {
	collector := NewUSEMetricsCollector(10 * time.Second)
	collector.CollectAll()

	fmt.Println("USE Metrics collected")

	alerter := &MetricsAlerter{}
	redAlerts := alerter.CheckREDAlerts(1000, 10, 0.5)
	useAlerts := alerter.CheckUSEAlerts(50, 60, 70)

	fmt.Printf("RED Alerts: %d\n", len(redAlerts))
	fmt.Printf("USE Alerts: %d\n", len(useAlerts))

	diagnosis := DiagnoseSlowService(1000, 10, 0.5, 50, 60)
	fmt.Println("Diagnosis:", diagnosis)
}

const prometheus = require('prom-client');
const os = require('os');

// RED Metrics
const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'endpoint'],
});

const httpRequestErrors = new prometheus.Counter({
  name: 'http_requests_errors_total',
  help: 'HTTP request errors',
  labelNames: ['method', 'endpoint', 'status_code'],
});

const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'endpoint'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
});

// USE Metrics
const cpuUtilization = new prometheus.Gauge({
  name: 'cpu_utilization_percent',
  help: 'CPU utilization percentage',
});

const memoryUtilization = new prometheus.Gauge({
  name: 'memory_utilization_percent',
  help: 'Memory utilization percentage',
});

const diskUtilization = new prometheus.Gauge({
  name: 'disk_utilization_percent',
  help: 'Disk utilization percentage',
  labelNames: ['device'],
});

// USE Metrics Collector
class USEMetricsCollector {
  constructor(interval = 10000) {
    this.interval = interval;
  }

  updateCPUMetrics() {
    const cpus = os.cpus();
    const avgLoad = os.loadavg()[0];
    const cpuCount = cpus.length;
    
    // Approximate CPU utilization (load-based)
    const loadPercent = (avgLoad / cpuCount) * 100;
    cpuUtilization.set(Math.min(loadPercent, 100));
  }

  updateMemoryMetrics() {
    const totalMemory = os.totalmem();
    const freeMemory = os.freemem();
    const usedMemory = totalMemory - freeMemory;
    
    const memPercent = (usedMemory / totalMemory) * 100;
    memoryUtilization.set(memPercent);
  }

  collectAll() {
    this.updateCPUMetrics();
    this.updateMemoryMetrics();
  }

  startCollecting() {
    this.collectAll();
    setInterval(() => this.collectAll(), this.interval);
  }
}

// RED Middleware
class REDMiddleware {
  handleRequest(method, endpoint, handler) {
    const start = Date.now();

    try {
      const result = handler();
      httpRequestsTotal.labels(method, endpoint).inc();

      const duration = (Date.now() - start) / 1000;
      httpRequestDuration.labels(method, endpoint).observe(duration);

      return result;
    } catch (error) {
      const statusCode = error.statusCode || '500';
      httpRequestErrors.labels(method, endpoint, statusCode).inc();

      const duration = (Date.now() - start) / 1000;
      httpRequestDuration.labels(method, endpoint).observe(duration);

      throw error;
    }
  }
}

// Alerting
class MetricsAlerter {
  checkREDAlerts(rate, errors, durationP99) {
    const alerts = [];

    if (rate > 0) {
      const errorRatio = errors / rate;
      if (errorRatio > 0.01) {
        alerts.push({
          type: 'HIGH_ERROR_RATE',
          value: errorRatio,
          threshold: 0.01,
          message: `Error rate ${(errorRatio * 100).toFixed(1)}% is too high`,
        });
      }
    }

    if (durationP99 > 1.0) {
      alerts.push({
        type: 'HIGH_LATENCY',
        value: durationP99,
        threshold: 1.0,
        message: `p99 latency ${durationP99.toFixed(2)}s exceeds 1 second`,
      });
    }

    return alerts;
  }

  checkUSEAlerts(cpuUtil, memUtil, diskUtil) {
    const alerts = [];

    if (cpuUtil > 80) {
      alerts.push({
        type: 'HIGH_CPU',
        value: cpuUtil,
        threshold: 80,
        message: `CPU utilization ${cpuUtil.toFixed(1)}%`,
      });
    }

    if (memUtil > 85) {
      alerts.push({
        type: 'HIGH_MEMORY',
        value: memUtil,
        threshold: 85,
        message: `Memory utilization ${memUtil.toFixed(1)}%`,
      });
    }

    if (diskUtil > 90) {
      alerts.push({
        type: 'DISK_FULL',
        value: diskUtil,
        threshold: 90,
        message: `Disk ${diskUtil.toFixed(1)}% full`,
      });
    }

    return alerts;
  }
}

// Diagnosis
function diagnoseSlowService(rate, errors, durationP99, cpuUtil, memUtil) {
  const highLatency = durationP99 > 1.0;
  const highErrors = rate > 0 && errors / rate > 0.01;
  const highCPU = cpuUtil > 80;
  const highMemory = memUtil > 80;

  if (highLatency && highCPU && !highErrors) {
    return 'CPU bottleneck - optimize code or scale CPU';
  }

  if (highLatency && highMemory && !highErrors) {
    return 'Memory pressure - optimize memory or scale RAM';
  }

  if (highLatency && !highCPU && !highMemory) {
    return 'External dependency slow (DB, API, network)';
  }

  if (highErrors && highCPU) {
    return 'Service overloaded - scale horizontally';
  }

  return 'Service nominal';
}

// Example usage
const collector = new USEMetricsCollector(10000);
collector.collectAll();

const alerter = new MetricsAlerter();
const redAlerts = alerter.checkREDAlerts(1000, 10, 0.5);
const useAlerts = alerter.checkUSEAlerts(50, 60, 70);

console.log('RED Alerts:', redAlerts);
console.log('USE Alerts:', useAlerts);

const diagnosis = diagnoseSlowService(1000, 10, 0.5, 50, 60);
console.log('Diagnosis:', diagnosis);

module.exports = { USEMetricsCollector, REDMiddleware, MetricsAlerter, diagnoseSlowService };

Real-World Examples

Example: Diagnose Slow Checkout

RED shows:

Rate: 500 req/s
Errors: 0
Duration p99: 2 seconds

USE shows:

CPU: 25%
Memory: 30%
Disk: 40%

Analysis: High latency with low resource usage = external dependency. Likely: Payment service slow.

Example: CPU Bottleneck

RED shows:

Rate: 1000 req/s
Errors: 5% (100 req/s)
Duration p99: 5 seconds

USE shows:

CPU: 95%
Memory: 40%
Disk: 20%

Analysis: High latency + high errors + high CPU = CPU bottleneck. Solution: optimize code or scale CPU.

Common Mistakes

Mistake 1: Measuring Everything

❌ WRONG: 1000+ metrics per service
- Information overload
- Hard to know what's important
- Dashboards are useless

✅ CORRECT: RED + USE only
- ~10 metrics total
- Clear actionable insights
- Easy to alert on

Mistake 2: Not Correlating RED and USE

❌ WRONG: Alert on high CPU without RED context
- Maybe CPU is high but requests are fast

✅ CORRECT: Correlate
- High CPU + high latency = optimize code
- High CPU + low latency = not a problem

Self-Check

What's the difference between RED and USE?
When should you alert on RED vs. USE?
How do you diagnose slow service using both?
What's an example of high USE with low RED impact?

Design Review Checklist

Next Steps

Implement RED metrics for all services
Implement USE metrics for infrastructure
Create dashboards combining RED and USE
Set alerts on RED thresholds
Set alerts on USE thresholds
Document diagnosis playbooks

RED and USE Methodologies

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

RED Methodology (Service Level)​

USE Methodology (Resource Level)​

RED vs. USE​

Implementation​

Real-World Examples​

Example: Diagnose Slow Checkout​

Example: CPU Bottleneck​

Common Mistakes​

Mistake 1: Measuring Everything​

Mistake 2: Not Correlating RED and USE​

Self-Check​

Design Review Checklist​

Next Steps​

References​