Load Balancing: L4 vs L7

Route traffic across instances at transport and application layers.

TL;DR

L4 (Transport): TCP/UDP level. Fast (minimal overhead), simple (round-robin, least conn). Best for: bulk data, latency-critical, non-HTTP. L7 (Application): HTTP level. Slow (overhead), smart routing (by path, hostname, header). Best for: APIs, microservices, content-based routing. L4 → L7 if you need content-based routing. Hairpin: Load balancer should not send traffic to itself. Health checks: TCP (fast) vs. HTTP (accurate). Session affinity: If needed, persist to same backend (but breaks horizontal scaling).

Learning Objectives

Understand L4 vs. L7 tradeoffs
Design load balancing strategies
Implement health checks
Handle session affinity
Debug load balancing issues
Optimize for latency
Design for high availability
Monitor load balancer health

Motivating Scenario

API deployed to 10 instances. Random failures (some instances slower). Round-robin sends equal traffic to slow instances. Solution: L7 load balancer routes by latency percentile (least connections). Failing instances marked unhealthy. Result: 10x better latency p99.

Core Concepts

L4 vs. L7

Aspect	L4 (Transport)	L7 (Application)
Protocol	TCP/UDP	HTTP/gRPC
Overhead	Low	High
Speed	Fast	Slower
Routing	IP + Port	Path, hostname, header
Latency	< 1ms	1-5ms
Stickiness	Hash-based	Header/cookie-based
Best for	High throughput	APIs, microservices

Load Balancing Algorithms

Algorithm	Behavior	Use Case
Round-Robin	Cycle through instances	Stateless, equal capacity
Least Connections	Route to instance with fewest active	Varying request durations
IP Hash	Hash client IP to instance	Session affinity (but not ideal)
Weighted	Custom weights per instance	Different instance sizes
Latency	Route to lowest latency instance	Geo-distributed
Random	Random selection	Simple, distributed

Implementation

L4 (TCP Load Balancing)
L7 (HTTP Load Balancing)
Health Checks

# NGINX L4 load balancing (TCP)
stream {
  upstream api_backend {
    server 10.0.1.10:8080 weight=5;  # More traffic
    server 10.0.1.11:8080 weight=3;
    server 10.0.1.12:8080 weight=2;
  }
  
  server {
    listen 8080;
    proxy_pass api_backend;
    
    # Load balancing method
    # (default: round-robin)
    # least_conn: route to fewest connections
    # hash $remote_addr: IP-based stickiness
    
    # Health check (TCP)
    proxy_connect_timeout 5s;
    proxy_socket_keepalive on;
    
    # Logging
    access_log /var/log/nginx/l4.log;
  }
}
---

# HAProxy L4 load balancing
global
  maxconn 50000
  
defaults
  mode tcp
  balance leastconn  # Least connections algorithm
  timeout connect 5000
  timeout client 50000
  timeout server 50000

frontend api_frontend
  bind *:8080
  mode tcp
  default_backend api_servers

backend api_servers
  mode tcp
  balance leastconn
  
  # Health check (TCP)
  option tcpchk
  tcp-check connect port 8080
  
  server api1 10.0.1.10:8080 check inter 2000
  server api2 10.0.1.11:8080 check inter 2000
  server api3 10.0.1.12:8080 check inter 2000
---

# AWS Network Load Balancer (L4)
apiVersion: v1
kind: Service
metadata:
  name: api-nlb
spec:
  type: LoadBalancer
  loadBalancerClass: network
  sessionAffinity: None
  externalTrafficPolicy: Local  # Preserve source IP
  
  selector:
    app: api
  
  ports:
  - port: 8080
    targetPort: 8080
    protocol: TCP
  
  healthCheckNodePort: 30000

# NGINX L7 load balancing (HTTP)
http {
  upstream api_backend {
    # Least connections (good for varying request times)
    least_conn;
    
    server api1.example.com:8080 weight=5;
    server api2.example.com:8080 weight=3;
    server api3.example.com:8080 weight=2;
    
    # Keepalive connections
    keepalive 32;
  }
  
  server {
    listen 80;
    server_name api.example.com;
    
    # Route by path
    location /api/users {
      proxy_pass http://user_backend;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_http_version 1.1;
      proxy_set_header Connection "";
    }
    
    location /api/orders {
      proxy_pass http://order_backend;
      proxy_set_header X-Real-IP $remote_addr;
    }
    
    # Default route
    location / {
      proxy_pass http://api_backend;
      
      # Headers
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;
      
      # Timeouts
      proxy_connect_timeout 5s;
      proxy_send_timeout 60s;
      proxy_read_timeout 60s;
      
      # Buffering
      proxy_buffering on;
      proxy_buffer_size 4k;
      proxy_buffers 8 4k;
      
      # Keepalive
      proxy_http_version 1.1;
      proxy_set_header Connection "";
    }
    
    # Health check
    location /health {
      access_log off;
      return 200 "healthy";
    }
  }
}
---

# Kubernetes Ingress (L7)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      # Route by path
      - path: /api/users
        pathType: Prefix
        backend:
          service:
            name: user-service
            port:
              number: 8080
      
      - path: /api/orders
        pathType: Prefix
        backend:
          service:
            name: order-service
            port:
              number: 8080
      
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

  # TLS/HTTPS
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls-cert
---

# Session affinity (sticky sessions)
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  type: LoadBalancer
  sessionAffinity: ClientIP  # Route same client to same backend
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours
  
  selector:
    app: api
  
  ports:
  - port: 8080
    targetPort: 8080

# Kubernetes Service health checks
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api
  
  # Defines endpoints for service
  ports:
  - port: 8080
    targetPort: 8080
    name: api
---

# Pod with health check probes
apiVersion: v1
kind: Pod
metadata:
  name: api-pod
spec:
  containers:
  - name: api
    image: api:1.0
    ports:
    - containerPort: 8080
    
    # Liveness probe: restart if fails
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
    
    # Readiness probe: remove from LB if fails
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 2
      failureThreshold: 2
    
    # Startup probe: wait before health checks
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      failureThreshold: 30
      periodSeconds: 10
---

# L4 health check (TCP)
backend api_servers
  option tcpchk
  tcp-check connect port 8080
  
  server api1 10.0.1.10:8080 check inter 2000 rise 2 fall 3
  # inter: check every 2 seconds
  # rise: healthy after 2 successful checks
  # fall: unhealthy after 3 failed checks

Real-World Examples

Scenario 1: Global Traffic Distribution

User in US:      Latency 50ms to us-east-1
User in EU:      Latency 200ms to us-east-1

L7 Load Balancer with geo-routing:
  US traffic → us-east-1 (50ms)
  EU traffic → eu-west-1 (50ms)

Result: 4x better latency for EU users

Scenario 2: Canary Deployment

Current: v1 (100% traffic)
New: v2 (canary)

L7 routing:
  - 95% traffic → v1
  - 5% traffic → v2
  - Monitor v2 error rate
  - If OK: increase to 50%, then 100%
  - If errors: rollback

L4 can't do this (no HTTP awareness)

Scenario 3: High-Frequency Trading

Latency-critical workload
Requirement: < 1ms

Solution: L4 Network Load Balancer
  - No HTTP overhead
  - Direct TCP pass-through
  - < 100μs added latency
  - 50k+ concurrent connections

L7 would add 1-5ms (unacceptable)

Common Mistakes

Mistake 1: Session Affinity (Breaks Scaling)

❌ WRONG: Sticky sessions
  User A always routes to Instance 1
  Instance 1 fails → User A disconnected
  Can't add instances (breaks affinity)

✅ CORRECT: Stateless design
  User state in database/cache
  Any instance can serve user
  Easy horizontal scaling

Mistake 2: Health Check Not Matching Traffic

❌ WRONG: TCP health check on HTTP service
  TCP connects, but service is hung
  Instance marked healthy, but returns errors

✓ CORRECT: HTTP health check
  GET /health returns 200
  Accurate detection of real problems

Mistake 3: Single Load Balancer (SPOF)

❌ WRONG: Single LB
  LB fails → all traffic lost

✓ CORRECT: HA load balancers
  2+ load balancers
  VIP (virtual IP) for failover
  Active-passive or active-active

Design Checklist

Next Steps

Choose L4 or L7 (or both)
Select load balancing algorithm
Configure health checks
Setup timeouts and connection pooling
Test failover scenarios
Monitor load balancer metrics
Document routing rules
Plan for scaling

References

Advanced Load Balancing

Consistent Hashing

For distributed caches/databases, regular hashing breaks on server addition:

import hashlib

# Simple hashing (breaks on server change)
def simple_hash(key, servers):
    h = hash(key) % len(servers)
    return servers[h]

# Consistent hashing (adds/removes servers gracefully)
class ConsistentHash:
    def __init__(self, servers):
        self.servers = sorted(servers)
        self.hash_ring = {}
        for server in self.servers:
            for i in range(160):  # Virtual nodes
                node_key = f"{server}:{i}"
                hash_val = int(hashlib.md5(node_key.encode()).hexdigest(), 16)
                self.hash_ring[hash_val] = server
    
    def get_server(self, key):
        hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
        for ring_key in sorted(self.hash_ring.keys()):
            if ring_key >= hash_val:
                return self.hash_ring[ring_key]
        return self.hash_ring[min(self.hash_ring.keys())]

# When server added/removed, only ~1/n of keys rehash
# vs. simple hashing which rehashes all keys

Connection Draining

Gracefully remove instance from load balancer:

Phase 1: Drain (stop sending NEW connections)
  - LB marks instance as "draining"
  - Existing connections continue
  - New connections go to other instances

Phase 2: Wait (existing connections finish)
  - LB waits for existing connections
  - Timeout if connections don't close
  - Usually 30-60 seconds

Phase 3: Remove (instance removed from LB)
  - Instance can now restart or terminate
  - No connection loss

Content-Based Routing

L7 routing rules:

# Route by hostname
api.example.com → api-backend
admin.example.com → admin-backend
cdn.example.com → cdn-backend

# Route by path
/api/v1/* → v1-backend
/api/v2/* → v2-backend
/admin/* → admin-backend (requires auth)

# Route by header
X-Client: mobile → mobile-optimized-backend
X-Client: web → web-backend

# Route by cookie
session-type: premium → premium-backend
session-type: free → free-tier-backend

SSL/TLS Termination

Decrypt HTTPS at load balancer:

Client (HTTPS) → LB (decrypt) → Backend (HTTP)
                                  (fast, local)

Benefits:

Backends don't waste CPU on encryption
Certificate management centralized
Can inspect/modify headers

Drawbacks:

Requires storing private key in LB
LB becomes security boundary

Modern approach: mTLS

Client (HTTPS) → LB (HTTPS) → Backend (HTTPS)
                  |
              Decrypt for inspection only
              (re-encrypt for backend)

Load Balancer Monitoring

Key Metrics

Connection count: Active connections to backend
Request latency: Time for LB to forward + wait for response
Error rate: 5xx responses from backend
Dropped connections: LB dropped due to overload
Backend health: Number of healthy backends

Alerting

alerts:
- name: UnhealthyBackends
  condition: healthy_backends < 2
  message: "Only 2 backends healthy, risk of outage"

- name: HighErrorRate
  condition: error_rate > 0.01
  message: "Error rate > 1%, investigate backends"

- name: HighLatency
  condition: latency_p99 > 500ms
  message: "p99 latency > 500ms, possible overload"

- name: DrainedBackends
  condition: draining_backends > 1
  message: "Multiple backends draining, potential issue"

Performance Tuning

For high throughput (>100k req/s):

Connection pooling: Reuse connections to backends

Without pooling: New TCP connection per request (slow)
With pooling: Reuse connection (fast)

Keepalive timeout: Keep connections open longer

Short (60s): Quick resource cleanup, more overhead
Long (300s): Less overhead, higher resource usage

Buffer sizes: Match expected packet sizes

send_buffer_size: 64KB
receive_buffer_size: 64KB
# Adjust based on average request/response size

CPU affinity: Pin LB to CPU cores

taskset -c 0,1,2,3 nginx
# Improves cache locality, reduces context switching

Failover and Resilience

Active-Passive Failover

Two load balancers, one active:

VIP: 10.0.1.100 (Virtual IP)
  ↓
Active LB: 10.0.1.10 (owns VIP)
  ↓
Backends: 10.0.2.*

If Active fails → Passive takes VIP
Clients don't notice (same IP)

Technologies:

VRRP (Virtual Router Redundancy Protocol)
AWS Elastic IP + Lambda failover
DNS failover (Route53)

Active-Active Load Balancing

Both load balancers serve traffic:

LB1: 10.0.1.10 → Route A traffic
LB2: 10.0.1.11 → Route B traffic

Failure: Route traffic from failed LB to other

Pros: Better utilization, no single point of failure Cons: More complex, requires distributed coordination

Conclusion

Load balancing is critical for availability:

L4: Fast, simple, good for TCP protocols
L7: Smart routing, good for HTTP

Design for:

High availability (multiple LBs)
Graceful failover (connection draining)
Content-based routing (microservices)
Observability (metrics, logging)

Monitor:

Backend health
Request latency
Error rates
Dropped connections

Scale:

Spot instances for cost savings
Reserved instances for baseline
Mixed strategy for flexibility

L4 vs. L7 Decision Matrix

Use L4 when:

Protocol: TCP, UDP, non-HTTP
Throughput: > 100k req/sec
Latency: < 1ms required
Examples: Gaming, DNS, NTP, custom protocols

Use L7 when:

Protocol: HTTP(S), gRPC
Routing: By path, hostname, header
Throughput: < 100k req/sec acceptable
Examples: APIs, web apps, microservices

Common L7 Routing Patterns

API versioning:

GET /api/v1/users  → v1-backend
GET /api/v2/users  → v2-backend

Feature flags:

Header: X-Feature-Flag: experimental
Route to experimental-backend

A/B testing:

Cookie: ab-test=group-a → a-backend
Cookie: ab-test=group-b → b-backend
Random → 50/50 split

Tenant isolation:

Header: X-Tenant: acme → acme-backend
Header: X-Tenant: widgetcorp → widget-backend

Load Balancing: L4 vs L7

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

L4 vs. L7​

Load Balancing Algorithms​

Implementation​

Real-World Examples​

Scenario 1: Global Traffic Distribution​

Scenario 2: Canary Deployment​

Scenario 3: High-Frequency Trading​

Common Mistakes​

Mistake 1: Session Affinity (Breaks Scaling)​

Mistake 2: Health Check Not Matching Traffic​

Mistake 3: Single Load Balancer (SPOF)​

Design Checklist​

Next Steps​

References​

Advanced Load Balancing​

Consistent Hashing​

Connection Draining​

Content-Based Routing​

SSL/TLS Termination​

Load Balancer Monitoring​

Key Metrics​

Alerting​

Performance Tuning​

Failover and Resilience​

Active-Passive Failover​

Active-Active Load Balancing​

Conclusion​

L4 vs. L7 Decision Matrix​

Common L7 Routing Patterns​