Timeouts, Retries, Exponential Backoff, and Jitter

Master the foundation of resilience: the four patterns that prevent cascading failures

TL;DR

Every network call must have a timeout (preventing resource exhaustion). Transient failures are common—retry with exponential backoff (1s, 2s, 4s, 8s...). Add jitter (randomness) to backoff to prevent thundering herd (all clients retrying simultaneously). Chain multiple layers: at request level, connection level, overall latency budget. Typical: 5-second request timeout, 3-5 retries with exponential backoff, jitter on each retry.

Learning Objectives

Understand why timeouts are essential
Implement retries correctly
Apply exponential backoff to avoid overwhelming recovering services
Use jitter to prevent synchronized retries
Set realistic timeout values

Motivating Scenario

Database becomes slow. API clients wait 30 seconds for response. Meanwhile, new requests arrive, also wait 30 seconds. Thread pools fill up. Memory grows. The system cascades into complete failure. A simple timeout of 5 seconds would have detected the problem, freed resources, and allowed the system to recover.

Timeouts: Prevent Resource Exhaustion

Every network call needs a timeout. Without it, slow services cause resource starvation.

Timeout Prevents Cascade

Timeout Levels

Request Level: How long to wait for a single RPC?

Typical: 5-30 seconds depending on operation
Too short: High failure rate on normal operations
Too long: Cascades when service is slow

Connection Level: How long to establish TCP connection?

Typical: 1-5 seconds
Protects against network hangs during handshake

Overall Budget: Total time for entire operation (all retries)?

Request timeout × number of retries
Example: 5s per request × 3 retries = 15s budget

Timeout Implementation

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

# Configure timeout + retries
def requests_with_timeout():
    session = requests.Session()

    # Timeout: 5s for connection, 10s for read
    timeout = (5, 10)

    # Retries with exponential backoff
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,  # 1s, 2s, 4s
        status_forcelist=[429, 500, 502, 503, 504],
        method_whitelist=["GET", "POST"]
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    return session

# Usage
session = requests_with_timeout()
response = session.post(
    'http://payment-service/charge',
    json={'amount': 100},
    timeout=(5, 10)  # connection_timeout, read_timeout
)

Retries: Recover from Transient Failures

Network is unreliable. Services temporarily fail. Retrying often succeeds on transient failures.

When to Retry:

Network timeout (maybe just slow, not dead)
5xx errors (server error, might recover)
Connection refused (service restarting)

When NOT to Retry:

4xx errors (client error, won't improve with retry)
401 Unauthorized (bad credentials won't get better)
Business logic failures (not transient)

Exponential Backoff: Prevent Overwhelming

Retry immediately: all fail again immediately. Retry after 100ms: still too fast. Exponential backoff gives service time to recover.

Pattern:

Attempt 1: Immediate (fail)
Attempt 2: Wait 1s then retry
Attempt 3: Wait 2s then retry
Attempt 4: Wait 4s then retry
Attempt 5: Wait 8s then retry

Why It Works:

Service crashes, restarts (takes 5-10 seconds)
Exponential backoff gives it time
Each retry has higher chance of success

Exponential Backoff Timeline

Jitter: Prevent Thundering Herd

All clients retry simultaneously = overwhelming spike.

Without jitter: All 1000 clients wait exactly 1 second, then all retry at once. Server gets hammered.

With jitter: Each client waits 1s ± random value (up to 250ms). Retries spread over time.

Without jitter:
t=1.0s: 1000 retries hit server simultaneously (overwhelm!)

With jitter:
t=0.75s: 50 retries
t=0.85s: 80 retries
t=0.95s: 70 retries
t=1.05s: 75 retries
t=1.15s: 65 retries (spread out, manageable)

Jitter Example

import random
import time

def retry_with_backoff(func, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return func()
        except Exception as e:
            if attempt == max_attempts - 1:
                raise

            # Exponential backoff + jitter
            backoff = 2 ** attempt  # 1, 2, 4, 8, ...
            jitter = random.uniform(0, backoff * 0.1)  # up to 10% jitter
            sleep_time = backoff + jitter

            print(f"Attempt {attempt + 1} failed, sleeping {sleep_time:.2f}s")
            time.sleep(sleep_time)

# Usage
def call_payment_service():
    return requests.post('http://payment-service/charge')

try:
    result = retry_with_backoff(call_payment_service)
except Exception as e:
    print(f"Failed after all retries: {e}")

Practical Configuration

Conservative (Availability Focus):

Request timeout: 30 seconds
Retry attempts: 5
Backoff factor: 2
Max backoff: 30 seconds

Aggressive (Latency Focus):

Request timeout: 5 seconds
Retry attempts: 3
Backoff factor: 2
Max backoff: 10 seconds

Balanced (Recommended):

Request timeout: 10 seconds
Retry attempts: 3
Backoff factor: 2
Jitter: ±10%

Anti-Patterns

Retry Without Idempotency: Duplicate side effects (double charge, etc.). Always use idempotent operations with retries.

Linear Backoff: 1s, 2s, 3s, 4s. Still too fast, server can't recover. Use exponential.

No Jitter: Thundering herd. Always add jitter.

Same Timeout Everywhere: Payment critical (short timeout). Analytics non-critical (longer timeout). Vary by importance.

Timeout and Retry Tuning Guide

Real-World Timeout Examples

Operation	Typical Timeout	Rationale
HTTP GET request	5-30s	Network + service response time
Database query	10-30s	Query complexity varies
File upload	60-300s	Network speed varies
Payment authorization	5-15s	External service latency
Search query	2-5s	User expects quick results
Cache read	100-500ms	Should be very fast
Internal service call	1-5s	Local network, should be fast

Retry Strategy Decision Tree

Operation is idempotent?
├─ YES: Safe to retry
│  ├─ Is it a read (GET)? → Retry more aggressively (5+ times)
│  └─ Is it a write (POST)? → Retry conservatively (2-3 times)
│       (Idempotent writes: use idempotency key)
└─ NO: Risky to retry
   ├─ Is it critical? → Implement idempotent wrapper
   └─ Is it non-critical? → Retry 1-2 times maximum

Tuning Exponential Backoff

Initial Delay: Start small (100ms) for fast services, larger (1s) for slower services.

Max Retries:

Critical path: 3-5 retries
Non-critical: 2-3 retries
Background jobs: 5-10 retries

Backoff Factor:

Standard: 2.0 (doubles each time)
Aggressive: 1.5 (slower escalation)
Conservative: 3.0 (faster escalation)

Example:

Initial: 100ms
Retry 1: 100ms × 2 = 200ms
Retry 2: 200ms × 2 = 400ms
Retry 3: 400ms × 2 = 800ms
Retry 4: 800ms × 2 = 1600ms
Total wait time: 3.1s + 4 requests = 7.1s total

If you want max 10s total:
- Use max_backoff: 2s (cap at 2 seconds)
- Limits total to ~6s
- Allows more retries within budget

Jitter Strategies

Strategy 1: Equal Jitter (Recommended)

sleep_time = (backoff / 2) + random(0, backoff / 2)
Example: backoff=4s → sleep between 2s-4s

Strategy 2: Full Jitter (More spread)

sleep_time = random(0, backoff)
Example: backoff=4s → sleep between 0s-4s
Downside: Sometimes no wait (immediate retry)

Strategy 3: Decorrelated Jitter (Most sophisticated)

sleep_time = min(max_backoff, last_sleep * 3 + random(0, 1s))
Adapts based on success patterns

Self-Check

Does every network call have a timeout?
Are your operations idempotent (safe to retry)?
Do you use exponential backoff with jitter?
What's the maximum total time a client waits (all retries)?
Have you tested timeout behavior (what happens at timeout+1ms)?
Do you monitor timeout/retry rates? (High rates indicate problems)

One Takeaway

Timeouts + exponential backoff + jitter = foundation of resilience. Get these right and most failures recover automatically. Measure timeout and retry rates—they're key indicators of system health.

Next Steps

Circuit Breaker: Read Circuit Breaker
Load Shedding: Learn Load Shedding
Idempotency: Explore Idempotency

Timeout Configuration by Technology Stack

Node.js / JavaScript

// http module (default: no timeout!)
const req = http.request(options, callback);
req.setTimeout(5000);  // 5 second timeout
req.on('timeout', () => {
  req.destroy();
  callback(new Error('Timeout'));
});

// Axios (popular HTTP client)
const client = axios.create({
  timeout: 5000  // 5 second timeout for all requests
});

// Fetch API (modern browsers/Node 18+)
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 5000);
try {
  const response = await fetch(url, {
    signal: controller.signal
  });
} catch (err) {
  if (err.name === 'AbortError') {
    console.error('Request timeout');
  }
} finally {
  clearTimeout(timeoutId);
}

Python

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Requests library with timeout + retry
session = requests.Session()
session.timeout = 5  # 5 second timeout

# Requests with retry strategy
retry = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)

# Per-request timeout
response = session.get(url, timeout=5)

Go

// HTTP client with timeout
client := &http.Client{
  Timeout: time.Second * 5,  // 5 second timeout
}

response, err := client.Get(url)

// Or with context for finer control
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
response, err := client.Do(req)

Timeout vs Deadline Concepts

Timeout: Absolute time limit from now

Now: 10:00:00
Timeout: 5 seconds
Deadline: 10:00:05
If timeout extends: deadline extends

Deadline: Absolute point in time

Deadline: 10:00:10 (fixed)
Current: 10:00:05
Remaining: 5 seconds
If we delay: remaining shrinks

Use timeout for individual operations. Use deadline for overall request processing (useful in distributed systems where deadlines propagate).

References

Nygard, M. J. (2007). "Release It!: Design and Deploy Production-Ready Software". Pragmatic Programmers.
Newman, S. (2015). "Building Microservices". O'Reilly Media.
Cockroft, A. (2015). "Hystrix: Latency and Fault Tolerance". Netflix Tech Blog.
AWS SDK Timeout Documentation: https://docs.aws.amazon.com/general/latest/gr/aws-apis.html
Kubernetes Timeout Best Practices: https://kubernetes.io/docs/concepts/services-networking/service/#timeouts

Timeouts, Retries, Exponential Backoff, and Jitter

TL;DR​

Learning Objectives​

Motivating Scenario​

Timeouts: Prevent Resource Exhaustion​

Timeout Levels​

Retries: Recover from Transient Failures​

Exponential Backoff: Prevent Overwhelming​

Jitter: Prevent Thundering Herd​

Practical Configuration​

Anti-Patterns​

Timeout and Retry Tuning Guide​

Real-World Timeout Examples​

Retry Strategy Decision Tree​

Tuning Exponential Backoff​

Jitter Strategies​

Self-Check​

Next Steps​

Timeout Configuration by Technology Stack​

Node.js / JavaScript​

Python​

Go​

Timeout vs Deadline Concepts​

References​