Skip to main content

Disaster Recovery Patterns

TL;DR

Disaster recovery combines backups, replication, and failover mechanisms to minimize data loss and downtime. Define two critical metrics: RPO (Recovery Point Objective—acceptable data loss) and RTO (Recovery Time Objective—acceptable downtime). Point-in-time recovery (PITR) restores databases to a specific moment; replication-based failover is faster but requires coordination. Follow the 3-2-1 rule: maintain 3 copies of data, on 2 different media types, with 1 copy offsite. Automate all backups, test failover quarterly, and document runbooks. Ransomware and corruption are threats; make backups immutable.

Learning Objectives

  • Design backup strategies aligned with business continuity requirements and RTO/RPO targets
  • Understand RPO/RTO trade-offs and achieve SLOs without over-engineering
  • Implement point-in-time recovery (PITR) for databases and stateful systems
  • Test failover procedures regularly and document recovery runbooks
  • Protect backups against ransomware and accidental deletion
  • Calculate recovery capacity planning and cost trade-offs

Motivating Scenario

A SaaS company runs PostgreSQL on a single server in one region. No backups. A user accidentally runs a DELETE query that wipes the customer database. By the time anyone notices, the transaction is committed. The only "backup" is a week-old snapshot—customer loses a week of data. Angry customer files a lawsuit.

With disaster recovery: Daily automated backups in 2 regions. Point-in-time recovery enabled. The company detects the accidental DELETE within 5 minutes. Restores from a backup taken 1 hour earlier. Customer loses 1 hour of data instead of a week. Business survives.

Core Concepts

RPO vs RTO

RPO (Recovery Point Objective): Maximum acceptable data loss, measured in time. If RPO = 1 hour, you accept losing up to 1 hour of data in a disaster. Achieved through backup frequency.

RTO (Recovery Time Objective): Maximum acceptable downtime. If RTO = 15 minutes, your system must be back online within 15 minutes of a failure. Achieved through fast failover and replication.

MetricDefinitionExample
RPOData loss tolerance"Lose data from last 30 minutes only"
RTODowntime tolerance"Offline for max 2 hours"
Backup frequencyDrives RPOHourly backups → 1h RPO
Failover speedDrives RTOReplication → fast (seconds). PITR restore → slower (minutes)

The 3-2-1 Rule

A proven backup strategy:

  • 3 copies: Original + 2 backups
  • 2 media types: Disk + Tape (or Block storage + Object storage)
  • 1 offsite: At least one copy in a different geographic location

Example:

  • Copy 1: Production database (on-disk)
  • Copy 2: Daily backup to local NAS (same region)
  • Copy 3: Daily backup replicated to S3 in another region (offsite)

This protects against:

  • Single disk failure (Copy 2, 3 exist)
  • Regional disaster (Copy 3 offsite)
  • Media format obsolescence (2 different types)

Backup Types

TypeStrategyRPORTOCost
Full backupCopy entire dataset1-7 daysMedium (restore slow)High
IncrementalCopy only changed dataDailyMediumLow
DifferentialCopy changes since last fullDailyMediumMedium
Continuous replicationStream changes to standbyMinutes-secondsFast (seconds)Very High
Point-in-Time Recovery (PITR)Restore to any moment (via logs)MinutesMedium (restore from backup + replay logs)Medium

Failover Patterns

PatternMechanismRTOConsistencyComplexity
Passive StandbySingle active, manual failoverHoursStrongLow
Active-Passive (automatic)Automatic failover on detectionMinutesStrongMedium
Active-ActiveBoth regions serving (eventual consistency)SecondsEventualHigh
Warm StandbyPre-provisioned, replication, quick promotion5-15 minStrongMedium

Practical Example

# Example: Disaster recovery plan for PostgreSQL + Kubernetes

# 1. Daily automated full backup to local storage
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup-daily
namespace: data-layer
spec:
schedule: "2 0 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
serviceAccountName: postgres-backup
containers:
- name: backup
image: postgres:15-alpine
env:
- name: PGHOST
value: "postgres.data-layer"
- name: PGUSER
value: "postgres"
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
command:
- /bin/sh
- -c
- |
BACKUP_FILE="backup-$(date +%Y%m%d-%H%M%S).sql.gz"
pg_dump --verbose --format=custom | gzip > /mnt/backups/$BACKUP_FILE
echo "Backup created: $BACKUP_FILE"
# Replicate to S3 for offsite storage
aws s3 cp /mnt/backups/$BACKUP_FILE s3://dr-backups/postgres/$BACKUP_FILE
# Clean up old backups (keep 30 days)
find /mnt/backups -name "backup-*.sql.gz" -mtime +30 -delete
volumeMounts:
- name: backup-storage
mountPath: /mnt/backups
restartPolicy: OnFailure
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
---
# 2. Continuous transaction logging for PITR
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-wal-config
namespace: data-layer
data:
postgresql.conf: |
# Enable WAL archiving for PITR
wal_level = replica
max_wal_senders = 10
wal_keep_segments = 64
archive_mode = on
archive_command = 'test ! -f /mnt/wal-archive/%f && cp %p /mnt/wal-archive/%f'
# Replicate WAL to S3 every 5 minutes
archive_timeout = 300
---
# 3. Standby replica in another region (continuous replication)
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-primary-us-east-1
namespace: data-layer
spec:
instances: 3
bootstrap:
initdb:
database: myapp
owner: postgres
postgresql:
parameters:
log_checkpoints: "on"
log_statement: "all"
resources:
requests:
cpu: "1"
memory: "2Gi"
backup:
barmanObjectStore:
destinationPath: s3://dr-backups/postgres
s3Credentials:
accessKeyId:
name: aws-credentials
key: access_key
secretAccessKey:
name: aws-credentials
key: secret_key
retention:
recoverWindowDays: 30
retentionPolicy: "RECOVERY WINDOW OF 30 days"
monitoring:
enabled: true
prometheusRule:
enabled: true
---
# 4. Failover procedure (manual, documented)
# In case of primary region failure:
# 1. Verify primary is truly down (not network partition)
# 2. Promote standby to primary: pg_ctl promote
# 3. Update application connection strings to new primary
# 4. Run: PITR restore from backup if needed (for corruption/delete)
# 5. Test all queries work
# 6. Update DNS to point to new primary

Backup Schedule:

  • Hourly: Transaction logs (WAL) to S3 (enables PITR to any minute)
  • Daily: Full backup at 2 AM UTC to local NAS + S3
  • Weekly: Full backup to tape for long-term retention
  • Monthly: Offsite tape shipped to secure vault

Recovery Capability:

  • RPO: 1 hour (lose at most 1 hour of data if primary fails)
  • RTO: 5 minutes for replication-based failover, 30 minutes for PITR restore

When to Use / When NOT to Use

Disaster Recovery: Sound Strategy vs Over-Engineering
Best Practices
  1. DO: Define RPO/RTO Based on Business Impact: Critical ecommerce: RPO=1h, RTO=5min. Non-critical: RPO=24h, RTO=4h. Let business set targets. Design recovery to meet those targets, no more.
  2. DO: Test Failover Quarterly: Run automated dr-test.sh. Simulate primary failure. Verify standby promotes. Verify data. Takes 1 hour, catches problems before real disaster.
  3. DO: Keep Backups Immutable: Backups in S3 with versioning + MFA delete. Ransomware can't encrypt or delete them. 30-day retention minimum (survive crypto-locker discovery lag).
  4. DO: Automate Everything: Backups run via cron. Failover detected/promoted automatically. Recovery scripts tested. Runbook is executable code, not Word document.
  5. DO: Follow 3-2-1 Rule: Original + 2 backups, on 2 media types, 1 offsite. Protects against disk failure, regional disaster, format obsolescence.
  6. DO: Document Recovery RTO/RPO: Runbook includes: what we can lose (RPO), how long it takes to recover (RTO), step-by-step recovery script. Shared with team.
Anti-Patterns
  1. DO: Define RPO/RTO Based on Business Impact: Engineer everything to RPO=5min, RTO=30sec (most expensive). Or ignore business requirements and hope backups are 'good enough.'
  2. DO: Test Failover Quarterly: Never test. Assume failover works because design looks good on paper. Discover critical bugs during actual disaster (worst time to learn).
  3. DO: Keep Backups Immutable: Backups stored on NAS with same credentials as production. Ransomware encrypts NAS. Backups gone. No recovery possible.
  4. DO: Automate Everything: Manual backup procedures: 'Call DBA at 3 AM to run backup.' Failover: 'Manually SSH to standby and run pg_ctl promote.' Error-prone, slow.
  5. DO: Follow 3-2-1 Rule: Single backup in same region/same storage type. Entire data center burns down = backups gone too.
  6. DO: Document Recovery RTO/RPO: Recovery procedure exists only in one person's head. That person leaves company. No one else can recover. Disaster = company dies.

Patterns & Pitfalls

Backups run daily and look successful (0 errors). No one ever restores a backup. During disaster, restore fails: corrupt backup, wrong format, missing credentials. Recovery takes hours or fails entirely.
Primary DB in us-east-1. Backups stored in us-east-1. Data center catches fire, entire region destroyed. Backups gone. No recovery.
Attacker gains access to production, encrypts primary. Admin restores from backup, but backups are on NAS that attacker also encrypted. All copies gone.
Business says: 'RPO=1h, RTO=15min.' Design accordingly. PITR + hourly backups for RPO. Warm standby + automated failover for RTO. Don't over-engineer for 99.99% uptime if business only needs 99.9%.
Copy 1: Primary DB. Copy 2: Automated daily backup to local S3. Copy 3: Daily backup replicated to S3 in another region. All automated, immutable, tested quarterly.
Monthly: Kill primary in production (within runbook time window). Standby auto-promotes. Team observes. Document any issues. Fix before next month. Realistic failure scenario.
Recovery procedure written in 2020 Word document. Code has changed. Software versions differ. Runbook doesn't match reality. When disaster hits, instructions don't work.
Failover script is bash/python that runs hourly in testing. Tested, proven to work. When real failover needed, run same script. No surprises.

Design Review Checklist

  • Are RPO and RTO targets defined by business (not engineering guesses)?
  • Do backup strategies achieve stated RPO targets? (frequency, retention)
  • Are backups tested quarterly (actual restore, not just 'no errors')?
  • Is the 3-2-1 rule followed (3 copies, 2 media, 1 offsite)?
  • Are backups stored in a different region from production?
  • Are backups immutable (versioning, MFA delete, write-once storage)?
  • Is PITR (point-in-time recovery) available for databases?
  • Is point-in-time RTO realistic, documented, and tested?
  • Is failover automated (not manual multi-step procedure)?
  • Is failover RTO measured, documented, and achievable?
  • Are runbook procedures executable code (scripts), not Word documents?
  • Are runbooks tested quarterly (chaos engineering)?
  • Does runbook include: pre-disaster verification, failure detection, promotion, DNS update, application reconnect, post-recovery validation?
  • Is ransomware considered (immutable backups, offsite copy)?
  • Is recovery capacity planned (spare region/infrastructure exists)?
  • Can team execute failover without on-call engineer (automated)?
  • Are backup/recovery costs justified by business impact (ROI)?
  • Is disaster recovery communication plan in place (who notified, when)?

Self-Check

  1. Right now, could you restore a critical database from backup? How long would it take? Test it on a dev database.
  2. What is your current RPO and RTO? If you don't have targets, ask business what they need.
  3. Where are your backups stored? If they're in the same region as production, you're not protected against regional disaster.
  4. When was the last successful restore test? If it's been > 3 months, test immediately.
  5. If primary database fails right now, who detects it? How long until users notice? How long until it's promoted and working?

Next Steps

  1. Define RPO/RTO — Interview business owners, document targets.
  2. Design backup strategy — Implement 3-2-1 rule. Automate backups.
  3. Implement PITR — Enable transaction logging, retention, offsite copy.
  4. Set up warm standby — Replicate to another region. Automate failover detection.
  5. Test quarterly — Run failover drill. Measure RTO. Document issues.
  6. Create executable runbook — Write recovery scripts (bash/python). Test regularly.
  7. Make backups immutable — Versioning, MFA delete, prevent ransomware.
  8. Plan recovery capacity — Ensure standby infrastructure is pre-provisioned.

References

  1. DigitalOcean: RPO and RTO Explained ↗️
  2. PostgreSQL: Backup and Restore ↗️
  3. PostgreSQL: Point-In-Time Recovery ↗️
  4. AWS: Disaster Recovery ↗️
  5. Kubernetes: Disaster Recovery ↗️
  6. CISA: Ransomware Alerts & Mitigation ↗️