Skip to main content

Drift Detection and Prevention

Detect when infrastructure diverges from code; prevent manual changes from bypassing version control.

TL;DR

Infrastructure drift occurs when the actual deployed infrastructure diverges from the infrastructure-as-code definition. A developer SSHs into a production server and changes a configuration file. Someone creates an S3 bucket through the AWS console instead of Terraform. A load balancer is modified manually. Now the live system doesn't match the code, creating a "source of truth" problem. If the server crashes and is rebuilt from code, the manual changes are lost. Worse, security audits can't reason about the actual configuration; it exists in someone's head. Prevention requires forbidding manual changes and enforcing automation. Detection requires regularly comparing code to reality and alerting on divergence.

Learning Objectives

  • Understand the causes and risks of configuration drift
  • Implement drift detection mechanisms
  • Prevent unauthorized manual changes through process and tooling
  • Reconcile discovered drift safely
  • Monitor and report on compliance
  • Design immutable infrastructure practices

Motivating Scenario

Your company manages a microservices platform on AWS with 200+ resources. Six months ago, you switched to Terraform for infrastructure management. Three weeks ago, on Friday evening, the API service crashed. While investigating, you found that someone had manually increased the RDS instance size through the AWS console to handle a traffic spike. The Terraform code still specified the smaller size. When the instance crashed and restarted, it reverted to the smaller size, causing the outage.

A security audit the next week revealed multiple manual security group modifications that didn't match the code. Production had become a snowflake—undocumented, unreproducible, fragile.

Core Concepts

What Is Drift?

Drift is the gap between desired state (code) and actual state (deployed infrastructure).

Sources of drift:

  • Manual changes: SSH, console, API calls made outside version control
  • Failed automation: A deployment partially completes, leaving the system in an inconsistent state
  • Third-party modifications: An external service changes your infrastructure (e.g., auto-scaling modifies capacity, updates are applied by patching systems)
  • Time-dependent changes: Certificates expire, credentials are rotated without updating code
  • Environmental effects: Network issues cause state files to be stale

Impact of drift:

  • Unreproducibility: You can't rebuild infrastructure from code because it won't match production
  • Security vulnerabilities: Security groups, firewall rules, encryption settings may diverge from secure defaults
  • Compliance violations: Audit trails can't match code to reality; regulatory audits fail
  • Incident response failure: During an incident, you can't quickly understand the actual configuration
  • Team knowledge loss: If the person who made manual changes leaves, the knowledge walks out the door

Drift Detection Strategies

Drift Detection Loop

Refresh-based detection: Terraform's terraform refresh command queries the cloud provider to see if resources still match the state file. If a resource was modified outside Terraform, refresh detects it. This is passive—it only reports drift, doesn't fix it.

Policy-based detection: Run policies (OPA, Sentinel) against infrastructure to detect violations. Different from refresh—you're not checking if actual equals planned, but whether actual configuration meets compliance requirements. Example: "All databases must be encrypted. Is this database encrypted?"

Inventory scanning: Regularly query cloud providers for all resources and compare to your IaC definitions. Useful for finding resources that were created manually but never added to IaC.

Continuous monitoring: In cloud-native environments (Kubernetes), control planes continuously reconcile actual state to desired state. Kubernetes controllers constantly compare the cluster's actual state to the desired state expressed in YAML manifests.

Drift Prevention: Immutable Infrastructure

The strongest drift prevention is immutable infrastructure—infrastructure that never changes after deployment. To update configuration, you replace the entire resource, never modify it.

Benefits:

  • No drift by definition (you can't change things after creation)
  • Quick rollback (keep the old version running, redirect traffic)
  • Clear audit trail (every change is a new deployment)
  • Easier testing (test the exact artifact you'll deploy)

Implementation:

  • Use containers (Docker) for application servers—rebuild the image, deploy new containers
  • Use immutable machine images (AMIs, custom VM images)—changes mean building a new image and deploying new instances
  • Treat databases differently—they contain state and can't truly be immutable, but use snapshots for backup/rollback

Practical Example

Let's implement a comprehensive drift detection and prevention system.

# main.tf - Infrastructure with drift prevention

terraform {
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}

# Remote state prevents local modifications and enables team collaboration
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}

provider "aws" {
region = var.aws_region

# Prevent accidental modifications from console
default_tags {
tags = {
ManagedBy = "Terraform"
Environment = var.environment
CreatedAt = timestamp()
}
}
}

# Launch configuration for immutable infrastructure
resource "aws_launch_template" "api_server" {
name_prefix = "api-server-"
image_id = data.aws_ami.amazon_linux.id
instance_type = "t3.medium"

# Immutability: use public cloud images, not custom configs
user_data = base64encode(templatefile("${path.module}/user-data.sh", {
app_version = var.app_version
}))

tag_specifications {
resource_type = "instance"
tags = {
Name = "api-server"
}
}

# Prevent console modifications
metadata_options {
http_endpoint = "enabled"
http_tokens = "required"
http_put_response_hop_limit = 1
}

monitoring {
enabled = true
}

lifecycle {
create_before_destroy = true
}
}

# Auto Scaling Group - immutable replacement policy
resource "aws_autoscaling_group" "api_servers" {
name = "api-asg"
vpc_zone_identifier = var.private_subnet_ids
target_group_arns = [aws_lb_target_group.api.arn]
health_check_type = "ELB"
health_check_grace_period = 300

launch_template {
id = aws_launch_template.api_server.id
version = "$Latest"
}

min_size = 3
max_size = 10
desired_capacity = 3

# Immutable infrastructure: replace entire ASG on changes
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 90
instance_warmup_seconds = 300
}
}

tag {
key = "Name"
value = "api-server"
propagate_at_launch = true
}

lifecycle {
create_before_destroy = true
}
}

# Database with immutability considerations
resource "aws_db_instance" "postgres" {
identifier = "production-postgres"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.t3.medium"
allocated_storage = 100

# Encryption at rest
storage_encrypted = true
kms_key_id = aws_kms_key.db.arn

# Encryption in transit
db_subnet_group_name = aws_db_subnet_group.private.name
publicly_accessible = false
skip_final_snapshot = false
final_snapshot_identifier = "production-postgres-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"

# Backups
backup_retention_period = 30
backup_window = "03:00-04:00"
copy_tags_to_snapshot = true

# Logging
enabled_cloudwatch_logs_exports = ["postgresql"]

# Apply immediately is dangerous in prod
apply_immediately = false
maintenance_window = "mon:04:00-mon:05:00"

tags = {
Name = "production-postgres"
}

lifecycle {
# Prevent accidental deletion
prevent_destroy = true
ignore_changes = [
# Ignore changes from manual modifications
allocated_storage
]
}
}

# Data source to find AMI
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]

filter {
name = "name"
values = ["amzn2-ami-minimal-*"]
}
}

# Variables
variable "aws_region" {
type = string
default = "us-east-1"
}

variable "environment" {
type = string
}

variable "app_version" {
type = string
description = "Version of application to deploy"
}

variable "private_subnet_ids" {
type = list(string)
}

# KMS key for encryption
resource "aws_kms_key" "db" {
description = "Encryption key for RDS database"
enable_key_rotation = true
}

resource "aws_db_subnet_group" "private" {
name = "private-db-subnet-group"
subnet_ids = var.private_subnet_ids
}

resource "aws_lb_target_group" "api" {
name = "api-tg"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
}

variable "vpc_id" {
type = string
}

# CloudWatch alarms for drift detection
resource "aws_cloudwatch_metric_alarm" "asg_desired_mismatch" {
alarm_name = "api-asg-desired-mismatch"
comparison_operator = "NotEqual"
evaluation_periods = 2
metric_name = "GroupDesiredCapacity"
namespace = "AWS/AutoScaling"
period = 300
statistic = "Average"
threshold = 3

dimensions = {
AutoScalingGroupName = aws_autoscaling_group.api_servers.name
}

alarm_description = "Alert if ASG actual capacity diverges from desired"
}

When to Use / When Not to Use

Use Drift Detection When:
  1. You use IaC (Terraform, CloudFormation, Ansible) but can't enforce exclusive IaC usage
  2. Multiple teams deploy to the same environments
  3. You have compliance or audit requirements to verify configurations
  4. You've experienced manual changes causing incidents
  5. You want to detect unintended modifications quickly
  6. You're migrating to IaC and need to identify manual infrastructure
Avoid Drift Detection When:
  1. You've already eliminated manual access (strong drift prevention)
  2. Your infrastructure is purely immutable (containers, serverless)
  3. You have a small, well-coordinated team with clear processes
  4. You lack the expertise to maintain detection scripts
  5. The overhead of drift detection exceeds the benefit

Patterns and Pitfalls

Patterns and Pitfalls

Detecting drift is useful, but preventing it is better. Use network policies to prevent SSH access to production servers. Use IAM policies to prevent manual resource creation. Make it impossible or at least very difficult to bypass IaC. Detection is a fallback; prevention is the goal.
Not all drift is bad. If a service auto-scales from 3 to 5 instances due to traffic, that's expected drift. Your detection should ignore expected changes (auto-scaling groups, certificates that rotate, backups that are created). Update your ignore list as you learn what's normal.
Anti-pattern: Automatically fix drift without investigating why it occurred. Better: Alert on drift, investigate the root cause, fix the cause (process issue? missing policy?), then reconcile. If you just auto-fix, you'll keep fighting the same drift repeatedly.
Comparing state to reality can be slow if you have thousands of resources. Don't run drift detection on every deploy—schedule it as a periodic job (nightly or every 6 hours). Use APIs to query cloud providers in parallel to speed up detection.
Terraform state must be synchronized with reality. If state is stale (Terraform doesn't know about changes), drift detection will be inaccurate. Use remote state (S3, Terraform Cloud) with locking to keep state synchronized. Never manually edit state files.
When drift is detected, escalate appropriately. Minor drift (auto-scaling groups at higher capacity) might just be logged. Security drift (S3 bucket made public) needs immediate escalation and remediation.

Design Review Checklist

  • Can manual infrastructure changes be prevented (SSH access restricted, console access limited)?
  • Is drift detection implemented and regularly executed?
  • Are drift detection results reviewed and acted upon?
  • Is Terraform or IaC the source of truth for your infrastructure?
  • Is state stored remotely with encryption and access controls?
  • Can you recreate any environment from IaC without manual steps?
  • Are unmanaged resources identified and brought under IaC control?
  • Is drift reconciliation a documented, tested process?
  • Do you monitor for and alert on security-related drift?
  • Are expected changes (auto-scaling, rotation) excluded from drift detection?
  • Do all teams understand the drift prevention policy?
  • Is there a root cause analysis process when drift is discovered?

Self-Check Questions

  1. Manual Access: Can you prevent SSH access to production servers? Can you prevent console access to infrastructure creation?
  2. Drift Detection: How often is drift detected in your infrastructure? What's the average time to remediate?
  3. State Management: Where is your Terraform state stored? Is it encrypted? Is it locked?
  4. Audit Trail: Can you see who made manual changes and when? Can you trace all infrastructure changes to code commits?
  5. Immutability: What percentage of your infrastructure is immutable (can't be changed in-place)?

Next Steps

  1. Implement Remote State: Move to remote, locked state storage (Terraform Cloud, S3 with locking).
  2. Enable Access Controls: Restrict manual changes through IAM, network policies, and process.
  3. Set Up Detection: Run terraform refresh regularly and compare state to actual resources.
  4. Create Runbooks: Document how to investigate and reconcile drift.
  5. Measure Drift: Track how often drift is detected, root causes, and time to remediation.
  6. Move to Immutability: Gradually migrate to immutable infrastructure patterns (containers, AMIs).

References

  1. Terraform State Management ↗️
  2. Terraform Cloud ↗️
  3. Humble, J., & Farley, D. (2010). Continuous Delivery. Addison-Wesley.
  4. AWS Infrastructure & Automation Blog ↗️
  5. Newman, S. (2015). Building Microservices. O'Reilly Media.