Skip to main content

Incident Response and Forensics

Investigation and Recovery from Security Incidents

TL;DR

Incident response saves time, money, and reputation by enabling rapid detection, containment, and recovery. Forensics preserve evidence for investigation and legal proceedings. Key phases: Preparation → Detection/Analysis → Containment → Eradication → Recovery → Post-Incident Review. Automation and pre-planning reduce response time from hours to minutes.

Learning Objectives

  • Design incident response teams and processes
  • Develop incident classification and severity framework
  • Implement detection and alerting for common attacks
  • Plan containment and eradication strategies
  • Preserve forensic evidence for investigation
  • Conduct effective post-incident reviews

Core Concepts

Incident Response Phases

1. Preparation:

  • Define roles and responsibilities
  • Develop playbooks for common incident types
  • Test procedures (tabletop exercises)
  • Tools and access pre-provisioned

2. Detection and Analysis:

  • SIEM alerts and monitoring
  • Alert triage and severity assessment
  • Initial containment (isolate affected system?)
  • Notification to incident commander

3. Containment:

  • Short-term: Stop spread (isolate affected systems)
  • Long-term: Prepare systems for eradication
  • Collect forensic evidence
  • Maintain access for investigation

4. Eradication:

  • Remove attacker access (change passwords, revoke tokens)
  • Patch vulnerabilities that enabled attack
  • Remove malware/backdoors
  • Close exploitation vector

5. Recovery:

  • Restore from clean backups
  • Rebuild compromised systems
  • Verify no backdoors remain
  • Gradual reconnection to network
  • Monitor for re-compromise

6. Post-Incident Review:

  • Timeline reconstruction
  • Root cause analysis
  • Identify lessons learned
  • Update detection rules and playbooks

Incident Severity Classification

SeverityImpactResponse TimeExample
CriticalFull service down, data breach imminent< 15 minRansomware on file servers
HighSignificant service impact, breach confirmed< 1 hourAttacker active on critical system
MediumLimited impact, contained< 4 hoursPhishing with credentials obtained
LowMinimal impact, isolated< 24 hoursFailed login attempts, misconfig detected

Forensics Preservation

Chain of Custody:

  • Document what was captured, by whom, when
  • Cryptographic hashing (SHA-256) for files
  • Maintain secure storage of evidence
  • Restrict access (prevents contamination)

Evidence Collection:

  • Memory dumps before power off
  • Disk images (bit-for-bit copies)
  • Log files and application data
  • Network traffic captures (PCAP)
  • Timing and sequence of events

Practical Example

Incident Response Playbook

IncidentType: Ransomware Detection

Severity: Critical
SLA: 15 minutes to initial containment

Detection:
Triggers:
- Multiple file extensions changing rapidly
- Encryption library activity detected
- Backup deletion attempts
Alert: P1-Ransomware-Detected

InitialResponse:
Incident Commander: On-call CISO
Team:
- Security Engineer (evidence collection)
- Infrastructure Engineer (isolation)
- Forensics Analyst (investigation)
- Communications (notification)
Actions:
1. Verify alert authenticity (false positive check)
2. Isolate affected system from network
3. Preserve memory dump for forensics
4. Create disk snapshot
5. Notify backup team (prevent auto-sync)
6. Document timeline

Containment:
- Isolate subnet/vlan
- Disable affected user account
- Revoke session tokens
- Block outbound network to C2 domains
- Scan for lateral movement

Eradication:
- Identify attacker entry point (RDP, VPN, etc.)
- Patch vulnerability or disable service
- Remove malware/ransomware samples
- Audit admin accounts for backdoors
- Verify with EDR/SIEM

Recovery:
- Restore from clean, pre-infection backup
- Rebuild to baseline configuration
- Monitor for re-infection
- Gradual service restoration
- Verify integrity

PostIncident:
- Full forensic analysis
- Timeline and attack chain
- Root cause (phishing, unpatched server, etc.)
- Update detection rules
- Update playbook
- Communication to stakeholders

Forensics Procedures

Memory Collection

# Linux: volatility framework
dd if=/dev/mem of=/mnt/usb/memory.dump bs=1M

# Windows: Windows Memory Diagnostic
dumpit.exe memory.dump

# Note: Collect before shutting down (volatile data lost)

Disk Imaging

# Create forensic image with hashing
dcfldd if=/dev/sda1 of=/mnt/external/disk.img hash=sha256 hashwindow=256M

Evidence Log

Chain of Custody:
Item: Memory dump from server-prod-01
Collected: 2025-02-14 10:45 UTC
Collected by: John Analyst (ID: john.analyst@company.com)
Reason: Ransomware incident response
Storage: Secure evidence vault (encryption + MFA)
Access: John Analyst, Jane CISO, Legal counsel
Integrity: SHA256: a3f4e2b1c9d8e7f6a5b4c3d2e1f0a9b8
Hash verified: 2025-02-14 10:46 UTC
Status: Preserved for investigation

When to Use / When Not to Use

Incident Response Best Practices
  1. Clear incident classification and severity levels
  2. Written playbooks for common incident types
  3. Regular tabletop exercises and drills
  4. Immediate notification to incident commander
  5. Preservation of forensic evidence from start
  6. Post-incident review and lessons learned
  7. Metrics (MTTD, MTTR) tracked
  8. Communication plan for stakeholders
Common Mistakes
  1. No incident classification framework
  2. Manual, ad-hoc response procedures
  3. No tabletop exercises or testing
  4. Delayed incident notification
  5. Destroying evidence during response
  6. No post-incident review
  7. No metrics or improvement tracking
  8. Silent breach (no external communication)

Design Review Checklist

  • Incident response team defined and trained?
  • On-call rotation established?
  • Playbooks for common incident types?
  • Forensic tools and storage ready?
  • Alerting configured for critical events?
  • Alert response time SLAs defined?
  • Incident classification framework documented?
  • Escalation procedures clear?
  • Forensic evidence collection procedures?
  • Chain of custody documented?
  • Evidence stored securely?
  • Legal/compliance requirements understood?
  • Post-incident review process established?
  • Lessons learned documented?
  • Detection rules updated based on findings?
  • Response metrics tracked (MTTD, MTTR)?
One Takeaway

Incident response preparation saves hours during actual incidents. Playbooks, training, and tools pre-positioned reduce MTTR from days to hours. Post-incident reviews drive continuous improvement.

Complete Incident Response Examples

Example 1: Data Breach Response (Real Timeline)

2025-02-14 10:45 UTC: Anomaly Detected
SIEM Alert: Unusual database query pattern
- Selecting millions of rows (normally selects 1000s)
- From sensitive tables (customers, transactions)
- By service account (not normal)

Alert severity: HIGH
Incident commander paged

2025-02-14 10:48 UTC: Initial Investigation (3 min)
Security engineer checks:
- Query source: IP 192.168.1.50 (internal)
- Service: Reports API (should only select aggregated data)
- Query: "SELECT * FROM customers" (no WHERE clause!)
- Duration: Running for 15 minutes already

Verdict: Likely data exfiltration
Action: IMMEDIATELY isolate affected system

2025-02-14 10:50 UTC: Containment (5 min)
Infrastructure engineer:
- Kills database connections from Reports API
- Revokes API credentials
- Isolates Reports API servers (network rules)
- Stops running queries

Security engineer:
- Reviews database activity logs (how much data accessed?)
- Begins forensics data collection (memory dump, disk snapshot)
- Checks for lateral movement (other compromised systems?)

2025-02-14 11:15 UTC: Forensic Analysis (25 min)
Discovery: Reports API container had unpatched OpenSSL bug
- Attacker exploited to gain shell access
- Attacker created backdoor (cron job running malicious script)
- Attacker accessed database with stolen API key

Data breach: Customer names, emails, phone numbers
Estimate: 500,000 records accessed (but not all exfiltrated)

2025-02-14 11:30 UTC: Eradication (45 min)
- Delete backdoor cron job
- Patch OpenSSL vulnerability (rebuild container)
- Scan all containers for similar exploits
- Revoke all API keys (need to rotate)
- Audit file system for suspicious changes

2025-02-14 13:00 UTC: Recovery (2 hours 15 min)
- Rebuild Reports API from clean backup (pre-compromise)
- Restore API keys (new, secure)
- Monitor for signs of re-compromise
- Test functionality (ensure no data loss)

2025-02-14 14:00 UTC: Notification (3 hours)
Legal team determines: Breach notification law triggered
- Notify 500K affected customers (email, breach notification letter)
- Notify regulators (depending on jurisdiction)
- Notify press (if required by law)

Message: "We discovered unauthorized access to customer emails/phones.
Passwords were NOT accessed. We've patched the vulnerability.
Please monitor your email for scams."

2025-02-21 (1 week): Post-Incident Review
- Timeline confirmed
- Root cause: Unpatched OpenSSL + lack of network segmentation
- Lessons learned:
* Implement container image scanning (CVE checking)
* Network segmentation (DB access only from specific IPs)
* Database activity monitoring (unusual queries alerted)
* Credential rotation policies (invalidate old API keys)
- Action items assigned with owners and deadlines

Metrics:
- MTTD (Mean Time To Detect): 15 minutes (alert triggered)
- MTTR (Mean Time To Respond): 5 minutes (system isolated)
- Data Exposed: 500K records (identified within 2 hours)

Example 2: Ransomware Attack Response

Timeline:

T+0 (8 AM Monday):
Alert: High volume of file modifications detected
Files being renamed with .encrypted extension
Verdict: Ransomware confirmed

Immediate Actions:
- Page incident commander (critical)
- Isolate affected systems (network isolation)
- Preserve memory dump (volatile data)
- Create disk snapshot (forensics)

T+5 min:
Investigation: Ransomware type?
- File signatures suggest: ALPHV (known ransomware gang)
- Ransom note appears: "Pay $2M in Bitcoin or data deleted"

Assessment:
- Affected: Finance server, customer database backups
- Not affected: Production servers (different subnet)
- Damage: Backups encrypted (can't restore)
- Options: Pay ransom (NO!), rebuild from different backups, law enforcement

T+30 min:
Forensics reveals entry point: RDP service (port 3389)
- Weak password on finance admin account
- No MFA
- Attacker brute-forced credentials

Attacker path:
RDP login → Admin shell → Disable antivirus
→ Deploy ransomware → Encrypt all accessible files
→ Display ransom note → Exit

T+1 hour:
Eradication:
- Kill all unauthorized processes
- Patch RDP service (disable weak auth)
- Re-enable antivirus (remove disable commands)
- Force password reset (all domain accounts)
- Enable MFA (all admins)
- Enable network segmentation (finance isolated)

T+4 hours:
Recovery:
- Restore from offline backup (that wasn't encrypted)
- Finance data from 24 hours ago (some work lost)
- Monitor for re-compromise (unusual logins)

T+24 hours:
Notification:
- No evidence customer data accessed
- But transparency: explain what happened, what we're doing
- Offer credit monitoring anyway (goodwill)
- Law enforcement notified

T+2 weeks:
Post-incident review:
- Why no MFA? (Should be mandatory for admins)
- Why weak password? (Password policy insufficient)
- Why RDP exposed to internet? (Should VPN-only)
- Why no network segmentation? (Finance should be isolated)

Fixes:
- Enforce MFA for all admins (immediate)
- Password policy: 16 chars, complexity (immediate)
- VPN for RDP access only (immediate)
- Network segmentation (2 weeks)
- Backup testing (monthly)
- Incident response drills (quarterly)

Result: Expensive but survived. Lessons learned. No customer data lost.

Forensics Best Practices

Collecting Evidence Without Contaminating

# ❌ BAD: Modifying data while investigating
ls -la /var/log/* # Accesses file, changes atime
cat /var/log/auth.log # Modifies atime
grep "attacker" /var/log/* # Modifies files

# ✅ GOOD: Preserving evidence chain
# Step 1: Create bit-for-bit copy (before touching original)
dcfldd if=/dev/sda1 of=/mnt/usb/disk.img hash=sha256

# Step 2: Hash original (for integrity verification)
sha256sum /dev/sda1 > /mnt/usb/original.hash

# Step 3: Verify copy matches
sha256sum /mnt/usb/disk.img
# Should match original.hash

# Step 4: Only analyze the copy (not original)
# Mount copy as read-only
mount -o ro,loop /mnt/usb/disk.img /mnt/forensics

# Step 5: Document chain of custody
# Who accessed it, when, what tools, what was found

Evidence Preservation Checklist

class EvidencePreservation:
def collect_evidence(self, incident_id: str):
"""Properly preserve evidence for investigation"""

evidence = {
'volatile_data': [],
'disk_images': [],
'network_captures': [],
'logs': [],
}

# Memory (volatile, lost on reboot)
if should_preserve_memory():
memory = self.memory_dump(process='all')
evidence['volatile_data'].append(self.hash_and_store(memory))

# Network traffic (PCAP)
pcap = self.capture_network_traffic()
evidence['network_captures'].append(self.hash_and_store(pcap))

# Disk (if needed, power off first to prevent changes)
if should_image_disk():
disk_image = self.create_disk_image('/dev/sda1')
evidence['disk_images'].append(self.hash_and_store(disk_image))

# Logs (application, system, audit)
logs = self.collect_logs()
evidence['logs'].append(self.hash_and_store(logs))

# Chain of custody documentation
self.document_chain_of_custody(
incident_id=incident_id,
items=evidence,
collector='Security Team',
timestamp=datetime.now(),
storage_location='Evidence Vault (HSM encrypted)',
access_restrictions='CISO, Legal, Forensics Analyst'
)

return evidence

def hash_and_store(self, evidence_item):
"""Hash evidence, store securely, document"""
sha256_hash = hashlib.sha256(evidence_item).hexdigest()

# Store in secure evidence vault
self.evidence_vault.store(
data=evidence_item,
hash=sha256_hash,
encrypted=True, # Encrypt in vault
access_log=True # Log all access
)

return {
'item': evidence_item,
'hash': sha256_hash,
'stored_at': datetime.now(),
'location': 'Secure Evidence Vault'
}

Metrics for Measuring Response Effectiveness

class IncidentMetrics:
def calculate_response_metrics(self, incident):
"""Measure incident response effectiveness"""

# Time metrics
metrics = {
'MTTD': incident.detected_at - incident.occurred_at,
'MTTR': incident.resolved_at - incident.detected_at,
'MTPT': incident.patched_at - incident.occurred_at, # Patch time
}

# Impact metrics
metrics.update({
'customers_affected': incident.affected_users,
'data_exposed': incident.records_exposed,
'downtime_minutes': (incident.resolved_at - incident.started_at).total_seconds() / 60,
'financial_impact': incident.incident_cost, # Cost of incident + response
})

# Response quality
metrics.update({
'playbook_used': incident.used_predefined_playbook,
'all_steps_followed': incident.followed_procedure,
'escalated_appropriately': incident.escalation_correct,
'communication_timely': incident.notified_within_sla,
})

# Lessons learned
metrics.update({
'root_cause_identified': incident.root_cause is not None,
'preventive_actions_identified': len(incident.preventive_actions) > 0,
'action_items_assigned': all(a.owner for a in incident.action_items),
})

return metrics

# Target metrics (best in class):
# MTTD: < 1 hour (detect within 1 hour)
# MTTR: < 4 hours (resolve within 4 hours)
# MTPT: < 24 hours (patch within 1 day)
# Customer notification: Within 24-72 hours
# Post-incident review: Within 2 weeks

One Takeaway

Incident response preparation saves hours during actual incidents. Playbooks, training, and tools pre-positioned reduce MTTR from days to hours. Post-incident reviews drive continuous improvement. The goal is not zero incidents (impossible) but rapid detection, containment, and recovery.

References