Incident Response and Forensics
Investigation and Recovery from Security Incidents
TL;DR
Incident response saves time, money, and reputation by enabling rapid detection, containment, and recovery. Forensics preserve evidence for investigation and legal proceedings. Key phases: Preparation → Detection/Analysis → Containment → Eradication → Recovery → Post-Incident Review. Automation and pre-planning reduce response time from hours to minutes.
Learning Objectives
- Design incident response teams and processes
- Develop incident classification and severity framework
- Implement detection and alerting for common attacks
- Plan containment and eradication strategies
- Preserve forensic evidence for investigation
- Conduct effective post-incident reviews
Core Concepts
Incident Response Phases
1. Preparation:
- Define roles and responsibilities
- Develop playbooks for common incident types
- Test procedures (tabletop exercises)
- Tools and access pre-provisioned
2. Detection and Analysis:
- SIEM alerts and monitoring
- Alert triage and severity assessment
- Initial containment (isolate affected system?)
- Notification to incident commander
3. Containment:
- Short-term: Stop spread (isolate affected systems)
- Long-term: Prepare systems for eradication
- Collect forensic evidence
- Maintain access for investigation
4. Eradication:
- Remove attacker access (change passwords, revoke tokens)
- Patch vulnerabilities that enabled attack
- Remove malware/backdoors
- Close exploitation vector
5. Recovery:
- Restore from clean backups
- Rebuild compromised systems
- Verify no backdoors remain
- Gradual reconnection to network
- Monitor for re-compromise
6. Post-Incident Review:
- Timeline reconstruction
- Root cause analysis
- Identify lessons learned
- Update detection rules and playbooks
Incident Severity Classification
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| Critical | Full service down, data breach imminent | < 15 min | Ransomware on file servers |
| High | Significant service impact, breach confirmed | < 1 hour | Attacker active on critical system |
| Medium | Limited impact, contained | < 4 hours | Phishing with credentials obtained |
| Low | Minimal impact, isolated | < 24 hours | Failed login attempts, misconfig detected |
Forensics Preservation
Chain of Custody:
- Document what was captured, by whom, when
- Cryptographic hashing (SHA-256) for files
- Maintain secure storage of evidence
- Restrict access (prevents contamination)
Evidence Collection:
- Memory dumps before power off
- Disk images (bit-for-bit copies)
- Log files and application data
- Network traffic captures (PCAP)
- Timing and sequence of events
Practical Example
Incident Response Playbook
IncidentType: Ransomware Detection
Severity: Critical
SLA: 15 minutes to initial containment
Detection:
Triggers:
- Multiple file extensions changing rapidly
- Encryption library activity detected
- Backup deletion attempts
Alert: P1-Ransomware-Detected
InitialResponse:
Incident Commander: On-call CISO
Team:
- Security Engineer (evidence collection)
- Infrastructure Engineer (isolation)
- Forensics Analyst (investigation)
- Communications (notification)
Actions:
1. Verify alert authenticity (false positive check)
2. Isolate affected system from network
3. Preserve memory dump for forensics
4. Create disk snapshot
5. Notify backup team (prevent auto-sync)
6. Document timeline
Containment:
- Isolate subnet/vlan
- Disable affected user account
- Revoke session tokens
- Block outbound network to C2 domains
- Scan for lateral movement
Eradication:
- Identify attacker entry point (RDP, VPN, etc.)
- Patch vulnerability or disable service
- Remove malware/ransomware samples
- Audit admin accounts for backdoors
- Verify with EDR/SIEM
Recovery:
- Restore from clean, pre-infection backup
- Rebuild to baseline configuration
- Monitor for re-infection
- Gradual service restoration
- Verify integrity
PostIncident:
- Full forensic analysis
- Timeline and attack chain
- Root cause (phishing, unpatched server, etc.)
- Update detection rules
- Update playbook
- Communication to stakeholders
Forensics Procedures
Memory Collection
# Linux: volatility framework
dd if=/dev/mem of=/mnt/usb/memory.dump bs=1M
# Windows: Windows Memory Diagnostic
dumpit.exe memory.dump
# Note: Collect before shutting down (volatile data lost)
Disk Imaging
# Create forensic image with hashing
dcfldd if=/dev/sda1 of=/mnt/external/disk.img hash=sha256 hashwindow=256M
Evidence Log
Chain of Custody:
Item: Memory dump from server-prod-01
Collected: 2025-02-14 10:45 UTC
Collected by: John Analyst (ID: john.analyst@company.com)
Reason: Ransomware incident response
Storage: Secure evidence vault (encryption + MFA)
Access: John Analyst, Jane CISO, Legal counsel
Integrity: SHA256: a3f4e2b1c9d8e7f6a5b4c3d2e1f0a9b8
Hash verified: 2025-02-14 10:46 UTC
Status: Preserved for investigation
When to Use / When Not to Use
- Clear incident classification and severity levels
- Written playbooks for common incident types
- Regular tabletop exercises and drills
- Immediate notification to incident commander
- Preservation of forensic evidence from start
- Post-incident review and lessons learned
- Metrics (MTTD, MTTR) tracked
- Communication plan for stakeholders
- No incident classification framework
- Manual, ad-hoc response procedures
- No tabletop exercises or testing
- Delayed incident notification
- Destroying evidence during response
- No post-incident review
- No metrics or improvement tracking
- Silent breach (no external communication)
Design Review Checklist
- Incident response team defined and trained?
- On-call rotation established?
- Playbooks for common incident types?
- Forensic tools and storage ready?
- Alerting configured for critical events?
- Alert response time SLAs defined?
- Incident classification framework documented?
- Escalation procedures clear?
- Forensic evidence collection procedures?
- Chain of custody documented?
- Evidence stored securely?
- Legal/compliance requirements understood?
- Post-incident review process established?
- Lessons learned documented?
- Detection rules updated based on findings?
- Response metrics tracked (MTTD, MTTR)?
Incident response preparation saves hours during actual incidents. Playbooks, training, and tools pre-positioned reduce MTTR from days to hours. Post-incident reviews drive continuous improvement.
Complete Incident Response Examples
Example 1: Data Breach Response (Real Timeline)
2025-02-14 10:45 UTC: Anomaly Detected
SIEM Alert: Unusual database query pattern
- Selecting millions of rows (normally selects 1000s)
- From sensitive tables (customers, transactions)
- By service account (not normal)
Alert severity: HIGH
Incident commander paged
2025-02-14 10:48 UTC: Initial Investigation (3 min)
Security engineer checks:
- Query source: IP 192.168.1.50 (internal)
- Service: Reports API (should only select aggregated data)
- Query: "SELECT * FROM customers" (no WHERE clause!)
- Duration: Running for 15 minutes already
Verdict: Likely data exfiltration
Action: IMMEDIATELY isolate affected system
2025-02-14 10:50 UTC: Containment (5 min)
Infrastructure engineer:
- Kills database connections from Reports API
- Revokes API credentials
- Isolates Reports API servers (network rules)
- Stops running queries
Security engineer:
- Reviews database activity logs (how much data accessed?)
- Begins forensics data collection (memory dump, disk snapshot)
- Checks for lateral movement (other compromised systems?)
2025-02-14 11:15 UTC: Forensic Analysis (25 min)
Discovery: Reports API container had unpatched OpenSSL bug
- Attacker exploited to gain shell access
- Attacker created backdoor (cron job running malicious script)
- Attacker accessed database with stolen API key
Data breach: Customer names, emails, phone numbers
Estimate: 500,000 records accessed (but not all exfiltrated)
2025-02-14 11:30 UTC: Eradication (45 min)
- Delete backdoor cron job
- Patch OpenSSL vulnerability (rebuild container)
- Scan all containers for similar exploits
- Revoke all API keys (need to rotate)
- Audit file system for suspicious changes
2025-02-14 13:00 UTC: Recovery (2 hours 15 min)
- Rebuild Reports API from clean backup (pre-compromise)
- Restore API keys (new, secure)
- Monitor for signs of re-compromise
- Test functionality (ensure no data loss)
2025-02-14 14:00 UTC: Notification (3 hours)
Legal team determines: Breach notification law triggered
- Notify 500K affected customers (email, breach notification letter)
- Notify regulators (depending on jurisdiction)
- Notify press (if required by law)
Message: "We discovered unauthorized access to customer emails/phones.
Passwords were NOT accessed. We've patched the vulnerability.
Please monitor your email for scams."
2025-02-21 (1 week): Post-Incident Review
- Timeline confirmed
- Root cause: Unpatched OpenSSL + lack of network segmentation
- Lessons learned:
* Implement container image scanning (CVE checking)
* Network segmentation (DB access only from specific IPs)
* Database activity monitoring (unusual queries alerted)
* Credential rotation policies (invalidate old API keys)
- Action items assigned with owners and deadlines
Metrics:
- MTTD (Mean Time To Detect): 15 minutes (alert triggered)
- MTTR (Mean Time To Respond): 5 minutes (system isolated)
- Data Exposed: 500K records (identified within 2 hours)
Example 2: Ransomware Attack Response
Timeline:
T+0 (8 AM Monday):
Alert: High volume of file modifications detected
Files being renamed with .encrypted extension
Verdict: Ransomware confirmed
Immediate Actions:
- Page incident commander (critical)
- Isolate affected systems (network isolation)
- Preserve memory dump (volatile data)
- Create disk snapshot (forensics)
T+5 min:
Investigation: Ransomware type?
- File signatures suggest: ALPHV (known ransomware gang)
- Ransom note appears: "Pay $2M in Bitcoin or data deleted"
Assessment:
- Affected: Finance server, customer database backups
- Not affected: Production servers (different subnet)
- Damage: Backups encrypted (can't restore)
- Options: Pay ransom (NO!), rebuild from different backups, law enforcement
T+30 min:
Forensics reveals entry point: RDP service (port 3389)
- Weak password on finance admin account
- No MFA
- Attacker brute-forced credentials
Attacker path:
RDP login → Admin shell → Disable antivirus
→ Deploy ransomware → Encrypt all accessible files
→ Display ransom note → Exit
T+1 hour:
Eradication:
- Kill all unauthorized processes
- Patch RDP service (disable weak auth)
- Re-enable antivirus (remove disable commands)
- Force password reset (all domain accounts)
- Enable MFA (all admins)
- Enable network segmentation (finance isolated)
T+4 hours:
Recovery:
- Restore from offline backup (that wasn't encrypted)
- Finance data from 24 hours ago (some work lost)
- Monitor for re-compromise (unusual logins)
T+24 hours:
Notification:
- No evidence customer data accessed
- But transparency: explain what happened, what we're doing
- Offer credit monitoring anyway (goodwill)
- Law enforcement notified
T+2 weeks:
Post-incident review:
- Why no MFA? (Should be mandatory for admins)
- Why weak password? (Password policy insufficient)
- Why RDP exposed to internet? (Should VPN-only)
- Why no network segmentation? (Finance should be isolated)
Fixes:
- Enforce MFA for all admins (immediate)
- Password policy: 16 chars, complexity (immediate)
- VPN for RDP access only (immediate)
- Network segmentation (2 weeks)
- Backup testing (monthly)
- Incident response drills (quarterly)
Result: Expensive but survived. Lessons learned. No customer data lost.
Forensics Best Practices
Collecting Evidence Without Contaminating
# ❌ BAD: Modifying data while investigating
ls -la /var/log/* # Accesses file, changes atime
cat /var/log/auth.log # Modifies atime
grep "attacker" /var/log/* # Modifies files
# ✅ GOOD: Preserving evidence chain
# Step 1: Create bit-for-bit copy (before touching original)
dcfldd if=/dev/sda1 of=/mnt/usb/disk.img hash=sha256
# Step 2: Hash original (for integrity verification)
sha256sum /dev/sda1 > /mnt/usb/original.hash
# Step 3: Verify copy matches
sha256sum /mnt/usb/disk.img
# Should match original.hash
# Step 4: Only analyze the copy (not original)
# Mount copy as read-only
mount -o ro,loop /mnt/usb/disk.img /mnt/forensics
# Step 5: Document chain of custody
# Who accessed it, when, what tools, what was found
Evidence Preservation Checklist
class EvidencePreservation:
def collect_evidence(self, incident_id: str):
"""Properly preserve evidence for investigation"""
evidence = {
'volatile_data': [],
'disk_images': [],
'network_captures': [],
'logs': [],
}
# Memory (volatile, lost on reboot)
if should_preserve_memory():
memory = self.memory_dump(process='all')
evidence['volatile_data'].append(self.hash_and_store(memory))
# Network traffic (PCAP)
pcap = self.capture_network_traffic()
evidence['network_captures'].append(self.hash_and_store(pcap))
# Disk (if needed, power off first to prevent changes)
if should_image_disk():
disk_image = self.create_disk_image('/dev/sda1')
evidence['disk_images'].append(self.hash_and_store(disk_image))
# Logs (application, system, audit)
logs = self.collect_logs()
evidence['logs'].append(self.hash_and_store(logs))
# Chain of custody documentation
self.document_chain_of_custody(
incident_id=incident_id,
items=evidence,
collector='Security Team',
timestamp=datetime.now(),
storage_location='Evidence Vault (HSM encrypted)',
access_restrictions='CISO, Legal, Forensics Analyst'
)
return evidence
def hash_and_store(self, evidence_item):
"""Hash evidence, store securely, document"""
sha256_hash = hashlib.sha256(evidence_item).hexdigest()
# Store in secure evidence vault
self.evidence_vault.store(
data=evidence_item,
hash=sha256_hash,
encrypted=True, # Encrypt in vault
access_log=True # Log all access
)
return {
'item': evidence_item,
'hash': sha256_hash,
'stored_at': datetime.now(),
'location': 'Secure Evidence Vault'
}
Metrics for Measuring Response Effectiveness
class IncidentMetrics:
def calculate_response_metrics(self, incident):
"""Measure incident response effectiveness"""
# Time metrics
metrics = {
'MTTD': incident.detected_at - incident.occurred_at,
'MTTR': incident.resolved_at - incident.detected_at,
'MTPT': incident.patched_at - incident.occurred_at, # Patch time
}
# Impact metrics
metrics.update({
'customers_affected': incident.affected_users,
'data_exposed': incident.records_exposed,
'downtime_minutes': (incident.resolved_at - incident.started_at).total_seconds() / 60,
'financial_impact': incident.incident_cost, # Cost of incident + response
})
# Response quality
metrics.update({
'playbook_used': incident.used_predefined_playbook,
'all_steps_followed': incident.followed_procedure,
'escalated_appropriately': incident.escalation_correct,
'communication_timely': incident.notified_within_sla,
})
# Lessons learned
metrics.update({
'root_cause_identified': incident.root_cause is not None,
'preventive_actions_identified': len(incident.preventive_actions) > 0,
'action_items_assigned': all(a.owner for a in incident.action_items),
})
return metrics
# Target metrics (best in class):
# MTTD: < 1 hour (detect within 1 hour)
# MTTR: < 4 hours (resolve within 4 hours)
# MTPT: < 24 hours (patch within 1 day)
# Customer notification: Within 24-72 hours
# Post-incident review: Within 2 weeks
One Takeaway
Incident response preparation saves hours during actual incidents. Playbooks, training, and tools pre-positioned reduce MTTR from days to hours. Post-incident reviews drive continuous improvement. The goal is not zero incidents (impossible) but rapid detection, containment, and recovery.