Skip to main content

Data Lifecycle

TL;DR

Map each stage of data (ingest, process, store, serve, archive) to clear owners, SLAs, controls, and costs. Classify data on write (PII, secrets, public); apply encryption at rest and in transit; automate retention policies and deletion; track lineage and audit access. Treat data governance as a first-class concern, not an afterthought.

Learning Objectives

After reading this article, you will be able to:

  • Design a data lifecycle strategy aligned with compliance, cost, and business requirements.
  • Classify data by sensitivity and apply appropriate controls at each stage.
  • Implement automated retention and deletion policies.
  • Track data lineage and access for audit purposes.
  • Balance data utility (freshness, availability) with privacy and cost.

Motivating Scenario

Your e-commerce company ingests user activity, payment data, and product information daily. Some data (order history) must be kept for 7 years for tax compliance. Other data (IP addresses, session cookies) should be deleted after 30 days. Without a lifecycle strategy, you accumulate storage debt: expensive cold storage, regulatory risk, and privacy violations. With proper lifecycle management: data flows through hot (operational) → warm (analytical) → cold (archive) → deletion tiers. PII is encrypted and access is audited. Costs drop 60%, compliance improves, and users trust their data is deleted when promised.

Core Concepts

The Five Stages of Data Lifecycle

Data lifecycle from creation through deletion, with policy gates and cost transitions at each stage.

1. Ingest

Data enters the system from sources: user APIs, databases, event streams, file uploads, sensors.

Key concerns:

  • Classification: Mark data with sensitivity labels (public, internal, PII, secret, health).
  • Validation: Check schema, integrity, and completeness.
  • Deduplication: Avoid duplicate records in the pipeline.
  • Sampling: For high-volume streams, consider sampling for cost-efficiency.

2. Process

Data is transformed, enriched, and validated.

Key concerns:

  • Encryption in transit: Use TLS for all data movement.
  • Access controls: Only authorized services process sensitive data.
  • Lineage tracking: Log which source data produced which output.
  • Error handling: Quarantine bad data and alert.

3. Store

Data is persisted to a database, data lake, or data warehouse.

Key concerns:

  • Encryption at rest: All databases should encrypt data by default.
  • Backup and DR: Regular backups with tested recovery procedures.
  • Partitioning: Partition by retention class (hot, warm, cold) for efficient deletion.
  • Indexing: Create indexes for common access patterns.

4. Serve

Data is queried for operational or analytical use.

Key concerns:

  • Access controls: Row-level security (RLS) for sensitive datasets.
  • Audit logging: Track who accessed what, when, and why.
  • Caching: Cache frequently accessed data to reduce query load.
  • Data residency: Keep data in-region for compliance (GDPR, etc.).

5. Retention & Deletion

Data ages out according to policy.

Key concerns:

  • Retention periods: Regulatory (7 years for financial), operational (30 days for logs), privacy (right to be forgotten).
  • Automated purge: Schedule jobs to delete expired data.
  • Compliance verification: Prove deletion occurred via audit logs.
  • Undelete safety: Hold deleted data in a quarantine before permanent deletion (e.g., 24h grace period).

Data Classification and Sensitivity

Classify data at ingest time:

ClassExamplesRetentionEncryptionAccessRisk
PublicProduct catalog, published docsPer policyNo (optional)AnyoneLow
InternalEmployee directory, company metricsPer policyRecommendedEmployeesMedium
PIINames, emails, phone numbersUp to legal maxRequiredMinimal (service)High
SecretsAPI keys, passwords, tokensUntil revokedRequiredSingle serviceCritical
Payment CardCredit cards, bank accountsUp to 7 yearsRequired (HSM)AuditedCritical
HealthMedical records, diagnosesUp to 7-10 yearsRequiredMinimal + auditCritical

Practical Example: Tiered Data Storage

from enum import Enum
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import json

class DataClass(Enum):
PUBLIC = "public"
INTERNAL = "internal"
PII = "pii"
PAYMENT_CARD = "payment_card"
SECRETS = "secrets"

class RetentionPolicy:
def __init__(self, data_class: DataClass, retention_days: int, tier_transitions: Dict[str, int]):
self.data_class = data_class
self.retention_days = retention_days
self.tier_transitions = tier_transitions # e.g., {"hot": 30, "warm": 90, "cold": 365}

def get_tier(self, days_old: int) -> str:
"""Determine which storage tier based on age."""
tiers = sorted(self.tier_transitions.items(), key=lambda x: x[1])
for tier, threshold in tiers:
if days_old <= threshold:
return tier
return "archive"

def should_delete(self, days_old: int) -> bool:
"""Check if data should be deleted."""
return days_old >= self.retention_days

class DataRecord:
def __init__(self, record_id: str, data_class: DataClass, data: Dict, created_at: datetime):
self.record_id = record_id
self.data_class = data_class
self.data = data
self.created_at = created_at
self.encrypted = False
self.access_log: List[Dict] = []

def encrypt(self) -> None:
"""Mark as encrypted (in practice, use a real encryption library)."""
self.encrypted = True
print(f"[DataRecord {self.record_id}] Encrypted (class={self.data_class.value})")

def log_access(self, user_id: str, purpose: str) -> None:
"""Log access for audit trail."""
self.access_log.append({
"user_id": user_id,
"purpose": purpose,
"timestamp": datetime.utcnow().isoformat(),
"data_class": self.data_class.value
})

def get_age_days(self) -> int:
"""Get age in days."""
return (datetime.utcnow() - self.created_at).days

class DataLifecycleManager:
def __init__(self):
self.records: Dict[str, DataRecord] = {}
self.policies = {
DataClass.PUBLIC: RetentionPolicy(DataClass.PUBLIC, 730, {"hot": 90, "warm": 365}),
DataClass.INTERNAL: RetentionPolicy(DataClass.INTERNAL, 365, {"hot": 60, "warm": 180}),
DataClass.PII: RetentionPolicy(DataClass.PII, 180, {"hot": 30, "warm": 90}),
DataClass.PAYMENT_CARD: RetentionPolicy(DataClass.PAYMENT_CARD, 2555, {"hot": 365, "warm": 1095}), # 7 years
DataClass.SECRETS: RetentionPolicy(DataClass.SECRETS, 7, {"hot": 0, "warm": 3}),
}
self.deleted_records: List[str] = []

def ingest(self, record_id: str, data_class: DataClass, data: Dict) -> DataRecord:
"""Ingest data, classify, and encrypt."""
record = DataRecord(record_id, data_class, data, datetime.utcnow())

# Encrypt sensitive data immediately
if data_class in [DataClass.PII, DataClass.PAYMENT_CARD, DataClass.SECRETS]:
record.encrypt()

self.records[record_id] = record
print(f"[Ingest] Record {record_id} (class={data_class.value}) ingested and stored")
return record

def serve(self, record_id: str, user_id: str, purpose: str) -> Optional[Dict]:
"""Serve data with access logging."""
record = self.records.get(record_id)
if not record:
print(f"[Serve] Record {record_id} not found")
return None

record.log_access(user_id, purpose)
print(f"[Serve] Record {record_id} served to {user_id} for {purpose}")
return record.data

def process_lifecycle(self) -> Dict[str, int]:
"""Process all records: tier, archive, or delete."""
stats = {"hot": 0, "warm": 0, "cold": 0, "archive": 0, "deleted": 0}

records_to_delete = []
for record_id, record in self.records.items():
policy = self.policies[record.data_class]
age_days = record.get_age_days()

if policy.should_delete(age_days):
records_to_delete.append(record_id)
stats["deleted"] += 1
print(f"[Lifecycle] Record {record_id} (age={age_days}d) scheduled for deletion")
else:
tier = policy.get_tier(age_days)
stats[tier] += 1
print(f"[Lifecycle] Record {record_id} (age={age_days}d) → {tier} tier")

# Delete records
for record_id in records_to_delete:
del self.records[record_id]
self.deleted_records.append(record_id)

return stats

def audit_access(self, data_class: Optional[DataClass] = None) -> List[Dict]:
"""Generate audit report of all accesses."""
audit = []
for record_id, record in self.records.items():
if data_class and record.data_class != data_class:
continue
for access in record.access_log:
audit.append({
"record_id": record_id,
**access
})
return audit

# Example usage
def main():
print("=== Data Lifecycle Management ===\n")

manager = DataLifecycleManager()

# Ingest various data types
print("--- Ingest Phase ---")
manager.ingest("user:001", DataClass.PII, {"name": "Alice", "email": "alice@example.com"})
manager.ingest("card:001", DataClass.PAYMENT_CARD, {"last_four": "4242", "expiry": "12/25"})
manager.ingest("log:001", DataClass.SECRETS, {"api_key": "sk_live_***"})
manager.ingest("product:001", DataClass.PUBLIC, {"sku": "ABC123", "price": 29.99})

# Serve and audit
print("\n--- Serve Phase ---")
manager.serve("user:001", "user_service", "order_processing")
manager.serve("user:001", "analytics_service", "user_cohort_analysis")
manager.serve("product:001", "api", "product_detail_page")

# Simulate aging
print("\n--- Simulate Aging (modify created_at for demo) ---")
# In real systems, records age naturally over time
manager.records["log:001"].created_at = datetime.utcnow() - timedelta(days=10)
manager.records["card:001"].created_at = datetime.utcnow() - timedelta(days=100)

# Process lifecycle
print("\n--- Lifecycle Processing ---")
stats = manager.process_lifecycle()
print(f"\nTiering Summary: {stats}")

# Audit trail
print("\n--- Audit Trail (PII access) ---")
audit = manager.audit_access(DataClass.PII)
for entry in audit:
print(f" {entry['record_id']}: {entry['user_id']} accessed for {entry['purpose']} @ {entry['timestamp']}")

# Summary
print(f"\n--- Final State ---")
print(f"Active records: {len(manager.records)}")
print(f"Deleted records: {len(manager.deleted_records)}")

if __name__ == "__main__":
main()

When to Use / When NOT to Use

Implement Full Lifecycle Management
  1. Systems with regulatory requirements (GDPR, CCPA, HIPAA, SOX)
  2. Large-scale data systems where storage cost is significant
  3. Systems handling PII, payment data, or health information
  4. Multi-region deployments with data residency requirements
  5. Applications with explicit retention and deletion policies
Simpler Approach May Suffice
  1. Small datasets (< 10GB) where storage cost is negligible
  2. Short-lived data (caches, sessions) with implicit expiry
  3. Single-region, single-storage-tier systems
  4. Systems without regulatory requirements or PII
  5. MVP products validating a hypothesis (lifecycle later)

Patterns & Pitfalls

Move data through hot → warm → cold → archive tiers based on age. Hot (SSD, fast): recent, frequently accessed data. Warm (disk): older but occasionally needed. Cold (tape, cloud archive): rarely accessed. Archive (offline): legally required but practically inaccessible. This reduces costs 80%+.
Classify every record at creation time using metadata or schema hints. PII, secrets, and regulated data get automatic encryption and audit logging. Public data skips overhead. Misclassification is the root of many data leaks.
Schedule deletion but hold records in quarantine for 24-48 hours. This prevents accidental deletion from bugs or misconfigurations. After grace period, permanently delete (overwrite on disk, no recovery).
Data lingers in backups, caches, or old replicas long after the primary is deleted. Track all replicas and backups; apply deletion uniformly. Use immutable backups or WAL-based backup systems for easier cleanup.
Teams forget what retention policies are in place, leading to either over-retention (cost waste) or under-retention (compliance risk). Document policies in code; automate enforcement.
Log all access to sensitive data: who, what, when, why. Store audit logs in a write-once system. Auditors will ask for this log; it also helps detect suspicious access.

Design Review Checklist

  • Have you classified all data types by sensitivity (public, internal, PII, secrets)?
  • Is classification enforced at the point of ingest (not retrospectively)?
  • Are retention periods defined and justified for each data class?
  • Is deletion automated and verifiable (audit trail of deletions)?
  • Are backups and replicas subject to the same retention policies?
  • Is data encrypted at rest for PII, payment data, and secrets?
  • Is data encrypted in transit (TLS) for all inter-service communication?
  • Are access logs maintained and auditable for sensitive data?
  • Have you tested data recovery and deletion procedures?
  • Is the lifecycle strategy documented and communicated to the team?

Self-Check

Before finalizing your lifecycle strategy:

  1. Regulatory requirements: What laws apply (GDPR, HIPAA, SOX)? What's the minimum retention period? What's the right-to-be-forgotten policy?

  2. Data sensitivity: What are the most sensitive datasets? How would a breach impact users or the business?

  3. Storage tiers: Does your infrastructure support cold storage? What's the cost difference between tiers?

  4. Deletion risk: Can you afford to accidentally delete important data? Do you need a grace period?

Next Steps

  • Audit current data: Catalog what data exists, where, how old, how sensitive.
  • Classify: Tag each dataset with retention and sensitivity labels.
  • Define policies: Work with legal/compliance to set retention and deletion rules.
  • Implement automation: Code lifecycle jobs; test in staging first.
  • Monitor: Track data volumes, tier transitions, deletion success rates.
  • Document: Write runbooks for data recovery, emergency deletion, compliance audits.

References

  1. Google Cloud: Data Lifecycle ↗️
  2. Azure CAF: Data Management ↗️
  3. GDPR Information Portal ↗️
  4. Data Lineage Explained ↗️