Skip to main content

Tokenization & Pseudonymization

Mask sensitive data while retaining usability

TL;DR

Tokenization: Replace sensitive data (credit card 4111111111111111) with token (tok_8f3n2k9). Original data stored securely elsewhere; token useless to attacker. Pseudonymization: Replace identifiers with pseudonyms (user_id 5 → user_abc123). Enables analytics without revealing identity. Use when full encryption overkill but privacy critical. Compliance: GDPR treats pseudonymized data as PII; tokenized data may not.

Learning Objectives

  • Understand tokenization and pseudonymization differences
  • Design tokenization for payment data
  • Implement pseudonymization for analytics
  • Balance privacy with data utility
  • Meet compliance requirements (GDPR, PCI)

Motivating Scenario

Scenario 1: Payment Processing

  • Credit card processor receives card numbers during transactions
  • Storing raw card numbers in application database? PCI DSS violation, liability, breach risk
  • Solution: Tokenize card numbers. Application stores only tokens (useless to attackers). Card numbers stay in secure vault. Payment processor handles detokenization.

Scenario 2: Analytics on Sensitive Data

  • Analytics team needs to analyze user behavior (which features used, session length, error patterns)
  • But can't access PII (names, emails, IDs) due to privacy regulations (GDPR, CCPA)
  • Solution: Pseudonymize user IDs. Replace user_id=5 with pseudonym=user_a3f2e1d6b5c9 (deterministic hash). Analytics can correlate behavior across sessions without knowing who the user is. Data scientist never sees actual identities.

Core Concepts

Tokenization vs Pseudonymization

Tokenization
  1. Replace data with random token
  2. Original stored in vault (one-way)
  3. Token ≠ original data
  4. Example: CC 4111 → tok_abc123
  5. Reversible (only vault can detokenize)
  6. Use: payment data, secrets
Pseudonymization
  1. Replace identifier with consistent pseudonym
  2. Derived from original (deterministic)
  3. Same input = same pseudonym
  4. Example: userid 5 → userabc123 (always)
  5. Reversible (only with original salt)
  6. Use: PII, analytics, research

Implementation Patterns

Format-Preserving Encryption (FPE): Encrypt while preserving format.

  • Input: credit card 4111111111111111
  • Output: 5923847362912456 (still looks like CC, but encrypted)
  • Advantage: No schema changes, reversible

Tokenization with Vault:

  • Input: SSN 123-45-6789
  • Vault stores: 123-45-6789 → token_xyz
  • Output: token_xyz (completely different format)
  • Database has only tokens, original in vault

Hashing for Pseudonymization:

  • Input: email john@example.com, salt: "secret_key"
  • Pseudonym: hash(john@example.com + secret_key) = abc123
  • Output: abc123 (looks like ID, but actually hash)
  • Deterministic (same input always produces same hash)

Practical Examples

// Tokenization: replace sensitive data with random token
const crypto = require('crypto');

class TokenVault {
constructor() {
this.store = new Map(); // In production: secure database
}

async tokenize(sensitiveData) {
// Generate random token (not reversible from original)
const token = crypto.randomBytes(16).toString('hex');
const key = `token_${token}`;

// Store mapping in vault (not in application)
this.store.set(key, {
data: sensitiveData,
createdAt: new Date(),
lastAccessed: new Date()
});

// Return only token to application
return token;
}

async detokenize(token) {
// Only authorized service can retrieve original
const key = `token_${token}`;
const record = this.store.get(key);

if (!record) {
throw new Error('Token not found');
}

// Update last accessed time (for audit)
record.lastAccessed = new Date();
return record.data;
}
}

// Usage: Payment Processing
async function processPayment(creditCard) {
const vault = new TokenVault();

// 1. Tokenize card (only happens once, at payment gateway)
const token = await vault.tokenize(creditCard.number);
// vault.store has: token_abc123 → 4111111111111111

// 2. Application stores only token (safe)
const payment = {
amount: 99.99,
currency: 'USD',
cardToken: token // Not the actual card number!
};

// 3. Later: payment processor detokenizes
const actualCard = await vault.detokenize(token);
// Only payment processor can retrieve
}

// Real-world flow:
// User inputs card → Browser sends to payment gateway (Stripe, Square)
// Payment gateway tokenizes → Application gets token
// Application stores token (safe for PCI compliance)
// Processor keeps vault (high security, limited access)

Patterns and Pitfalls

Tokenize data but store token→original mapping in same database as tokens. If database breached, attacker can detokenize.
Store mapping in separate, hardened vault with different access controls. Only authorized services can detokenize. Different teams, different networks.
Use predictable pseudonyms (sequential IDs, weak salt, or MD5). Attackers reverse-engineer original identifiers.
Use strong salt (random, long), strong hash (SHA-256+), ensure salt isn't compromised. Test for reversibility.
Pseudonymize user IDs, but leave other identifying fields (email, phone). Quasi-identifiers expose identity.
Pseudonymization + encryption + aggregation + field removal. Multiple layers. Test with re-identification attacks.

Self-Check

  • What's the difference between tokenization and pseudonymization?
  • Can you reverse a pseudonym if you have the salt?
  • Why is vault placement important for tokenization?
  • Can you use Format-Preserving Encryption instead of tokenization?
  • How does pseudonymization differ from anonymization?

Design Review Checklist

  • Sensitive data types identified (CC, SSN, health records, emails)?
  • Tokenization used for payment/vault data?
  • Pseudonymization used for analytics/research?
  • Vault separate from application database?
  • Vault access controls minimal and audited?
  • Mapping data encrypted (not plaintext)?
  • Salt unique per organization, stored securely?
  • Hashing algorithm strong (SHA-256+, not MD5)?
  • Reversibility documented and restricted?
  • Compliance requirements mapped (PCI, GDPR, HIPAA)?
  • Data retention policies defined?
  • Audit trails for all detokenization?
  • Tests verify pseudonym determinism?
  • Key rotation schedule defined?

Next Steps

  1. Identify sensitive data — Catalog all PII, payment data, health records
  2. Choose strategy — Tokenization (payment), pseudonymization (analytics), encryption (general)
  3. Design vault — Separate system, restricted access, audit logging
  4. Implement deterministically — Ensure consistency, enable joins
  5. Test thoroughly — Try to reverse pseudonyms, verify vault security
  6. Audit and maintain — Monitor access, rotate keys, update policies

References