Architecture Checklists

TL;DR

Architecture checklists are systematic reminders to think through critical quality attributes and design concerns. Dimensions include scalability, security, cost, operational readiness, data governance, disaster recovery, compliance, and performance. Checklists surface blind spots and ensure consistency across designs. Review checklists at three stages: ADR creation (design validation), before implementation (readiness check), and before production deployment (go-live criteria). Organizations with disciplined checklist practices prevent preventable failures and maintain architectural consistency. Iterate checklists based on postmortems: if an incident could have been prevented by a checklist question, add that question to future checklists.

Learning Objectives

You will be able to:

Design organization-specific checklists addressing critical quality attributes
Use checklists systematically in design reviews and architecture decisions
Identify gaps and blind spots in architectural designs through checklist-driven review
Create role-specific checklists (architect, security, DevOps, compliance)
Iterate and improve checklists based on incidents and postmortem learnings
Balance rigor (comprehensive checklists) with speed (don't over-check simple decisions)

Motivating Scenario

Your team ships a new microservice. It works great in staging. On day one in production, the service crashes. Investigation reveals: no monitoring was set up. No alerts were configured. No runbook existed for common failures. Operations team was blind when the service went down.

In postmortem, this was preventable. A simple checklist would have required: "Monitoring configured? Alerts set up? Runbook written?" But the checklist didn't exist. So the service slipped through without these critical operational elements.

Later, you ship a second service. This time, you use a checklist: before production, must have monitoring, alerts, runbook. The second service also has an issue, but operations team catches it via alerting within 1 minute. Impact is minimal.

Architecture checklists prevent these failures by ensuring nothing is forgotten. They're not waterfall bureaucracy—they're systematic reminders to think through critical concerns.

Core Content

Why Checklists Matter

Humans have cognitive limits. When designing complex systems, it's easy to focus on the central concern (the feature) and forget critical cross-cutting concerns (security, monitoring, resilience). Checklists combat this.

Studies on checklists (Atul Gawande's work in hospitals) show:

Checklists prevent simple mistakes and oversights
Teams using checklists have fewer preventable failures
Checklists enable consistent quality across teams
Even experts benefit from checklists
Checklists don't slow things down if designed well

In architecture, checklists ensure:

Consistency: New services follow same patterns
Completeness: No critical concerns are missed
Knowledge preservation: Lessons from past incidents inform future designs
Distributed decision-making: Junior architects can design with confidence, knowing checklist covers gaps
Audit trail: Checklist completion provides evidence of due diligence

Designing Checklists

Principle 1: Organize by Concern

Group checklist items by quality attribute or concern:

SCALABILITY
- [ ] Identified max expected load (requests/sec, data volume)
- [ ] Designed for 3x growth without re-architecting
- [ ] Database indexed for query patterns
- [ ] Caching layer designed for read-heavy operations
- [ ] Load balancer distributes traffic

SECURITY
- [ ] All external inputs validated
- [ ] Passwords hashed with strong algorithm (bcrypt, scrypt)
- [ ] Secrets not hardcoded (use secret management)
- [ ] HTTPS/TLS enforced for all external communication
- [ ] Authentication and authorization implemented
- [ ] SQL injection, XSS, CSRF mitigations in place
- [ ] Security review completed

OPERATIONAL READINESS
- [ ] Monitoring and alerting configured
- [ ] Key metrics identified and tracked
- [ ] Alerts with runbooks (what to do when alert fires?)
- [ ] Graceful degradation for dependency failures
- [ ] Health checks implemented
- [ ] Logging centralized (ELK, Datadog, etc.)
- [ ] Log retention policy defined

RESILIENCE & DISASTER RECOVERY
- [ ] Identified critical paths (features that must never fail)
- [ ] Failure modes documented (what can go wrong?)
- [ ] Redundancy for critical components (multi-zone, multi-region)
- [ ] Timeouts configured (no infinite waits)
- [ ] Retry logic with exponential backoff
- [ ] Circuit breaker patterns for external dependencies
- [ ] Data backup and recovery tested
- [ ] RTO/RPO targets defined and achievable

DATA GOVERNANCE
- [ ] Data ownership clear (which service owns which data?)
- [ ] Consistency guarantees documented (ACID vs. eventual?)
- [ ] Data retention policies defined
- [ ] PII identified and protected
- [ ] Data lineage documented (for compliance)
- [ ] Backup strategy and frequency

COST
- [ ] Identified major cost drivers
- [ ] Right-sized infrastructure (not over-provisioned)
- [ ] Reserved instances or commitments used where applicable
- [ ] Cost monitoring and alerts configured
- [ ] Estimated monthly cost documented

Principle 2: Tailor to Organization

Generic checklists are better than nothing, but organization-specific checklists are more valuable. Customize based on:

Tech stack (microservices? Serverless? Monolith?)
Industry (fintech has different security/compliance needs than media)
Scale (startup vs. enterprise)
Maturity (early-stage teams need simpler checklists)

Principle 3: Right-size the Checklist

Too long: checklist becomes a burden, people skip it. Too short: misses critical concerns.

Guide: 5-20 items per category is typical. If a category has >20 items, break it into sub-categories.

Principle 4: Make Items Testable

Bad: "Performance is good" Good: "API response time p95 < 100ms under peak load"

Bad: "Security is considered" Good: "External inputs validated with spec_security library or equivalent"

Testable items are: verifiable, specific, actionable.

Role-Specific Checklists

Different roles care about different concerns. Create role-specific checklists:

Architect Checklist (Design Phase)

ARCHITECTURE & DESIGN
- [ ] Identified system boundaries and interfaces
- [ ] Described data flow (request → response)
- [ ] Documented major components and dependencies
- [ ] Design patterns used (if any) identified
- [ ] Identified single points of failure
- [ ] Scalability assumptions documented
- [ ] Failure scenarios identified (what can go wrong?)
- [ ] Alternative designs considered and rejected with reasoning
- [ ] Design aligns with organizational strategy/ADRs
- [ ] Design reviewed by 2+ architects (consensus)

Security Team Checklist

SECURITY REVIEW
- [ ] Authentication/authorization design reviewed
- [ ] All external inputs validated
- [ ] Sensitive data encrypted (in transit + at rest)
- [ ] Secrets management approach (not hardcoded)
- [ ] OWASP Top 10 considerations addressed
- [ ] Network security (firewalls, VPCs, least privilege)
- [ ] Dependency/library versions (no known CVEs)
- [ ] Logging/audit trail for sensitive operations
- [ ] Compliance requirements (GDPR, SOC 2, etc.) addressed

DevOps/Operations Checklist

OPERATIONAL READINESS
- [ ] Infrastructure-as-code (Terraform, K8s manifests)
- [ ] Monitoring and alerting configured
- [ ] Logging centralized (ELK, Datadog, etc.)
- [ ] Deployment process automated (CI/CD)
- [ ] Rollback procedure tested
- [ ] Load testing completed (supports expected load?)
- [ ] Runbook written (operations troubleshooting guide)
- [ ] Disaster recovery tested (backup/restore works?)
- [ ] Capacity planning (growth plan for 2 years?)

Data Team Checklist

DATA & ANALYTICS
- [ ] Data schema designed (normalized, indexes)
- [ ] Backup and recovery process tested
- [ ] Data retention policy defined
- [ ] PII identified and protected
- [ ] Data lineage documented
- [ ] Replication/sync strategy (if multi-region)
- [ ] Performance (query latency acceptable?)
- [ ] Cost (right storage tier? Partitioning strategy?)

Three-Stage Checklist Review

Checklists are most effective when reviewed at multiple stages:

Stage 1: ADR/Design Review (Early)

Review checklist when decision is being made, not after implementation.

Benefit: Course-correct early when changes are cheap.

Example: "Before we decide on Redis for sessions, let's review the Operational Readiness checklist. Do we have experience operating Redis? Do we have monitoring for Redis cluster health?"

Stage 2: Pre-Implementation Review (Before Coding)

Review again before writing code. Some details emerge during design that weren't visible during ADR.

Example: "Security checklist asks about validating external inputs. Our API accepts file uploads. Have we thought about file validation (size, type, virus scanning)?"

Stage 3: Pre-Production Review (Before Deploy)

Final review before shipping to production. Catch any gaps that slipped through earlier.

Example: "Operational Readiness checklist requires runbook. We haven't written it yet. Before production, let's document how ops team should respond to common failures."

Checklist Iterations

Checklists should evolve based on:

Incidents: If an incident could have been prevented by a checklist question, add that question
Lessons learned: If a design review surfaced a gap, add it
Regulatory/compliance: New compliance requirements → new checklist items
Technology evolution: New frameworks/tools → updated checklist

Example Incident-Driven Evolution:

Incident: Memory leak in service caused OOM (Out of Memory) kill. Service died unexpectedly.

Post-mortem: "If we had memory monitoring + alerts, we would have caught this before OOM. We should add to checklist: 'Memory monitoring configured with alerts for high usage?'"

Updated Operational Readiness checklist:

- [ ] CPU monitoring configured
- [ ] Memory monitoring configured with alerts (>80% usage)  [NEW]
- [ ] Disk monitoring configured
- [ ] Network monitoring configured

Building Organizational Consensus

For checklists to work, team must buy in. How:

Start small: 5-10 items. Add incrementally as organization grows.
Make items specific: Not vague ("be secure"). Specific, actionable items.
Get feedback: Let teams use checklist. Collect feedback. Refine.
Share wins: When checklist catches an issue, share the story. "Checklist asked about alerts—caught memory leak early."
Iterate: Checklists aren't fixed. They evolve as organization learns.

Tooling & Automation

Checklists can be:

Manual: Printed document. Fill out by hand.
Google Sheets: Shared spreadsheet. Track across designs.
Architecture tools: Structurizr, Archi support checklists.
Custom app: Track checklist responses in your own tool.
Automation: Some items can be auto-verified (monitoring configured? CI/CD present?)

Example GitHub Action that validates checklist:

name: Architecture Checklist Verification

on: [pull_request]

jobs:
  checklist:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Check monitoring configuration
        run: |
          if ! grep -r "prometheus\|datadog\|newrelic" config/; then
            echo "ERROR: No monitoring configuration found"
            echo "Checklist requires monitoring configured"
            exit 1
          fi

      - name: Check runbook exists
        run: |
          if [ ! -f "docs/runbook.md" ]; then
            echo "ERROR: Runbook missing"
            echo "Checklist requires: 'Runbook written for common failures'"
            exit 1
          fi

Automated checks catch violations at PR stage, not after production.

Three-Stage Checklist Review Process

Patterns & Pitfalls

Pattern: Lightweight Checklists for Simple Systems Not every decision needs a 50-item checklist. Microservice adding simple CRUD endpoint? 5-item checklist sufficient. System redesign? Full checklist.

Pattern: Checklist Ownership Assign owner for each checklist category. Architect owns Design Checklist, Security team owns Security Checklist, etc. Ensures someone is thinking about each dimension.

Pattern: Checklist as Review Agenda Use checklist to structure design reviews. Go through each item, discuss, confirm. Ensures review is systematic, not ad-hoc.

Pitfall: Checkbox Culture Checking boxes without thinking. "Yes, we considered security" (but really didn't). Checklists work when team engages thoughtfully.

Pitfall: Static Checklists Checklist created 5 years ago, never updated. Org has evolved, but checklist hasn't. Stale checklists lose value.

Pitfall: Too Long Checklist becomes so comprehensive it takes 4 hours to complete. Team skips it. Keep checklists concise.

When to Use / When Not to Use

Use checklists for:

New services/features (ensure nothing is forgotten)
Designs that will be long-lived and critical
Organizations with multiple teams (consistency)
After incidents (prevent recurrence)
Designs with high-risk decisions (security, data, cost)

Less critical for:

Trivial changes (small bug fix doesn't need full checklist)
Rapid prototypes (going to be thrown away)
Organizations with deep institutional knowledge and perfect track record (unlikely)

Balance: Use checklists to prevent preventable mistakes. Don't let them slow down innovation.

Operational Considerations

Tool: Google Sheets, confluence wiki, or custom app. Whatever team uses for documentation.
Accessibility: Checklist should be easy to find and use. If hard to access, team won't use it.
Review cycle: Update checklists quarterly or after major incidents
Feedback: Collect team feedback. What items are helpful? What's missing?
Training: Teach team how to use checklists and why they matter

Design Review Checklist (Inception!)

Organizations with disciplined checklist practices have fewer preventable failures. After shipping a service, they don't discover missing monitoring or alerting. When designing new features, security concerns aren't an afterthought. Teams can design with confidence, knowing checklists will catch gaps. New team members can use checklists to guide their work. Incidents inspire checklist improvements, making the organization smarter over time. This is how checklists become part of engineering culture.

Self-Check

If you reviewed your last 3 designs, how many could have been improved by using a checklist? If all 3, you need systematic checklists.
After your last incident, could you add a question to your checklist that would have caught it? This is how checklists evolve and prevent recurrence.
Does your team know where the architecture checklist is and when to use it? If not, accessibility is the problem.

Next Steps

Create starter checklist: Pick 3-5 critical dimensions (scalability, security, operations, resilience, cost)
Add 10-15 items: Be specific and testable
Share with team: Gather feedback
Use in next 3 designs: Test it out
Iterate: Based on feedback and incidents, refine checklist

ℹ️

Architecture checklists are not bureaucracy—they're systematic reminders to think through important concerns. The best checklist is one your team actually uses and finds valuable. Start small, iterate, and evolve based on real experience.

Architecture Checklists

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Content​

Why Checklists Matter​

Designing Checklists​

Role-Specific Checklists​

Three-Stage Checklist Review​

Checklist Iterations​

Building Organizational Consensus​

Tooling & Automation​

Patterns & Pitfalls​

When to Use / When Not to Use​

Operational Considerations​

Design Review Checklist (Inception!)​

Self-Check​

Next Steps​

References​