Architecture Checklists
TL;DR
Architecture checklists are systematic reminders to think through critical quality attributes and design concerns. Dimensions include scalability, security, cost, operational readiness, data governance, disaster recovery, compliance, and performance. Checklists surface blind spots and ensure consistency across designs. Review checklists at three stages: ADR creation (design validation), before implementation (readiness check), and before production deployment (go-live criteria). Organizations with disciplined checklist practices prevent preventable failures and maintain architectural consistency. Iterate checklists based on postmortems: if an incident could have been prevented by a checklist question, add that question to future checklists.
Learning Objectives
You will be able to:
- Design organization-specific checklists addressing critical quality attributes
- Use checklists systematically in design reviews and architecture decisions
- Identify gaps and blind spots in architectural designs through checklist-driven review
- Create role-specific checklists (architect, security, DevOps, compliance)
- Iterate and improve checklists based on incidents and postmortem learnings
- Balance rigor (comprehensive checklists) with speed (don't over-check simple decisions)
Motivating Scenario
Your team ships a new microservice. It works great in staging. On day one in production, the service crashes. Investigation reveals: no monitoring was set up. No alerts were configured. No runbook existed for common failures. Operations team was blind when the service went down.
In postmortem, this was preventable. A simple checklist would have required: "Monitoring configured? Alerts set up? Runbook written?" But the checklist didn't exist. So the service slipped through without these critical operational elements.
Later, you ship a second service. This time, you use a checklist: before production, must have monitoring, alerts, runbook. The second service also has an issue, but operations team catches it via alerting within 1 minute. Impact is minimal.
Architecture checklists prevent these failures by ensuring nothing is forgotten. They're not waterfall bureaucracy—they're systematic reminders to think through critical concerns.
Core Content
Why Checklists Matter
Humans have cognitive limits. When designing complex systems, it's easy to focus on the central concern (the feature) and forget critical cross-cutting concerns (security, monitoring, resilience). Checklists combat this.
Studies on checklists (Atul Gawande's work in hospitals) show:
- Checklists prevent simple mistakes and oversights
- Teams using checklists have fewer preventable failures
- Checklists enable consistent quality across teams
- Even experts benefit from checklists
- Checklists don't slow things down if designed well
In architecture, checklists ensure:
- Consistency: New services follow same patterns
- Completeness: No critical concerns are missed
- Knowledge preservation: Lessons from past incidents inform future designs
- Distributed decision-making: Junior architects can design with confidence, knowing checklist covers gaps
- Audit trail: Checklist completion provides evidence of due diligence
Designing Checklists
Principle 1: Organize by Concern
Group checklist items by quality attribute or concern:
SCALABILITY
- [ ] Identified max expected load (requests/sec, data volume)
- [ ] Designed for 3x growth without re-architecting
- [ ] Database indexed for query patterns
- [ ] Caching layer designed for read-heavy operations
- [ ] Load balancer distributes traffic
SECURITY
- [ ] All external inputs validated
- [ ] Passwords hashed with strong algorithm (bcrypt, scrypt)
- [ ] Secrets not hardcoded (use secret management)
- [ ] HTTPS/TLS enforced for all external communication
- [ ] Authentication and authorization implemented
- [ ] SQL injection, XSS, CSRF mitigations in place
- [ ] Security review completed
OPERATIONAL READINESS
- [ ] Monitoring and alerting configured
- [ ] Key metrics identified and tracked
- [ ] Alerts with runbooks (what to do when alert fires?)
- [ ] Graceful degradation for dependency failures
- [ ] Health checks implemented
- [ ] Logging centralized (ELK, Datadog, etc.)
- [ ] Log retention policy defined
RESILIENCE & DISASTER RECOVERY
- [ ] Identified critical paths (features that must never fail)
- [ ] Failure modes documented (what can go wrong?)
- [ ] Redundancy for critical components (multi-zone, multi-region)
- [ ] Timeouts configured (no infinite waits)
- [ ] Retry logic with exponential backoff
- [ ] Circuit breaker patterns for external dependencies
- [ ] Data backup and recovery tested
- [ ] RTO/RPO targets defined and achievable
DATA GOVERNANCE
- [ ] Data ownership clear (which service owns which data?)
- [ ] Consistency guarantees documented (ACID vs. eventual?)
- [ ] Data retention policies defined
- [ ] PII identified and protected
- [ ] Data lineage documented (for compliance)
- [ ] Backup strategy and frequency
COST
- [ ] Identified major cost drivers
- [ ] Right-sized infrastructure (not over-provisioned)
- [ ] Reserved instances or commitments used where applicable
- [ ] Cost monitoring and alerts configured
- [ ] Estimated monthly cost documented
Principle 2: Tailor to Organization
Generic checklists are better than nothing, but organization-specific checklists are more valuable. Customize based on:
- Tech stack (microservices? Serverless? Monolith?)
- Industry (fintech has different security/compliance needs than media)
- Scale (startup vs. enterprise)
- Maturity (early-stage teams need simpler checklists)
Principle 3: Right-size the Checklist
Too long: checklist becomes a burden, people skip it. Too short: misses critical concerns.
Guide: 5-20 items per category is typical. If a category has >20 items, break it into sub-categories.
Principle 4: Make Items Testable
Bad: "Performance is good" Good: "API response time p95 < 100ms under peak load"
Bad: "Security is considered" Good: "External inputs validated with spec_security library or equivalent"
Testable items are: verifiable, specific, actionable.
Role-Specific Checklists
Different roles care about different concerns. Create role-specific checklists:
Architect Checklist (Design Phase)
ARCHITECTURE & DESIGN
- [ ] Identified system boundaries and interfaces
- [ ] Described data flow (request → response)
- [ ] Documented major components and dependencies
- [ ] Design patterns used (if any) identified
- [ ] Identified single points of failure
- [ ] Scalability assumptions documented
- [ ] Failure scenarios identified (what can go wrong?)
- [ ] Alternative designs considered and rejected with reasoning
- [ ] Design aligns with organizational strategy/ADRs
- [ ] Design reviewed by 2+ architects (consensus)
Security Team Checklist
SECURITY REVIEW
- [ ] Authentication/authorization design reviewed
- [ ] All external inputs validated
- [ ] Sensitive data encrypted (in transit + at rest)
- [ ] Secrets management approach (not hardcoded)
- [ ] OWASP Top 10 considerations addressed
- [ ] Network security (firewalls, VPCs, least privilege)
- [ ] Dependency/library versions (no known CVEs)
- [ ] Logging/audit trail for sensitive operations
- [ ] Compliance requirements (GDPR, SOC 2, etc.) addressed
DevOps/Operations Checklist
OPERATIONAL READINESS
- [ ] Infrastructure-as-code (Terraform, K8s manifests)
- [ ] Monitoring and alerting configured
- [ ] Logging centralized (ELK, Datadog, etc.)
- [ ] Deployment process automated (CI/CD)
- [ ] Rollback procedure tested
- [ ] Load testing completed (supports expected load?)
- [ ] Runbook written (operations troubleshooting guide)
- [ ] Disaster recovery tested (backup/restore works?)
- [ ] Capacity planning (growth plan for 2 years?)
Data Team Checklist
DATA & ANALYTICS
- [ ] Data schema designed (normalized, indexes)
- [ ] Backup and recovery process tested
- [ ] Data retention policy defined
- [ ] PII identified and protected
- [ ] Data lineage documented
- [ ] Replication/sync strategy (if multi-region)
- [ ] Performance (query latency acceptable?)
- [ ] Cost (right storage tier? Partitioning strategy?)
Three-Stage Checklist Review
Checklists are most effective when reviewed at multiple stages:
Stage 1: ADR/Design Review (Early)
Review checklist when decision is being made, not after implementation.
Benefit: Course-correct early when changes are cheap.
Example: "Before we decide on Redis for sessions, let's review the Operational Readiness checklist. Do we have experience operating Redis? Do we have monitoring for Redis cluster health?"
Stage 2: Pre-Implementation Review (Before Coding)
Review again before writing code. Some details emerge during design that weren't visible during ADR.
Example: "Security checklist asks about validating external inputs. Our API accepts file uploads. Have we thought about file validation (size, type, virus scanning)?"
Stage 3: Pre-Production Review (Before Deploy)
Final review before shipping to production. Catch any gaps that slipped through earlier.
Example: "Operational Readiness checklist requires runbook. We haven't written it yet. Before production, let's document how ops team should respond to common failures."
Checklist Iterations
Checklists should evolve based on:
- Incidents: If an incident could have been prevented by a checklist question, add that question
- Lessons learned: If a design review surfaced a gap, add it
- Regulatory/compliance: New compliance requirements → new checklist items
- Technology evolution: New frameworks/tools → updated checklist
Example Incident-Driven Evolution:
Incident: Memory leak in service caused OOM (Out of Memory) kill. Service died unexpectedly.
Post-mortem: "If we had memory monitoring + alerts, we would have caught this before OOM. We should add to checklist: 'Memory monitoring configured with alerts for high usage?'"
Updated Operational Readiness checklist:
- [ ] CPU monitoring configured
- [ ] Memory monitoring configured with alerts (>80% usage) [NEW]
- [ ] Disk monitoring configured
- [ ] Network monitoring configured
Building Organizational Consensus
For checklists to work, team must buy in. How:
- Start small: 5-10 items. Add incrementally as organization grows.
- Make items specific: Not vague ("be secure"). Specific, actionable items.
- Get feedback: Let teams use checklist. Collect feedback. Refine.
- Share wins: When checklist catches an issue, share the story. "Checklist asked about alerts—caught memory leak early."
- Iterate: Checklists aren't fixed. They evolve as organization learns.
Tooling & Automation
Checklists can be:
- Manual: Printed document. Fill out by hand.
- Google Sheets: Shared spreadsheet. Track across designs.
- Architecture tools: Structurizr, Archi support checklists.
- Custom app: Track checklist responses in your own tool.
- Automation: Some items can be auto-verified (monitoring configured? CI/CD present?)
Example GitHub Action that validates checklist:
name: Architecture Checklist Verification
on: [pull_request]
jobs:
checklist:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Check monitoring configuration
run: |
if ! grep -r "prometheus\|datadog\|newrelic" config/; then
echo "ERROR: No monitoring configuration found"
echo "Checklist requires monitoring configured"
exit 1
fi
- name: Check runbook exists
run: |
if [ ! -f "docs/runbook.md" ]; then
echo "ERROR: Runbook missing"
echo "Checklist requires: 'Runbook written for common failures'"
exit 1
fi
Automated checks catch violations at PR stage, not after production.
Patterns & Pitfalls
Pattern: Lightweight Checklists for Simple Systems Not every decision needs a 50-item checklist. Microservice adding simple CRUD endpoint? 5-item checklist sufficient. System redesign? Full checklist.
Pattern: Checklist Ownership Assign owner for each checklist category. Architect owns Design Checklist, Security team owns Security Checklist, etc. Ensures someone is thinking about each dimension.
Pattern: Checklist as Review Agenda Use checklist to structure design reviews. Go through each item, discuss, confirm. Ensures review is systematic, not ad-hoc.
Pitfall: Checkbox Culture Checking boxes without thinking. "Yes, we considered security" (but really didn't). Checklists work when team engages thoughtfully.
Pitfall: Static Checklists Checklist created 5 years ago, never updated. Org has evolved, but checklist hasn't. Stale checklists lose value.
Pitfall: Too Long Checklist becomes so comprehensive it takes 4 hours to complete. Team skips it. Keep checklists concise.
When to Use / When Not to Use
Use checklists for:
- New services/features (ensure nothing is forgotten)
- Designs that will be long-lived and critical
- Organizations with multiple teams (consistency)
- After incidents (prevent recurrence)
- Designs with high-risk decisions (security, data, cost)
Less critical for:
- Trivial changes (small bug fix doesn't need full checklist)
- Rapid prototypes (going to be thrown away)
- Organizations with deep institutional knowledge and perfect track record (unlikely)
Balance: Use checklists to prevent preventable mistakes. Don't let them slow down innovation.
Operational Considerations
- Tool: Google Sheets, confluence wiki, or custom app. Whatever team uses for documentation.
- Accessibility: Checklist should be easy to find and use. If hard to access, team won't use it.
- Review cycle: Update checklists quarterly or after major incidents
- Feedback: Collect team feedback. What items are helpful? What's missing?
- Training: Teach team how to use checklists and why they matter
Design Review Checklist (Inception!)
- Checklist created addressing critical quality attributes for organization
- Checklist organized by concern/domain (scalability, security, operations, etc.)
- Each item is specific and testable (not vague)
- Checklist right-sized (5-20 items per category, not too long)
- Role-specific checklists created (architect, security, devops, data, etc.)
- Checklist used at ADR stage, pre-implementation, and pre-production
- Checklist refined based on incidents and postmortems
- Automation added where possible (CI/CD validation)
- Checklist accessible and easy to use (team knows where it is)
- Checklist reviewed quarterly and updated
- Team trained on when and how to use checklists
- Wins shared when checklist catches issues
- Balance between rigor and speed (simple changes don't need full review)
- Tool chosen for tracking responses (Google Sheets, wiki, etc.)
- Ownership assigned (who maintains and updates checklist?)
Organizations with disciplined checklist practices have fewer preventable failures. After shipping a service, they don't discover missing monitoring or alerting. When designing new features, security concerns aren't an afterthought. Teams can design with confidence, knowing checklists will catch gaps. New team members can use checklists to guide their work. Incidents inspire checklist improvements, making the organization smarter over time. This is how checklists become part of engineering culture.
Self-Check
-
If you reviewed your last 3 designs, how many could have been improved by using a checklist? If all 3, you need systematic checklists.
-
After your last incident, could you add a question to your checklist that would have caught it? This is how checklists evolve and prevent recurrence.
-
Does your team know where the architecture checklist is and when to use it? If not, accessibility is the problem.
Next Steps
- Create starter checklist: Pick 3-5 critical dimensions (scalability, security, operations, resilience, cost)
- Add 10-15 items: Be specific and testable
- Share with team: Gather feedback
- Use in next 3 designs: Test it out
- Iterate: Based on feedback and incidents, refine checklist
Architecture checklists are not bureaucracy—they're systematic reminders to think through important concerns. The best checklist is one your team actually uses and finds valuable. Start small, iterate, and evolve based on real experience.
References
- The Checklist Manifesto - Atul Gawande ↗️
- NIH - Checklists Reduce Complications ↗️
- ThoughtWorks - Effective Architecture Review ↗️
- Gartner - Architecture Review Best Practices ↗️