Branch by Abstraction
Manage large refactorings and migrations through feature flags and abstraction layers.
TL;DR
Manage large refactorings and migrations through feature flags and abstraction layers. Rather than risky big-bang replacements, these patterns enable incremental, low-risk transitions that keep systems running while you modernize. They require discipline and clear governance but dramatically reduce modernization risk.
Learning Objectives
- Understand the pattern and when to apply it
- Learn how to avoid common modernization pitfalls
- Apply risk mitigation techniques for major changes
- Plan and execute incremental transitions safely
- Manage team and organizational change during modernization
Motivating Scenario
You have a legacy system that is becoming a bottleneck. Rewriting it would take a year and risk breaking critical functionality. Instead, you incrementally replace it with new services while keeping the old system running, gradually shifting traffic and functionality. Six months later, the legacy system handles 10 percent of traffic and serves as a fallback. Eventually, you can retire it completely. This pattern turns a risky all-or-nothing gamble into a managed, incremental transition.
Core Concepts
Migration Risk
Major system changes carry existential risk: downtime impacts revenue, data corruption destroys trust, performance regression loses customers. These patterns manage that risk through incremental change and careful rollback planning.
Incremental Transition
Rather than "old system" then "new system", these patterns create "old plus new coexisting" then "new plus old as fallback" then "new only". This gives you multiple checkpoints to verify things are working.
The key insight: You do not need to be perfect on day one. You just need to be good enough to carry traffic safely, with a fallback if something goes wrong.
Dual-Write and Data Consistency
When migrating data, you typically need both systems to have current data for a period. Dual writes keep both systems in sync, backfill catches up old data, and CDC (Change Data Capture) handles streaming updates.
Abstraction Layers
Rather than replacing a system wholesale, you wrap it with an abstraction (facade or anti-corruption layer). The abstraction routes traffic gradually: initially 100 percent to old, then 99 percent old / 1 percent new, then 50/50, then eventually 100 percent new.
Practical Example
- Strangler Fig Timeline
- Migration Strategies
- Verification Checklist
Phase 1: Preparation (Weeks 1-4)
New service deployed but offline
Load balancer configured to route traffic
Both systems connected to same database
Risk: Low (new service not receiving traffic)
Phase 2: Canary (Weeks 5-6)
Traffic Distribution: 1 percent to new, 99 percent to old
Monitoring: Compare response times, error rates
Verify data consistency
Rollback: Instant (revert to 100 percent old)
Risk: Very low (affects 1 percent of traffic)
Phase 3: Ramp (Weeks 7-12)
Gradual Increase: 5 percent, 10 percent, 25 percent, 50 percent
Continuous comparison with old system
Alert on any divergence
Rollback: Revert traffic distribution instantly
Risk: Medium (affects majority by end)
Phase 4: Retirement (Weeks 13+)
New service handles all traffic independently
Old system monitored but not actively used
Archive old system after stable period
Extract data and lessons learned
Core Strategies:
Strangler Fig Pattern:
Grow new functionality alongside old
Gradually shift traffic
Old system stays running as safety net
Works for mostly stateless systems
Branch by Abstraction:
Use feature flags and abstraction layers
Hide complexity behind clean interface
Transition implementation at runtime
Dual Write and Backfill:
New system writes all new data
Old system maintains historical data
Backfill job copies old data to new system
Both stay consistent until cutover
Anti-Corruption Layers:
Wrap legacy systems with adapters
Translate between old and new interfaces
Insulate new code from legacy complexity
Domain-Aligned Decomposition:
Decompose monolith along domain boundaries
Each domain becomes new microservice
Easier to coordinate
Reduces coupling
Data Consistency:
- Old and new systems return identical data
- Both systems show same transaction count
- Randomly sampled records match in both
- Metadata (timestamps, IDs) consistent
Functionality:
- All major features work in new system
- Error cases handled identically
- Edge cases verified
- Security rules enforced identically
Performance:
- New system has acceptable latency
- New system has acceptable error rate
- Database resources not over-utilized
- No unexpected slow queries
Operations:
- Monitoring and alerting working
- Logs contain expected entries
- Can trace requests across systems
- Runbooks updated for new system
Rollback:
- Can quickly switch all traffic back
- Old system receives traffic without issues
- Data not corrupted if rollback needed
- Team has tested rollback procedure
Key Success Factors
- Clear success criteria: You know what "done" looks like
- Incremental approach: Big changes done in small, verifiable steps
- Comprehensive testing: Automated tests at multiple levels
- Constant verification: Continuous comparison between old and new
- Fast rollback: Can switch back to old system in minutes
- Team alignment: Everyone understands the approach and timeline
- Transparent communication: Stakeholders understand progress and risks
Pitfalls to Avoid
❌ All-or-nothing thinking: "Just rewrite it" instead of incremental migration ❌ Ignoring data consistency: Assuming old and new data will magically sync ❌ No fallback plan: If new system fails, you are stuck ❌ Invisible progress: Weeks of work with no deployed functionality ❌ Parallel maintenance: Maintaining old and new systems forever ❌ Rushing to cleanup: Retiring old system before new one is truly stable ❌ Team turnover: Key knowledge not documented
Related Patterns
- Strangler Fig: Wrap and gradually replace legacy systems
- Feature Flags: Control which code path executes at runtime
- Blue-Green Deployment: Switch entire systems at once
- Canary Releases: Route small percentage to new version
Checklist: Before Modernization
- Clear business case: Why modernize? What is the benefit?
- Phased approach defined: How will you migrate incrementally?
- Success criteria explicit: What does "done" look like?
- Risk mitigation planned: What could go wrong? How will you recover?
- Testing strategy defined: How will you verify correctness?
- Monitoring in place: Can you detect problems in new system?
- Team capacity sufficient: Do you have capacity for both?
- Communication plan ready: How will you keep stakeholders informed?
Feature Flags for Safe Refactoring
Simple Feature Flag Implementation
class FeatureFlagManager:
def __init__(self, config_source):
self.flags = config_source.load_flags()
def is_enabled(self, flag_name, user_id=None, percentage=100):
"""Check if feature is enabled for user"""
if flag_name not in self.flags:
return False
flag = self.flags[flag_name]
# Percentage-based rollout
if flag.get('percentage', 100) < 100:
user_hash = hash(f"{user_id}:{flag_name}") % 100
if user_hash >= flag['percentage']:
return False
# User-specific overrides
if 'allowed_users' in flag and user_id not in flag['allowed_users']:
return False
return flag.get('enabled', False)
# Usage during refactoring
flag_manager = FeatureFlagManager(ConfigStore())
def process_order(order_id, user_id):
if flag_manager.is_enabled('use_new_order_processor', user_id):
# New implementation
return NewOrderProcessor().process(order_id)
else:
# Old implementation (fallback)
return LegacyOrderProcessor().process(order_id)
Gradual Rollout Strategy
# Feature flag config for gradual migration
features:
new_payment_system:
enabled: true
percentage: 0 # Start at 0%
rollout_schedule:
- date: 2025-02-15
percentage: 1 # 1% canary
- date: 2025-02-16
percentage: 5 # 5% after monitoring
- date: 2025-02-17
percentage: 25 # 25%
- date: 2025-02-18
percentage: 50 # 50%
- date: 2025-02-19
percentage: 100 # 100%
allowed_users: [] # Can add specific users for early access
monitoring:
error_rate_threshold: 0.01 # Rollback if error rate > 1%
latency_threshold_ms: 2000 # Rollback if p99 > 2s
rollback_condition: |
IF error_rate > 0.01 OR latency_p99 > 2000 THEN rollback
Branch by Abstraction Example: Payment Processing
# Old implementation: tightly coupled
class LegacyOrderService:
def __init__(self, stripe_client):
self.stripe = stripe_client
def create_order(self, order_data):
# Payment processing tightly coupled
charge = self.stripe.charge(order_data['amount'])
return Order.create(order_data, charge_id=charge.id)
# Abstraction: create payment service interface
class PaymentGateway(ABC):
@abstractmethod
def charge(self, amount, customer_id):
pass
class StripePaymentGateway(PaymentGateway):
def __init__(self, stripe_client):
self.stripe = stripe_client
def charge(self, amount, customer_id):
return self.stripe.charge(amount, customer_id)
class LegacyPaymentGateway(PaymentGateway):
"""Wraps old payment system"""
def __init__(self, legacy_client):
self.legacy = legacy_client
def charge(self, amount, customer_id):
return self.legacy.process_charge(amount, customer_id)
# Refactored order service: depends on abstraction, not concrete class
class OrderService:
def __init__(self, payment_gateway: PaymentGateway, flag_manager):
self.payment = payment_gateway
self.flags = flag_manager
def create_order(self, order_data, user_id):
# Behavior controlled by feature flag
if self.flags.is_enabled('new_payment_system', user_id):
payment = StripePaymentGateway(...).charge(...)
else:
payment = LegacyPaymentGateway(...).charge(...)
return Order.create(order_data, charge_id=payment.id)
Dual-Write Pattern for Data Consistency
class DualWriteOrderRepository:
"""Write to both old and new database during migration"""
def __init__(self, old_db, new_db, flag_manager):
self.old_db = old_db
self.new_db = new_db
self.flags = flag_manager
def save(self, order, user_id):
# Always write to old system (safety net)
self.old_db.save(order)
# If new system is enabled, also write there
if self.flags.is_enabled('new_order_db', user_id):
try:
self.new_db.save(order)
except Exception as e:
# Log but don't fail; old DB succeeded
logger.warning(f"Failed to write to new DB: {e}")
def get(self, order_id, user_id):
# Read from appropriate system based on flag
if self.flags.is_enabled('read_from_new_db', user_id, percentage=10):
try:
order = self.new_db.get(order_id)
# Verify consistency
old_order = self.old_db.get(order_id)
if order != old_order:
logger.warning(f"Data divergence detected for order {order_id}")
return order
except Exception as e:
logger.error(f"Failed to read from new DB: {e}")
return self.old_db.get(order_id) # Fallback
else:
return self.old_db.get(order_id)
Monitoring and Rollback Triggers
class MigrationMonitor:
def __init__(self, metrics_client, flag_manager):
self.metrics = metrics_client
self.flags = flag_manager
def check_health(self, feature_name):
"""Check if migration is healthy; rollback if needed"""
metrics = self.metrics.get_recent(feature_name)
# Monitor error rate
error_rate = metrics['errors'] / metrics['total_requests']
if error_rate > 0.01: # More than 1% errors
logger.error(f"High error rate for {feature_name}: {error_rate}")
return self.rollback(feature_name)
# Monitor latency
p99_latency = metrics['p99_latency_ms']
if p99_latency > 2000: # More than 2 seconds
logger.error(f"High latency for {feature_name}: {p99_latency}ms")
return self.rollback(feature_name)
# Monitor data divergence
divergence = metrics['divergent_reads']
if divergence > 0.001: # More than 0.1% divergence
logger.error(f"Data divergence for {feature_name}: {divergence}")
return self.rollback(feature_name)
return True # All checks passed
def rollback(self, feature_name):
"""Immediately disable feature for all users"""
logger.critical(f"Rolling back {feature_name}")
self.flags.set_percentage(feature_name, 0)
self.metrics.emit_event('rollback', {'feature': feature_name})
return False
Self-Check
- Can you do this migration in phases? If not, find a way to break it up. Use feature flags to phase behavior gradually.
- Can you roll back in less than 1 hour? Yes: flip feature flag to 0%. No code deployment needed.
- Have you tested the fallback procedure? Practice rollback in staging: set flag percentage to 100% (use new), then 0% (use old). Verify system remains functional.
- Does everyone understand the timeline? Publish rollout schedule showing percentage by date. Clear triggers for rollback (error rate, latency, divergence).
Warning signs:
- "We'll just flag out the old code after new code is working"
- "Feature flag performance is negligible"
- "We don't need to monitor during rollout"
- No automated rollback triggers defined
Takeaway
Modernization is not about technology—it is about managing risk while delivering business value. The best migrations are the ones teams do not notice: they happen gradually, safely, with multiple checkpoints. Time is your friend in modernization; rushing increases risk without proportional gain.
Next Steps
- Define the scope: What exactly are you modernizing?
- Identify the risks: What could go wrong? How will you mitigate?
- Plan in phases: How can you migrate incrementally?
- Design verification: How will you prove old and new are equivalent?
- Prepare rollback: How will you revert if something goes wrong?