Service Discovery

Enable services to find each other in dynamic environments where instances scale and fail

TL;DR

Services don't have fixed addresses; instances scale up/down and fail. Service discovery enables Service A to find Service B. Two approaches: client-side (client queries registry, picks instance), server-side (client contacts load balancer, which picks instance). Combined with health checking: remove failed instances automatically. Health checks detect failures; discovery updates are eventually consistent.

Learning Objectives

Understand client-side and server-side service discovery
Implement health checking and instance registration
Handle stale instances and eventual consistency
Recognize the tension between consistency and availability

Motivating Scenario

You hardcode "payment-service.example.com" in your code. It works until you scale the payment service to 3 instances. Which one does your code connect to? If one fails, your code doesn't know. Adding a new payment service requires updating all dependent services. Hard-coded addresses don't work in scalable systems.

Service discovery solves this: When payment service starts, it registers itself. When it stops, it deregisters. Other services query the registry: "Give me a healthy payment service instance."

Client-Side Discovery

Definition: Client queries a service registry, gets a list of instances, picks one.

Characteristics:

Client responsibility to query, pick, retry
Flexible (client can implement smart selection)
Simple infrastructure
Client sees all instances (can fail over)
Stale registry data impacts client

Flow:

Payment service starts → registers in registry
Order service queries registry: "Give me payment service instances"
Order service gets [payment1:3000, payment2:3000, payment3:3000]
Order service picks one (round-robin, random, etc.)
Order service calls payment1:3000
If payment1 fails, order service retries with another instance

Client-Side Example

# Service registration
registry = ServiceRegistry('consul.example.com')

class PaymentService:
    def __init__(self):
        # Register when starting
        registry.register(
            name='payment-service',
            address='payment1.internal',
            port=3000,
            health_check='http://payment1:3000/health'
        )

    def shutdown(self):
        # Deregister when stopping
        registry.deregister(name='payment-service')

# Client-side discovery
class OrderService:
    def process_payment(self, amount):
        # Query registry for instances
        instances = registry.get_instances('payment-service')

        # Pick one (simple round-robin)
        instance = self.pick_instance(instances)

        # Call it
        try:
            return requests.post(
                f'http://{instance.address}:{instance.port}/charge',
                json={'amount': amount}
            )
        except RequestException:
            # If failed, retry with different instance
            return self.retry_with_fallback(amount, instances)

Server-Side Discovery

Definition: Client calls a load balancer. Load balancer queries registry, picks an instance, forwards request.

Characteristics:

Load balancer responsibility
Simpler client code (just call one address)
Centralized intelligence
Load balancer can optimize picks (warm instances, latency, etc.)
Load balancer becomes critical component

Flow:

Payment service starts → registers in registry
Load balancer monitors registry for payment service
Order service calls load balancer: "payment-service.internal"
Load balancer queries registry, picks healthy instance
Load balancer forwards request to picked instance
Load balancer retries if needed

Comparison

Client-Side Discovery

Consul
Eureka
Custom registry

Server-Side Discovery

Kubernetes Services
AWS ELB
NGINX

Health Checking

Failed instances must be removed from rotation:

Passive Checks: Client detects failure and retries with different instance.

Active Checks: Registry periodically calls /health on each instance. Remove if unresponsive.

Health Check Example

# Service health endpoint
@app.get('/health')
def health_check():
    return {
        'status': 'healthy',
        'timestamp': datetime.now().isoformat(),
        'dependencies': {
            'database': check_database_connection(),
            'cache': check_cache_connection()
        }
    }

# Registry health checking
class HealthChecker:
    def check_instance(self, instance):
        try:
            response = requests.get(
                f'http://{instance.address}:{instance.port}/health',
                timeout=5
            )
            return response.status_code == 200
        except (RequestException, Timeout):
            return False

    def monitor_instances(self):
        for service_name in registry.services():
            instances = registry.get_instances(service_name)
            for instance in instances:
                if not self.check_instance(instance):
                    # Remove failed instance
                    registry.deregister_instance(
                        service_name, instance
                    )

Eventual Consistency in Discovery

Service registry changes are eventually consistent:

Instance fails
Health check fails (5-10 seconds later)
Registry removes instance
Clients query registry and get updated list
Total time: 10-30 seconds

During this window, clients may try dead instances. You need:

Fast timeouts (short timeout catches failures quickly)
Retries (try next instance if one fails)
Circuit breaker (stop trying failed service after N failures)

Advanced Topics

DNS-Based Service Discovery

DNS is a simpler alternative to service registries for some use cases:

# Simple approach: single DNS record per service
# Example: payment-service.internal → Single IP (load balancer)

Pros:
  - Built-in to operating systems
  - No special client libraries needed
  - Language-agnostic
  - Simple and reliable

Cons:
  - DNS TTL means stale entries (5-60 seconds)
  - Load balancing is single point of failure
  - Health checks must be at load balancer level
  - Not suitable for rapid instance changes

# Modern DNS-SD (DNS Service Discovery)
# SRV records: _service._proto.name
# Example: _payment._tcp.internal → [payment1:3000, payment2:3000, payment3:3000]

Pros:
  - Multiple instances in single DNS query
  - Service metadata in DNS
  - No client-side load balancing (clients still see all instances)
  - Better than single A record

Cons:
  - Requires SRV record support (not all systems)
  - Still has DNS TTL issues
  - Updates take time to propagate

Handling Network Partitions

Service discovery becomes complex during network splits:

Scenario: Network partition (client can't reach registry for 30 seconds)

Client behavior (should be):
  - Use cached registry data
  - Try known instances
  - If all fail, use stale data (better than failing)
  - Once partition heals, refresh from registry

Bad behavior:
  - Client loses registry connection → panic
  - Tries random instances
  - Fails all requests during partition

Implementation:
  class ServiceDiscoveryClient:
      def __init__(self):
          self.cached_instances = {}  # From last successful query
          self.cache_age = {}          # When cache was updated

      def get_instances(self, service_name):
          try:
              # Query fresh from registry
              fresh = self.registry.query(service_name)
              self.cached_instances[service_name] = fresh
              self.cache_age[service_name] = datetime.now()
              return fresh
          except RegistryUnreachable:
              # Use cache if available
              if service_name in self.cached_instances:
                  age = datetime.now() - self.cache_age[service_name]
                  logger.warn(f"Using stale cache ({age.seconds}s old)")
                  return self.cached_instances[service_name]
              else:
                  # No cache, truly isolated
                  raise ServiceDiscoveryUnavailable()

Practical Deployment Examples

Kubernetes (Built-in Service Discovery)

apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment
  ports:
    - port: 3000
      targetPort: 3000
  type: ClusterIP

# Kubernetes DNS automatically creates:
# payment-service.default.svc.cluster.local → 10.0.0.1 (virtual IP)
# Kube-proxy load balances traffic to all backend pods

# Clients simply call: payment-service:3000

Consul (Explicit Service Discovery)

import consul

consul_client = consul.Consul('consul.example.com:8500)

# Service registration (runs on application startup)
consul_client.agent.service.register(
    name='payment-service',
    service_id='payment-1',
    address='payment1.internal',
    port=3000,
    check=consul.Check.http(
        'http://payment1:3000/health',
        interval='5s',
        timeout='1s',
        deregister='10s'  # Deregister if 10s of failed health checks
    )
)

# Service discovery (client queries)
index, services = consul_client.health.service(
    'payment-service',
    passing=True  # Only healthy instances
)

# Usage: randomly pick one
instance = random.choice(services)

Docker Swarm (Built-in DNS)

version: '3'
services:
  payment-service:
    image: payment:latest
    replicas: 3  # 3 instances
    # Docker DNS automatically load-balances across replicas
    # Clients call: payment-service:3000 (resolved by Swarm)

Observing Service Discovery

Monitor discovery health to catch issues:

class ServiceDiscoveryMonitoring:
    def monitor_stale_instances(self):
        """Alert if old instances are still registered"""
        all_instances = self.registry.all_instances()
        for instance in all_instances:
            age = datetime.now() - instance.registered_at
            if age > 7_days:
                logger.error(f"Stale instance {instance.id} ({age.days} days)")

    def monitor_discovery_latency(self):
        """Track how long service discovery queries take"""
        start = time.time()
        instances = self.registry.get_instances('payment-service')
        latency = time.time() - start

        if latency > 100:  # ms
            logger.warn(f"Service discovery latency: {latency}ms")

    def monitor_health_check_coverage(self):
        """Verify all instances have health checks"""
        for instance in self.registry.all_instances():
            if not instance.has_health_check:
                logger.error(f"Instance {instance.id} has no health check!")

    def alert_on_discovery_outage(self):
        """Alert if service discovery itself is down"""
        try:
            self.registry.ping()
        except:
            logger.critical("Service discovery is unreachable!")
            # Trigger page (critical)

Self-Check

What happens when a service instance fails? How long until clients know?
- Passive: Clients try instance, it fails, they retry with different instance. Time: 1-5 seconds (depends on timeout).
- Active health checks: Registry detects failure in 5-10 seconds. Clients query registry, see updated list. Time: 5-10 seconds.
Is your service registration automatic or manual?
- Should be automatic. Services register themselves on startup (or orchestrator registers for them).
- Manual registration is error-prone and doesn't work at scale.
Do your health checks actually verify the service works?
- Check: HTTP GET /health returns 200. Problem: Service returns 200 but database is down.
- Better: /health endpoint checks critical dependencies (database, cache). Returns 200 only if all OK.

One Takeaway

Service discovery enables horizontal scaling. Use active health checks to remove failed instances quickly. Combine with retries and timeouts for reliability. Cache discovery results to handle registry outages gracefully.

Next Steps

Service Mesh: Read Service Mesh for advanced features (retries, timeouts, circuit breakers)
Resilience: Learn Circuit Breaker
Communication: Explore Synchronous vs Asynchronous
Load Balancing: Study Load Balancing

References

Newman, S. (2015). "Building Microservices". O'Reilly Media.
Richardson, C. (n.d.). "Service Discovery Pattern". microservices.io.
Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). "Borg, Omega, and Kubernetes". ACM Queue.
Consul Service Discovery ↗
Kubernetes Services ↗

Service Discovery

TL;DR​

Learning Objectives​

Motivating Scenario​

Client-Side Discovery​

Server-Side Discovery​

Comparison​

Health Checking​

Eventual Consistency in Discovery​

Advanced Topics​

DNS-Based Service Discovery​

Handling Network Partitions​

Practical Deployment Examples​

Observing Service Discovery​

Self-Check​

Next Steps​

References​

TL;DR

Learning Objectives

Motivating Scenario

Client-Side Discovery

Server-Side Discovery

Comparison

Health Checking

Eventual Consistency in Discovery

Advanced Topics

DNS-Based Service Discovery

Handling Network Partitions

Practical Deployment Examples

Observing Service Discovery

Self-Check

Next Steps

References