Skip to main content

Scheduling and Autoscaling

Distribute pods and scale capacity dynamically based on demand.

TL;DR

Scheduling: Assign pod to node. Kubernetes scheduler uses: resource requests (CPU, memory), taints/tolerations (node constraints), node affinity (prefer certain nodes), pod affinity (group related pods). HPA (Horizontal Pod Autoscaler): Scale pod count based on metrics (CPU, memory, custom). VPA (Vertical Pod Autoscaler): Adjust resource requests. Cluster Autoscaler: Add nodes when pods can't fit. Golden rule: measure before scaling (scale at 70% utilization, not 90%+).

Learning Objectives

  • Understand pod scheduling algorithm
  • Configure resource requests and limits
  • Use taints and tolerations
  • Implement pod affinity/anti-affinity
  • Setup horizontal pod autoscaling (HPA)
  • Configure custom metrics for scaling
  • Implement cluster autoscaling
  • Monitor scaling events
  • Debug scheduling issues

Motivating Scenario

Pod deploy fails: "Insufficient CPU". Pods scattered across nodes, can't fit. Cluster Autoscaler doesn't know when to add nodes. HPA scales pods to 1000, but no resources to run them. Result: poor utilization. With proper scheduling: pods pack efficiently, HPA + Cluster Autoscaler work together.

Core Concepts

Scheduling Process

1. Filter nodes:
✓ Has enough CPU/memory
✓ Tolerate taints
✓ Match node affinity

2. Rank nodes:
- Prefer less loaded
- Prefer nodes with high score

3. Bind pod to best node

Pod Resource Model

spec:
containers:
- name: app
image: myapp:1.0
resources:
requests:
cpu: "100m" # Reserve 100 millicores
memory: "128Mi" # Reserve 128MB
limits:
cpu: "500m" # Max 500 millicores
memory: "512Mi" # Max 512MB
  • Request: Reserved (guaranteed available)
  • Limit: Max allowed (can throttle or kill)

Scaling Types

TypeTriggerAdjustmentUse Case
HPACPU/Memory % or custom metricScale pod countStateless services
VPAActual resource usageAdjust requestsRight-sizing
CAPods pending (no node space)Add nodesCluster growth

Implementation

# Pod with resource requirements
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests:
cpu: "100m" # 100 millicores (0.1 CPU)
memory: "128Mi" # 128 MB
limits:
cpu: "500m" # 500 millicores
memory: "512Mi" # 512 MB
---
# Taints and tolerations (run workload on specific nodes)
# Taint node (prevent pods from running)
# kubectl taint nodes gpu-node gpu=yes:NoSchedule

apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
containers:
- name: compute
image: ml-trainer:1.0
tolerations:
- key: gpu
operator: Equal
value: "yes"
effect: NoSchedule
---
# Pod affinity (co-locate pods)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
affinity:
# Preferred: API pods on same nodes as cache pods
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: kubernetes.io/hostname

# Anti-affinity: Spread across nodes
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname

# Node affinity: Run on specific nodes
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disk
operator: In
values:
- ssd # Only nodes with ssd=true label
---
# Node selector (simple affinity)
apiVersion: v1
kind: Pod
metadata:
name: high-memory-app
spec:
nodeSelector:
memory-type: high # Only nodes with this label
containers:
- name: app
image: app:1.0

Real-World Scenarios

Scenario 1: E-Commerce Traffic Spike

Normal traffic: 10 pods
Flash sale starts: 1000x traffic spike

HPA detects CPU > 70%
Scales to 100 pods (max)
Cluster Autoscaler adds nodes
System handles spike

30 min later: HPA scales down to 15 pods
CA removes unused nodes
Cost: $X for spike period

Scenario 2: Batch Processing

Batch job: Sort 1M orders
Requires 50 CPU cores, 100GB memory
Pod affinity: Keep batch pods on same nodes (fast network)
Taint node: Only batch workloads
HPA: Scale to 50 pods max
Result: 50 concurrent tasks, completes in 1 hour

Scenario 3: Mixed Workloads

On-demand: Production API (responsive)
Spot instances: Batch jobs (cost-sensitive)
GPU nodes: ML training (expensive)

Scheduling:
- API: preferredDuringScheduling on on-demand
- Batch: preferredDuringScheduling on spot
- ML: required on GPU nodes

Result: 40% cost savings without reliability loss

Common Mistakes

Mistake 1: No Resource Requests

# ❌ WRONG: No requests
resources:
limits:
cpu: "500m"

# Scheduler doesn't reserve space
# Pods can be overprovisioned

# ✅ CORRECT: Requests + limits
resources:
requests:
cpu: "100m"
limits:
cpu: "500m"

Mistake 2: HPA on Wrong Metrics

# ❌ WRONG: Scale based on request count (misleading)
- metric:
name: http_requests_total

# ✅ CORRECT: Scale based on per-pod rate
- metric:
name: http_requests_per_second_per_pod

Mistake 3: CA Doesn't Know When to Add

# ❌ WRONG: HPA maxes out, pods pending
# Cluster Autoscaler waits (misconfigured)
# Pods stay pending

# ✅ CORRECT: CA monitor pending pods
# Automatically add nodes within 2 minutes

Design Checklist

  • Resource requests defined for all containers?
  • Resource limits set (prevent OOM kill)?
  • HPA configured (min/max replicas)?
  • Scaling metrics chosen (CPU, memory, custom)?
  • HPA scale-up/down delays tuned?
  • Pod affinity configured (co-location)?
  • Pod anti-affinity configured (spreading)?
  • Node selector or node affinity used?
  • Taints and tolerations for special nodes?
  • Cluster Autoscaler enabled?
  • Monitoring of scaling events?
  • Runbook for scaling failures?

Next Steps

  1. Define resource requests/limits
  2. Setup HPA for stateless services
  3. Configure scaling metrics
  4. Deploy Cluster Autoscaler
  5. Monitor scaling events
  6. Tune scaling parameters
  7. Test scaling under load
  8. Document scaling behavior

References

Advanced Scaling Scenarios

Custom Metrics Autoscaling

Scale based on application-specific metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 1
maxReplicas: 50
metrics:
# Scale based on queue depth
- type: Pods
pods:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue: orders
target:
type: AverageValue
averageValue: "10" # 10 messages per pod

Metrics flow:

  1. Custom metric emitted by app
  2. Collected by metrics collector (Prometheus, Stackdriver)
  3. Adapter exposes to Kubernetes metrics API
  4. HPA reads and scales

Scaling Policies

Control scaling speed:

behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Double pods
periodSeconds: 15
- type: Pods
value: 10 # Add 10 pods max
periodSeconds: 15
selectPolicy: Max # Use most aggressive

scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min
policies:
- type: Percent
value: 25 # Remove 25% of pods
periodSeconds: 60
selectPolicy: Min # Use least aggressive

Pod Disruption Budgets

Ensure minimum availability during scaling:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # At least 2 pods must be available
selector:
matchLabels:
app: api

When scaling down:

  • Kubernetes respects PDB
  • Won't terminate pod if it would violate constraint
  • Ensures service availability

Vertical Scaling

Adjust resource requests without restarting:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Recreate" # Restart pods to apply
resourcePolicy:
containerPolicies:
- containerName: api
controlledResources: ["cpu", "memory"]
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2000m
memory: 4Gi

VPA recommender observes usage and adjusts:

  • If consistently using 200m CPU, recommend 250m (+ overhead)
  • If consistently using 512Mi memory, recommend 640Mi (+ overhead)

Cost Optimization

Reduce scaling costs:

  1. Spot instances: 70% cheaper, but can be interrupted

    nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
    nodeSelectorTerms:
    - matchExpressions:
    - key: cloud.google.com/gke-preemptible
    operator: In
    values:
    - "true"
  2. Reserved instances: 30-50% cheaper for committed capacity

    nodeSelector:
    cloud.google.com/gke-nodepool: reserved-pool
  3. Mixed strategy: Use spot for flexible, reserved for baseline

Troubleshooting Scaling Issues

Pod Pending

Cause: Not enough resources

Debug:

kubectl describe pod <pod-name>
# Look for: Insufficient CPU, Insufficient Memory

# Check node resources
kubectl describe node <node-name>

# Solution: Increase resource requests, add nodes, or reduce pod count

HPA Not Scaling

Cause: Metrics not available

Debug:

kubectl get hpa
kubectl describe hpa <hpa-name>

# Check metrics
kubectl get --all-namespaces=true --all-pods=true \
resource.metrics.k8s.io v1beta1 pods

# Solution: Install metrics-server, configure custom metrics adapter

Scaling Too Slow/Fast

Tune scaling parameters:

# Currently scaling too slow?
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Was 300, reduce delay
policies:
- type: Percent
value: 100 # Increase aggression
periodSeconds: 15

# Currently scaling too fast?
behavior:
scaleUp:
stabilizationWindowSeconds: 300 # Add delay
policies:
- type: Percent
value: 50 # Reduce aggression
periodSeconds: 30

Conclusion

Scheduling and autoscaling are foundations of Kubernetes:

  • Resource requests guide scheduler
  • Pod affinity/anti-affinity control placement
  • HPA scales based on metrics
  • Cluster Autoscaler adds nodes

Tuning requires:

  • Measuring current resource usage
  • Setting appropriate requests/limits
  • Configuring HPA thresholds
  • Monitoring scaling events

Result: Efficient resource utilization, automatic scaling, cost optimization.

HPA Real-World Configurations

Web Server (CPU-based scaling):

minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 70
# Typical for stateless web servers

Queue Worker (Custom metric):

minReplicas: 1
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: queue_depth
target:
averageValue: "10" # 10 items per worker
# Scale based on work queue size

Database (Memory-based, no scaling):

# Don't use HPA for stateful workloads
# Instead: VPA for right-sizing, CA for node capacity
minReplicas: 3
# Fixed count, manual scaling

Resource Request Tuning

Common starting points (adjust based on actual metrics):

Web service:
cpu: 100m
memory: 128Mi

API service:
cpu: 250m
memory: 512Mi

Worker:
cpu: 500m
memory: 1Gi

Database:
cpu: 2000m
memory: 8Gi

Measure actual usage, then adjust requests accordingly.