Skip to main content

Cost Controls & Quotas

TL;DR

CPU and memory requests reserve capacity for a pod; limits prevent a pod from consuming more than allowed. Namespace quotas prevent any single team from hoarding the entire cluster. ResourceQuota, LimitRange, and admission policies enforce governance automatically. Track costs per team and service via resource tags and usage monitoring. Implement chargeback models so teams see (and pay for) their consumption, creating accountability and driving efficiency.

Learning Objectives

  • Design effective resource request and limit strategies aligned with actual workload behavior
  • Implement and enforce namespace quotas for fair multi-tenant resource allocation
  • Monitor and attribute costs to teams, services, and workloads
  • Build chargeback and showback models to drive cost awareness and optimization
  • Enforce Quality of Service (QoS) guarantees through admission policies
  • Optimize resource utilization to reduce cloud spending

Motivating Scenario

A company deploys services to Kubernetes without requests or limits. One team runs a data-processing job that consumes all available CPU and memory. Other teams' pods are evicted. Critical services go down. The bill is astronomical because the cluster never stops scaling up to meet demand. No visibility into what costs what.

With proper cost controls: Each team gets a namespace with a ResourceQuota (10 CPU, 20GB memory). Pods must declare requests and limits. A runaway job hits the quota and gets rejected instead of starving others. Cost attribution per team drives optimization: "Your daily cost is $150; you can reduce it to $80 by tuning that memory allocation."

Core Concepts

Resource Requests vs Limits

Requests tell Kubernetes the minimum resources a pod needs. The scheduler uses requests to decide which node can fit the pod.

Limits prevent a pod from consuming more than specified. If a pod tries to exceed memory limit, it's OOMKilled. If it exceeds CPU limit, it's throttled.

The relationship matters:

  • Requests ≤ Limits (always)
  • Requests = Guaranteed capacity (scheduler reserves it)
  • Limits = Safety ceiling (prevents runaway)
  • Requests < Limits = Burst capacity (use extra when available, but give up if needed)

Quality of Service (QoS) Classes

Kubernetes assigns every pod a QoS class based on requests and limits. This determines eviction priority.

QoS ClassRequests & LimitsGuaranteeEviction Priority
Guaranteedrequests = limitsPod gets exact resourcesNever evicted (unless exceeds)
Burstablerequests < limitsMin guarantee + burstEvicted if node under pressure
BestEffortNo requests/limitsNoneEvicted first

Namespace Quotas

A ResourceQuota object limits total resource consumption per namespace. Prevents one team from monopolizing the cluster.

LimitRanges

A LimitRange enforces min/max resource constraints on individual pods. Prevents outliers (e.g., a 128GB memory pod when sane max is 8GB).

Cost Attribution

Track actual CPU/memory usage over time. Correlate with pricing. Bill teams fairly. Cost visibility drives efficiency.

Practical Example

apiVersion: v1
kind: Pod
metadata:
name: web-app
namespace: team-a
spec:
containers:
- name: app
image: myapp:1.0.0
ports:
- containerPort: 8080

# REQUESTS: Guarantee this much capacity
resources:
requests:
cpu: "500m" # 0.5 CPU cores (can share)
memory: "512Mi" # 512 MiB minimum

# LIMITS: Cap at this much (prevent runaway)
limits:
cpu: "1000m" # 1.0 CPU core max (throttled beyond)
memory: "1Gi" # 1 GiB max (OOMKill if exceeded)
---
# Deployment: More realistic (replicas)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: team-a
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: api:2.0.0
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
---
# StatefulSet with per-instance resources
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: data-platform
spec:
replicas: 3
template:
spec:
containers:
- name: postgres
image: postgres:15
resources:
requests:
cpu: "2" # 2 cores per replica
memory: "4Gi" # 4 GiB per replica
limits:
cpu: "4" # Burst to 4 cores
memory: "8Gi" # Max 8 GiB

Key Decision Points:

  • Requests = what the pod needs normally
  • Limits = what pod must not exceed (safety valve)
  • Total requests should not exceed node capacity
  • Limits can exceed node capacity (oversubscription acceptable if bursting is temporary)

When to Use / When NOT to Use

Cost Controls: Best Practices vs Anti-Patterns
Best Practices
  1. DO: Right-Size Based on Measured Workload: Production API averages 500m CPU, peaks 1.2 CPU. Set request=600m, limit=1.5. Batch job uses 4 CPU consistently. Set request=limit=4.
  2. DO: Enforce Quotas Per Team: Team A quota: 20 CPU, 40GB. Team B quota: 10 CPU, 20GB. Over time, enforce tight quotas (5% headroom). Prevents monopolization.
  3. DO: Set Pod Limits Well Below Node Capacity: Node capacity: 16 CPU, 64 GiB. Max pod limit: 4 CPU, 8Gi. Multiple pods fit. Cluster consolidates well.
  4. DO: Monitor Actual Usage vs Requests: Pod requests 2 CPU, actually uses 300m average. Adjust request down to 400m, limit down to 800m. Save 70% of reserved capacity.
  5. DO: Implement Cost Attribution: Team sees weekly dashboard: 'Your average daily cost is $215, trend is +5% from last week.' This drives immediate optimization.
  6. DO: Use Admission Control to Enforce Policy: Webhook rejects any pod without requests/limits defined. Rejects memory requests > 8Gi. Prevents outliers, keeps cluster healthy.
Anti-Patterns
  1. DO: Right-Size Based on Measured Workload: Guess blindly. 'Give everything 4 CPU and 8GB just in case.' This starves the cluster and prevents autoscaling from working properly.
  2. DO: Enforce Quotas Per Team: No quotas. One team's data job consumes 80% of cluster. Other teams' services fail. Cluster auto-scales endlessly, costing millions.
  3. DO: Set Pod Limits Well Below Node Capacity: Pod limit = node capacity (16 CPU). Only 1 pod fits per node. Cluster drastically underutilized. 70% idle capacity, money wasted.
  4. DO: Monitor Actual Usage vs Requests: Never measure. Requests stay inflated forever. 70% of cluster sits idle but quota says 'full.' Waste is invisible.
  5. DO: Implement Cost Attribution: Cost is opaque. Teams waste freely. Finance gets surprised by a $500k bill at month-end. No accountability.
  6. DO: Use Admission Control to Enforce Policy: Voluntary guidelines. Some teams follow, others don't. Resource fragmentation, unpredictable performance, hard to schedule.

Patterns & Pitfalls

Pod deployed with no requests or limits. It uses all memory on the node. OOMKill kills it. Other pods get evicted. Cascading failures. No visibility into what caused chaos. No way to prevent it happening again.
Team copies request/limit values from another team without understanding their workload. Some pods over-provisioned (waste). Some pods under-provisioned (throttled, slow). Resource fragmentation prevents efficient scheduling.
Quota is set to 50 CPU. Team gets 'quota exceeded' error. No dashboard showing what's consuming quota. No tools to investigate. Team frustrated, requests larger quota. Cycle repeats.
1. Deploy with conservative estimates (e.g., 1 CPU, 2 GiB). 2. Monitor actual usage for 1 week. 3. Adjust requests to actual + 20% headroom. 4. Adjust limits to peak + 30% headroom. 5. Repeat monthly. Converges on optimal values within 2-3 months.
Assign tight quota to a namespace (e.g., 10 CPU, 20 GiB). Forces team to design efficiently: small, composable services instead of monoliths. Autoscaling with 5 replicas x 2 CPU = 10 CPU budget. Team optimizes or adds resources if business justifies it.
Team dashboard shows daily cost, weekly trend, cost per service. Alert if daily cost increases >20% week-over-week. Catches runaway processes (memory leak, new job) within hours, not weeks. Enables rapid optimization.
Prod: tight quotas, strict admission control. Staging: moderate quotas. Dev: loose quotas, optional requests. Different policies match risk tolerance. Prevents teams from testing 'big ideas' in prod but allows experimentation in dev.
Requests/limits set at deployment, never revisited. Over 6 months, usage patterns change. Requests become outdated. Under-provisioned for peak, over-provisioned for normal. Waste accumulates invisibly.

Design Review Checklist

  • Are resource requests based on measured workload data (not guesses or 'just in case')?
  • Do requests account for sustained average load, or peak load? Is the distinction clear?
  • Are limits set to allow normal burst (requests < limits) but prevent runaway (limits < requests * 2)?
  • Does every namespace have a ResourceQuota defined and documented?
  • Is LimitRange configured to enforce sensible min/max per container and pod?
  • Are teams aware of their current resource quota and usage (within 90% threshold)?
  • Is cost monitoring integrated (actual usage tracking, not just requests)?
  • Can teams see their cost per service/feature on a dashboard?
  • Is there a chargeback or showback model in place (teams see their spend)?
  • Are quotas tight enough to force optimization, yet loose enough to avoid team frustration?
  • Are admission webhooks preventing over-request (e.g., max 8Gi memory per container)?
  • Are old, unused pods regularly removed (monthly cost hygiene reviews)?
  • Is cluster autoscaler configured to remove underutilized nodes (cost optimization)?
  • Are cost anomalies detected and investigated within 24 hours (trending alerts)?
  • Can platform team justify quota allocations to finance (business case)?

Self-Check

  1. Right now, what are your top 3 resource consumers in your cluster by namespace? If you can't name them without querying Prometheus, your visibility is poor.
  2. How much does your cluster cost per month? Per team? If you don't know, that's a problem. Unknown costs are runaway costs.
  3. If you double the number of users tomorrow, will your cluster have room, or will autoscaling kick in (adding $X/day cost)? Can you predict the cost impact?
  4. One random pod in prod uses 16GB of memory. Is that a bug, a feature, or expected? How do you find out? How long does investigation take?
  5. If a team requests more quota, what's your approval process? Is it data-driven (cost analysis) or political (who shouts loudest)?

Next Steps

  1. Measure current usage — Deploy Prometheus or similar. Collect actual CPU/memory metrics for all pods for 1 week.
  2. Set realistic requests — For each workload, adjust requests from data (e.g., p95 of observed usage).
  3. Implement namespace quotas — One per team/project. Start generous (150% of current usage), then tighten over time.
  4. Configure LimitRange — Sane defaults: min 50m CPU/64Mi memory, max 4 CPU/8Gi memory per container.
  5. Add cost monitoring — Link resource usage to billing. Build a cost dashboard. Show teams their spend.
  6. Enable admission webhooks — Require all pods to declare requests and limits.
  7. Run cost optimization quarterly — Review quotas, right-size, delete obsolete workloads, celebrate savings.
  8. Educate teams — "Your pod requests 8GB but uses 2GB average. Reduce to 3GB, save ~$60/month."

References

  1. Kubernetes: Managing Resources for Containers ↗️
  2. Kubernetes: Resource Quotas ↗️
  3. Kubernetes: Configure Default Memory Requests and Limits ↗️
  4. FinOps: Cloud Cost Optimization ↗️
  5. CNCF: FinOps for Kubernetes ↗️
  6. Kubernetes: Limit Ranges ↗️