KubernetesDevOpsInfrastructure10 min read

Kubernetes at Scale: Practical Lessons from Production

November 2, 2024

The honeymoon phase

When we first migrated to Kubernetes, everything felt magical. Declarative deployments, auto-healing, horizontal scaling — it was everything we'd dreamed of. Then we hit production scale, and the real lessons began.

Resource limits are not optional

The single biggest mistake teams make is skipping resource requests and limits. Without them, a single noisy neighbour pod can starve the entire node. Set CPU and memory requests based on your service's steady-state usage, and limits at 1.5-2x that for burst capacity.

Pod disruption budgets

Kubernetes will drain nodes for updates, and without PodDisruptionBudgets, you'll lose all replicas of a service at once. Always set PDBs for stateful workloads:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-server

Graceful shutdowns

When a pod is terminated, Kubernetes sends a SIGTERM. Your application needs to catch this signal, stop accepting new requests, finish in-flight ones, and then exit. The default 30-second grace period is often too short for long-running requests.

The control plane is not a toy

etcd is the brain of your cluster. Back it up regularly, monitor its latency, and never run it on the same nodes as your workloads. A slow etcd means a slow API server, which means a slow cluster.

What I'd do differently

Start with namespace isolation from day one
Use network policies even in development
Invest in a good ingress controller early
Don't use the default storage class for production
Monitor the control plane before monitoring workloads

All posts