Kubernetes at Scale: Practical Lessons from Production
November 2, 2024
The honeymoon phase
When we first migrated to Kubernetes, everything felt magical. Declarative deployments, auto-healing, horizontal scaling — it was everything we'd dreamed of. Then we hit production scale, and the real lessons began.
Resource limits are not optional
The single biggest mistake teams make is skipping resource requests and limits. Without them, a single noisy neighbour pod can starve the entire node. Set CPU and memory requests based on your service's steady-state usage, and limits at 1.5-2x that for burst capacity.
Pod disruption budgets
Kubernetes will drain nodes for updates, and without PodDisruptionBudgets, you'll lose all replicas of a service at once. Always set PDBs for stateful workloads:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-serverGraceful shutdowns
When a pod is terminated, Kubernetes sends a SIGTERM. Your application needs to catch this signal, stop accepting new requests, finish in-flight ones, and then exit. The default 30-second grace period is often too short for long-running requests.
The control plane is not a toy
etcd is the brain of your cluster. Back it up regularly, monitor its latency, and never run it on the same nodes as your workloads. A slow etcd means a slow API server, which means a slow cluster.
What I'd do differently
- Start with namespace isolation from day one
- Use network policies even in development
- Invest in a good ingress controller early
- Don't use the default storage class for production
- Monitor the control plane before monitoring workloads