Building Resilient Distributed Systems
December 15, 2024
Why resilience matters
Every system fails eventually. The question isn't *if* your distributed system will experience a failure — it's *how gracefully* it degrades when one happens. Over the past eight years, I've designed and operated systems that handle millions of requests daily, and the single most important lesson is this: design for failure from day one.
Circuit breakers
A circuit breaker is a pattern that prevents cascading failures. When a downstream service starts responding slowly or returning errors, the circuit breaker trips and subsequent calls fail fast instead of waiting for a timeout. This gives the downstream service time to recover.
type CircuitBreaker struct {
failures int
threshold int
lastFailure time.Time
timeout time.Duration
mu sync.Mutex
}
func (cb *CircuitBreaker) Call(fn func() error) error {
cb.mu.Lock()
if cb.failures >= cb.threshold {
if time.Since(cb.lastFailure) < cb.timeout {
cb.mu.Unlock()
return ErrCircuitOpen
}
cb.failures = 0 // half-open
}
cb.mu.Unlock()
err := fn()
if err != nil {
cb.mu.Lock()
cb.failures++
cb.lastFailure = time.Now()
cb.mu.Unlock()
}
return err
}Bulkheads
Inspired by ship design, bulkheads isolate different parts of your system so a failure in one component doesn't sink the whole ship. In practice, this means separate connection pools, thread pools, and rate limiters for each downstream dependency.
Graceful degradation
When a non-critical service fails, your system should still serve the core functionality. A recommendation engine going down shouldn't prevent users from placing orders. Implement fallback responses, cached data, or feature flags that disable non-essential features.
Observability is non-negotiable
You can't fix what you can't see. Every service should export metrics, structured logs, and distributed traces. I've found the OpenTelemetry standard to be the most practical approach — it's vendor-neutral and has excellent language support.
Key takeaways
- Design for failure, not just success
- Use circuit breakers to prevent cascading failures
- Isolate components with bulkheads
- Always have a degradation path
- Invest in observability early