ArchitectureDistributed SystemsGo8 min read

Building Resilient Distributed Systems

December 15, 2024

Why resilience matters

Every system fails eventually. The question isn't *if* your distributed system will experience a failure — it's *how gracefully* it degrades when one happens. Over the past eight years, I've designed and operated systems that handle millions of requests daily, and the single most important lesson is this: design for failure from day one.

Circuit breakers

A circuit breaker is a pattern that prevents cascading failures. When a downstream service starts responding slowly or returning errors, the circuit breaker trips and subsequent calls fail fast instead of waiting for a timeout. This gives the downstream service time to recover.

type CircuitBreaker struct {
    failures    int
    threshold   int
    lastFailure time.Time
    timeout     time.Duration
    mu          sync.Mutex
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    cb.mu.Lock()
    if cb.failures >= cb.threshold {
        if time.Since(cb.lastFailure) < cb.timeout {
            cb.mu.Unlock()
            return ErrCircuitOpen
        }
        cb.failures = 0 // half-open
    }
    cb.mu.Unlock()

    err := fn()
    if err != nil {
        cb.mu.Lock()
        cb.failures++
        cb.lastFailure = time.Now()
        cb.mu.Unlock()
    }
    return err
}

Bulkheads

Inspired by ship design, bulkheads isolate different parts of your system so a failure in one component doesn't sink the whole ship. In practice, this means separate connection pools, thread pools, and rate limiters for each downstream dependency.

Graceful degradation

When a non-critical service fails, your system should still serve the core functionality. A recommendation engine going down shouldn't prevent users from placing orders. Implement fallback responses, cached data, or feature flags that disable non-essential features.

Observability is non-negotiable

You can't fix what you can't see. Every service should export metrics, structured logs, and distributed traces. I've found the OpenTelemetry standard to be the most practical approach — it's vendor-neutral and has excellent language support.

Key takeaways

Design for failure, not just success
Use circuit breakers to prevent cascading failures
Isolate components with bulkheads
Always have a degradation path
Invest in observability early

All posts