Building an Observability Stack with OpenTelemetry
August 5, 2024
Why OpenTelemetry
Proprietary observability agents lock you into a vendor. OpenTelemetry is the industry standard for generating, collecting, and exporting telemetry data. It's vendor-neutral, has first-class support in every major language, and integrates with any backend.
The three pillars
Observability rests on three pillars: metrics, logs, and traces. OpenTelemetry handles all three with a unified data model.
Metrics
Metrics are numeric aggregations — request counts, error rates, latency percentiles. They tell you *what* is happening.
Logs
Logs are discrete events with structured metadata. They tell you *what happened* in detail.
Traces
Traces follow a single request across service boundaries. They tell you *where* the latency is.
The collector pattern
The OpenTelemetry Collector is a vendor-agnostic agent that receives, processes, and exports telemetry data. We run it as a DaemonSet on every Kubernetes node:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
prometheus:
endpoint: 0.0.0.0:8889
otlp:
endpoint: tempo.example.com:4317
tls:
insecure: falseThe dashboards that matter
After years of iterating, these are the dashboards every platform team should have:
- Service Overview — request rate, error rate, latency (P50/P95/P99), CPU, memory
2. Dependency Graph — service-to-service call graph with error rates
3. Deployment Health — metrics before/during/after each deployment
4. Cost by Service — CPU and memory usage per service per namespace
The MTTR impact
Since implementing full OpenTelemetry instrumentation across our stack, our Mean Time To Resolve incidents dropped by 60%. The biggest win: distributed tracing lets us pinpoint the failing service in seconds instead of minutes.