ObservabilityOpenTelemetryGrafanaTypeScript9 min read

Building an Observability Stack with OpenTelemetry

August 5, 2024

Why OpenTelemetry

Proprietary observability agents lock you into a vendor. OpenTelemetry is the industry standard for generating, collecting, and exporting telemetry data. It's vendor-neutral, has first-class support in every major language, and integrates with any backend.

The three pillars

Observability rests on three pillars: metrics, logs, and traces. OpenTelemetry handles all three with a unified data model.

Metrics

Metrics are numeric aggregations — request counts, error rates, latency percentiles. They tell you *what* is happening.

Logs

Logs are discrete events with structured metadata. They tell you *what happened* in detail.

Traces

Traces follow a single request across service boundaries. They tell you *where* the latency is.

The collector pattern

The OpenTelemetry Collector is a vendor-agnostic agent that receives, processes, and exports telemetry data. We run it as a DaemonSet on every Kubernetes node:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  otlp:
    endpoint: tempo.example.com:4317
    tls:
      insecure: false

The dashboards that matter

After years of iterating, these are the dashboards every platform team should have:

Service Overview — request rate, error rate, latency (P50/P95/P99), CPU, memory

2. Dependency Graph — service-to-service call graph with error rates

3. Deployment Health — metrics before/during/after each deployment

4. Cost by Service — CPU and memory usage per service per namespace

The MTTR impact

Since implementing full OpenTelemetry instrumentation across our stack, our Mean Time To Resolve incidents dropped by 60%. The biggest win: distributed tracing lets us pinpoint the failing service in seconds instead of minutes.

All posts