Observability Playbook: OpenTelemetry to Grafana (Traces, Metrics, Logs) in Real Microservices

Teams often say "we have logs" but still struggle during incidents: what changed, which request path is slow, and whether the bottleneck is the database or a downstream API. Observability works when telemetry is consistent (standard attributes), correlated (trace IDs everywhere), and actionable (dashboards + alerts tied to SLOs).

1) Reference architecture (what runs where)

Services emit OTLP telemetry using OpenTelemetry SDKs (HTTP/gRPC).
OpenTelemetry Collector runs as an agent (per-node) and/or gateway (central).
Metrics: Prometheus with Grafana dashboards.
Traces: Grafana Tempo (or Jaeger).
Logs: Loki with trace_id injected into structured logs for correlation.

2) Instrumentation rules that prevent noisy data

Always set: service.name, service.version, deployment.environment.
Avoid high-cardinality attributes on metrics (full URLs, random IDs).
Keep logs structured (JSON) and include trace_id/span_id.

typescript

Node.js: minimal OTel bootstrap (conceptual)

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

3) Collector configuration: enrich, batch, redact, export

In production you want batching (lower overhead), enrichment (cluster/region/env), redaction (PII), and routing to multiple backends. The Collector reduces per-service config drift and lets platform teams enforce standards.

yaml

otel-collector gateway (illustrative)

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  batch:
    timeout: 5s
  attributes/redact:
    actions:
      - key: http.request.header.authorization
        action: delete
exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes/redact, batch]
      exporters: [otlp/tempo]

4) SLOs and alerts that reduce fatigue

Example availability SLO: 99.9% success rate (5xx < 0.1%) over 30 days.
Latency SLOs per journey (p95): checkout < 300ms, search < 200ms.
Use burn-rate alerts (fast + slow) instead of pure CPU thresholds.

5) What improves when you do this well (realistic numbers)

In an 18-service Kubernetes platform, standard OTel + Grafana correlation reduced median time-to-diagnosis for P1 incidents from ~42 minutes to ~16 minutes over two quarters. The biggest win was trace-driven triage: engineers could see a slow DB call inside one request path instead of searching logs across services.

Want this level of engineering on your product?

PharmoTech builds high-performance web apps, mobile apps, desktop apps, and supports growth with branding + marketing — serving healthcare & businesses across Bihar.

Start a project WhatsApp