Teams often say "we have logs" but still struggle during incidents: what changed, which request path is slow, and whether the bottleneck is the database or a downstream API. Observability works when telemetry is consistent (standard attributes), correlated (trace IDs everywhere), and actionable (dashboards + alerts tied to SLOs).
1) Reference architecture (what runs where)
Services emit OTLP telemetry using OpenTelemetry SDKs (HTTP/gRPC).
OpenTelemetry Collector runs as an agent (per-node) and/or gateway (central).
Metrics: Prometheus with Grafana dashboards.
Traces: Grafana Tempo (or Jaeger).
Logs: Loki with trace_id injected into structured logs for correlation.
2) Instrumentation rules that prevent noisy data
Always set: service.name, service.version, deployment.environment.
Avoid high-cardinality attributes on metrics (full URLs, random IDs).
Keep logs structured (JSON) and include trace_id/span_id.
Node.js: minimal OTel bootstrap (conceptual)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();3) Collector configuration: enrich, batch, redact, export
In production you want batching (lower overhead), enrichment (cluster/region/env), redaction (PII), and routing to multiple backends. The Collector reduces per-service config drift and lets platform teams enforce standards.
otel-collector gateway (illustrative)
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 5s
attributes/redact:
actions:
- key: http.request.header.authorization
action: delete
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes/redact, batch]
exporters: [otlp/tempo]4) SLOs and alerts that reduce fatigue
Example availability SLO: 99.9% success rate (5xx < 0.1%) over 30 days.
Latency SLOs per journey (p95): checkout < 300ms, search < 200ms.
Use burn-rate alerts (fast + slow) instead of pure CPU thresholds.
5) What improves when you do this well (realistic numbers)
In an 18-service Kubernetes platform, standard OTel + Grafana correlation reduced median time-to-diagnosis for P1 incidents from ~42 minutes to ~16 minutes over two quarters. The biggest win was trace-driven triage: engineers could see a slow DB call inside one request path instead of searching logs across services.
Want this level of engineering on your product?
PharmoTech builds high-performance web apps, mobile apps, desktop apps, and supports growth with branding + marketing.
