What is Observability? Metrics, Logs & Traces Explained

The short answer

Observability is the ability to understand the internal state of a system by examining its external outputs. In software engineering, those outputs are primarily metrics, logs, and traces — often called the three pillars of observability.

A system is observable if you can answer the question "what is happening and why?" without deploying new code or adding new instrumentation after something goes wrong.

Observability vs monitoring

Monitoring tells you when something is wrong. Observability tells you why. Monitoring is about checking known failure modes against predefined thresholds. Observability is about having enough data to investigate unknown failure modes — failures you did not anticipate when you set up your dashboards.

In simple systems, monitoring is sufficient. In distributed systems with many services, databases, queues, and third-party dependencies, monitoring alone leaves too many blind spots. Observability fills those gaps.

Key distinction: Monitoring is reactive — you define what to watch for. Observability is exploratory — you have enough data to answer questions you did not think to ask in advance.

The three pillars

Metrics

Metrics are numeric measurements collected at regular intervals: CPU usage, request rate, error rate, latency percentiles. They are efficient to store and easy to alert on, but they aggregate information — a spike in p99 latency tells you something is slow, not which requests or why.

Common metric formats: Prometheus exposition format, StatsD, OpenTelemetry metrics.

Logs

Logs are timestamped records of discrete events. They contain rich context — request IDs, user IDs, error messages, stack traces — that metrics cannot capture. Logs are essential for debugging specific incidents but are expensive to store and slow to query at scale.

Structured logging (JSON rather than plain text) is now standard practice because it makes logs searchable and parseable by log aggregation platforms.

Traces

Traces follow a single request as it flows through a distributed system — from the user's browser through the API gateway, across multiple microservices, to the database and back. Each step is a "span" with its own timing and metadata. Together they form a trace that shows exactly where time was spent and where errors occurred.

Traces are the newest of the three pillars and the hardest to instrument, but they are the most powerful tool for diagnosing latency problems in microservice architectures.

OpenTelemetry

OpenTelemetry (OTel) is an open-source observability framework that provides standardized APIs, SDKs, and tooling for collecting metrics, logs, and traces. It is now the default instrumentation standard for most new systems, supported natively by Datadog, Grafana, Honeycomb, New Relic, and most other observability platforms.

Using OpenTelemetry means your instrumentation is vendor-neutral — you can switch backends without re-instrumenting your code.

Observability in practice

Achieving good observability is an iterative process. Most teams start with basic metrics and logging, add tracing when microservices make debugging too painful, and gradually improve the quality and coverage of their instrumentation based on what they actually need during incidents.

The most important question is not "do we have observability?" but "can we answer questions about our system's behavior when something goes wrong?"