Fundamentals of Observability

This is Part II in Observability Engineering: Achieving Production Excellence

We can’t understand a complex system if it’s a black box.

Observability aims to understand and explain your system's internal state from its outputs, ideally without adding new metrics.

Structured events are the building blocks of observability. High cardinality, high dimensionality, and context-rich events facilitate discoverability, enabling a movement away from reactive, iterative debugging to an approach where curiosity is immediately rewarded.

To answer all possible questions with metrics, all metrics would have to be captured at all levels of granularity, which is unrealistic and prohibitively expensive. You'd spend more time with metrics than with the actual software. Furthermore, domain expertise is still necessary to contextualize the metrics and make sense of them concerning the question at hand.

When emitting high-context events, we still get all metrics related to the event, but only those metrics, which is much more reasonable. We only require metrics relevant to the event's context, and the events can be compared for outliers in an existing data set. This is useful for performance analysis.

Chapter Five concludes that metrics are too low-level and isolated to serve as a building block for true software observability. They should instead be relegated to where they are efficient in monitoring low-variability infrastructure and system-level concerns.

Glossary

  • Metric: a pre-aggregated measurement, as a scalar value, collected to represent system state, with optional tags used for grouping and searching.
  • Structured Event: a record of everything that occurred while one particular request interacted with your service, organized and formatted as key-value pairs so it's easily searchable.
  • Distributed Trace: the tracking of interrelated events that occur throughout a distributed backend, usually in the service of a single request.
  • Trace Span: the segments that comprise each part of a distributed trace. These might correlate to network jumps between services, or particular areas of measurement,  are typically differentiated as root span and parent-child spans, and contain specific data used to enable the stitching of the trace: the trace ID, span ID, parent ID, Timestamp, and Duration. In addition, additional data can be added to a span as a series of tags to be leveraged in custom queries and sampling rules.

OpenTelemetry

OpenTelemetry (OTel) is a Cloud Native Computing Foundation incubating project formed by merging the OpenTracing and OpenCensus projects. This happened in 2019, so as a whole OTel is still relatively new. Despite its age, it has seen rapid adoption in the industry, with technical committees composed of representatives from Google, LighStep, Microsoft and Uber. Some benefits to using OpenTelemetry:

  • Vendor-agnostic and community-supported means you only have to instrument once to send telemetry data to different products.
  • Consistency in language and established semantic conventions help with alignment and ensure that everyone is on the same page.
  • Ample availability of libraries with broad language support

OTel provides libraries, agents, tooling, and other things designed for capturing and managing telemetry data across your services.  

OpenTelementry Concepts

  • API: OTel libraries have a specific interface that developers use to interact with the OTel system
  • SDK: The concrete implementation component of OTel that tracks state and batches data for transmission
  • Tracer: A component in the SDK that tracks which span is currently active in a system process. It also enables adding attributes or events to the span or modifying its state.
  • Meter: A component responsible for creating instruments used for reporting measurements in your process and the ability to access and modify measurements, such as by adding or retrieving values at periodic intervals.
  • Context propagation: The current inbound request contains headers that the SDK deserializes to specify the present context for the process and also serializes it to pass downstream.
  • Exporter: A plug-in for the SDK that translates OTel in-memory objects into the appropriate format required by a specific destination, such as stdout, a lot file, Zipkin, Jaeger, Lightstep or Honeycomb.  
  • Collector: a standalone binary process that receives telemetry data in OTLP format, processes it, and sends it to one or more configured destinations.

The next meetup is July 20th, where we cover Part III: Observability For Teams.