Observability at Scale

This is Part IV in Observability Engineering: Achieving Production Excellence


Build Versus Buy and Return on Investment

This chapter provides solid advice for those who are unfamiliar with the "not invented here" syndrome.

When considering building with open source tools, weigh the full impact of hidden costs like recruiting, hiring, and training to develop and maintain custom solutions and the opportunity costs of not delivering core business value.

Efficient Data Storage

There are many challenges when it comes to storing but especially querying observability data, which has real-time requirements on billions of rows of ultrawide events of high-dimensionality and high-cardinality data. This chapter uses Honeycomb's Retriever implementation to elucidate the various tradeoffs. Other publicly available data stores up to the challenge include Google Cloud Big Query, ClickHouse, and Apache Druid.

Cheap and Accurate Enough: Sampling

Your team is probably more concerned about traces that contain errors or poor performance. Sampling is an excellent technique for improving the signal-to-noise ratio on events you care about, drastically reducing complexity and costs when considering storage and query requirements.

Because sampling is so valuable when handling observability data at scale, it's becoming increasingly common for open-source instrumentation libraries such as OTel to provide sampling logic capabilities.

There are two different sampling strategies to use tactically:

  • Head-based: The decision is made immediately and is propagated downstream via headers. Pro: reduces the overhead of collecting and storing unnecessary traces right at the source. Con: Potentially significant or anomalous traces may be missed or incomplete if only some services in a distributed system decide to sample a request.
  • Tail-based: The sampling decision occurs at the end of a transaction or request. The system collects all spans related to a trace and then decides whether or not to keep it based on various criteria. Pro: All meaningful traces are retained, leading to better insights. Con: More resource-intensive; implementation is more complex.

Telemetry Management with Pipelines

More to come.