This covers Part 1 of Observability Engineering: Achieving Production Excellence.
What's a metric?
A metric was introduced in 1988 as the foundational substrate of monitoring. It's a single number with optional tags for grouping and searching. They are cheap and straightforward, enabling tooling and optimizations around collecting, storing, shipping, and analyzing. They are easy to aggregate in time-series buckets.
Originally coined in 1960, it was a characterization to describe mathematical control systems, defined as a measure of how well the internal states of a system can be inferred from knowledge of its external outputs. Extrapolate that to today, and the idea is that if structured events are emitted from throughout the system, they can be dynamically explored with granular control to give a much faster and more accurate diagnosis of the system state.
The book presents observability in software systems as "understand any system state your application may have gotten itself into, even new states you couldn't have predicted, without shipping custom code to handle it."
The pillars of observability are described as:
- Structured events
- Hypothesis-driven debugging
- Tooling that supports high cardinality, high dimensionality, and explorability
How does it differ from Monitoring?
Monitoring has been conventionally expressed as using logs, metrics, and traces to approximate overall system health. The telemetry is set up ahead of time based on assumptions of how a well-understood system will operate. As such, monitoring is well-suited for well-understood and less volatile systems, such as infrastructure-level analysis.
But with shifts towards continuous delivery and cloud native practices, software grows in complexity. New software systems can have a varied and emergent state space, contingent on invariable factors. Traditional monitoring and it's assumptions might no longer be the best tool for the job.
Debugging with monitoring requires certain assumptions, and to get value from metrics requires a level of knowledge of the system and how it is supposed to function. The upper limit on effective troubleshooting is constrained by your ability to pre-declare conditions that describe what you might be looking for. The role of intuition requires extensive experience.
Monitoring is great for infrastructure, Kafka and messaging queues. These are "known territory", where you know what to look for. Observability is ideal for "unknown territory", where you might not know what to be looking for. It's hard to find what you don't know what to look for.
What does Observability enable?
Observability through high-cardinality events enables an exploration of the "state space of system behavior" as a manner of investigation. This changes the classic troubleshooting workflow from a reactive model that is heavily reliant on institutional data, experience, and intuition, to a proactive model that rewards curiosity.
It's basically a "back-foot vs front-foot" distinction. Troubleshooting with monitoring is reactive, and requires work up front to set the conditions. Observability enables investigation in the moment, to explore conditions with existing data.
How does it work?
Structured events are key-value pairs, at an arbitrary length. The data should be high-cardinality, because this will be most useful in identifying data during debugging a system. For example, you will benefit from being able to tie data back to users, timeframes, nodes, processes, batches, etc.
Ideally the events are "wide" enough to carry all significant context variable that could influence the state space of system behavior. This can mean hundred or even thousands of key-value pairs, enabling drilling down on any combination of those keys.
Next Book Club meeting is scheduled for June 16th and will cover Part II: Fundamentals of Observability