software testing

Observability for Teams

Jul 28, 2023 • 2 min read

This is Part III in Observability Engineering: Achieving Production Excellence

Start with the most significant pain points, and then flesh out your instrumentation iteratively.

A key finding of Accelerate: Building and Scaling High Performing Technology Organizations was that the inverse relationship between speed and quality is a myth: high-performing teams can release high-quality code quickly, and these two qualities are correlated and reinforce each other. Conversely, failures tend to happen more often for teams that move slowly and take substantially longer to recover.
The key metric for the health and effectiveness of an engineering team is the time elapsed from when code is written to when it is in production. Every team should be tracking this metric and working to improve it.
Isolated test-driven development does not reveal whether customers are having a good experience with your service.
Observability should be used early in the software development life cycle, during the development process, to help catch defects earlier and reduce the cost of fixing them later. This is what is meant by "Shifting Observability Left."

Threshold alerting is for known unknowns only. This isn't sustainable; distributed systems' failures are inevitable and unpredictable.
A good alert must reflect immediate user impact, be actionable, be novel, and require investigation rather than rote action.
SLOs decouple the "what" and "why" behind incident alerting.
SLOs are excellent at communicating how to prioritize reliability vis-a-vis with feature development. If we aren't hitting SLOs, the focus ought to be reliability.
Two types of SLOs: time-based measures (99th percentile latency less than 300ms over each 5-minute window) and event-based measures (proportion of events that took less than 300 ms during a given rolling time window).
For time-based: 99p as the target; for every 100 minutes, I'm allowed 1 bad minute. For event-based: for 100 events, I'm allowed one bad event.
Use event-based because they provide a more reliable and granular way to quantify the state of a service. They are more precise. They measure brownouts better, like when more events fail but not all of them. You can more reasonably measure an SLO with event-based availability targets.
If SLOs are not being met, but customers are also not complaining, then perhaps it's okay to reduce the SLOs, if that could enable product development elsewhere.
If customers complain, it might be a poor leadership decision to reduce the SLOs further.

Stop relying on experience to guess what is happening in a system. It's unreliable and unsustainable.
Observability is not specific to debugging. Debugging's concern is to remove a bug, but it says little about the overall state of a given system. Observability will tell you which systems are good candidates to improve, which may involve debugging but could also involve performance improvements, refactoring or redesigning to achieve a target SLO.

Slack implemented Observability in the software supply chain, instrumenting the CI pipeline to solve complex problems throughout the CI workflow that were previously invisible or undetected.