Velocity depends on reliability.
When platform teams talk about reliability, we often talk about it in terms of uptime, SLOs, and incident counts. But how does the concept of reliability fit in with the larger organization?
Reliability is what enables a focus on delivery as the organization continues to grow — it's a core responsibility of a platform team. So what does reliability unlock for the engineers building on top of platform? The answer, I think, falls into four buckets.
Safety – engineers ship without fear.
When the underlying platform is trustworthy, product engineers move faster. They aren't constantly second-guessing whether a deploy will cause a cascade failure, or whether a schema migration will lock a table for ten minutes. Fear is a tax on velocity. Reducing it is a real contribution, which can be measured through a correlation of sentiment surveys and per-team delivery metrics.
Examples from my own work: canary deployments; ingestion backpressure systems; migration helpers that make zero-downtime schema changes safe and easy; standardized background job patterns so teams aren't rolling their own fragile in-process hacks; predictable release rollouts.
Signal – platform owns system health before it becomes a product problem.
If product engineers are the ones noticing that something is wrong, the observability layer is too shallow. If the product engineers are learning about availability or performance issues from users, their priorities may now be shifting mid-sprint, creating project delays.
The goal is for platform to be paged before product. For platform to resolve any issues before end users are affected. That means alerts on infra resources, SLOs on core services, health checks on shared APIs, and proactive scaling, redundancy, and hardening.
When platform is on top of it, product teams stay in flow. That's a velocity improvement, even if it never shows up in a sprint retrospective. These improvements can be measured by "number of incidents", "risks retired" or "dev hours saved".
Examples from my own work: infra observability and alerting; service level objectives that match to customer expectations; consistent structured logging to enable easy analysis; useful metrics and dashboards that surface the right signal without noise
Speed to resolution – when things break, anyone can navigate it
Incidents will happen. The question is how much drag they create. Structured playbooks and incident response checklists enable the conditions where any engineer can jump in and flowchart their way to resolution, rather than waiting for the one person who knows the system.
Once the immediate incident has been mitigated, this is Platforms time to shine. Incidents can come from anywhere, but Platform is at a particular point of leverage to identify and retire common root factors. Good metrics to watch for include "mean time to acknowledge", "mean time to resolve", as well as general uptime or availability metrics.
Examples from my work: incident response checklists; incident response escalation flowchart; rollback playbooks; integrated Slack workflows
Streamline – increase predictability in the delivery process itself
The first three buckets are about operating software reliably. This one is about shipping it reliably. CI pipelines that are dependable and fast, self-serve preview environments and infra provisioning to unblock development, and automation of toil common to product teams all reduce the friction engineers feel when getting code to production. My favorites metrics here are build times, deploy frequency, onboarding time, and toil reduction in hours per month.
Examples from my work: parallelization of CI workers; parallelization of unit tests; fixing flakey CI tests; introduction of preview environments and infra self-serve capabilities.
Summary
The through-line across all four: platform reliability is a product engineering multiplier. "Keeping the lights on" is the bare minimum – the mandate is to remove the drag that slows down every team that builds on top of you.
If you're a platform or infrastructure EM trying to articulate the value of your work to product stakeholders, this framing might help. It's not about uptime percentages — it's about what your reliability work makes possible for everyone else.