Cody Django Redmond

New job, new dog

Cody Redmond — Sun, 05 Jul 2026 21:45:07 GMT

Happy to announce I'm joining the leadership team at getjobber.com, where I'll be managing a couple of teams in the fintech area. I'm excited about this opportunity, as it's a great match to my skills and experience, with a growing and maturing organization.

We also got a new pup! Shep, a Silken Windhound. I've had the last two months off to spend time with him as he settles into his new life with our family. He's great. Very much a growing puppy, exploring and learning. His baby teeth are so sharp!

Joni already loves him, but it's difficult for her to understand when he's over-tired and bitey, and trying to steal her toys. I found a great dog trainer, and I'm going to a class every satuday morning, learning how to be the best puppy-coach I can be.

Shipping a voice-native Brand Discovery product

Cody Redmond — Thu, 18 Jun 2026 22:54:59 GMT

I'm an engineering leader, but I'm also a builder. This is a short report on a product I'm taking from hypothesis to beta: BrandSoup.ai, a voice agent that interviews a founder about their brand and produces a strategist-grade report. I'm sharing it to show how good judgment enables development velocity, even with a tiny team.

Situation

My brother-in-law is a brand strategist in Australia. We were talking about how brand strategy is locked inside expensive consultants and intimidating, lengthy questionnaires. Most founders never do the work. We want to deliver a conversation — natural, spoken, adaptive — that extracts the same signal a strategist would, and turns it into a usable framework and PDF that can work as an direct input into Go-To-Market motions.

Hypothesis and assumptions

Hypothesis: founders will go deeper, faster in a spoken conversation than in a form, and an LLM can reliably infer structured answers from that messy dialogue.
Assumptions: beta users are primarily on mobile, on speaker, with no headphones; latency and barge-in matter more than polish; and the underlying brand framework will change frequently as we dial in our target users and better understand their requirements — so nothing should be hardcoded.

Goals

Validate the hypothesis quickly: get real founders talking to it.
Techical feasibility: voice and multi-agent coordination are durable capabilities – this isn't a one-off, so do it right.
Find product-market fit: instrument enough to learn what converts and what doesn't.

Initial user stories

As a founder, I speak with an agent that asks one good question at a time and adapts to what I've already said, working through a set of framework criteria.
As a founder, I get a polished brand report by email within 20 minutes of starting the conversation.
As our brand expert, I edit the criteria framework (pillars, aspects, queries) in a spreadsheet — with no deploy required.
As our PM, I see per-call cost and coverage so we can tune for unit economics.
As a beta partner, I get something that just works on my Android phone.

Stakeholders today are deliberately tight: the brand-manager expert, the PM, and a small set of beta partners. I kept the circle small on purpose, because a loud feedback loop with the wrong people would pull us toward scale before we've earned the right to it. I also sequenced the bets: prove the conversation works on the cheapest, hardest device first (Android, on speaker); make the framework editable by a non-engineer before tuning the model; and defer payments until there's something worth paying for. Each decision protects the team's attention for the one question that matters this quarter — does the hypothesis hold?

Technical challenges

I'll pick three.

1. Two agents, one conversation

The hard part of a voice agent is that thinking and talking compete for the same critical path. My answer is a dual-process blackboard. A Conversational Agent (CA) drives the dialogue. An Analytical Agent (AA) runs in the background after each turn — inferring answers, updating coverage, and writing a fresh StrategicBrief to a shared blackboard. The CA never waits on the AA; it just reads the latest brief at the top of its next turn.

The tradeoff I accepted: the AA is intentionally one turn behind. That's a small loss of immediacy for a large win in responsiveness, and it keeps each process independently testable.

sequenceDiagram
%%{init: {'theme':'base', 'themeVariables': {'background':'transparent','primaryColor':'#ffffff','primaryTextColor':'#1a1a1a','primaryBorderColor':'#333333','lineColor':'#333333','fontSize':'14px','edgeLabelBackground':'#ffffff'}}}%%
    participant U as User (voice)
    participant CA as Conversational Agent
    participant BB as Blackboard
    participant AA as Analytical Agent
    U->>CA: speaks (turn N)
    CA->>BB: read StrategicBrief
    CA->>U: reply (turn N)
    CA-->>AA: fire-and-forget evaluate(turn N)
    AA->>AA: infer answers + coverage
    AA->>BB: write coverage + new StrategicBrief
    Note over CA,AA: AA stays one turn behind,
off the critical path

2. Real-time audio

Full-duplex voice on a phone speaker is a feedback nightmare: the mic hears the agent. We went half-duplex while the agent speaks, then restored interruption with a client-side RMS VAD in an audio worklet that emits a barge-in signal. Along the way we killed a class of audio clicks (WAV-header artifacts, PCM chunk misalignment) and worked around iOS ignoring the AudioContext sample-rate hint by reporting the real mic rate and resampling on playback. Every threshold lives in config and a tuning log, so the PM and I can tweak endpointing and barge-in timing without a code change.

3. Configurability as an architectural choice

The fastest team is the one that doesn't redeploy to learn something. So I treated configurability as a first-class requirement, not a nicety. The brand framework — pillars, questions, the strategist's prompts — lives in Google Sheets, cached locally, owned by our brand expert. Voice behavior (Deepgram model and voice, STT endpointing, barge-in and VAD thresholds) is env-driven and tracked in a tuning log. Web config is the same. The payoff is that the PM and the brand expert can run experiments without me in the loop — change a question, swap a voice, tighten endpointing, and watch the next call. That's the difference between a feedback loop measured in minutes and one measured in deploys.

The end-to-end loop

Conversations hold open a WebSocket; everything else (auth, conversation state, criteria progress, report download) is plain HTTP. A completed conversation flushes its confirmed answers to the database. Once enough answers meet the threshold for quality completion, the user can choose to wrap up and generate their report. Report generation (Claude authoring, PDF, email) runs async on Redis + RQ workers, off the request path.

%%{init: {'theme':'base', 'themeVariables': {'background':'transparent','primaryColor':'#ffffff','primaryTextColor':'#1a1a1a','primaryBorderColor':'#333333','lineColor':'#333333','fontSize':'14px','edgeLabelBackground':'#ffffff'}}}%%
sequenceDiagram
    participant B as Browser
    participant API as FastAPI
    participant Q as Redis + RQ
    participant W as Worker
    B->>API: journey complete
    API->>Q: enqueue report job
    API-->>B: 202 accepted
    W->>W: Claude → PDF → email
    W-->>B: delivered (SparkPost)

Sentry watches all three tiers — browser, API, and worker.

%%{init: {'theme':'base', 'themeVariables': {'background':'transparent','primaryColor':'#ffffff','primaryTextColor':'#1a1a1a','primaryBorderColor':'#333333','lineColor':'#333333','fontSize':'14px','edgeLabelBackground':'#ffffff'}}}%%

flowchart LR
    Web[React + Vite SPA]
    Web <-->|HTTP / WebSocket| API[FastAPI]
    API --> DB[("SQLite:
auth, journey, reports")]
    API -->|enqueue| Q[(Redis + RQ)]
    Q --> W[RQ Worker]
    W --> LLM[Claude]
    W --> PDF[PDF service]
    W --> Mail[SparkPost]
    API <-->|STT / TTS| DG[Deepgram]
    API <-->|streaming| LLM
    Sheets[("Google Sheets:
framework data")] --> API
    Web -.-> Sentry[(Sentry)]
    API -.-> Sentry
    W -.-> Sentry

Try it: brandsoup.ai

On values, revisited

Cody Redmond — Wed, 13 May 2026 22:29:00 GMT

I was asked recently: What are your core values as a software developer and manager?

I've written about values before. But it's been a few years, with challenges and learnings along the way. Was I going to answer the same way? Had anything actually changed?

Kindness and Curiosity

Software is a team sport. The code is rarely the challenging part. Communication, collaboration, coordination: human interactions are at the core of where we work, and how we move systems forward.

Something I've learned is that effective communication relies on openness, and these open lines of communication must be established and nurtured with my reports, peers, and stakeholders. Kindness and curiosity, applied regularly, is what keeps that openness alive past the first few months.

Healthy, open communication is the stage for feedback, vulnerability, and how we show up when things are broken or unclear. It's much easier to have a hard conversation when the space for the conversation is already established. When the lines of communication close, a team can quietly fall apart.

Stewardship

This one's about ownership, but with a wide, strategic lens.

An engineering managers, we're not just responsible for a slice of the codebase or our direct reports at a given point in time. We work in complex, dynamic systems. We're navigating terrain — technical and organizational — and our job is to increase the team's leverage, reduce the risk, and find the optimal positions to win.

This is situational awareness, and it applies equally to growth of people and systems. It's acknowledging that we're responsible over how systems change over time, and so does the criteria for a good decision. This applies to evolving complex software systems, growing engineers, and building high-trust, effective teams.

Stewardship means we're asking: what does this look like in a year? Does this action move us closer to a goal?

Clarity

I almost called this one Intentionality, because that's really what it's about. Clarity is the outcome; intentionality is the practice.

Anyone who's been through a scaling period knows what unmanaged complexity feels like. It creeps in as extra meetings, blurry ownership, metrics that measure the wrong things, systems that no one fully understands anymore. Left alone, it becomes a ceiling.

Clarity is the discipline of pushing back against that. It's choosing language, abstractions, caveats, and targets. It's making the tradeoff explicit instead of letting it make itself. It's being deliberate about what signals you're listening for and what success actually looks like.

Those are still my three. Kindness and Curiosity, Stewardship, Clarity. I don't think this is a departure from what I've written about in the past, just an updated articulation with more emphasis on direction.

Velocity depends on reliability

Cody Redmond — Wed, 29 Apr 2026 20:38:09 GMT

When platform teams talk about reliability, we often talk about it in terms of uptime, SLOs, and incident counts. But how does the concept of reliability fit in with the larger organization?

Reliability is what enables a focus on delivery as the organization continues to grow — it's a core responsibility of a platform team. So what does reliability unlock for the engineers building on top of platform? The answer, I think, falls into four buckets.

Safety – engineers ship without fear.

When the underlying platform is trustworthy, product engineers move faster. They aren't constantly second-guessing whether a deploy will cause a cascade failure, or whether a schema migration will lock a table for ten minutes. Fear is a tax on velocity. Reducing it is a real contribution, which can be measured through a correlation of sentiment surveys and per-team delivery metrics.

Examples from my own work: canary deployments; ingestion backpressure systems; migration helpers that make zero-downtime schema changes safe and easy; standardized background job patterns so teams aren't rolling their own fragile in-process hacks; predictable release rollouts.

Signal – platform owns system health before it becomes a product problem.

If product engineers are the ones noticing that something is wrong, the observability layer is too shallow. If the product engineers are learning about availability or performance issues from users, their priorities may now be shifting mid-sprint, creating project delays.

The goal is for platform to be paged before product. For platform to resolve any issues before end users are affected. That means alerts on infra resources, SLOs on core services, health checks on shared APIs, and proactive scaling, redundancy, and hardening.

When platform is on top of it, product teams stay in flow. That's a velocity improvement, even if it never shows up in a sprint retrospective. These improvements can be measured by "number of incidents", "risks retired" or "dev hours saved".

Examples from my own work: infra observability and alerting; service level objectives that match to customer expectations; consistent structured logging to enable easy analysis; useful metrics and dashboards that surface the right signal without noise

Speed to resolution – when things break, anyone can navigate it

Incidents will happen. The question is how much drag they create. Structured playbooks and incident response checklists enable the conditions where any engineer can jump in and flowchart their way to resolution, rather than waiting for the one person who knows the system.

Once the immediate incident has been mitigated, this is Platforms time to shine. Incidents can come from anywhere, but Platform is at a particular point of leverage to identify and retire common root factors. Good metrics to watch for include "mean time to acknowledge", "mean time to resolve", as well as general uptime or availability metrics.

Examples from my work: incident response checklists; incident response escalation flowchart; rollback playbooks; integrated Slack workflows

Streamline – increase predictability in the delivery process itself

The first three buckets are about operating software reliably. This one is about shipping it reliably. CI pipelines that are dependable and fast, self-serve preview environments and infra provisioning to unblock development, and automation of toil common to product teams all reduce the friction engineers feel when getting code to production. My favorites metrics here are build times, deploy frequency, onboarding time, and toil reduction in hours per month.

Examples from my work: parallelization of CI workers; parallelization of unit tests; fixing flakey CI tests; introduction of preview environments and infra self-serve capabilities.

Summary

The through-line across all four: platform reliability is a product engineering multiplier. "Keeping the lights on" is the bare minimum – the mandate is to remove the drag that slows down every team that builds on top of you.

If you're a platform or infrastructure EM trying to articulate the value of your work to product stakeholders, this framing might help. It's not about uptime percentages — it's about what your reliability work makes possible for everyone else.

AI documentation dividend

Cody Redmond — Tue, 24 Feb 2026 15:39:58 GMT

A friend recently asked me what qualities make a great engineering manager, and one of the things I mentioned was risk management – specifically, the value of investing in design artifacts like system diagrams, reference architectures, and clear interface boundaries before diving into execution. In my experience, the hard part was never knowing these things were valuable. The hard part was convincing anyone to pay for them.

That dynamic is shifting in an interesting way. As organizations start investing in agentic AI workflows for software development, there's a new top-down pressure to produce exactly the kind of context-rich documentation that engineering managers have been quietly advocating for years. Agentic systems need well-defined boundaries, clear interface contracts, and accurate architectural references to make good decisions autonomously. Leadership is now asking for these artifacts because AI needs them — and that's creating buy-in that was historically very difficult to generate.

The irony isn't lost on me. Human engineers have always needed this context too. The difference is that "invest in documentation" never made it past a quarterly planning meeting, while "enable our AI agents to work effectively" apparently does. I'm not complaining... I'll take the win! But it's worth naming: if your organization is finally building out these maintainability artifacts for AI, make sure your engineers benefit from them just as much.

Annual life update

Cody Redmond — Fri, 02 Jan 2026 00:29:43 GMT

Joni is now three years old, and doing great. The last few years have blown past. It's been such a blessing to see her go from infant to toddler to little kid to "big little kid". I'm happy and healthy. My job is engaging, rewarding, and challenging. Leading a platform team is satisfying, and I'm blessed to be working with some truly exceptional individuals.

Audrey is wrapping up her first year as an human resources generalist, and she likes it. We love where we're living, and our house is still slowly undergoing renovations. My mom and one of my sisters moved nearby, which is nice. My other sister in Australia now has a second baby, and we've been in contact more frequently.

I've been meditating more frequently over the last year, and plan to keep it up. It sure seems appropriate, considering the precarious world situation we're all living through. I also hope to spend a little more time this year with robotics, and with physical activity. And less time with renovations :)

What does it mean to "operationalize" an engineering team?

Cody Redmond — Thu, 01 May 2025 15:45:12 GMT

In startup mode, it's the WHY and the WHAT that see the most consideration. The customer problem, the purpose, the service, the deliverable, etc. The HOW is often overlooked, so long as it's "good enough" and delivered as quickly as possible. A common strategy during this phase is "we'll figure it out as we go" – an ad-hoc approach.

As product-market fit is found, companies often choose to focus on growth, expanding their customer base and acquiring as many as possible within their market. This is the "scale-up" mode, and this is where HOW becomes significantly important: how to consistently deliver quality service as customers and usage volume grows exponentially.

It's the intentional tackling of the HOW problem that is the focus of "Operations". This is the work of going from ad-hoc processes and reactive firefighting to a structured, process-driven operation that handles scale efficiently. Here's what operationalizing a reliability engineering team typically involves:

Playbooks - Creating standard operating procedures for common incidents, outages, and maintenance tasks
Controls, Automations and self-service - Building tools that enable better control of the systems, or to automate repetitive tasks and allow other teams to help themselves
Observability and monitoring - Implementation of metrics, logging, and alerting systems that provide visibility into system health and simplify troubleshooting
Service Level Objectives (SLOs) - Identifying the critical user journeys, establishing clear reliability targets and tracking performance against them
Refactoring legacy systems - Systematically identifying, updating or replacing early "quick and dirty" solutions with more scalable architectures that can handle increased load and complexity
Incident management framework - Structured approach to handling incidents with clear roles, communication channels, and post-mortems
Capacity planning - Regular forecasting of resource needs based on growth projections
Knowledge management - Documentation systems that capture institutional knowledge and reduce dependency on specific team members
Cross-training and hiring - Ensuring the team has redundancy in skills and is staffed appropriately for scale
On-call rotation - Establishing a fair, sustainable schedule for engineers to handle off-hours incidents with clear escalation paths and support systems to prevent burnout

The term "operations" has roots in both military and business process management, where it means to convert strategic goals into day-to-day operational processes. In software development, the language really gained traction with the rise of cloud operations in the 2010s, influenced by Google's approach to running large-scale SRE teams. It boils down to putting something into operation in a systematic, repeatable way.

Operationalizing is building systems, processes, and team structures that can handle a 10x scale without requiring a 10x headcount. While vertical scaling helps with raw capacity, it doesn't address processes that won't scale, or the exponential spikes or system patterns that can lead to cascading failures and erratic behavior.

In addition to efficiencies, a focus on operations is also highly rewarding for both engineers and end-users. In both cases we're investing in predictability, meaning a better and more consistent experience with fewer surprises. For engineers in particular, this can drastically improve work-life balance and increase flow time, reducing burnout.

Of course, it's not as easy as it seems. Especially when it comes to reworking legacy tech while scaling, or identifying and tuning SLOs. These are areas often underestimated, where experienced senior and staff engineers can make a huge difference.

For my next post, I'll tackle the "HOW" of operations – very meta! By that, I mean the heuristics I use to identify priorities and sequencing. I've found mapping activities to be super-useful in surfacing certain insights, such as value-stream mapping, and cognitive load/complexity mapping, and I think it's worthwhile of a post. I might also dip into how qualitative/anecdotal data is another useful driver, like the good work being done by the folks over at getdx.com.

What makes a good manager?

Cody Redmond — Mon, 20 May 2024 19:31:45 GMT

I was recently asked this question by a friend considering a move into engineering management. It's been a few years since I've asked myself this question, and I find it interesting how my perspective on this has changed over the years.

Before I begin, I'd like to note that I won't address the manager's responsibilities, which are frequently framed as delivering value to the organization, often by achieving operational objectives while retaining and growing reports. Instead, I'll focus on the qualities that make for a great manager within the context of a team.

Consistent Practices and Behaviours

A manager creates a stable and productive environment by establishing a repeatable set of practices and behaviours. When team members know what to expect, they can focus their energy on solving problems rather than worrying about unpredictability. Practices can (and should) evolve, but there ought to be a core stability that reinforces a psychologically safe environment for work to get done.

For example, I have a weekly team meeting with an agenda that acts as a checklist of our most important responsibilities and allows the team to discuss significant issues that may have come up. I also have 1:1 meetings every two weeks and a monthly retrospective to uncover better ways of working.

I also have a standard method for feedback (both praise and constructive) that I incorporate into my onboarding of any new direct report.

Mark Horstman covers this topic in his book The Effective Manager and the Manager Tools Podcast.
Will Larsen has a blog post that covers how to systematize just enough for teams to be effective: https://lethain.com/work-policy-not-exceptions/
Claire Hughes Johnson also recently published a book on this topic, which describes the repeatable set of practices as the manager's "operating system."

Regulated and Open

Software Development can evoke contrasting opinions, perspectives, and dynamics. Managers who successfully manage their own temperaments are far more effective when collaborating with others during times of increased tension.

Managers must effectively work with people not just on their team but throughout the organization. No matter the context, an effective manager leads from a place of responsibility, curiosity, and openness. The Conscious Leadership Group refers to this place as "above the line." Conversely, being "below the line" means being closed off, defensive, and reactive.

The book The Fifteen Commitments of Conscious Leadership describes the qualities of conscious leadership in useful detail and provides methods for moving from "below the line" to "above the line."

Reliable and Effective

A manager is responsible for the standard of quality across their teams. This means modelling professionalism and follow-through. It doesn't mean that you need to be stuffy or uptight; it just means that you know which qualities are most important to build trust and an effective team and work to push that standard higher. Sometimes, I refer to this as "setting the tone."

For example, if I say I will do something, I do it and report back to the team. I use proper spelling, grammar, and full sentences, and I expect that folks come to meetings prepared. I use agendas for meetings, keep them on topic, and end them early instead of running down the clock.

In his book The Art of Leadership: Small Things, Done Well, Michael Lopp highlights that managers must set the highest standard for follow-through, which is crucial in building a high-trust team.
The central lesson of the book The Score Takes Care of Itself is that leaders who set, model, and maintain a high standard will create an environment where success is a natural outcome.
Another fun read that covers this ground is Turn the Ship Around! by David Marquet. I liked this one, too. It goes a little deeper into how shifting language can be a powerful tool in shifting behaviours.

Ambitious

A line manager often links the CEO or VP and the staff, ensuring effective alignment of organizational goals. With clear understanding and articulation, aligning the organizational vision with the personal drivers of your direct reports is possible, which can result in a powerful source of motivation.

In addition to responding to top-down objectives, a great manager will also work from the bottom up, looking for opportunities or innovations that could generate new revenue streams or unlock untapped organizational value.

Lastly, a great manager understands that engineers love hard problems and that setting high targets is a precursor to achieving exceptional outcomes. But just as important is understanding that making unachievable commitments on someone else's behalf is a source of demotivation. In this regard, it's important to maintain awareness of the technical environment in which your team is operating and to collaborate with your team to identify ambitious targets that are also achievable.

The book Drive: The Surprising Truth About What Motivates Us does a great job of unpacking how different people are motivated by different factors. It's worthwhile to check in regularly with your direct reports on the most important factors and then find ways to align those drivers with operational objectives or growth goals.

Risk-Adverse

Lastly, I don't often see this aspect come up in discussions as frequently—perhaps because it's often relegated to project management, a separate area. But from my own experience, I've noticed that a process that identifies and manages risk early enables far greater delivery and fewer surprises.

I like starting with a risk mitigation activity with my team that uncovers the assumptions, dependencies, and bets. Then, to raise our confidence, we validate assumptions and spike in the areas where we are least confident. We then move to system design diagramming and technical designs for significant areas.

If you can do this quickly and repeatably, you will often avoid working on the wrong thing for too long or seeing a project go drastically off-track. A direct report might occasionally feel unfamiliar with the "design up front" approach. They come around when they see that the entire thing can be accomplished within hours, that it generally yields better results, and that it's fun, too.

In the book "Righting Software," Juval Löwy emphasizes the importance of addressing areas of volatility in a codebase and the critical role of project design in software architecture. In particular, he states that complex software benefits from considered design, scoping, and sequencing. The design involves defining the architecture, components and interactions; the scoping ensures the functional and nonfunctional requirements are met and communicated, and the sequencing ensures that the development workflow is optimal for the team, which I often interpret as "allowing for the maximum number of iterations on running code".
The Pragmatic Engineer has lots of content about how engineering managers should be familiar with project management and risk management.

Missing from my bookshelf: the 15 Commitments of Conscious Leadership

Cody Redmond — Thu, 07 Mar 2024 18:33:59 GMT

This book was recently recommended to me from Bryan Dunn, the VP of Product at Crowdbotics. It arrived at my door on Saturday, and I started reading it immediately. My general approach to a new book is to first read the intro and conclusion – this generally gives me a good idea of how well the book reads, and how much of an investment I want to make. I was immediately hooked. The tone and language is friendly and interesting; a little hippy-dippy but not unfocused.

At this point, I'm convinced this is a book that I've been missing. Of course every leader likes to think that they're "conscious" and with a high degree of self-awareness, but this is one of those things that can be perfectly described as a blind-spot. Emotions are tricky to navigate, and I'm eager for more management tricks and tools. I'm excited to dedicate my morning time over the next couple of weeks in the hopes of learning new such tricks and tools from this book. I'll come back to this post with updates as I go.

Scaling People: Tactics for Management and Company Building

Cody Redmond — Sun, 17 Sep 2023 20:21:35 GMT

In addition to my new role at Crowdbotics, I've also started reading a recently published book on scaling people management. It was written by Claire Hughes Johnson - former COO of Google and Stripe. I'm excited because it purports to be a pragmatic take on management, complete with templates and workbook material.

Update: I've read this book, and it would have been a lot more eye-opening to me when I started my leadership journey. Although I appreciated the articulation and writing style, I had already come to many of the insights on my own. I will recommend this book to new managers or folks leaning in that direction.

Notes from the introduction:

You need process, and you need it sooner than you realize
A company will not get far without "core processes": strong management and sound operating systems.
Companies and teams must establish a playing field where everyone participates and marks progress.

You know why playing a game is fun? Because it has rules, and you have a way to win. Picture a bunch of people showing up at an athletic field with random equipmnt and no rules. Someone is going to get hurt. You don't know how to play, you don't know how to score, and you don't know how to win.

Research has found that people who outperform in their fields employ strategies that move them past the autonomous stage of learning, like athletes who use speed workouts to improve their performance.
A combination of core frameworks, such as hiring and planning practices and underlying leadership principles, can help scale an organization.

Workbooks: http://press.stripe.com/scaling-people/workbooks

Consider reading: https://www.amazon.ca/Working-Backwards-Insights-Stories-Secrets

Two more books on Stripe Press that I can thoroughly recommend:

Democratizing Software Development in the Low-Code Industry.

Cody Redmond — Sun, 17 Sep 2023 19:12:59 GMT

I'm happy to announce that I'm starting a new leadership position as a Platform Software Engineering Manager at Crowdbotics. I'm working with a fantastic team to simplify and accelerate software development using AI and low-code/no-code approaches.

Crowdbotics was founded in 2017 by Y-combinator alum and Forbes Magazine "30 under 30" Anand Kulkarni. Since then, Crowdbotics has paved the way for low-code software development, enabling anyone to generate production-grade applications in minutes using AI-assisted product requirement analysis, visual building tools and code generation.

Over 20,000 apps have been launched through Crowdbotics, including mission-critical healthcare applications, venture-backed software products earning millions in revenue, learning management platforms, and government tools.

I'll work closely with Product Management to develop and scale the architecture and engineering teams necessary to enable a growth vector in an exciting new domain.

Although Crowdbotics has a small headquarters in Berkeley, CA, it's a globally distributed remote-first company with members across all time zones. My team is primarily situated in PDT, although there are a few in EST, as well as members in Nairobi, Dubai and Nepal!

Observability at Scale

Cody Redmond — Fri, 01 Sep 2023 20:16:53 GMT

This is Part IV in Observability Engineering: Achieving Production Excellence

Build Versus Buy and Return on Investment

This chapter provides solid advice for those who are unfamiliar with the "not invented here" syndrome.

When considering building with open source tools, weigh the full impact of hidden costs like recruiting, hiring, and training to develop and maintain custom solutions and the opportunity costs of not delivering core business value.

Efficient Data Storage

There are many challenges when it comes to storing but especially querying observability data, which has real-time requirements on billions of rows of ultrawide events of high-dimensionality and high-cardinality data. This chapter uses Honeycomb's Retriever implementation to elucidate the various tradeoffs. Other publicly available data stores up to the challenge include Google Cloud Big Query, ClickHouse, and Apache Druid.

Cheap and Accurate Enough: Sampling

Your team is probably more concerned about traces that contain errors or poor performance. Sampling is an excellent technique for improving the signal-to-noise ratio on events you care about, drastically reducing complexity and costs when considering storage and query requirements.

Because sampling is so valuable when handling observability data at scale, it's becoming increasingly common for open-source instrumentation libraries such as OTel to provide sampling logic capabilities.

There are two different sampling strategies to use tactically:

Head-based: The decision is made immediately and is propagated downstream via headers. Pro: reduces the overhead of collecting and storing unnecessary traces right at the source. Con: Potentially significant or anomalous traces may be missed or incomplete if only some services in a distributed system decide to sample a request.
Tail-based: The sampling decision occurs at the end of a transaction or request. The system collects all spans related to a trace and then decides whether or not to keep it based on various criteria. Pro: All meaningful traces are retained, leading to better insights. Con: More resource-intensive; implementation is more complex.

Telemetry Management with Pipelines

More to come.

Simplifying complexity

Cody Redmond — Thu, 03 Aug 2023 20:27:49 GMT

As organizations evolve and software grows, so does complexity. Learning how to identify and curb complexity is a core skill to develop as an engineering manager.

Unconstrainted complexity inevitably results in suboptimal outcomes, such as:

A slowdown in a team's ability to accurately assess a problem or deliver a solution
A slowdown in the time it takes to onboard new engineers
An increase in incidents and rework
A drop in morale or an increase in attrition

It's impossible to eliminate complexity: some problem spaces are just naturally complex, and there's no way around it. The trick for an engineering manager is to manage complexity, which means identifying, measuring, and simplifying where possible.

Talking about complexity

Complexity can be found in many forms. It's helpful to be able to identify which aspect of complexity we intend to simplify.

Aspects and measurements

Cognitive load: characterized by an overwhelming diversity of detailed tasks that also increases context switching and few opportunities for deep work or system optimization, resulting in generally slow delivery and low morale. Measure with developer surveys, 1:1s, and onboarding metrics for new hires.
Process complexity: characterized by too many cooks in the kitchen, too many meetings, too many required signoffs, slow decision-making, unmet requirements, and frequent change orders.
Codebase complexity: characterized by slow builds, slow tests, reduced cycle time on code changes, reduced cycle time on code reviews, and increased rework or deployment rollback. Measure using maintainability index scoring using an index that works for the stage and engineering goals of the organization.
System complexity: characterized by no single person knowing how the system works, no known success metrics, or low-quality metrics; quality or performance feedback often comes from end-users, and debugging or root cause analysis requires multiple people and a significant investment. Measure with build times, time it takes from code written to code in production, time for functional tests to run, rework ratios, and other DORA-inspired metrics.

Assessment and Approaches

The Cynafin framework is helpful for leaders to assess the operating context to take appropriate actions quickly. The four quadrants are Simple, Complicated, Complex, and Chaotic, where simple is characterized as apparent cause-and-effect relationships where correct answers are based on facts and easily verified, and things become much less linear from there. I won't go into the details of each quadrant's characteristics (feel free to read the link or check out the wiki page – it's excellent). But mapping complexities with this framework has often led me to the following actionable behaviours:

Reduce Chaotic domains to Complex by drawing out a signal from noise by introducing Observability and collaborative Event Storming.
Reduce Complex domains to Complicated by introducing abstractions, boundaries, patterns, and workflows.
Make Complicated domains Simple by introducing or improving tooling, access to specialists, or load-shedding via a specialized team to handle inherently complex systems that are core to the organization's revenue streams.
Eliminate Simple tasks via automation or outsourcing.

Conclusion

Divide and Conquer: Complexity in software is typically managed by a "divide and conquer" approach, which can be applied at any level of granularity when considering software systems.
Abstractions, Interfaces and Boundaries: introduce smaller cognitive loads with specialization and bounded contexts modelled on supporting the business's current and future revenue streams. Move teams to support the bounded contexts, and have those teams own the architecture that they depend on and document with C4 patterns.
Refactoring to Design Patterns: Refactor complicated codebase towards named design patterns to increase understandability, maintainability, and flexibility.
Introductions of Frameworks: If you're noticing the same sorts of use cases come up frequently, introducing a framework can make a huge impact, although it might require an up-front investment, so be prepared with a considered plan when proposing to your team.
Enabling Teams: A temporary enabling team to aid in strategic refactoring or short-term organizational projects such as GDPR, security improvements, etc. Enabling teams are also great for tackling technical debt or building technical surplus to unlock additional product velocity. They can also be transitioned easily to permanent developer experience teams if desired.
Complex Subsystem Teams: Load-shed core complexity from stream-aligned product teams with the introduction of a team designed to handle a particularly complicated area.

Bonus points:

Establish baseline performance measurements with SLAs, SLOs, and SLIs for services in production: Identifying baseline performance metrics enables a clear focus, especially when things are overwhelming. Knowing the performance and expectations of production services results in understanding the capacity for taking on new objectives and making wise decisions. For example, reducing complexity might be more critical when SLAs are unmet.
Establish high-level goals with Objectives and Quantitative Key Results: Identify big goals and ensure they can be achieved incrementally (i.e. a mix of leading and lagging metrics enables progress can be measured throughout development and not just at the very end). Working with Product Stakeholders to identify the right success metrics early results in a much simpler development process and managed expectations.

Coaching frameworks

Cody Redmond — Sat, 29 Jul 2023 04:30:52 GMT

I received a question a few weeks back on if I used a coaching framework and which one. My response was that I use a framework of my own design: a combination of SWOT analysis with OKR-style reporting. At the time, it didn't seem satisfying as an answer. The reality is always a little more nuanced; I've read multiple books on coaching, practiced various systems with my directs and received executive coaching myself. Over time and experience, I've dialled in a system that has resulted in many of my reports' professional growth and promotion.

But in retrospect, it's clearer to say that I use the GROW model. It's become widespread in tech, and it's generally the answer that folks want to hear.

GROW Model of Coaching

So what is the GROW model? It's a way of speaking that is intended to result in better nondirective coaching. Why non-directive coaching? Because most leaders naturally employ a directed style of leadership which takes place primarily through "telling" and leans heavily on authority. This style is prevalent in training a report for a specific job that is likely low-variability.

The challenge of modern teams is a fast-moving world: objectives change, technologies evolve, and it's no longer feasible to expect a manager or leader always to know the best way forward. Directed leadership also has the unfortunate property of stifling ownership and does not build organizational capacity well.

A GROW approach to coaching requires patience to learn and apply but is rewarding and energizing when used successfully. Avoiding leading questions means that the direct report always owns the problem to be solved, and the coaching method is specifically tailored to open perspective and draw out insight.

GOAL: Asking what the person wants to achieve from the session and in the immediate future, such as "What do you want when you walk out the door that you don't have now?"
Reality: This means asking probing questions about the current situation in detail. The trick here is to inquire about facts describing reality, such as what, who, when, and where – but explicitly not why, because it is tied to motivations and justifications, which border on judgment and can raise defences, which is counter-productive. A good question here is, "What are the key things we need to know?"
Options: This step is to help broaden the perspective. Often people who are seeking help feel like they have limited options. By opening the floor to creative thinking, more possibilities and perspectives emerge. I learned a trick from Product Management: Diverge for Options, Converge for Decisions. A good question here is, "If you had a magic wand, what would you do?"
Will: Asking "What will you do?" encourages detailing a specific plan. Sometimes the plan might just be to learn more about the situation, to show appreciative inquiry in learning more about a problem or someone's perspective or concerns. Sometimes, it's about preparing for a conversation or drafting a proposal. Another trick here is to ask how confident the person is in their plan and how likely they will be to act on it. Investing in helping someone through a problem, only for them not to feel confident enough to action on it, can be demotivating for all parties. My job as a people manager is to validate that my coaching serves the intended function.

The Effective Manager

This is the book I come back to more than anything. According to Mark Horstman, successful coaching is simply about asking for more. Supporting, mentoring, and coaching is not an end in itself. People managers are intended to drive value for the company. Effective managing is giving frequent feedback on performance, and effective coaching is continuously asking for higher levels of performance.

This may sound aggressive or counter-intuitive, but in a high-trust environment, it's a superpower. Too many managers feel like coaching is something to apply to low-performers, but coaching high-performers is satisfying and can yield much more.

Either way, every direct report should know their performance expectations in their role, and a regular review on the topic should occur. Not every direct will be rushing to exceed, and that's okay. They know what it is and are directly responsible for their career development. They also know that their manager will work with them to achieve their performance goals.

These are the steps for effective coaching that MT recommends:

Collaborate to Set a Goal: Describe a behaviour or result to achieve by a date. For example: by (four months from now), you will deliver admin features within one week on average without introducing any regressions. (This is an example of the MT goal structure DBQ – Date Behaviour Quality). MT specifies that coaching goals are long-term behavioural goals – if they can be achieved in less than four months, a frequent feedback model will suffice.
Collaborate to Brainstorm Resources: There are no silver bullets. Go for volume, not accuracy.
Collaborate to Create a Plan: Each step in the plan contains a deadline and behaviour and is completed when the reporting happens to the manager – effectively, the plan is an accountability system that you are directly investing in for the sake of your report. The plan is only for the first few weeks – it's not worth planning four months if they can't make it past the first week.
The Direct Acts and Reports on the Plan: If the system has been set up correctly, we should receive regular updates in the form of task completion emails, and we then are briefly discussing the progress each week in 1:1s.

Steps 3 and 4 become iterative toward achieving the goal from step 1. If the direct report fails to accomplish a coaching task, we give the direct report negative feedback. MT gives an example: "When you miss your coaching deadlines, that's more work for later. Can you change that?". Similarly, positive feedback can be given when the direct report completes a coaching deliverable.

I love that Manager Tools willingly leverages what we know about organizational behaviour to set short deadlines on doable tasks to increase the chance of completion. Whenever possible, look for opportunities to observe the direct engaging in the behaviour we want, to provide the direct with feedback on what we observe, and make that a regular, very short-scope task. If coaching is behavioural, then the best way to achieve it is to enable and incentivize frequent practice.

The Coaching Habit

This book claims that seven questions and a coaching habit will foster team autonomy and empowerment. The central argument of this book is that successful coaching is through asking questions rather than providing answers. I found this book to be too long for its little substance, but if you are looking for a way to refresh your 1:1 questions, you can look to the seven questions for inspiration.

Observability for Teams

Cody Redmond — Fri, 28 Jul 2023 22:59:53 GMT

This is Part III in Observability Engineering: Achieving Production Excellence

Applying Observability Practices in Your Team

Start with the most significant pain points, and then flesh out your instrumentation iteratively.

Observability-Driven Development

A key finding of Accelerate: Building and Scaling High Performing Technology Organizations was that the inverse relationship between speed and quality is a myth: high-performing teams can release high-quality code quickly, and these two qualities are correlated and reinforce each other. Conversely, failures tend to happen more often for teams that move slowly and take substantially longer to recover.
The key metric for the health and effectiveness of an engineering team is the time elapsed from when code is written to when it is in production. Every team should be tracking this metric and working to improve it.
Isolated test-driven development does not reveal whether customers are having a good experience with your service.
Observability should be used early in the software development life cycle, during the development process, to help catch defects earlier and reduce the cost of fixing them later. This is what is meant by "Shifting Observability Left."

Using Service-Level Objectives for Reliability

Threshold alerting is for known unknowns only. This isn't sustainable; distributed systems' failures are inevitable and unpredictable.
A good alert must reflect immediate user impact, be actionable, be novel, and require investigation rather than rote action.
SLOs decouple the "what" and "why" behind incident alerting.
SLOs are excellent at communicating how to prioritize reliability vis-a-vis with feature development. If we aren't hitting SLOs, the focus ought to be reliability.
Two types of SLOs: time-based measures (99th percentile latency less than 300ms over each 5-minute window) and event-based measures (proportion of events that took less than 300 ms during a given rolling time window).
For time-based: 99p as the target; for every 100 minutes, I'm allowed 1 bad minute. For event-based: for 100 events, I'm allowed one bad event.
Use event-based because they provide a more reliable and granular way to quantify the state of a service. They are more precise. They measure brownouts better, like when more events fail but not all of them. You can more reasonably measure an SLO with event-based availability targets.
If SLOs are not being met, but customers are also not complaining, then perhaps it's okay to reduce the SLOs, if that could enable product development elsewhere.
If customers complain, it might be a poor leadership decision to reduce the SLOs further.

Debugging SLO-Based Alerts

Stop relying on experience to guess what is happening in a system. It's unreliable and unsustainable.
Observability is not specific to debugging. Debugging's concern is to remove a bug, but it says little about the overall state of a given system. Observability will tell you which systems are good candidates to improve, which may involve debugging but could also involve performance improvements, refactoring or redesigning to achieve a target SLO.

Observability and the Software Supply Chain

Slack implemented Observability in the software supply chain, instrumenting the CI pipeline to solve complex problems throughout the CI workflow that were previously invisible or undetected.