A framework for a system design interview

Recently, one of my direct reports was assigned a rate limiting feature to be implemented across our mobile games. They're fairly new to backend engineering, and I was keen to get them more opportunities to practice software design. I was pleased to remember that designing a rate limiter is one of the examples in Alex Xu's wonderful book, System Design Interview, so I brought it from my shelf to my desk.

I've come back to this book often, and I recommend it frequently. The design examples are relevant and readable, full of interesting tidbits that encourage re-reading. But the largest value for someone new to system design is the framework introduced in the third chapter. This is a recap, with my own twist.

  1. Problem & Scope (5-10 minutes)
    Ask questions, gather information, and clarify requirements and assumptions. Validate assumptions with the interviewer. Once an assumption is stated or validated, write it down on the whiteboard.

    Ask questions to get exact requirements: What specific features? How many users? How fast do we anticipate to scale up? What are anticipated sales over the next year? What is the tech stack? What services do we have internally that can be leveraged for this problem? Volume in DAU? Performance requirements? Availability requirements? Data lifecycle requirements?  

    Post-its are useful for gathering information, as they can be easily grouped and moved, and can also represent events which can be visualized as a sequence, such as in a Big Picture Event Storm.
  2. High-level and Buy-in (10-15 minutes)
    Move to a new are on the whiteboard and start workshopping your ideas with topographical diagramming, thinking out loud, and never ignoring the interviewer. The idea is to sketch a diagram of the key areas of architectural significance. This often includes an entity relationship diagram for components and the flows for a few prioritized user stories that represent the scope of the problem to the satisfaction of the interviewer.

    Rough out a quick blueprint. Consider clients, servers, APIs, processors, message queues, data stores, redundancies, caching layers, web servers, load balancers, CDN – but include only if the scope dictates. Be wary of over-engineering or premature optimization. Caveat what you are willing to ignore for the moment and why, and what might deserve more attention. Check in, and explain the direction of the flow.

    If the problem scope involves system performance and processing for a specific load, propose a system capacity estimate using common scalability rules of thumb to sanity-check that the blueprint will fit the scale constraints.

    If the problem scope is more oriented to product or service development, consider including a high-level look at significant API endpoints, or database tables.
  3. Deep Dive (10-25 minutes)
    At this point the overall goals and feature scope is accounted for, and have a high-level reference architecture for the overall design. You've briefly discussed each component at a high-level. You've checked in for feedback, and have confirmed with the interviewer on which areas are trivial and which areas are more interesting to pursue.

    Feel free to ask the interviewer which areas they like for you to dive deeper, and be ready to offer your own suggestion and why. Every question is different, and there are no simple heuristics here that will work for every question. It's a great area to demonstrate your experience, but be wary of diving too deeply into details you might be interested in, but aren't interesting to the interviewer. You have limited time, so continue to be disciplined and to check in often.
  4. Wrap Up (3-5 minutes)
    You've covered the interesting problems to the satisfaction of the interviewer, and we're running low on time. Before ending the session, address that your design is not perfect and can still be improved. Recap the tradeoffs that were made, and identify potential bottlenecks and improvements. Describe how the architecture can evolve to support additional operational qualities such as configuration, monitoring, observability, etc. Consider edge cases or failure cases, and how they could be handled. Consider how the next scale curve could be handled, if moving to the next order of magnitude.