Lessons learned from experience steering technical design collaboration

Midjourney: A scroll depicting an architecture diagram

At a certain scale and ecosystem complexity, software engineers begin to need to a way to socialize their bigger design ideas outside of code review. Sometimes this starts as engineers just informally writing their ideas down in a doc and sharing with their team. After a while, someone realizes everyone keeps asking the same questions, and then software design document templates are born.

I’ve spent a little time in this space at LinkedIn, but I was in charge of Twitter’s technical design process and associated design template artifacts (let’s just call them RFCs) for quite a few years. I’ve spent a lot of time thinking about software design collaboration, and I’ll share some lessons from stewarding this kind of process and culture.

The earlier days at Twitter

Twitter’s RFC templates were modeled closely after Google’s, the result of an influx of senior engineers from Google in the company’s earlier days. The backend RFC Google Doc template was a beast: over time it had accumulated required sections from various “special interests”, and was in the ballpark of 14 pages — before being filled in. Included sections asked the author to describe:

 - Multi-region design and site failover considerations

- Compute and other resource requirements for backend services

- Strategies for maintaining availability and graceful degradation

- Testing and validation strategies

- i8n concerns

- Online and offline storage design consideration

- API design

- Runbook and operations information

- Alignment with endorsed technology recommendations

- And more

For a while, Twitter Engineering had a centralized design review forum, with an arms-length relationship to the CTO and their advisory group of ICs. Not everything went through this forum, arguably just the big-ish things, but the designs required explicit sign-off from a senior engineer who was an architectural authority in the space.

The collection and organization of Engineering-wide RFCs and their review was partially supported by program management, and partially by a loose collection of engineers.

Evolution of the process

I eventually became the de-facto owner for the RFC process. Over time, I rolled out a series of changes, which federated some of the design review structures around Engineering, and cleaned up the RFC template. I organized a small committee of owners for the process which was used to incrementally update the artifacts and process as necessary.

I also refined the role of the senior IC RFC review group to be design outcome facilitators rather than design approvers. Instead of directly approving the designs, their role was to organize and steer the purpose-built design review committees, ensuring the conversation was moving forward constructively and efficiently (in theory).

Usage of the templates, and engagement with the process seemed generally pretty high; these RFCs became one of the very few nearly-ubiquitous artifacts across Engineering.

Adherence to this process was fairly consistent, but one thing never materialized: sufficient resources or attention to ensure that the process was efficient and effective for our developers and their time. We didn’t really track productivity metrics around RFC usage (or completion), or gather continuous feedback from our engineers about their ideas and experience with design review. Given the time investment typically associated with RFC development, we should have.

A few years later, for a variety of reasons, we took another look at how RFCs were being used.

This time, we got data

I orchestrated a series of engineering interviews with a combination of 50+ RFC authors, reviewers, design facilitators, from junior to principal engineers, as well as some cross-functional partners. These interviews were run by our RFC / SDLC team. We set out to understand a series of questions:

  • What were the most frustrating, and the most useful parts of the RFC process ?
  • What kinds of things were more effective to uncover during design review, versus in code review?
  • How could we engage and motivate authors and reviewers in a way that reduced design turnaround time?
  • If we could do anything, what would the design process look like?

We got a lot of feedback — both positive and negative. We came away with a synthesized set of interesting lessons, a few of which are summarized below.

Midjourney: technical puzzle

First, authors and reviewers felt that most of the RFC template was superfluous. The level of detail in most docs was just too high, and much of the content was better suited for code review rather than doc review. For example, API or schema design, or library usage.

There’s a set of valuable, common questions to ask every author of a system under design, which helps clarify the circumstances of what’s being built:

- Problem statement / why do we need to build something?

- Goals of the system presented in design; non-goals which are out of scope

- What success looks like, and how it can be measured

- High-level overview of solution proposal

- Details

- Alternatives considered

- Risks

Basically everything else could be moved to an accompanying checklist. Authors could quickly knock that out, and reviewers could use it to efficiently assess to see if their concerns were being addressed. Details about code and schema should be moved into code review tools.

Second, it turned out that the design facilitators didn’t actually like their role. This cohort of engineers tended to be more senior, and more platformy, with strong opinions about how systems should be built easily and safely. Having role-based structure around RFC review was good, but they felt that much of the facilitation work could be handled by program management. This included steering async review, convening the review committees, or helping manage stakeholders. Instead, they wanted to focus their time on partnering with authors to refine their designs.

Third, the RFC authors felt that the experience and knowledge gained by running an RFC through a design review, with its structured thought process and senior engineering facilitator, was invaluable. Although it could be lengthy, it helped them build relationships across Engineering, learn how to use novel approaches or systems supporting their work, and understand how the stack worked end-to-end.

For reviewers, often from platform or supporting teams, it helped them understand where their customer teams were going, and what kinds of problems they were trying to solve. In this sense, RFCs, in aggregate, acted as a proxy for customer team strategy.

Lastly, RFC reviewers wanted a better understanding of what kinds of assertions or behaviors in the design were being dictated by a corresponding product spec, versus being made by the RFC author themselves. This would help avoid arguments over product definitions and tradeoffs in the RFC amongst the reviewers in Engineering. Those discussions are better to have with the Product Manager or requirements owner directly.

Regrettably, we could have gathered this feedback years earlier. We made a series of changes to the RFC materials and process based on what we learned. After aligning with our xfn stakeholders, we launched the changes to positive reception.

There’s a set of considerations which can help make RFC templates and processes simultaneously more effective and ergonomic for design and review.

RFCs are tools —not outcomes

RFCs are tools for supporting better outcomes. They can help you reach those outcomes through a more maintainable design, a de-risked delivery plan — or realizing that you might not have to build anything at all. They also help teams understand gaps in capabilities provided by a technology stack, and what should be done to close them. These things should be the focus of the process — not the RFC itself.

RFCs themselves deliver little intrinsic value. They account for a snapshot of design considerations at a point in time, support communication and shared understanding, and support the nemawashi of those considerations within the system in question. If all of these things could be captured in code review for simpler changes, then the RFC is just overhead.

The amount of effort invested in any given RFC should be proportional to its likely improvement of outcomes, manifesting primarily through better design — simpler, more efficient, less risk.

Use gating processes carefully

It is far more important than an RFC get written for appropriate projects and reviewed by someone than not written at all simply to avoid some procedural gating function. For the author, even the act of laying out a design and plan “on paper” can help clarify the path forward and logically break up a project into sensible deliverables and milestones.

Some designs necessitate critical, blocking review by either senior engineers, cross-functional partners, or both. Every company should have a process for determining what these are. When implementation risks of poor design far outweigh the cost of heavyweight review, spend the balance of time in design de-risking.

In many cases, this is understanding the difference between reversible and irreversible decisions.

Laundry lists of concerns in RFC templates indicates dysfunction

RFCs are things where symptoms of organizational dysfunction commonly manifest:

  • Ambiguous or non-existent product or customer requirements
  • Ambiguous team or ownership structure for implementations
  • Lack of standards or opinionated technology usage
  • Lack of clarity about architectural layering and abstraction in the ecosystem

If these are the problems tying up engineers in RFC reviews, they should be addressed at the level of the engineering team or the company.

Sometimes, all kinds of technical scale, capacity, and idiosyncratic details of internal platform usage are demanded within RFCs, even for simple projects or product systems. This is a strong indication that your software ecosystem is too complex. It’s offloading concerns to developers — causing an increase in scope and complexity of their designs. They’re being made to account for the things the org should be providing to them.

Put checklists in an actual checklist

RFC authors want flexibility to express their design in ways that feel intuitive to them. But if your design process and RFC templates is a series of yes/no sections such as:

  • Did you do X
  • Did you incorporate Y
  • Did you consult Z

This is much better represented as a checklist, which also gives the opportunity to build some tooling around these binary answers. It’s also much easier to evaluate whether a checklist contains the end-to-end set of “must-do’s” you’d like every design to think through.

I’ve found that most authors actually like checklists, because they don’t want to miss anything important. The problem is, they also don’t want to spend time cargo-culting written prose about how they went about doing X or consulting Z.

The RFC reviewer is a constituency involved in this process, too

It’s easy to over-index on the role of the author of the RFC, without realizing that there are usually an order of magnitude more reviewers than authors involved in this process.

Efficient RFC review is supported by reviewers understanding what their responsibilities are in the context of the RFC review. It’s further improved by “no surprises” — reviewers know what to look for and how to understand it. Ensuring a minimally-disruptive and focused role for specialized reviewers makes it easier to secure their consultation in the future.

Reviewers have obligations too: much like code review, their job is to offer constructive feedback that improves the overall design, in a timely fashion. Nitpicking and sluggish review slow the entire process down and costs everybody involved. Standards for review conduct should be established.

This stuff is important; it needs to be owned and improved

If most of your engineers engage in an RFC-like design process, which steers how the big technical ideas at the company are formulated and reviewed: it needs to be properly owned.

Engineers don’t like putting up with inadequate code review tools or processes; RFCs aren’t really any different. These processes should be measured so they can be iteratively improved, and better support engineers in their design work. This ultimately helps deliver the better, more time-efficient design outcomes for the company.

Read More