On Technical Debt (part 2)
In reviewing my notes from my chat with my friend on technical debt, I realized that my post missed a number of key points we had discussed. I intend to address the rest of them here.
Value Delivery
I already discussed how over-indexing on the strict definition of technical debt can lead people to forget the most important part about what they do: they are trying to solve a problem that creates value for somebody else. I’ve seen this happen as companies grow in size and complexity. When a team is small, it is easy for all people, technical and non-technical, to stay aligned and focused on value creation. But as an operation grows, organizations and departments emerge and technologists end up spending more time with other technologists rather than with business or scientific partners. While this doesn’t have to be the case, they often become more removed from the actual problems that need to be solved, and view themselves as removed. In such a situation, their focus can become on delivering software, not creating value. I’ve seen teams become overly obsessed with software development best practices (which are generally “good”), at the expense of value delivery best practices.
Here’s an example and idea that may seem a bit heretical. It is common for teams to block a pull request merge if anybody requests changes (formally, on Github). This might be a software delivery best practice (we don’t want less-than-perfect code to get merged into production!), but in my view is certainly not a value delivery best practice. Why? It naturally pushes teams into adversarial positions, even if that’s not the intent. In my view, the willingness of team members to throw out ideas is more important than perfect production code (which is ludicrous to begin with). Of course, if you conduct a code review and see that some code will break a system you should prevent that from happening, but embracing the power of “yes, and” can unlock a next level of value delivery. So don’t overly focus on software development best practices - make sure you leave room for value delivery best practices too.
Stay Alive
As a startup, the amount of value you are able to create is highly dependent on how long your company survives. A company failing is a real and high risk, so it is critical that you spend your resources focused on achieving the next value inflection point so that you can raise again (or get enough revenue in the door) to stay alive longer. Generally speaking, your infrastructure or lack of technical debt is not a determining factor in achieving those value inflection points - achieving them is dependent on your ability to solve real problems. An overabundance of technical debt will absolutely slow you down and decrease your value creation rate, but is far less likely to lead to company death early on. If you manage to stay alive long enough for your technical debt to have become a real problem, congratulations! You’ve survived longer than most startups.
YAGNI
Coined by Kent Beck, a software engineering legend and creator of Extreme Programming, this acronym stands for “You Aren’t Gonna Need It.” Explained brilliantly by Martin Fowler here, this principle outlines the dangers of building things you think you need based on a presumption of what the future will be. We all know that none of us are omniscient, yet somehow we often think we know the future state far better than we actually do. This can lead to an even more insidious form of technical debt: the creation of technical debt to avoid future technical debt.
A classic example of this is premature abstraction, wherein a team tries to create a general solution before even implementing a specific one. They try to think through the various ways instantiations will be made in the future and design with that future in mind. The abstraction is itself more complex (and harder to reason about) than the implementation they need at the moment, but hey, that’s okay because they’re setting themselves up for future success, right? Wrong. They almost invariably do not understand the future needs as well as they think, and so they build elements that are unnecessary now and unnecessary in the future. What ends up happening is they incur the costs of building features they will never need, the opportunity costs of delay (from not spending their time on other capabilities that truly are needs now), the costs of carrying/supporting those features until the supposed time of need arrives, the costs of tearing down those features once they realize they don’t actually need them (often teams don’t actually do this, creating zombie code that makes it more difficult for new team members to get their bearings on a project), and then the costs of building what they truly need in the end. In an effort to minimize future refactoring (a false form of “technical debt”), they actually incur technical debt.
Timing and Ratios
Another great Kent Beck-ism is “Make it work. Make it right. Make it fast” (in that order). This alone suggests that technical debt is not inherently bad (at first you just have to make it work, correct?). But at what point in time do you shift to making it right, or making it fast? When is too early, when is too late? An example from personal experience is investing in common engineering infrastructure and tooling.
In the early days at Recursion, each engineering team worked closely with different stakeholders to develop some product or capabilities that they needed. They also managed their own infrastructure, CI/CD pipelines, etc. While they generally used common backbones (we didn’t have one team using AWS and another GCP), things weren’t centrally managed or organized very well. As the teams grew in number and complexity, we eventually realized that we needed to invest in an engineering platform team whose job was to accelerate all of the other engineers and data scientists at Recursion. This team built common tooling and infrastructure that we eventually switched all of our teams to use, and we started to see real benefits from this. Teams no longer needed to spend as much time focusing on testing, integration, deployment, etc. and could just spend their time solving the problems that their end users had. For this reason the #eng-infra-and-tools team at Recursion will always be one of my favorites - their impact was huge (and in hindsight, we probably should have invested here a year and a half earlier than we actually did).
But it was the “right” thing to do to not have this team in place at the beginning. Why? We created technical debt by having each team manage its own systems, didn’t we? Yes, but when you are a resource strapped startup, you need to pay attention to your ratios. With one or two stream-aligned teams delivering products to end users, it doesn’t really make sense to have a separate platform team. The platform:stream-aligned team ratio is just too high (1:1 or 1:2). But once you have 4 or 5 stream-aligned teams, the ratio becomes much more reasonable (1:4 or 1:5). Timing and ratios are incredibly important in guiding your decision of “when” to start focusing on technical debt (or creating common infrastructure).
Paying it Down
So once you’ve decided it’s time to pay down your technical debt, where do you start? How do you decide which technical debt issues are worth addressing? First off, it’s important to recognize that rage tech debt paydown is probably not a good idea, though it happens quite often. At some point in time, a developer snaps at having to deal with the same clunky piece of code or step in some process, they drop what they are doing and hammer away at the problem. This can create additional technical debt, because things aren’t addressed with any kind of plan. Furthermore, it’s entirely reactive, and reactive situations tend to trigger fight-or-flight responses, empowering our lizard brains to take over and reducing our degree of rationality in our work.
A better approach would be to work technical debt paydown into a regular ritual within your company, which is a great way of reinforcing your values (thanks Kellan for teaching me about this years ago!). What does a tech debt ritual look like? While the frequency and details might look different for each organization, you could imagine creating some cadence of regular repeating events (e.g. a 1-week sprint every quarter) to focus exclusively on technical debt paydown. Make it a celebration and a point of focus. Prepare for it, so you can think through the value-to-cost ratio of the paydown. Have team members identify issues that have been bothering them and add them to some kind of tracker with brief (1-3 sentence) write ups on the issues. Allow the teams to score (simple 1-3 scoring) each issue in terms of value as well as cost. Pick the top N issues (where N is roughly your team headcount divided by your ideal team size), and let the people self organize. Give them a week and let them have at it. At the end of the sprint, throw a celebration. Have a set of award categories (these will be company specific) and vote for teams to receive the awards.
The idea here is not to limit technical debt paydown to these tech-debt weeks, but to create a culture where addressing technical debt is viewed as important, where it is valued. And then create an expectation and space for it to also be addressed in a more continuous, ongoing manner. But lest you over-index here and send an unequivocal, uncontextualized message that “technical debt is bad”, I recommend you balance this out with annual hack weeks, which are notorious for creating technical debt alongside immense value (more on these in a different post).
Conclusion
Hopefully these tips will be helpful to somebody out there that is facing the constant question of “to technical debt or not to technical debt”. Just remember that context matters. Technical debt is not inherently evil - it is simply a tool that can be immensely valuable or harmful, depending on how you wield it.