• Members of the previous forum can retrieve their temporary password here, (login and check your PM).

Technical Debt: the "what", the "why", and the "what the hell do we do about it"

Nydex

One With The Trees
Staff member
Moderator
Donator
Merits
1,285
Sooner or later, technical debt turns into a relevant topic of discussion in every software engineering team. Usually, the sooner, the better.

Tech debt (referred to as TD from here on) is a useful metaphor for the extra cost and friction created when a codebase is left in a less-than-ideal state to gain short-term advantage. It takes many forms - some more dangerous than others - and in all cases is something teams should always keep in mind and work diligently to reduce as much as possible.

To preface this, I want to say that I don't think any sufficiently large development team can completely avoid creating TD, especially not when they're operating within the framework of the modern software market, with tight deadlines, exacting clients that like to change their requirements far too often, and in more recent times - the surge of AI tools used to generate what I like to call "software slop".

Many people like to illustrate the idea of TD by comparing it with interest, since it is a metaphor that maps short-term implementation shortcuts to a future cost - additional effort (interest) required to change or extend the system later. When used wisely, it can be a tool to accelerate validated learning and deliver stuff on time; used carelessly, it becomes a crippling obstacle that grows with time.

The top-level categories TD can be split into is:
  • High-interest
  • Low-interest
Think of high-interest TD as code lacking tests covering system-critical functionality, or code that lacks necessary documentation. This is code that will degrade exceptionally quickly and predictably. Thus, the interest analogy applies - you will "pay back" the damage sooner, because you'll have to fix all the issues sooner.

On the other hand, low-interest code can be depicted as using outdated libraries or something of that nature - it's not immediately a problem, and it's very unclear if it will ever become a problem, but is definitely something that you need to keep an eye on for vulnerabilities and possible future breaking changes that cause incompatibility issues.

The way I like to think about code is that it cannot conceivably be considered "finished" until it has a comprehensive suite of tests, as well as extensive and clear documentation.

Furthermore, TD can be reasoned about using Martin Fowler's "Technical Debt Quadrant":
  • Deliberate + Prudent: planned shortcuts with a repayment plan.
  • Deliberate + Reckless: knowingly rushed work without a plan (dangerous).
  • Inadvertent + Prudent: design choices that later reveal better approaches (learning).
  • Inadvertent + Reckless: accidental cruft from lack of understanding or poor practices.
techDebtQuadrant.png

Both planned and unplanned TD are dangerous, but reckless TD doubly so. The idea with planned TD is that you plan to go back and refactor the code. That means you also write it in a way that will allow said refactoring in the future. This often provides a more rigid and predictable timeline of events that your POs/PMs and other Ms will be able to stomach more easily, though I wouldn't recommend expecting them to be livid about it in any case. This is what deliberate, prudent TD is.

There's also inadvertent prudent TD, which is expressed as the result of learning by working on a project. Even the best developers can't always predict what the best design for a complex project will be. After working on it for some time, it becomes clear where things could have been done better, and this can be used as a powerful learning tool for the future.

When teams underestimate where the design payoff line is and decide to believe they don't have enough time to follow the best practices, even though they might know of those best practices, the result is deliberate reckless TD. It can also be a good indicator of lazy, careless developers, so it's always worth paying extra attention to the sources of that kind of debt.

And the last kind of TD - inadvertent reckless - is what I would imagine the most common type of TD around. Anecdotally, it feels right to me - a lot of more junior people like to dive headfirst into coding, because it's fun to do so, without first reasoning about the project requirements and framework within which they need to implement the solutions, or planning about edge cases and future issues they might bump into. And to go back to the AI tools I mentioned earlier - I think this type of debt will become even more prevalent in the near future.

To expand further on the costs of TD and how it affects teams, you can think of it like this: TD shows up as friction on every future task - it’s not ‘one big rewrite’ cost, it’s thousands of small slowdowns.

High-interest debt corrodes a team's velocity with time. The more of it accumulates, the slower each new feature gets implemented. A project with a hundred tangled inter-dependent modules will inevitably take longer to work in than a properly structured project, because changing one file breaks 15 other files for no good reason at all. A bug fix takes 10x as long to find the reason for and implement the fix for, because debugging is cluttered and confusing, etc.

There's another very crucial point about this that is often overlooked by managers until it's too late - the talent and morale impact of high-interest TD. As someone that has been tasked with maintaining a legacy repository with horrible code in it, I can confirm this from firsthand experience - coding skills tend to atrophy with time, and I've had moments where I've lost almost all passion for the craft. I've found myself having friction with my team because of the mood I'm in, knowing I need to deal with yet another obscure, hardly traceable bug in the legacy mess I've inherited from a bunch of careless contractors and clueless juniors (of whom I have admittedly been).

So how does one measure and quantify TD?

One of the first things that need to be charted down is touch frequency on pieces of a system - how often do you modify existing modules or parts of a system? The ones that get the most attention are your highest-value candidates for a refactor.

The next element in the equation is the amount of what Fowler calls "cruft" in those parts of the system - "deficiencies in internal quality that make it harder than it would ideally be to modify and extend the system further". When you chart the amount of cruft on top of your most frequently touched parts of the system, you get a sort of a heatmap with zones that are both frequently changed AND difficult to change. Those are the areas you want to focus your attention on with the highest priority.

To "sell" this to your managers, you can use a sort of interest-to-income comparison - convert the TD heatmap to a simple ratio for prioritization: interest (time lost per month) ÷ benefit (revenue or product priority). When you explain it that way, it will become clear just how important refactoring can be. At that point it's time to start planning for that refactoring create a sort of "debt register" - list the "owed items" with estimated payoff cost and expected reduction in interest. Naturally, you can't be as precise in this as you can be in finance, but rough estimates should carry your point across effectively enough, provided a reasonable person is on the other end of this conversation.

To make them feel more prepared for it, you can leverage an interest rate proxy - calculate how long a change takes in a specific area of the heatmap before and after refactoring. If you've done a good refactor, the difference should be obvious enough to make the value of refactoring clear.

There are many patterns by which you can approach clearing TD. Here are some of them:
  • Refactor as part of the feature - include small refactors in the same ticket. Low coordination cost + immediate ROI (return of investment).
  • Definition of Done includes quality - unit tests, docs, and CI checks required before PR merge. Prevents new reckless debt.
  • Debt backlog & visible register - treat TD items like standard product backlog items with owners and acceptance criteria. Prioritize against feature work.
  • Time-boxed tech-health sprints - scheduled windows where the team focuses on paying down small/medium amounts of debt. Useful when interest rate is continually rising.
  • Automate detection - static analysis, dependency scanners, test coverage thresholds, libraries like SonarQube and others - use automation to keep a baseline as much as possible.
  • Architectural runway & component ownership - allocate product/engineering ownership for key modules so cruft doesn’t accumulate anonymously - helps with traceability and maintenance.
  • Feature toggles & incremental rewrite - for large risky changes, do iterative replacement behind toggles, keeping the system runnable.
  • Refinancing choices - sometimes a rewrite or replacement (big refactor) is justifiable, but only after quantifying cost/risk and ensuring resources exist (don’t rewrite for the sake of rewriting).
A rough formula to use in your decision making process looks something like this:
If area is touched > N times/month and each change costs > M extra hours because of cruft, schedule refactor within next sprint cycle. (Choose N and M to fit your org.)

It's important to contextualize how modern organizations deal with this kind of issue. Some of the approaches include (but are definitely not limited to):
  • Visibility & language: use the debt metaphor carefully; clarify whether items are “prudent” or “reckless” and assign expected payoff.
  • Decision ownership: require product + tech sign-off for intentional debt (acceptance of consequences + repayment plan).
  • Reward systems: incentivize maintaining code health - e.g. make part of performance review or team KPIs, include error budget, tech-health score, or reduction in interest.
  • Capacity allocation: dedicate a predictable % of sprint capacity to maintenance (e.g. 10-25%) rather than ad-hoc firefighting. This avoids endless deferral.
  • Postmortems & learning: treat large debt incidents as systemic failures; surface root causes and process changes rather than git-blaming individuals.
  • Hiring and onboarding: preserve and document architectural rationale so new engineers understand tradeoffs.
To conclude this, I would like to summarize TD in a few words like so:
It is an inevitable trade-off in software; it becomes a problem only when it’s unmanaged or reckless. Make debt visible, quantify its cost, treat it like a product decision, and allocate predictable capacity to pay it down -then it becomes a lever, not a liability.

 
Thank you for this post @Nydex, it was a very interesting read. As I lack experience in real-world codebases, there's not much I can say about it. But I have a question: do you think it could be said that there's no such thing as "no TD", but it's more about the difference in what you called the "interest rate"? For example, it could be said that every feature contains a certain degree of TD just by existing. But the difference between an annual 0.01% and a 10% interest rate is enormous.

I also was thinking about implicit TD that may be derived from technology choices. For example, a project that chooses Python is going to have a degree of TD just coming from the high speed at which Python breaks compatibility in version changes. Whereas (say) Go won't have that issue. Something similar could be said about the guarantees offered by the compiler and the type system, which could be seen as a form of test coverage: a Javascript or C codebase needs many more tests to approximate the correctness guarantees obtained by just using OCaml or Rust, and so in that respect (not necessarily in others) the former could be said to have a higher base interest rate baked in.

Just some thoughts!
 
Interesting read, although I’m not an software engineer I see there are two complex systems at work;

Reinforcing loop: delivery pressure >shortcuts >more technical debt >lower velocity > higher pressure. A vicious cycle.

Balancing loop: refactoring and maintenance >less debt >higher velocity >reduced pressure >fewer shortcuts. Difficult in an competitive environment because the reinforcing and balancing loops don’t operate in isolation they interact.

I would say that this means you must deliberately design mechanisms that protect time for refactoring, make debt visible as a cost, and reframe goals from “maximum output now” to “sustainable delivery over time.”
 
I was thinking about this Dijkstra quote, it took me a while to find it:

E.W. Dijkstra said:
The practice is pervaded by the reassuring illusion that programs are just devices like any others, the only difference admitted being that their manufacture might require a new type of craftsmen, viz. programmers. From there it is only a small step to measuring "programmer productivity" in terms of "number of lines of code produced per month". This is a very costly measuring unit because it encourages the writing of insipid code, but today I am less interested in how foolish a unit it is from even a pure business point of view. My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.

 
Thank you for this post @Nydex, it was a very interesting read. As I lack experience in real-world codebases, there's not much I can say about it. But I have a question: do you think it could be said that there's no such thing as "no TD", but it's more about the difference in what you called the "interest rate"? For example, it could be said that every feature contains a certain degree of TD just by existing. But the difference between an annual 0.01% and a 10% interest rate is enormous.

I also was thinking about implicit TD that may be derived from technology choices. For example, a project that chooses Python is going to have a degree of TD just coming from the high speed at which Python breaks compatibility in version changes. Whereas (say) Go won't have that issue. Something similar could be said about the guarantees offered by the compiler and the type system, which could be seen as a form of test coverage: a Javascript or C codebase needs many more tests to approximate the correctness guarantees obtained by just using OCaml or Rust, and so in that respect (not necessarily in others) the former could be said to have a higher base interest rate baked in.

Just some thoughts!
Well, technical debt is inherently a property of codebase entropy. Can a piece of code that has absolutely no TD exist - depends on the context and scale. If you isolate that piece of code only to the unit/module it is relevant to, and perfectly cover all of its logic with tests and describe it carefully with documentation, you could say that piece of code accrues no debt whatsoever. However, debt is also a function of time.

So with time, the language features used in that piece of code can become obsolete with new updates coming to the source language. Also, libraries used in the code can become outdated and create a vulnerability.

So to answer directly - a piece of code can have absolutely no debt at a certain point in time, but the only way to keep it that way is to maintain that piece of code as time goes on. And regarding the choice of tech stack and its relation to TD - I think here we're talking more about feature and architecture tradeoffs rather than actual TD.

Interesting read, although I’m not an software engineer I see there are two complex systems at work;

Reinforcing loop: delivery pressure >shortcuts >more technical debt >lower velocity > higher pressure. A vicious cycle.

Balancing loop: refactoring and maintenance >less debt >higher velocity >reduced pressure >fewer shortcuts. Difficult in an competitive environment because the reinforcing and balancing loops don’t operate in isolation they interact.

I would say that this means you must deliberately design mechanisms that protect time for refactoring, make debt visible as a cost, and reframe goals from “maximum output now” to “sustainable delivery over time.”
Absolutely on point. TD must be transparent to everyone involved in the project. It shouldn't be something that just shows up unannounced. It needs to be treated as an inherent part of the SWE process.

I was thinking about this Dijkstra quote, it took me a while to find it:


E.W. Dijkstra Archive: On the cruelty of really teaching computing science (EWD 1036)
Dijkstra is a legend for a good reason. Very poignant observation that I couldn't agree with more. The more code you have, the more opportunity for problems arise. In that sense, looking at code as "lines spent" implementing a feature is an exceptionally valuable perspective to have that can very much redefine how you think about your code.
 
Great write up. I really wish i could share a link to this directly with my team. Will need to distill and relay instead.
I am a self taught engineer and still relatively junior at that. Sometimes I feel like maybe my expectations are too high because i'm new and and have no other real world experience to compare to, but the volume and interest rate of TD we have is insane.
The architecture team doesnt help. Maybe this is normal and I'm just a newb but all they do is build/maintain idp or other central services rather than provide any level of guidance to individual teams on how to design and implement core architecture for new microservices/features.
Our team has an assigned architect but in the few years i've been on the team i've never seen him in any of the team meetings or discussions.
As a result we are constantly coding over poor implementations and when i try to say "hey wait, we should take a step back and fix what they did wrong before we build on it" PM and even Eng leadership to a degree responds basically with "no, we need to ship. GA timeline is more important. We'll tackle that later." And later, there is always a new, more pressing feature.
Idk, i could go on. Like core business critical modules being 95% comprised of a nest of thousand plus line sql procs. But at this point i realize i'm ranting and stuck in negativity. TD is certainly one of those aspects of engineering i was aware of but didnt expect it to be so pervasive. I imagine it's relative to the organization and coding ecosystem.
I'm currently working on a component that was an internal tool they want to make a customer facing product. So I feel like there's an inordinate amount of TD because the quality bar for internal vs customer facing is very different. As much as it takes some of the joy out of building, I'm learning to live with it and embrace the importance of designing around it.
And this post definitely helps inform conversations I'll be having with the team in the future.
 
Great write up. I really wish i could share a link to this directly with my team. Will need to distill and relay instead.
I am a self taught engineer and still relatively junior at that. Sometimes I feel like maybe my expectations are too high because i'm new and and have no other real world experience to compare to, but the volume and interest rate of TD we have is insane.
The architecture team doesnt help. Maybe this is normal and I'm just a newb but all they do is build/maintain idp or other central services rather than provide any level of guidance to individual teams on how to design and implement core architecture for new microservices/features.
Our team has an assigned architect but in the few years i've been on the team i've never seen him in any of the team meetings or discussions.
As a result we are constantly coding over poor implementations and when i try to say "hey wait, we should take a step back and fix what they did wrong before we build on it" PM and even Eng leadership to a degree responds basically with "no, we need to ship. GA timeline is more important. We'll tackle that later." And later, there is always a new, more pressing feature.
Idk, i could go on. Like core business critical modules being 95% comprised of a nest of thousand plus line sql procs. But at this point i realize i'm ranting and stuck in negativity. TD is certainly one of those aspects of engineering i was aware of but didnt expect it to be so pervasive. I imagine it's relative to the organization and coding ecosystem.
I'm currently working on a component that was an internal tool they want to make a customer facing product. So I feel like there's an inordinate amount of TD because the quality bar for internal vs customer facing is very different. As much as it takes some of the joy out of building, I'm learning to live with it and embrace the importance of designing around it.
And this post definitely helps inform conversations I'll be having with the team in the future.
You have no idea how much I relate to this hahaha

I can give you a probable hint about why the arch team doesn't intervene - they either don't deeply understand the impact of accumulating TD, or they don't care enough to do anything about it. Sometimes, quite frequently actually, it's a mixture between the two, as in "I know this is messed up, but I neither know exactly how to approach fixing it, nor am I paid enough to do so". And it's a sad thing to be a part of. Quite disheartening.

Understand this - being vocal about that kind of issue might make a good few people dislike you. It might put a metaphorical target on your back. It might make you uncomfortable to work with. But those things will happen only in a company where your value as an engineer that cares about their craft will not be fully appreciated anyway.

Being the spearhead of change is what propels you forward. Not being afraid of being vocal about these things is what lands you in the midst of a team of true engineers that are passionate about what they do. I'm yet to find myself in a company like that, but I have no doubt in my mind that it will come to me when the time is right.

The fact that you found value in this post already hints that you have your priorities in a good place. Junior or not, you understand why TD is an issue, and that's something which obviously even people on very high architectural positions don't often understand deeply enough.

Thanks for spending some of your time engaging with this thread. It's good to talk about code quality, especially in the face of all these AI tools coming up.
 
Not a software engineer but a lot of this can be thought about through other engineering fields as well. I will be thinking about this and using this to make my work better.
 
Back
Top Bottom