Taking the outside view on code quality
post by adamzerner
(Cross posted on my personal blog.)
Is it worth refactoring
currentDate? I think that there are two ways to look at it.
You can zoom in and ask yourself questions about whether such a refactor will actually have a business impact. Will it improve velocity? Reduce bugs? Sure,
currentDate might be slightly more descriptive, but does it really move the needle? How long does it take to figure out that
yyyymmdd refers to a date? A few seconds, maybe? Won't it be pretty obvious given the context? Shouldn't your highly paid, highly intelligent engineers be smart enough to put two and two together? Did we all just waste 30 seconds of our lives talking about this?
The other way of looking at it is to zoom out. How do you feel when you work in codebases where the variable names are slightly confusing? It slows you down, right? Often times you legitimately can't put two and two together. And there are times when it leads to bugs. Right?
It's interesting how two different viewpoints − zoomed in vs zoomed out − can produce wildly different answers to essentially the same question: do the costs of investing in code quality outweigh the benefits? When you zoom in, eg. to a single variable name, unless the code is truly awful, it usually doesn't seem worth it. The answer is usually, "it's not that bad, developers will be able to figure it out". But when you zoom out and look at the entirety of a codebase, I think the answer is usually that working in messy codebases will have legitimate, significant impacts on things like velocity and bugs, and it's worth taking the time to do things the right way.
What's going on here? Is this a paradox? Which is the right answer? To answer those questions, let's talk about something called the planning fallacy [LW · GW].
The Denver International Airport opened sixteen months later than scheduled, with a total cost of $4.8 billion, over $2 billion more than expected.
When estimating things, people usually zoom in. "Build an airport in Denver? Well, we just have to do A, B, C, D, E and F. Each should take about six months and $500M, so overall it should be three years and $3B." The problem with this is… well… the problem is that it just never works. You always forget something. And the individual components always end up being more complicated than they seem. Just like when you think dinner will be ready in 30 minutes.
So what can you do instead? Well, how long have similarly sized airports taken to build in the past? Ten years and $10B? Hm, if so, maybe your estimate is off. Sure, your situation is different from those other situations, but you can adjust upwards or downwards using the reference class of the other airports as a starting point. Maybe that brings you from 10 to 8 or 10 to 7, but probably not 10 to 3.
How does this relate to code quality? Well, I think that something similar is going on. When you zoom in and take the inside view, it looks like everything will be good. But when you zoom out and take the outside view, you realize that messy codebases usually cause significant problems. Is there a good reason to believe that your codebase is a special snowflake where messiness won't cause significant problems? Probably not.
I feel like I'm being a little bit dishonest here. I don't want to hype up the outside view too much. In practice, inside view thinking also has [? · GW] it's [LW · GW] virtues. And it makes sense to combine inside view thinking with outside view thinking. Doing so is more of an art than a science, and something that I am definitely still developing a feel for.
I think that certain things lend themselves more naturally to inside view thinking, and others lend themselves more naturally to outside view thinking. For example, coming up with startup ideas or scientific theories are both good fits for inside view thinking, IMHO. On the other hand, code quality feels to me like something that is a great fit for the outside view. And so, that's the viewpoint that I favor when I think about whether or not it is worthwhile to invest in.
Comments sorted by top scores.
comment by gjm ·
2021-05-07T13:07:40.444Z · LW(p) · GW(p)
I'm aware that the
yyyymmdd thing is only an example, but I'm not sure it's a good example because it's not obvious to me that
currentDate is necessarily better.
If this thing is a string describing the current date then there are at least two separate pieces of information you might want the name to communicate. One is that it's the current date rather than some other date. The other is that it's in
yyyymmdd format rather than some other format.
yyyymmdd is more informative depends on (1) which of those two things is easier to infer from context (e.g., maybe this is a piece of software that does a lot of stuff with dates in string form and they're always
yyyymmdd; or maybe the only date it ever has any reason to consider is the current date) and (2) which of them is more important in the bit of code in question (e.g., if what you're doing is working out which month it is, that operation is the same whether you're dealing with today's date or something else, but it depends a lot on the format of the input).
It might actually be better in some cases to call the variable something like Replies from: adamzerner
currentDate_ymd8 (the latter only makes sense if in your code there are a few different string formats in use for some hopefully-good reason (maybe you need to interoperate with multiple other bits of date-handling software), so that giving them codenames makes sense).
↑ comment by adamzerner ·
2021-05-07T16:22:56.937Z · LW(p) · GW(p)
Agreed! FWIW, I did realize that there are those issues with my example and that the post would be improved by using a better one (in addition to using multiple examples instead of just a single one). But I had trouble thinking of good examples and knew of the current one from here.
Replies from: gjm
↑ comment by gjm ·
2021-05-07T18:01:47.160Z · LW(p) · GW(p)
In that example I see that the actual format is Replies from: adamzerner
yyyy/mm/dd rather than
yyyymmdd. I definitely don't like the name
yyyymmdd in that case; to me it suggests no separators. (I might advocate for switching to
yyyy-mm-dd and using a name like
currentDate_iso8601, though that's a bit unwieldy.)
comment by gjm ·
2021-05-07T13:15:09.083Z · LW(p) · GW(p)
I'm not sure inside/outside is what's mostly going on when you're on the fence about whether making a minor name improvement is worth it. It seems to me more like the following things:
Replies from: adamzerner
- Looking at a single decision rather than the policy it implies. (Cf. "How I lost 100 pounds using TDT [LW · GW]".)
- Changing things has costs as well as benefits; if you rename the variable there's a (hopefully small) chance that you screw it up somehow and break things. Note that this needs to be considered even when you zoom out, even when you consider policies as well as individual decisions, and even when you take the outside view. (Would you rather work on a stable codebase or one where things keep being renamed as other people decide that some name is better? Would you rather concentrate on fixing bugs and adding features, or would you rather keep having meetings where everyone discusses ten variables they think have slightly the wrong names? Would you rather have bugs turn up every now and then because someone renamed a variable but forgot about one place where it's used, or didn't update a bit of documentation?)
↑ comment by adamzerner ·
2021-05-07T22:36:31.643Z · LW(p) · GW(p)
Looking at a single decision rather than the policy it implies.
Hm. So if you look at a single decision like "it isn't worth refactoring this", and then you extrapolate out into the policy it implies ("it isn't worth refactoring for the most part"), you're still left with the question of what to do with your macro-level conclusion of "it isn't worth refactoring for the most part". Is it a good conclusion or a bad one? You could just use a reducto ad absurdum argument of "of course that's a bad conclusion", but I feel like looking at other things in your reference class is (a big part of) the way to go.
Changing things has costs as well as benefits
Yeah, great point. I agree that those are important things to consider.
comment by korin43 ·
2021-05-07T19:03:29.313Z · LW(p) · GW(p)
This is only tangentially related, but in cases like this, the strategy of improving variable names when you're working on a piece of code is significantly more valuable than searching for code to refactor and improve.
It's true that improving a random variable name in your code base is not a big win, but:
Since you're already looking at this piece of code and presumably making a change, the cost of changing the variable name is lower than if you were changing a random part of the code.
The fact that you're looking at this piece of code and not a different one is evidence that this is something people are more likely to look at than usual, so the benefit of improving it is higher than improving a randomly chose variable name.
Because of these two things, the procedure "improve code you're working on" is signifantly more valuable than you'd expect if you think the procedure you're following is "improve all the code".
Replies from: adamzerner, ChristianKl
↑ comment by adamzerner ·
2021-05-07T22:26:09.211Z · LW(p) · GW(p)
Oh yeah, that's something I've actually been thinking about recently. Unfortunately, I think it isn't very compatible with the way management works at most companies. Normally there's pressure to get your tickets done quickly, which leaves less time for "refactor as you go". And then if you're lucky, they'll allocate some time for tech debt. But as you say, that's less efficient than "refactor as you go" because you have to load all that context back in to your working memory.
All of this is a big part of what I had in mind in writing this post though. If managers/decision makers took the outside view on code quality, maybe they would encourage developers to take their time and refactor as they go rather than having pressure to finish tickets quickly.
Replies from: SatvikBeri
↑ comment by SatvikBeri ·
2021-05-07T23:01:12.953Z · LW(p) · GW(p)
Unfortunately, I think it isn't very compatible with the way management works at most companies. Normally there's pressure to get your tickets done quickly, which leaves less time for "refactor as you go".
I've heard this a lot, but I've worked at 8 companies so far, and none of them have had this kind of time pressure. Is there a specific industry or location where this is more common?Replies from: adamzerner
↑ comment by adamzerner ·
2021-05-07T23:46:32.901Z · LW(p) · GW(p)
Interesting. My impression is that it's pretty widespread across industries and locations. It's been the case for me in all four companies I've worked at. Two of which were startups, two mid-sized, and each was in a different state.
↑ comment by ChristianKl ·
2021-05-07T21:10:48.298Z · LW(p) · GW(p)
Improving code you work on is also good because you are likely better understand the purpose of the code when you are working on it then when you look at a random part of your application.
comment by Darmani ·
2021-05-07T05:09:54.996Z · LW(p) · GW(p)
I think it's simpler than this: renaming it is a small upfront cost for gradual long-term benefit. Hyperbolic discounting kicks in. Carmack talks about this in his QuakeCon 2013, saying "humans are bad at integrating small costs over time": https://www.youtube.com/watch?v=1PhArSujR_A
But, bigger picture, code quality is not about things like local variable naming. This is Mistake #4 of the 7 Mistakes that Cause Fragile Code: https://jameskoppelcoaching.com/wp-content/uploads/2018/05/7mistakes-2ndedition.pdfReplies from: adamzerner
↑ comment by adamzerner ·
2021-05-07T05:31:34.792Z · LW(p) · GW(p)
I think it's simpler than this: renaming it is a small upfront cost for gradual long-term benefit.
Yes, but at some point the cost starts to outweigh the benefit. Eg. going from
currentDate is worthwhile, but going from
betterName, or from
evenBetterName might not be worthwhile. And so I think you do end up having to ask yourself the question instead of assuming that all code quality improvements are worthwhile. Although I also think there's wisdom in using heuristics rather than evaluating whether each and every case is worthwhile.
But, bigger picture, code quality is not about things like local variable naming. This is Mistake #4 of the 7 Mistakes that Cause Fragile Code: https://jameskoppelcoaching.com/wp-content/uploads/2018/05/7mistakes-2ndedition.pdf
I agree with the big picture point that things that are sort of siloed off aren't as important for code quality. I chose this example because I thought it would be easiest to discuss. However, although I don't think they are as important, or even frequently important, I do think that stuff like local variable names end up often being important. I'm not sure what the right adjective is here, but I guess I can say I find it to be important enough where it's worth paying attention to.
Replies from: Darmani
↑ comment by Darmani ·
2021-05-07T07:47:48.776Z · LW(p) · GW(p)
It's a small upfront cost for gradual long-term benefit. Nothing in that says one necessarily outweighs the other. I don't think there's anything more to be had from this example beyond "hyperbolic discounting."
comment by ChristianKl ·
2021-05-07T09:16:13.430Z · LW(p) · GW(p)
My own relationship to naming is more about taste. I want to be person who doesn't write crappy code but who writes good code and thus I don't commit code with crappy names.
comment by TruePath ·
2021-05-07T15:40:56.294Z · LW(p) · GW(p)
I feel there is something else going on here too.
Your claimed outside view asks us to compare a clean codebase with an unclean one and I absolutely agree that it's a good case for using currentDate when initially writing code.
But you motivated this by considering refactoring and I think things go off the rails there. If the only issue in your codebase was you called currentDate yyymmdd consistently or even had other consistent weird names it wouldn't be a message it would just have slightly weird conventions. Any coder working on it for a non-trivial length of time would start just reading yyymmdd as current date in their head.
Tge codebase is only messy when you inconsistently use a bunch of different names for a concept that aren't very descriptive. But now refactoring faces exactly the same problem working with the code does..the confusion coders experience seeing the variable and wondering what it does becomes ambiguity which forces a time intensive refactor.
Practically the right move is probably better stds going forward and to encourage coders to fix variable names in any piece of code they touch. But I don't think it's really a good example of divergent intuitions once you are talking about the same things.
Replies from: adamzerner