[link] Cargo Cult Debugging

post by MarkL · 2012-07-09T16:05:32.498Z · LW · GW · Legacy · 14 comments

[...] Here is the right way to address this bug:

  1. Learn more about manifests, so I know what a good one looks like.
  2. Take a look at the one we’re generating for Kiln; see if anything obvious screams out.
  3. If so, dive into the build system [blech] and have it fix up the manifest, or generate a better one, or whatever’s involved here. This part’s a second black box to me, since the Kiln Storage Service is just a py2exe executable, meaning that we might be hitting a bug in py2exe, not our build system.
  4. If not, burn a Microsoft support ticket so I can learn how to get some more debugging info out of the error message.

Here’s the first thing I actually did:

  1. Look at the executable using a dependency checker to see what DLLs it was using, then make sure they were present on Windows 2003.

This is not the behavior of a rational man. [...]

http://bitquabit.com/post/cargo-cult-debugging/

 

14 comments

Comments sorted by top scores.

comment by OrphanWilde · 2012-07-09T17:06:54.568Z · LW(p) · GW(p)

...except that, reading what he did, it makes perfect sense.

He recognized one potential cause of the issue, it was cheap to test, so he tested it.

The best approach to solving problems isn't to look into the most likely cause of the problem, but the most cost-effective, in terms of probability of being the issue and time involved in testing it.

I have issues with character encodings fairly regularly in my job from systems upstream or downstream giving me incorrect information about what to expect from them/send to them. I have a toolbox of preprogrammed solutions. It's cheaper for me to test every tool in my toolbox (takes less than an hour), see what fixes it, and use that, than it is for me to figure out exactly what the system upstream or downstream is doing wrong (takes many hours). On rare occasion, none of my tools work, and I need to debug the problem properly - then I modularize the solution and add it to the toolbox.

I could spend six hours every time, or one hour nine times out of ten and seven hours the tenth. One is the "proper" way to solve the problem, the other is trying random solutions (cargo cult debugging) and seeing what sticks.

This article sounds good. In practice, I don't think it measures up.

Because if it takes twenty minutes to check your dependencies, and twenty hours to learn how to read a manifest, you need to be sixty times more certain that it's a problem with the manifest to justify running that test first.

Replies from: David_Gerard, buybuydandavis
comment by David_Gerard · 2012-07-09T18:45:58.718Z · LW(p) · GW(p)

Yeah. If checking under the streetlight is cheap, there's little reason not to do that anyway. (Even if the chances of payoff don't outweigh the time, it counts as due diligence to check the obvious stuff. And IME, checking the obvious couldn't-possibly-be stuff gets a win often enough to make a good habit.)

comment by buybuydandavis · 2012-07-09T17:37:59.033Z · LW(p) · GW(p)

I have a toolbox of preprogrammed solutions. It's cheaper for me to test every tool in my toolbox (takes less than an hour), see what fixes it, and use that, than it is for me to figure out exactly what the system upstream or downstream is doing wrong (takes many hours). On rare occasion, none of my tools work, and I need to debug the problem properly - then I modularize the solution and add it to the toolbox.

A decent way to debug for the short and long term. Your toolbox ends up with tests for the most prevalent problems.

comment by [deleted] · 2012-07-09T17:10:40.015Z · LW(p) · GW(p)

Not unlike the way some pigeons trained B. F. Skinner. Skinner gave pigeons food at random times. The pigeons repeated behaviors (turning, corner-touching, etc.) they made when the food was delivered. Skinner interpreted this as the pigeons adopting a superstition that would magically make food appear. But later research suggested Skinner's interpretation of the error was, itself, an error.

Replies from: gwern
comment by gwern · 2012-07-09T20:58:37.927Z · LW(p) · GW(p)

Really? I hadn't heard that; so what's the right interpretation?

Replies from: None
comment by [deleted] · 2012-07-09T23:46:07.451Z · LW(p) · GW(p)

If I understand the later research, the criticism is that Skinner projected rather than observed repeating behaviors in the pigeons. Skinner was associating the corner-touching with the pigeons more than the pigeons were associating the corner-touching with the food.

Scientific problem solving includes falsification, and that's what'd I'd offer as a right interpretation / good way to minimize cargo-cult solutions. Non-scientific problem solving includes anything you like and the provisional proof is found only in the proverbial pudding.

Replies from: HumanFlesh
comment by HumanFlesh · 2012-07-16T07:23:47.223Z · LW(p) · GW(p)

Cite please.

Skinner avoided appeals to internal states and demonstrated how schedules of reinforcement affected behaviour.

Replies from: None
comment by [deleted] · 2012-07-16T14:12:53.251Z · LW(p) · GW(p)

http://tinyurl.com/6gu6p2

That goes to the Wikipedia entry on Skinner, sub-section 'Superstition of the Pigeon.'

Replies from: HumanFlesh
comment by HumanFlesh · 2012-07-20T11:59:52.805Z · LW(p) · GW(p)

The link contends the terminology used to describe superstitious behaviour. It doesn't claim that an arbitrary schedule of reinforcement has no effect on the pigeon behaviour.

comment by [deleted] · 2012-07-09T16:52:36.464Z · LW(p) · GW(p)

I think he's being a bit harsh on himself - if a possibility is easy to check, it can be worthwhile to check it first, even if it's not the most likely possibility.

comment by Morendil · 2012-07-09T23:51:38.344Z · LW(p) · GW(p)

I'd post the full error message instead of just complaining about its being unhelpful and philosophizing about cargo cults. The point about intellectual laziness is well taken, but the surrounding sermon strikes me as lacking much force.

The first words with which he describes the problem are "doesn't run". This, like its even commoner cousin "doesn't work", is overly generic. Exactly how a failure manifests is often the first vital clue to figuring out the causes.

Googling an error message is a good idea (default assumption: if it happened to me, it might have happened to someone else), but stopping there is a less effective one. Effective debugging involves squeezing all the information out of the failure that you can get: details of the error message, traces in system logs before, during and after the failure, and so on.

The failure might have to do with manifests, or that could be a red herring. Effective debugging avoids spending lots of time on potential red herrings.

The thing he tried, specifically, might not have been the best idea - as I recall Windows emits quite specific error messages when DLL dependencies are not satisfied, so a vague message that "likely has to do with a bad manifest" would not immediately send me in that direction.

Reproducing the failure can be useful, varying initial conditions - try a clean install of Windows 2003, try one that has had a bunch of stuff installed - characterizing it as a systematic or erratic error. Also trying to understand the relevant differences between Windows 2003 and other versions - it's axiomatic that the different behaviour has to originate in some difference, but the difference might be in the environment rather than in the executable being produced.

comment by buybuydandavis · 2012-07-09T17:27:10.190Z · LW(p) · GW(p)

XKCD: The General Problem - http://xkcd.com/974/

Got this one on my desk. Doing things "the right way" is the Achilles heel of technical types. To combat this particular bit of insanity, I always try to start from "what's the minimum cost path that allows me to move forward?"

Replies from: RolfAndreassen
comment by RolfAndreassen · 2012-07-09T18:10:52.839Z · LW(p) · GW(p)

I think the comic is describing a rather different problem.

Replies from: buybuydandavis
comment by buybuydandavis · 2012-07-09T20:01:41.282Z · LW(p) · GW(p)

I think it's a rather similar problem and a similar issue, except that what the OP sees as "the right way", I see as over engineering the solution. OrphanWilde has the better method for real life.