Coding Rationally - Test Driven Development

post by DSimon · 2010-10-01T15:20:47.873Z · LW · GW · Legacy · 84 comments

Contents

  Why Computer Programming Requires Rationality
  Why Test Driven Development Is Rational
  Why Test Driven Development Isn't Perfect
  Why Test Driven Development Isn't Always Appropriate
  Test Driven $BEHAVIOR
None
84 comments

Computer programming can be a lot of fun, or it can be brain-rendingly frustrating. The transition between these two states often goes something like this:

Paula the Programmer: Computer, using the "Paula's_Neat_Geometry" library, draw a triangle near the top of the screen once per frame.
Cronus the Computer: Sure, no problem.
P: After drawing that triangle, draw a rectangle 50 units below it.
C: Will do, boss.
P: Sweet. Alright, after the rectangle, draw a circle 20 units to the right, then another 20 units to the left.
C: GLARBL GLARBL GLARBL I hear it's amazing when the famous purple stuff wormed in flap-jaw space with the tuning fork does a raw blink on Hari Kiri Rock! I need scissors! 61!1 System error.
P: Crap! Crap crap crap. Um, okay, let's see...

And then Paula must spend the next 45 minutes turning the circle drawing code on and off and figuring out where the wackiness originates from. When the circle code is off, she sees everything work fine. When she turns it back on, she sees everything that she thought she understood so well, that she was previously able to manipulate with the calm joyful deftness of a virtuoso playing a violin, turn into a world of mystery and ice-slick confusion. Something about that request to draw that circle at that particular time and place is exposing a difference between Paula's model of the computer and the computer's reality.

When this happens to a programmer often enough, they begin to realize that even when things seem to be working fine, these differences still probably lurk unseen beneath the surface, waiting invisibly to strike. This is an unsettling feeling. As a technique of rationality, or just because being uncomfortable is unpleasant, they seek diligently to avoid creating these cross-model inconsistencies (known colloquially as "bugs") in their own code, so as to avoid subjecting themselves to GLARBL GLARBL GLARBL moments.

Having a sincere desire to be less wrong in one's thinking is fine, but not enough. One also needs an effective process to follow, a system for making it harder to fool oneself, or at least for noticing when it's happened. Test Driven Development is one such system; not the only one, and not without its practical problems (which will be at most briefly glossed over in this introductory article), but one of my personal favorites, primarily because of the way it makes me feel confident about the quality of my work.

Why Computer Programming Requires Rationality

Computer programming is the process of getting a messy, incomplete, often self-contradictory, and overall badly organized idea out of one's head and explaining it completely and thoroughly to a quite stupid machine that has no common sense whatsoever. This is beneficial for the users of the program, but also for the programmer, because the computer does not have a programmer's human biases, such as mistaking the name of an idea with an understanding of how that idea works.

It has been said that you only truly understand how to do something when you can teach a computer how to do it for you. This doesn't mean that you have to understand the thing perfectly before you can begin programming; the process of programming itself will change and refine the idea in the programmer's mind, chipping away rotten bits and smoothing connections as the idea moves piece-by-piece from the programmer's mind into a harsh reality that doesn't care about how neat something sounds, just whether or not it works.

Through the process of explaining the problem and solution to the computer, the programmer is also explaining it to themselves, checking that that explanation is correct as they go, and adjusting it in their own minds as necessary to make it match.

In a typical single-person development process, a programmer will think about the problem as a whole, mentally sketch out a framework of the tools and structures they will have to write to make the problem solvable, then begin implementing those tools in whatever order seems most intuitive. At this point, great loud alarm bells should be ringing in the heads of Less Wrong readers, indicating that this is a problematically far-mode way to go about things.

Why Test Driven Development Is Rational

The purpose of Test Driven Development is to formalize and divide into tiny pieces that part right before a programmer starts writing code: the part where they think about what they are expecting the code to do. They are then encouraged to think about each of those small pieces individually, in near-mode, using the following steps:

RED: Figure out what feature you want to add next; make it a small feature, like "draw a triangle". Write a test, a tiny test, a test that only checks for the one new feature, and that will only pass if the feature is working properly. This part can be hard if you didn't really have a clear idea of the feature in the first place, but at least you're dealing with that difficulty now and not when 20 other things in the program already depend on your slightly flawed understanding. Anyways, once you've written the test, run it and make sure it fails in the expected manner, since the feature hasn't actually been implemented yet.

GREEN: Now actually go and write the code to make the test pass. Write as little code as possible, with minimum cleverness, to make this one immediate goal happen. Don't write any code that isn't necessary for making the test pass.

REFACTOR: Huzzah, the test passes! But the code has some bad smells: it's repetitious, it's hard to read, it generally creates a feeling of creeping unease. Make it clean, remove all the duplicated parts, both in the test and the implementation.

BLISS: Run all the prior tests; they should still be green. Feel a sense of serene satisfaction that all your expectations continue to be met; be confident your mental model of the whole program continues to be a pretty close match. If you have a version control system (and you really should), commit your changes to it now with a witty yet descriptive message.

Working piece by tiny piece, your code will become as complicated as you need it to be, but no more so. You are not as likely to waste time creating vast wonderful code castles made of crystal and silver that turn out to be pointless and useless because you were thinking of the wrong abstraction. You are more likely to notice right away if you accidentally break something, because that something shouldn't be there in the first place unless it had a test to justify it, and that test will complain.

TDD is a good anti-akrasia technique for writing tests. Classically, tests are written after the program is working, but such tests are rarely very thorough, because it feels superfluous to write a test that already tells you what you (think that you) know, that the program works.

TDD is also helpful broadly fighting against programming akrasia in general. You receive continuous feedback that what you are doing is accomplishing something and not breaking anything. It becomes more difficult to dawdle, since there's always an immediate short-term goal to focus on.

Finally, for me and for many other people who've tried it, TDD makes programming more fun, and more satisfying. There's nothing quite like the feeling of confidence that comes from knowing that your program does just what you think it does.

Or, well, thinking that you know.

Why Test Driven Development Isn't Perfect

Basking innocently in the feeling of the BLISS stage, you check your email and get an angry bug report: when the draw color is set to turquoise, instead of rectangles your program is drawing something that looks vaguely like a silhouette of Carl Friedrich Gauss engaged in a swordfight against a trout. What's going on here? Why wasn't this bug caught by the tests? There's a "Rectangles_Are_Drawable" test, and a "Turquoise_Things_Are_Drawable" test, and they both pass, so how can drawing turquoise rectangles fail?

Something about turqouiseness and rectangleness is lining up just right and causing things to fall apart, and this outcome is certainly not predicted by the programmer's mental model of the program. This means that either that something in the program is not actually being tested at all, or (more likely) that one of the tests doesn't test everything the programmer thinks it does. TDD (among its other benefits) does reduce the chance of bugs being created, but doesn't eliminate it, because even within the short near-mode phases of Red-Green-Refactor-Bliss there's still opportunity for us to foul things up. Eliminating all bugs is a grand dream, but not likely to happen in reality as long as the program isn't dead simple (or formally verifiable, but that's a technique for another day).

However, because we can express bugs as testable assumptions, TDD applies just as well to creating bugfixes as it does to adding new features:

RED: Write a new test "Turquoise_Rectangles_Are_Drawable", which sets the color to turquoise, tells the library to draw a rectangle, and makes sure a rectangle and not some other shape was drawn. Run the test, it should fail. If it doesn't, then the bug report was incomplete, and the situation that needs to be setup before Gauss is drawn is more elaborate.

GREEN: Figure out what's making the bug happen. Fix it. Test passes.

REFACTOR: Make the fix pretty.

BLISS: The rest of the program still works as expected (to the degree that your expectations were expressed, anyways). Also, this particular bug will never come back, because if someone does accidentally reintroduce it then the test that checks this newly created expectation will complain. Commit changes with a joke about Gaussian blurring.

Why Test Driven Development Isn't Always Appropriate

A word of warning: this article is intended to be readable for people who are unfamiliar with programming, which is why simple, easily visualized examples like drawing shapes were used. Unfortunately, in real life, graphics-drawing is just the sort of thing that's hardest to write tests for.

As an extreme example, consider CAPTCHA, software that tries to detect whether a human being or a spambot is trying to get an account on your site by asking them to read back an image of squirrelly-looking text. TDD would at best be minimally useful for this; you could bring in the best OCR algorithms you have available and pass the test if they *cannot* pull text out of the image... but it would be hard to tell if that was because the program was producing properly hard-to-scan images, or because it was producing useless nonsense!

It's part of a larger category of things which are hard to automatically test because their typical operation involves working with a human, and we can't simulate humans very well at all (yet). Any program that's meant to interact with a human, and depend upon that human behaving in a sophisticated human way (or in other words, any program that has a user interface which isn't incredibly simple), will have difficulty being thoroughly tested in a non-brittle way. This problem is exacerbated because user interfaces tend to change significantly as they are subjected to usability testing and rethought, necessitating tedious changes in any tests that depend on their specifics. That doesn't mean TDD isn't applicable to such programs, just that it is more useful when working on their inner machinery than their user-facing shell.

(There are also ways of minimizing this problem in certain sorts of user interface scenarios, but that's beyond the scope of this article.)

Test Driven $BEHAVIOR

It is unfortunate that this technique is not more widely applicable to situations other than computer programming. As a rationalist, the process of improving my beliefs should be like TDD: doing one specific near-mode thing at a time, doing checks they can definitively pass or fail, and building up through this process a set of tests/experiments that thoroughly represent and drive changes to the program implementation, aka my model of the world.

The major disadvantage my beliefs have compared to a computerized test suite is that they won't hold still and be counted. I cannot do an on-demand enumeration through every single one of my beliefs and test them individually to make sure they all still hold up; I have to rely on my memories of them, which might well be shifting and splitting up and making a mess of themselves whenever I'm not looking. I can do RED and GREEN phases on particular ideas when they come to mind, but I'm unfortunately unable to do anything like a thorough and complete BLISS phase.

This article has partly been about introducing a coding technique which I think is pretty neat and of relevance to rationalists, but it's also about leading up to this question that I'd like to ask Less Wrong: how can I improve my ability to do Test Driven Thinking?

 


1. This bit of wonderfully silly text is from Konami's Metal Gear Solid 2.

 

84 comments

Comments sorted by top scores.

comment by jimrandomh · 2010-09-28T19:34:39.210Z · LW(p) · GW(p)

Test-driven development is a common subject of affective death spirals, and this post seems to be the product of one. In general, programmers ought to write more unit tests than they usually do, but not everything should be unit tested, and unit testing alone is absolutely not sufficient to ensure good code or good abstractions.

Writing tests is not free; it takes time, and while it often pays for itself, there are plenty of scenarios where it doesn't. Time is a limited resource which is also spent on thinking about abstractions, coding, refactoring, testing by hand, documenting and many other worthy activities. The amount of time required to write a unit test depends on the software environment, and the specific thing being tested. Things that involve interactions crossing out of your program's domain, like user interfaces, tend to be hard to test automatically. The amount of time required to write a test is not guaranteed to be reasonable. The benefits of a test also vary, depending on the thing being tested. Trivially simple code is not worth testing; for example, it would be wrong to test Java getters and setters except as an implicit part of a larger test.

Replies from: matt, Perplexed, DSimon, Morendil, sketerpot
comment by matt · 2010-09-29T10:42:16.864Z · LW(p) · GW(p)

All of the above is true.
Also true is that most of the specific cases where you think you should skip testing first are errors, and you should have started with the test.

(Extensive unit testing without a good mocking and stubbing framework is hard. Testing around external interfaces is also hard (but not hard).)
("most" != "all"; and jim, your beard may be longer than mine, in which case you are assumed to be an exception to the vast over-generalisation I commit above.)

Replies from: jimrandomh
comment by jimrandomh · 2010-09-29T18:25:11.354Z · LW(p) · GW(p)

This seems like it ought to have some numbers attacheed. The project I'm currently working on is about 30% tests and test scaffolding by line count, and I think that's about right for it. Another project I'm working on has a whole bunch of test data, driven by a small bit of code, for one particularly central and particularly hard algorithm, but no tests at all for the rest of it.

Replies from: taw
comment by taw · 2010-10-04T15:40:08.875Z · LW(p) · GW(p)

1:1 test code to app code ratio is about right usually, for highly testable languages like Ruby. The reason people don't test much outside Ruby world has less to do with testing and more with their language being so bad at it. Have you ever seen properly unit-tested C++ code or layout CSS?

Replies from: DSimon
comment by DSimon · 2010-10-06T19:43:34.917Z · LW(p) · GW(p)

CSS is hard to unit-test because nearly all the places where it can be messed up are detected by a human who says "Hey, this looks really ugly/hard to read/misorganized", a category of problems that is generally hard to write automated tests for. I don't think it's a fault in the language, but the application domain.

C++ is also hard to unit-test, but in that case I agree that it really is part of the language. I enjoy working with C++ and use it for some of my own projects, but if I'm being honest I have to admit that its near-total lack of reflectivity and numerous odd potholes and tripwires makes it much less convenient to do certain sorts of things with it, in-language automated testing being a prominent one of those.

I'm optimistic about Vala, an in-development C#/Javaish language that compiles to Glib-using C and supports native (but language-mediated) access to C libraries, so you get all the performance and platform-specific benefits of working in C/C++, but with modern language features and a noticeable lack of C++'s slowly expanding layers of cruft.

comment by Perplexed · 2010-10-01T18:42:18.809Z · LW(p) · GW(p)

Much of the benefit of systematic testing shows up much later in the maintenance and reuse phases of the software development cycle. But only if maintaining the test code is given just as high a priority and visibility as maintaining the product code.

One test-intensive discipline of software development is "Extreme programming", which adds pair programing, frequent refactoring, and short release cycles to TDD.

comment by DSimon · 2010-09-29T16:34:05.727Z · LW(p) · GW(p)

I agree with all you've said. I didn't mean to imply by my article that TDD is a panacea, and perhaps I should put a sentence or two in describing that. Mostly the reason I didn't list all the practical exceptions was because I was targetting my article at people with little or no programming experience (hence the lack of pseudo code or particularly realistic examples).

Replies from: DSimon
comment by DSimon · 2010-09-29T16:44:10.497Z · LW(p) · GW(p)

Okay, I've added some text about this to the paragraph starting with "Having a sincere desire..."

comment by Morendil · 2010-09-30T06:19:04.856Z · LW(p) · GW(p)

Test-driven development is a common subject of affective death spirals

There's a post or six in that. :)

comment by sketerpot · 2010-10-01T19:03:58.156Z · LW(p) · GW(p)

One of the best examples of when TDD doesn't apply is when you're writing a version 0 that you expect to throw away, just to get a feel for what you should have done. When you do that, screw extensive test suites and elegance and maintainability; you just want to get something working as quickly as you can, so that you can get quickly to the part where you rewrite from scratch.

Often you can't really know what the issues are until you've already run afoul of them. Exploratory programming is a way of getting to that point, fast. You can bring out the tests when you've got a project (or piece of a project) that you're pretty sure you'll be using for a while.

Replies from: pjeby
comment by pjeby · 2010-10-01T20:03:32.487Z · LW(p) · GW(p)

Exploratory programming is a way of getting to that point, fast. You can bring out the tests when you've got a project (or piece of a project) that you're pretty sure you'll be using for a while.

There are two problems with this idea. First, I've found TDD to be extraordinarily effective at helping break down a problem that I have no idea how to solve. That is, if I don't even know what to sketch, I sometimes start with the tests. (Test-Driven Development By Example has some good examples of when/why you'd do that.)

Second: we can be really bad at guessing whether or not something will get thrown away. Rarely does version 0 really get thrown away, and so by the time you've built up a bunch of code, the odds that you'll go back and write the tests are negligible.

Replies from: thomblake
comment by thomblake · 2010-10-05T14:53:24.805Z · LW(p) · GW(p)

Rarely does version 0 really get thrown away,

That's a choice. Some of us deliberately throw away the first draft for anything that matters, no exceptions.

comment by cousin_it · 2010-10-03T12:15:54.218Z · LW(p) · GW(p)

Like most things in IT, test-driven development has a sweet spot of applicability. Good programmers must recognize when a specific technique would help and when it would hurt. See Joel's article about the five worlds of software development and ask yourself which "worlds" would benefit most from TDD. For example, what percentage of a typical game's code is amenable to TDD at all? What percentage of a typical webapp? A typical shrinkwrap Windows app? A typical device driver?

The answer is that, in general, Extreme Programming practices like TDD were created to make writing internal corporate apps easier and more predictable, but other "worlds" have limited use for them - because these practices solve only the very easy problems in these worlds, but still require you to twist your development process so you no longer know how to solve the harder ones.

At my current job (web UI development and bits and pieces of the backend for an online map) I used TDD exactly once - when I needed a mini-parser for a subset of SQL. But even there the deciding factor was my familiarity with the concept of "parser combinators", not my familiarity with TDD.

Replies from: Morendil
comment by Morendil · 2010-10-03T13:19:18.876Z · LW(p) · GW(p)

these practices solve only the very easy problems in these worlds, but still require you to twist your development process so you no longer know how to solve the harder ones

I'm not following the argument here. Explain how TDD causes you to no longer know how to solve the harder problems in some of these domains?

Also, I'm not sure I buy the "sweet spot" theory. Some techniques have a broad range of applicability rather than a sweet spot: they fail only in some corner cases. I suspect that having lots of focused unit tests is one such technique. And, given that TDD is a more realistic way to end up with lots of unit tests than test-last, I'd be tempted to conclude that TDD also has a broad range of applicability - only slightly narrower than having lots of unit tests.

Of course one big issue with this kind of debate is the almost complete lack of empirical research in these techniques. Anecdotally, I've heard reports of people using TDD to beneficial effect in all the domains mentioned.

Replies from: cousin_it
comment by cousin_it · 2010-10-03T13:33:52.116Z · LW(p) · GW(p)

I'm not following the argument here. Explain how TDD causes you to no longer know how to solve the harder problems in some of these domains?

What I'm saying is really simple. If you follow TDD strictly, you can't even begin writing Google Maps without first writing a huge test harness that simulates mouse events or something similarly stupid. And the more you allow yourself to deviate from TDD, the faster you go.

Replies from: Morendil, Vladimir_Nesov
comment by Morendil · 2010-10-03T16:37:07.270Z · LW(p) · GW(p)

We may have different understandings of "TDD", so I suggest tabooing the term. Can you address your argument above to the description that follows?

The process I know as TDD consists of specifyfing the actual values of a precondition and a postcondition, prior to writing the third (command) portion of a Hoare triple. The "rules" of this process are

  • I am only allowed to write a Command (C) after writing its pre- and post-conditions (P+Q)
  • I must observe a postcondition failure before writing C
  • I must write the simplest, "wrongest" possible C that satisfies P+Q
  • I must next write the simplest P+Q that shows a deficiency in my code
  • my algorithm should arise as a sequence of such Cs satisfying P+Qs
  • my application code (and ideally test code) should be well-factored at all times

If you follow this process strictly, you can't even begin writing a huge test harness. The process enforces interleaved writing of application code and test code. As I've argued elsewhere, it tends to counter confirmation bias in testing, and it produces a comprehensive suite of unit tests as a by-product. It encourages separation of concerns which is widely regarded as a cornerstone of appropriate design.

Empirically, my observations are that this process reliably results in higher developer productivity, by decreasing across the board the time between introducing a new defect and detecting that defect, which has a huge impact on the total time spent fixing defects. The exceptions to this rule are when developers are too unskilled to produce code that works by intention to start with, i.e. developers of the "tweak it until it seems to work" variety.

Replies from: cousin_it, wnoise
comment by cousin_it · 2010-10-03T21:47:54.005Z · LW(p) · GW(p)

What you're saying is too abstract, I can't understand any of it. What would be the "preconditions and postconditions" for Google Maps? "The tiles must join seamlessly at the edges"? "When the user clicks and drags, all tiles move along with the cursor"? How do you write automated tests for such things?

In a child comment wnoise says that "every bug that is found should have a unit test written". For the record, I don't agree with that either. Take this bug: "In Opera 10.2 the mousewheel delta comes in with the wrong sign, which breaks zooming." (I can't vouch for the exact version, but I do remember that Opera pulled this trick once upon a minor version update.) It's a very typical bug, I get hundreds of those; but how do you write a test for that?

You could say web development is "special" this way. Well, it isn't. Ask game developers what their typical bugs look like. (Ever try writing a 3D terrain engine test-first?) Ask a Windows developer fighting with version hell. Honestly I'm at a loss for words. What kind of apps have you seen developed with TDD start to finish? Anything interesting?

Maybe related: Ron Jeffries (well-known Extreme Programming guru) tries to write a Sudoku solver using TDD which results in a slow motion trainwreck: 1, 2, 3, 4, 5. Compare with Peter Norvig's attempt, become enlightened.

Replies from: Morendil, Richard_Kennaway
comment by Morendil · 2010-10-04T06:26:14.268Z · LW(p) · GW(p)

What would be the "preconditions and postconditions" for Google Maps? "The tiles must join seamlessly at the edges"?

OK, suppose you are writing Google Maps, from scratch. Is the above the first thing you're going to worry about?

No, presumably you're going to apply the usual strategy to tackle a big hairy problem: break it down into more manageable chunks, tackle each chunk in turn, recursing if you have to. Maps has subareas, like a) vector drawing of maps, b) zoomable display of satellite pictures, c) mapping informally specified adresses to GPS coordinates.

So, suppose you decide to start with a), vector draw. Now you feel ready to write some code, maybe something that takes two X,Y pairs and interprets them as a road segment, drawing the road segment to a canvas.

The "precondition" is just that, the fact of having two X,Y pairs that are spatially separated. And the "postcondition" is that the canvas should receive drawing commands to display something in the right line style for a road segment, against a background of the right color, at the right scale.

Well that's perfectly testable, and in fact testable without a sophisticated testing harness.

My point is that if you feel you know enough about a given problem to write a couple lines of code that start solving it, then you have narrowed it down enough to also write a unit test. And so the claim that "TDD requires you to first write a huge test harness" is baseless.

Take this bug: "In Opera 10.2...

The way you tell this, it's a defect originating with the Opera developers, not on your part. You may still want to document your special-casing this version of Opera with a workaround, and a unit test is a good way to document that, but the point of your doing TDD is to help your code be bug-free. Other people's bugs are out of scope for TDD as a process.

More generally, "software development as a whole is a big hairy mess" is also not a very good reason to doubt the principle of TDD. Yes we're starting from a mess, but that's not a valid reason to give up on cleaning the mess.

What kind of apps have you seen developed with TDD start to finish? Anything interesting?

Things like a content management system or a trading backend, to start with my own experience. Or, that I've heard of, a tiny little IDE called Eclipse? Not sure if that warrants "interesting". ;)

Maybe related

Dude, "Ron Jeffries once had a bad hair day" is a spectacularly lame argument from which to try and derive general conclusions about any software development technique. I expect better of you.

Replies from: cousin_it
comment by cousin_it · 2010-10-04T10:27:25.198Z · LW(p) · GW(p)

OK, suppose you are writing Google Maps, from scratch. Is the above the first thing you're going to worry about?

Actually yes - you usually start with drawing a tiled raster map, it's way easier than a vector one. A raster map is just a bunch of IMG tags side by side. But let's go with your scenario of vector drawing, it will serve just fine and maybe I'll learn something:

And the "postcondition" is that the canvas should receive drawing commands to display something in the right line style for a road segment, against a background of the right color, at the right scale.

So the test says "calling this code must result in this exact sequence of calls to the underlying API"? Hah. I have a method like this in one of my maps, but as far as I can remember, every time I tweaked it (e.g. to optimize the use of the different canvas abstractions in different browsers - SVG, VML, Canvas) or fixed bugs in it (like MSIE drawing the line off by one pixel when image dimensions have certain values) - I always ended up changing the sequence of API calls, so I'd need to edit the test every time which kinda defeats the purpose. Basically, this kind of postcondition is lousy. If I could express a postcondition in terms of what actually happens on the screen, that would be helpful, but I can't. What does TDD give me here, apart from wasted effort?

Or, that I've heard of, a tiny little IDE called Eclipse?

Eclipse was developed test-first? I never heard of that and that would be very interesting. Do you have any references?

Replies from: Morendil, Morendil, Morendil, wnoise
comment by Morendil · 2010-10-04T11:25:00.487Z · LW(p) · GW(p)

Gamma described the Eclipse "customized Agile" process in a 2005 keynote speech (pdf). He doesn't explicitly call it test-first, but he emphasizes both the huge number of unit tests and their being written closely interleaved with the production code.

comment by Morendil · 2010-10-04T11:17:18.246Z · LW(p) · GW(p)

Eclipse was developed test-first? I never heard of that and that would be very interesting. Do you have any references?

Look for write-ups of Erich Gamma's work; he's the coauthor with Kent Beck of the original JUnit and one of three surviving members of the Gang of Four. Excerpt from this interview:

Erich Gamma was the original lead and visionary force behind Eclipse’s Java development environment (JDT). He still sits on the Project Management Committee for the Eclipse project. If you’ve never browsed the Eclipse Platform source code, you’re in for a real treat. Design patterns permeate the code, lending an elegant power to concepts like plug-ins and adapters. All of this is backed up by tens of thousands of unit tests. It’s a superb example of state of the art object oriented design, and Erich played a big part in its creation.

Even with this kind of evidence I prefer to add a caveat here, I'm not entirely sure it'd be fair to say that Eclipse was written in TDD "start to finish". It had a history spanning several previous incarnations before becoming what it is today, and I wasn't involved closely enough to know how much of it was written in TDD. Large (application-sized) chunks of it apparently were.

comment by Morendil · 2010-10-04T11:05:32.045Z · LW(p) · GW(p)

So the test says "calling this code must result in this exact sequence of API calls"?

That's one way. It's also possible to draw to an offscreen canvas and pixel-diff expected and actual images. Or if you're targeting SVG you can compare the output XML to an expected value. Which method of expressing the postcondition you use is largely irrelevant.

The salient point is that you're eventually going to end up with a bunch of isolated tests each of which address a single concern, whereas your main vector drawing code, is, of necessity, a cohesive assemblage of sub-computations which is expected to handle a bunch of different cases.

You only need to change the test if one such behavior itself changes in a substantial way: that's more or less the same kind of thing you deal with if you document your code. (Test cases can make for good documentation, so some people value tests as a substitute for documentation which has the added bonus of detecting defects.)

Without tests, what tends to happen is that a change or a tweak to fix an issue affecting one of these cases may very well have side-effects that break one or more of the other cases. This happens often enough that many coding shops have a formal or informal rule of "if the code works, don't touch it" (aka "code freeze").

If your test suite detects one such side-effect, that would otherwise have gone undetected, the corresponding test will have more than paid for its upkeep. The cost to fix a defect you have just introduced is typically a few minutes; the cost to fix the same defects a few days, weeks or months later can be orders of magnitude bigger, rising fast with the magnitude of the delay.

Those are benefits of having comprehensive unit tests; the (claimed) added benefit of TDD is that it tends to ensure the unit tests you get are the right ones.

Again, this whole domain could and should be studied empirically, not treated as a matter of individual developers' sovereign opinions. This thread serves as good evidence that empirical study requires first dispelling some misconceptions about the claims being investigated, such as the opinion you had going in that TDD requires first writing a huge test harness.

Replies from: cousin_it
comment by cousin_it · 2010-10-04T11:44:31.011Z · LW(p) · GW(p)

Wha? I'm not even sure if you read my comment before replying! To restate: the only reason you ever modify the method of drawing a line segment is to change the sequence of emitted API calls (or output XML, or something). Therefore a unit test for that method that nails down the sequence is useless. Or is it me who's missing your point?

The cost to fix a defect you have just introduced is typically a few minutes; the cost to fix the same defects a few days, weeks or months later can be orders of magnitude bigger, rising fast with the magnitude of the delay.

For the record, I don't buy that either. I can fix a bug in our webapp in a minute after it's found, and have done that many times. Why do you believe the cost rises, anyway? Maybe you're living in a different "world" after all? :-)

Thanks for the links about Eclipse, they don't seem to prove your original point but they're still interesting.

Replies from: thomblake, Morendil, Morendil
comment by thomblake · 2010-10-05T15:10:56.417Z · LW(p) · GW(p)

I can fix a bug in our webapp in a minute after it's found

It's still relevant that "a minute after it's found" might be months after it's introduced, possibly after thousands of customers have silently turned away from your software.

comment by Morendil · 2012-09-04T09:07:40.410Z · LW(p) · GW(p)

For the record, I don't buy that either. I can fix a bug in our webapp in a minute after it's found, and have done that many times. Why do you believe the cost rises, anyway?

For the record, cousin_it was entirely right to be wary of the rising-cost-of-defects claim. I believed it was well supported by evidence, but I've since changed my mind.

comment by Morendil · 2010-10-04T12:20:04.006Z · LW(p) · GW(p)

Or is it me who's missing your point?

You want behaviour to be nailed down. If you have to go back and change the test when you change the behaviour, that's a good sign: your tests are pinning down what matters.

What you don't want is to change the test for a behaviour X when you are making code changes to an unrelated behaviour Y, or when you are making implementation changes which leave behaviour unaltered.

If you're special-casing IE9 so that your roads should render as one pixel thicker under some circumstances, say, then your original test will remain unchanged: its job is to ensure that for non-IE9 browsers you still render roads the same.

Why do you believe the cost rises, anyway?

It's one of the few widely-agreed-on facts in software development. See Gilb, McConnell, Capers Jones.

The mechanisms aren't too hard to see: when you've just coded up a subtle defect, the context of your thinking (the set of assumptions you were making) is still in a local cache, you can easily access it again, see where you went off the rails.

When you find a defect later, it's usually "WTF was I thinking here", and you must spend time reconstructing that context. Plus, by that time, it's often the case that further code changes have been piled on top of the original defect.

they don't seem to prove your original point

I wasn't the one with a point originally. You made some assertions in a comment to the OP, and I asked you for a clarification of your argument. You turned out to have, not an argument, but some misconceptions.

I'm happy to have cleared those up, and I'm now tapping out. Given the potential of this topic to induce affective death spirals, it's best to let others step onto the mat now, if they still think this worth arguing.

Replies from: cousin_it
comment by cousin_it · 2010-10-04T12:50:35.547Z · LW(p) · GW(p)

Well, this frustrates me, but I know the frustration is caused by a bug in my brain. I respect your decision to tap out. Thanks! Guess I'll reread the discussion tomorrow and PM you if unresolved questions remain.

comment by wnoise · 2010-10-04T16:00:03.284Z · LW(p) · GW(p)

If I could express a postcondition in terms of what actually happens on the screen, that would be helpful, but I can't.

Why not? There are automated tools to take snapshots of the screen, or window contents.

Replies from: cousin_it
comment by cousin_it · 2010-10-04T16:54:57.049Z · LW(p) · GW(p)

No. Just no. I'd guess that different minor versions of Firefox can give different screenshots of the same antialiased line. And that's not counting all these other browsers.

comment by Richard_Kennaway · 2010-10-04T13:23:07.024Z · LW(p) · GW(p)

Ron Jeffries (well-known Extreme Programming guru) tries to write a Sudoku solver using TDD which results in a slow motion trainwreck: 1, 2, 3, 4, 5. Compare with Peter Norvig's attempt, become enlightened.

The main difference I see between those is that Norvig knew how to solve Sudoku problems before he started writing a program, while Jeffries didn't, and started writing code without any clear idea of what it was supposed to do. In fact, he carries on in that mode throughout the entire sorry story. No amount of doing everything else right is going to overcome that basic error. I also think Jeffries writes at unnecessarily great length, both in English and in code.

Replies from: cousin_it, randallsquared
comment by cousin_it · 2010-10-04T13:51:30.221Z · LW(p) · GW(p)

The problem is, Extreme Programming is promoted as the approach to use when you don't know what your final result will be like. "Embrace Change!" As far as I understand, Jeffries was not being stupid in that series of posts. He could have looked up the right algorithms at any time, like you or me. He was just trying to make an honest showcase for his own methodology which says you're supposed to be comfortable not knowing exactly where you're going. It was an experiment worth trying, and if it worked it would've helped convince me that TDD is widely useful. Like that famous presentation where Ocaml's type system catches an infinite loop.

comment by randallsquared · 2010-10-06T04:09:00.726Z · LW(p) · GW(p)

The main difference I see between those is that Norvig knew how to solve Sudoku problems before he started writing a program, while Jeffries didn't

When you already know exactly how to do something, you've already written the program. After that, you're transliterating the program. The real difficulty in any coding is figuring out how to solve the problem. In some cases, it's appropriate to start writing code as a part of the process of learning how to solve the problem, and in those cases, writing tests first is not going to be especially useful, since you're not sure exactly what the output should be, but it certainly is going to slow down the programming quite a lot.

So, I'll agree that Jeffries should have understood the problem space before writing many tests, but not that understanding the problem space is entirely a pre-coding activity.

comment by wnoise · 2010-10-03T20:09:17.477Z · LW(p) · GW(p)

Thank you for the quite clear specification of what you mean by TDD.

Personally, I love unit tests, and think having lots of them is wonderful. But this methodology is an excessive use of them. It's quite common to both write overly complex code and to write overly general code when that generalization will never be needed. I understand why this method pushes against that, and approve. Never the less, dogmatically writing the "wrongest possible" code is generally a huge waste of time. Going through this sort of process can help one learn to see what's vital and not, but once the lesson has been internalized, following the practice is sub-optimal.

Every bug that is found should have a unit test written. There are common blind-spots that programmers make[*], and this catches any repeats. Often these are ones that even following your version of TDD (or any) won't catch. You still need to think of the bad cases to catch them, either via tests, or just writing correct code in the first place.

[*]: Boundary conditions are a common case, where code works just fine for everywhere in the middle of an expected range, but fails at the ends.

comment by Vladimir_Nesov · 2010-10-03T13:36:21.095Z · LW(p) · GW(p)

I agree with this argument, but note that you could write some tests as instructions for human testers. If it's the style of development that's the more important output of TDD for you and not regression tests, you could run those human tests yourself, and discard them afterwards. The discipline is still useful.

Replies from: Morendil
comment by Morendil · 2010-10-03T16:18:32.947Z · LW(p) · GW(p)

You're both losing me, I'm afraid. I wasn't parsing cousin_it's argument as saying "it's the style of development that's the more important output of TDD". How do you get that?

I'll agree that a style of development can be useful in and of itself. The style of development that tends to be useful is one where complex behaviour is composed out of smaller and simpler elements of behaviour, in such a way that it is easy to ascertain the correctness not only of the components but also of the whole.

So we have one question here amenable to empirical inquiry: does the approach known as TDD in fact lead to the style of development outlined above?

But it also seems to me that a suite of tests is a useful thing to have in many circumstances. The empirical question here is, does having a comprehensive test suite in fact yield a practical benefit, if so on which dimensions, and at what cost?

If the tests are of the kind that can be automated, then I see little benefit in having human testers manually follow the same instructions. The outcome is the same - possibly detecting a new defect - but the cost is a million times the cost of computer execution. The main cost of automation is the cost of thinking about the tests, not the cost of typing them in so a computer can run them. So ex hypothesi that cost is incurred anyway.

comment by pjeby · 2010-10-01T17:20:56.914Z · LW(p) · GW(p)

Just a side note, but essentially the entire field of mind hacking as I teach it is based on TDD. Specifically, the testing of autonomous anticipations: i.e., what your "near" brain expects will happen in a concrete situation, and especially the emotional tags it assigns to that expectation.

If you know how to do that type of testing, you can test any self-help technique designed for changing beliefs, and determine in a matter of minutes whether it's actually any good (assuming you're able to execute the technique correctly, of course).

Most of the techniques I teach people for belief change, therefore, are other people's techniques, but ones whose performance I've been able to verify, via what's effectively TDD. I've also used the approach to refine others' techniques, and develop a few of my own.

It is not a perfect solution: this type of TDD is definitely just "unit testing" of individual beliefs, whereas a solution to most real-world problems requires that one also does "acceptance testing" -- i.e. does your actual behavior change along with the belief, and does it stay changed?

Continuing the programming analogy, to fix a bug in a program, you may have to make changes that span multiple code units, and therefore multiple tests. That is, a behavior may be affected by multiple beliefs, at the same or different stages of the behavior. (Fixing a belief that keeps you from even considering public speaking, for example, may not fix a second belief that will cause you to "Bruce" yourself out of the actual speaking engagement.)

However, being able to test at all in a fast feedback cycle is a huge advantage, and I've learned some tricks and heuristics to help find the right "units" ahead of time.

Replies from: wedrifid, DSimon
comment by wedrifid · 2010-10-07T06:18:48.736Z · LW(p) · GW(p)

It is not a perfect solution: this type of TDD is definitely just "unit testing" of individual beliefs, whereas a solution to most real-world problems requires that one also does "acceptance testing" -- i.e. does your actual behavior change along with the belief, and does it stay changed?

I'm glad to see this distinction made. It's great to be able to test in almost real time that a technique is having a desirable effect, but some things you just can't tell until down the track. The obvious example is, as you allude to in a grandchild, whether 2 years down the track you actually have the book written. It's great if my program is getting all green lights but if it turns out that despite that it just isn't working then I need to be able to go back, reassess, understand the problem and hopefully create new unit tests that are more rigorous.

comment by DSimon · 2010-10-06T19:29:58.615Z · LW(p) · GW(p)

This sounds interesting; do you have a link?

Replies from: pjeby
comment by pjeby · 2010-10-06T20:31:21.462Z · LW(p) · GW(p)

This sounds interesting; do you have a link?

Er, a link to what?

Replies from: DSimon
comment by DSimon · 2010-10-06T22:59:42.590Z · LW(p) · GW(p)

To more information about the TDD-based mind hacking that you mentioned.

Replies from: pjeby
comment by pjeby · 2010-10-07T04:58:36.969Z · LW(p) · GW(p)

To more information about the TDD-based mind hacking that you mentioned.

Well, at the moment that's mostly in not-publicly-available stuff, though there are bits and bobs in my blog, and a bit in chapter 4 of the first draft of Thinking Things Done. Most of the meat is in audios and videos of recorded workshops for the Mind Hackers' Guild, though.

Short version: notice what happens in your body in response to a thought. Notice what flashes through your mind right before the response in your body. Observe whether these reactions are both spontaneous (i.e., "you" aren't doing them) and repeatable (i.e., you get roughly the same response to the thought each time).

This is now your "test". To borrow TDD language, you now have a test that's "on red". You can now use any method you like to try to change the output of the test, except for consciously overriding the response. (Since you would then have to do it every time.)

You can make a test for any behavior or circumstance, by thinking about a specific instance of that behavior or circumstance. The key is it has to be near-mode (experiential) thinking, not far-mode (abstraction, categories, not sensory-specific).

Our behaviors are generally driven by our emotional (and physical) anticipations, so if you can change (via memory reconsolidation) the anticipations in question, then the behaviors will generally also change.

Keys to change:

  • You must notice what memory/anticipation is flickering past, in order to alter it (see e.g. Eliezer's getting rid of his anticipation of a serial killer in the bathroom) -- this is a general principle known as memory reconsolidation

  • To observe the fleeting memory or anticipation, it is generally helpful to intensify it, or connect it to its ultimate consequence, e.g. by asking, "What's bad about that?" (Sometimes you will get a flicker of something that seems like a neutral or not-too-bad anticipation, yet it is accompanied by a negative physical response... Often you can pull up the actual anticipation by focusing on the feeling itself.)

  • Sometimes the anticipation is about what happens if you fail. Sometimes it's about if you succeed. Sometimes, it's about something good you'll get by failing, and lose if you succeed. Sometimes, it's the kind of person you think you'll be, or that others will think you are. And not a one of these anticipations will make a lick of sense to your conscious mind, which will insist that these ideas are not yours at all. And that, is the best possible evidence that what you're doing is actually working. If you are not surprised by what you discover, you are probably doing something wrong.

  • Corollary: real change is also surprising. When you successfully mindhack, two things usually happen: first, you will surprise yourself by acting differently in some area where you didn't expect it, and then realize that there's a connection you didn't see to the thing you changed. Second, you will begin to forget that you ever did things differently, and what it was like to think the way you did before.

  • There are only a handful of techniques that work, but they are dressed in many superficially different outward forms. ALL of them appear to work via some kind of memory consolidation; they mostly differ in the details of how they get to the memory, and how they get you to either disrupt the trace or lay down a new one. The epistemology of all these techniques is 100% horseshit: nobody really knows how or why they work (even my reconsolidation hypothesis is just that: a hypothesis), so don't let a stupid theory stop you from actually doing something.

  • To get good at this, you need one physical technique and several mental ones. Physical techniques like EFT or the Matherne Speed Trace are good for dealing with strong reactions, and will fix a lot of things. Pick one, and learn it cold by testing. If you can take, say, a physical reaction you get to seeing, smelling, or just imagining a food that disgusts you as your test, and you can make it go away using a physical technique, then you know you learned the technique. If you use it on something else, and it doesn't work, this is now evidence that the technique won't work for what you're trying to fix, rather than evidence that the technique is bad or you don't know how to do it.

  • When you have tried the physical technique on everything you want to change in your life that you can successfully get a strong spontaneous reaction to, you're ready to move up to mental techniques. The physical ones are not so good at changing things like misplaced moral judgments, warped values, crazy definitions, and other kinds of broken brain patterns that you currently don't know you have. (Side note: if you're not a mind hacker, chances are good you have no idea just how crazy your own brain is by default. Just because you have the logical side of things wrapped up tight, doesn't mean your emotional brain isn't a fucking basket case!)

  • For a first mental technique -- or at least, of the ones you can learn inexpensively -- I suggest either the Work of Byron Katie, or the Lefkoe belief-change process. To successfully learn either, though, you need to be rigorous about testing. Identify a response that you have (that your physical technique failed on), and attempt to change it using the technique you're learning. If you can't get that one to change, try another, until you learn the technique itself. Katie and Lefkoe are good examples of intermediate-level mental techniques; I have a few easier-to-learn ones, but the learning materials are more expensive compared to K&L. ;-)

That's pretty much the gamut of what you can do with free or nearly-free materials out there. (Katie and Lefkoe may both be in your local library, while EFT and Matherne are just a Google away.) This is (sort of) the path I followed myself, except that I'm cutting out lots of dead ends and leaving out all the good bits I personally discovered, tweaked, or adapted, but which require too many details to describe here.

(There's also a good reason that the mental-level techniques are generally taught in workshop form, or as workshop transcripts; many things are difficult to grok except by experience, or at least the vicarious "experience" of seeing how someone else's bug or blind spot resembles your own. Even Katie and Lefkoe put a lot of session transcripts in their books.)

A final note: the information in this comment, even if combined with the sources I've mentioned, is not sufficient to prevent you from going down a thousand blind alleys. The only thing that will get you out of them is an absolute desire to succeed, combined with an implacable determination to rigorously test, and to assume that anything that doesn't result in a clear "red" or clear "green" is just confusion and time wasting.

Always define the test first, and get a repeatable red before you try to change. Always recheck the test after you "do" a technique. Assume that if you can't get any test to change using a given technique, it probably means you haven't learned to do the technique correctly. Once you have successfully made some tests change using a given technique, you can then assume that it not working on something probably means you're either applying it to the wrong part of the problem, or it's not applicable to that problem... but only experience will let you discover which is the case.

The changes you make will sometimes "stick" with one application of a technique, just often enough to make it really frustrating when you have a "bug" that keeps coming back. Generally, though, you'll find that your original "test" is still passing, but there is now a problem at some other part of your thought process. Either you missed it the first time, or your brain is generating new bugs in order to address a higher-level need.

If something keeps coming back, you generally have some emotional need for your problem, that has to be addressed with a different level of analytic technique; plain old K&L will probably not cut it. I spent almost two years in hell after deciding to write Thinking Things Done because I had mastered everything I just told you above, except for this final level of technique.

comment by Jonathan_Graehl · 2010-09-28T18:38:31.894Z · LW(p) · GW(p)

I program without TDD.

I only sometimes write automatically evaluated tests.

One thing that appeals to me about the idea of TDD is that perhaps it seems like a puzzle or game to pass some (easy to create) test. That is, the feedback is free once established.

Since I'm good at finding and fixing bugs without an always-run suite of tests, I don't bother creating them at all for most small programs (some of which are easy to implement correctly, but annoying to automatically test comprehensively). At some scale or difficulty, my abilities will fail and I will definitely profit by building some routinely-run tests.

A halfway territory between automated tests and blind hope is to generate routinely produced summaries that can be evaluated easily by a human for correct-feeling. E.g. for graphics, you show the picture. This can eventually morph into a regression test (you freeze the report format and save the output produced from the inputs). If it's just a small step to add some automatic consistency checks as the report is produced, then I of course do so.

Replies from: matt, DSimon
comment by matt · 2010-09-29T10:46:50.122Z · LW(p) · GW(p)

Jonathan - one of the reasons to write unit tests is so that programmers other than you can see what your code is designed to do (for occasional values of "programmers other than you" that includes you when you're not at your best). Inheriting someone else's bad tested code is much less painful than inheriting someone else's bad untested code.

(And I humbly observe that a larger proportion of the untested (or sparsely tested) code I've inherited has been bad than the tested code that I've inherited.)

Replies from: Jonathan_Graehl
comment by Jonathan_Graehl · 2010-09-29T20:29:08.588Z · LW(p) · GW(p)

I agree that automated tests are more important in multi-person projects. I do value whatever tests exist in projects I've worked on (including my own solo stuff); I just don't always choose to spend the effort creating them.

comment by DSimon · 2010-09-29T17:17:36.181Z · LW(p) · GW(p)

A halfway territory between automated tests and blind hope is to generate routinely produced summaries that can be evaluated easily by a human for correct-feeling.

Agreed. This is especially important for very human-oriented software (graphical programs) and pieces of software (user interfaces), because those are hard to test without an actual human involved.

comment by PhilGoetz · 2010-10-01T22:46:21.882Z · LW(p) · GW(p)

PLEASE insert a front-page break near the beginning of the post.

Replies from: DSimon
comment by DSimon · 2010-10-05T14:03:01.316Z · LW(p) · GW(p)

Ah, I was wondering why other stories seemed to have that nice intro/body seperation. Thanks, I'll do that right away. :-)

Replies from: DSimon
comment by DSimon · 2010-10-05T14:04:24.514Z · LW(p) · GW(p)

And... it seems someone already has. Thanks to whoever did that.

comment by cousin_it · 2010-10-01T15:32:40.083Z · LW(p) · GW(p)

Sorry, I have a meta question. (Will comment on the content later.) When you moved this article from the discussion area to LW proper, did its upvotes get converted at a 1-to-10 rate? :-)

Replies from: DSimon
comment by DSimon · 2010-10-01T16:06:30.284Z · LW(p) · GW(p)

No worries, I was curious about that too. What happened is: All the existing votes stayed at 1, but votes after that time were valued at 10.

Replies from: Perplexed, Morendil
comment by Perplexed · 2010-10-01T18:28:46.758Z · LW(p) · GW(p)

Another (less important) item of curiosity: Are people who upvoted before the move (for 1 point) permitted to upvote again (for 10 points)? Can they back-out their 1 point upvotes, generating a -9 point no-op and then continue to downvote for an additional -10.

PS: I used to work in software test. :)

Replies from: pjeby
comment by pjeby · 2010-10-01T20:06:33.692Z · LW(p) · GW(p)

Another (less important) item of curiosity: Are people who upvoted before the move (for 1 point) permitted to upvote again (for 10 points)? Can they back-out their 1 point upvotes, generating a -9 point no-op and then continue to downvote for an additional -10.

Funny, I was wondering the exact same thing. Likewise, if you publish to main LW first and move to discussion, what effect does that have on the point totals?

comment by Morendil · 2010-10-02T10:15:21.727Z · LW(p) · GW(p)

Another meta question: how do you move an article from discussion to LW?

Replies from: whpearson
comment by whpearson · 2010-10-02T10:47:56.303Z · LW(p) · GW(p)

Edit the post and choose where it is posted to.

comment by DSimon · 2010-09-30T16:55:38.014Z · LW(p) · GW(p)

Okay, I've edited my article again, reducing some of the (retrospectively obvious) over-enthusiasm, adding some more details about anti-akrasia benefits and the potential downfalls of TDD, and also sprinkling in a few more links. I appreciate the comments I've got so far, and further suggestions are very welcome.

Should I submit this article to the LW post queue proper?

comment by JGWeissman · 2010-09-28T21:12:39.851Z · LW(p) · GW(p)

I find it interesting that you used graphics for your examples, which is among the areas it is hard to automate tests for (there is a lot of complexity behind "verify a rectangle was drawn on the screen", that we don't notice because our visual cortexes take care of it for us), but you don't address any of the problems with testing in this domain.

This leads me to agree with jimrandomh's affective death spiral theory.

EDIT: Moving the article from discussion has changed its URL and the comments' URL. Corrected link (Original left to illustrate problem). It would be better if articles and comments had URL's that do not indicate their section, and let the section just affect index pages and feeds.

Replies from: DSimon
comment by DSimon · 2010-09-29T16:41:08.556Z · LW(p) · GW(p)

I agree that it's not a particularly realistic example. Do you have any alternate suggestions? I'm trying to target people who are not so familiar with programing, and in doing so I valued simplicity over realism. However, assuming some of the readers go on to try some actual TDD, they might try graphics programming first and discover that it's unpleasantly frustrating to test.

Replies from: JGWeissman
comment by JGWeissman · 2010-09-29T17:43:44.946Z · LW(p) · GW(p)

Good example:

You want a program that can invert matrices. You write a test that calls the program, passing it a matrix. The test multiplies the original matrix with the result from the invert program, and verifies that the product is unity.

This is a good case to use TDD because the test is simpler than the program you are testing.

Also, an important part is to test edge/corner cases. In the matrix example, there should be a test that passes the invert program a singular matrix, and validates that it returns an appropiate error.

comment by Morendil · 2010-09-28T18:35:20.414Z · LW(p) · GW(p)

TDD is a tactic against confirmation bias - it feels like that should go in there somewhere.

Replies from: Vladimir_Nesov, DSimon
comment by Vladimir_Nesov · 2010-09-28T18:39:28.435Z · LW(p) · GW(p)

Also good for getting regression tests done, i.e. tactic against akrasia.

Replies from: matt
comment by matt · 2010-09-29T10:53:37.831Z · LW(p) · GW(p)

TDD is generally a good anti-akrasia hack - you spend more of your time in near mode doing one-more-thing-after-another (with a little squirt of pleasure on each GREEN), and less in far mode thinking about architecture (and what you're going to have for lunch… and how messy the kitchen is… and…).
(And then, as if by an invisible hand, your architecture ends up being good anyway.)

comment by DSimon · 2010-09-30T17:01:38.235Z · LW(p) · GW(p)

I'm not quite sure where confirmation bias comes into it. Can you go into more detail about this?

Replies from: Morendil
comment by Morendil · 2010-09-30T22:06:54.646Z · LW(p) · GW(p)

Sure. Suppose you've just coded up a method. Your favored hypothesis is that your code is correct. Thus, you will find it harder to think of inputs likely to disconfirm this hypothesis than to think of inputs likely to confirm it.

Wason's 2-4-6 test provides a good illustration of this. You can almost directly map the hypothesis most people come up with to a method that returns "true" if the numbers provided conform to a certain rule. A unit test is a sample input that should cause the method to return false, which you could then check against the actual output of the Wason "oracle" (the actual criterion for being an acceptable triple).

Most people think more readily of confirmatory tests, that is, input data which conforms to their favored hypothesis. This will apply if you have already written the method.

This was noticed in the late 70's by testing authority Glenford Myers who derived from it the (misleading and harmful) edict "developers should never test their own code".

However, if you have to write the test first, you will avoid confirmation bias. You are less likely to harbor preconceptions as to the rule to be implemented, in fact you are actively thinking about test inputs that will invalidate the implementation, whatever that happens to be.

Does that help?

Replies from: Douglas_Knight, sketerpot, DSimon
comment by Douglas_Knight · 2010-10-01T17:39:39.385Z · LW(p) · GW(p)

However, if you have to write the test first, you will avoid confirmation bias. You are less likely to harbor preconceptions as to the rule to be implemented, in fact you are actively thinking about test inputs that will invalidate the implementation, whatever that happens to be.

I don't see any theoretical reason or mechanism why writing the tests first encourages negative tests. Is this supposed to be convincing, or are you making an empirical claim?

DSimon proposes a mechanism:

You're saying that false positive tests are weeded out in TDD because the implementation isn't allowed to have any code to raise errors or return negative states until there's first a test checking for those errors/states.

This is plausible, but I still don't find it convincing. In fact, it seems close to the claim "it's easier to learn to think of errors as positive features, requiring positive tests, than it is to learn to write negative tests," which doesn't really distinguish between writing tests before and after.

Let me propose a hypothesis: perhaps it is easier to learn to write negative tests when switching to TDD because it's easier to adopt habits while making large overhauls to behavior.

Hmm...Going back to the original quote,

You are less likely to harbor preconceptions as to the rule to be implemented

is a good point, especially for, eg, determining which are the edge cases. So I shouldn't say no theory or mechanism, but I'm not convinced.

Replies from: Morendil
comment by Morendil · 2010-10-02T09:12:59.242Z · LW(p) · GW(p)

Let's use a more concrete example. Since I've recently worked in that domain, say we're implementing a template language like mustache.

We're starting from a blank slate, so our first test might be a test for the basic capability of variable substitution:

input: blah{{$foo}}blah
context: foo=blah
expectation: blahblahblah

Is this format clear? It's language-agnostic, so you could implement that test in Ruby or whatever.

We want to see this test fail. So we have to supply an implementation that is deliberately broken - a simple way to do that is to return an empty string, or perhaps return the exact same string that was passed as input - there are many ways to be broken.

At this point an experienced TDDer will notice this arbitrariness, think "we could have started with an even simpler test case", and write the following:

input: blahblah
context: foo=whatever
expectation: blahblah

We've narrowed it down to only one non-arbitrary way to make the test fail: return the empty string. And to make the test pass we'll return the original input.

See how this works? Because I'm not yet thinking about my implementation, but thinking about how my tests pin down the correct implementation I'm free to come up with non-confirmatory examples. My tests are drawing a box around something, but I'm not yet concerned with the contents of the box.

Now we can go back to the simple variable substitution. An important part of TDD is that one failing test does not allow you, yet, to write the fully general code of your algorithm. You're supposed to write the simplest possible code that causes all tests to pass, erring on the side of "simplistic" code.

So for instance you'd write a small method that did more or less this:

return "blahblahblah" if template.contains("{")
else return "blahblah"

Many of my colleagues make a long face when they realize this is what TDD entails. "Why would you write such a deliberately stupid implementation?" Precisely to keep thinking about better tests, and to hold off on thinking about the fully general implementation.

So now I may want to add the following:

input: blah{{$bar}}blah
context: foo=blah
expectation: blahblah

And maybe this:

input: blah{$foo}blah
context: foo=blah
expectation: blah{$foo}blah

Which are important non-confirmatory test cases. And I want to see these tests fail, because they reveal important differences between a sophisticated enough implementation and my "naive" first attempt.

I will also probably be thinking that even this "basic" capability is starting to look like a fairly complex bit, a complexity which wasn't obvious until I started thinking about all these non-confirmatory test cases. At this point if I were coding this for real I would start breaking down the problem, and maybe narrow the problem down to tokenizing the template:

input: blah
expectation: (type=text, value="blah")

and

input: blah{{$foo}}blah
expectation: (type=text, value="blah"), (type=var, value="foo"),(type=text, value="blah")

(Now I'm not testing the actual output of the program, of course, but an intermediate representation. That's OK, since those are unit tests: they're allowed to examine internals of the code.)

At this point the only line of code I have written is a deliberately simplistic implementation of my expansion algorithm, and I have already gotten a half-dozen important tests out of my thinking.

Replies from: DSimon
comment by DSimon · 2010-10-05T13:51:29.229Z · LW(p) · GW(p)

This is a good explanation. I have one point of difference, though:

input: blah{{$foo}}blah

context: foo=blah

expectation: blahblahblah

return "blahblahblah" if template.contains("{") else return "blahblah"

This implementation has copy&pasted magic values from the test. I've usually thought of these kinds of intermediate implementations as being side tracks because AIUI they are necessarily weeded out right away by the refactor phase of each cycle.

So, my deliberately-stupid implementation might've been:

def substitute(input, context): return input.sub(/\${{.+?}}/, context.values.first)

Which is more complex than the one you suggested, but still I think the least complex one that makes the test pass without copy & paste.

Then as with your example, this would've led to tests to make sure the right substitution variable was being matched to the right key, in which more than one substitution variable is supplied, in which substitutions are made for variables that aren't in the context, and so on....

(By the way, how did you get that nice fixed-width font?)

Replies from: wnoise
comment by wnoise · 2010-10-05T15:17:59.012Z · LW(p) · GW(p)

but still I think the least complex one that makes the test pass without copy & paste.

He didn't say "without copy and paste".

Come to think of it, "simplest" varies person to person. In one metric the "simplest that could work" would just be a huge switch statement mapping input for a given test to output for the same test...

(By the way, how did you get that nice fixed-width font?)

http://wiki.lesswrong.com/wiki/Comment_formatting

Enclose with backticks for inline code, and

 start with spaces for blocks.
Replies from: DSimon
comment by DSimon · 2010-10-06T14:09:01.757Z · LW(p) · GW(p)

He didn't say "without copy and paste".

Just copying the expected value from the test into the body of the implementation will make the test go green, but it's completely un-DRY, so you'd have to rip it out and replace it with a non-c&p implementation during the necessary refactor phase anyways.

Wikipedia agrees with me on this, and they cite to "Test-Driven Development by Example" by Kent Beck, the original TDD guy.

So, TDD as I learned it discourages c&p from the test. However, Morendil, now you've got me interested in talking about the possible benefits of a c&p-permitted approach: for example, I can see how it might force the programmer to write more sophisticated tests. Though on the other hand, it might also force them to spend a lot more time on the tests but for only minor additional benefit.

comment by sketerpot · 2010-10-01T19:07:09.589Z · LW(p) · GW(p)

On the other hand, if you write the code first and then the test, you'll have a better idea of how to make the code break. If you can put yourself in a sufficiently ruthless frame of mind, I think this is better than writing the test first.

comment by DSimon · 2010-10-01T15:34:43.937Z · LW(p) · GW(p)

Okay, I think I see where you're going, but let me double-check:

You're saying that false positive tests are weeded out in TDD because the implementation isn't allowed to have any code to raise errors or return negative states until there's first a test checking for those errors/states.

So, if an everythingWorkingOkay() function always returns true, it wouldn't pass the test that breaks things and then makes sure it return false. We know that test exists because ideally, for TDD, that test must be written before any code can be added to the function that is intended to return false at all.

Whereas, if the programmer writes the code first and the test second, they might well forget to test for negative output, since that possibility won't come to mind as readily.

Replies from: Morendil
comment by Morendil · 2010-10-02T09:17:31.637Z · LW(p) · GW(p)

See this other reply for more detail.

comment by Relsqui · 2010-10-02T08:58:41.572Z · LW(p) · GW(p)

Upvoted for your very precise articulation of what encountering a bug feels like. ;)

I find it interesting that this style of development seems very familiar to me ... except for the bit about automated testing! (Bear with me.) I tend to work on small projects, and because of the way I'm wired (i.e. because I haven't rewired myself out of it), when I'm working on something larger, I can't really keep many pieces of it in my head at once. When I try to, bits which aren't drawing my attention right now will fall out of my brain, and sometimes those bits were important. So I have to focus on small pieces at a time--I'll add the minimum amount of code which should produce a new result, make sure it works, and repeat. The principle of this seems much like the principle behind TDD; I just never learned to write tests for my code.

Another comment about evaluation of tests reminds me of someone I know who's working on a project which evaluates the location and frequency of logging statements in a piece of code. I find this a very cool meta tool. Who debugs the debuggers? (If I find out he's got some public writing about this I'll link it.)

Replies from: DSimon
comment by DSimon · 2010-10-05T14:18:08.166Z · LW(p) · GW(p)

I tend to work on small projects, and because of the way I'm wired (i.e. because I haven't rewired myself out of it), when I'm working on something larger, I can't really keep many pieces of it in my head at once. When I try to, bits which aren't drawing my attention right now will fall out of my brain, and sometimes those bits were important. So I have to focus on small pieces at a time[...]

Ye hairy gods, do NOT attempt to rewire yourself out of this. :-)

I don't know of any programmers who are able to productively hold very large chunks of implementation in their head at once. Writing in small pieces with frequent result checking is nearly always a good idea even if your result checking isn't regressive or automated. You should be happy you started out with this habit instead of having to force yourself into it.

Replies from: Relsqui, wnoise
comment by Relsqui · 2010-10-05T17:37:48.130Z · LW(p) · GW(p)

wnoise summed it up well. I'm very pleased to have the habit (and am occasionally confused by people who don't do it). But it's frustrating to have that mental limitation. It makes me feel a bit dumb when I'm talking to Real Actual Programmers. :)

comment by wnoise · 2010-10-05T15:23:27.882Z · LW(p) · GW(p)

The habit is very good. The necessity of it is bad. Larger working memory is incredibly useful for a programmer.

comment by Darmani · 2010-09-28T18:51:21.509Z · LW(p) · GW(p)

It's confusing to think of TDD as its own rationality technique: testing your belief that a piece of code works is not fundamentally different from testing any other belief. Okay, so that part is just running unit tests once. Since whether a piece of code works is a different belief from whether that piece of code with a few modifications works, for efficiency, since writing tests is work, you need to keep that tests around and rerun them. So, that's unit testing. TDD is just writing your tests beforehand, which makes a difference in the process of designing software, but not really in how confident you should be that your code works.

Something more interesting to think about is how much information a test gives you that your code works. You can often tell from eyeballing whether switching an integer from positive to negative will make a meaningful difference whether the code produces the intended result.

This really turns into an approximate, poor-man's version of proving code correct, which typically proceeds by breaking down code into its paths and checking each against the mathematical model.

Which reminds me, I have to go prove a few Standard ML functions work by Friday.

Replies from: Vladimir_Nesov, DSimon, DSimon, DSimon
comment by Vladimir_Nesov · 2010-09-28T19:09:43.907Z · LW(p) · GW(p)

It's confusing to think of TDD as its own rationality technique: testing your belief that a piece of code works is not fundamentally different from testing any other belief.

No rationality technique is "fundamentally different" from fixing of any other kind of incorrect beliefs. You've just made an absolutely general counterargument.

(With TDD, you also test that the code didn't work before the bugfix (in the specific way), and started working as a result of it. It's too easy to fix nonexistent problems.)

comment by DSimon · 2010-09-29T17:10:54.762Z · LW(p) · GW(p)

Something more interesting to think about is how much information a test gives you that your code works.

There are several tools that help to increase that amount, that test your tests in other words. They're far from conclusive or thorough, but they can help a great deal:

  • Coverage testers: This technique involves running your tests and seeing which lines in your program they involve, or (for more sophisticated versions of this technique) which possible paths through your program are ran. If you have full coverage that's no guarantee that you have thorough tests... but if you have low coverage that definitely means your tests could be improved, because there are parts of your program that aren't being tested.

  • Mutation testers: These tools will randomly introduce temporary but destructive changes in your code (swapping strings for nonsense, flipping booleans, and so on), and then run the unit tests to make sure they fail. This is far from rigorous, but it can be useful; if your test is passing even though your code isn't actually working, then that means there's an opportunity for a bug to be introduced there without being detected.

(Darmani, you probably already know all this; I'm saying it for whoever else might be reading this discussion.)

comment by DSimon · 2010-09-29T17:16:01.460Z · LW(p) · GW(p)

TDD is just writing your tests beforehand, which makes a difference in the process of designing software, but not really in how confident you should be that your code works.

The advantage I've had with TDD over regular testing is that I find myself going down fewer dead-ends, and so being more confident that whatever I'm writing at the moment is actually helpful.

At any given time I'm supposed to only be thinking about the latest test, so I'm much more likely to be in near mode laying down one brick at a time in just the right spot than in far mode mentally sketching out the whole castle.

comment by DSimon · 2010-09-29T17:09:43.842Z · LW(p) · GW(p)

Something more interesting to think about is how much information a test gives you that your code works.

There are several tools that help to increase that amount, that test your tests in other words. They're far from conclusive or thorough, but they can help a great deal:

  • Coverage testers: This technique involves running your tests and seeing which lines in your program they involve, or (for more sophisticated versions of this technique) which possible paths through your program are ran. If you have full coverage that's no guarantee that you have thorough tests... but if you have low coverage that definitely means your tests could be improved, because there are parts of your program that aren't being tested.

  • Mutation testers: These tools will randomly introduce temporary but destructive changes in your code (swapping strings for nonsense, flipping booleans, and so on), and then run the unit tests to make sure they fail. This is far from rigorous, but it can be useful; if your test is passing even though your code isn't actually working, then that means there's an opportunity for a bug to be introduced there without being detected.

(Darmani, you probably already know all this; I'm saying it for whoever else might be reading this discussion.)

comment by lukstafi · 2010-09-29T14:32:38.151Z · LW(p) · GW(p)

We are slowly moving towards languages with contracts (like Spec#), where "unit tests" are replaced by "contract tuning" + "functional tests".

Replies from: khafra
comment by khafra · 2010-09-30T14:05:42.263Z · LW(p) · GW(p)

I know about functional programming, but it took a few tries with the google-fu to get information on contract tuning and Spec#, so I'm posting links:

http://en.wikipedia.org/wiki/Design_by_Contract

http://en.wikipedia.org/wiki/Spec_Sharp

Replies from: DSimon
comment by DSimon · 2010-10-06T19:46:51.432Z · LW(p) · GW(p)

BTW, functional programming and functional tests are not related concepts.

Replies from: khafra
comment by khafra · 2010-10-06T20:23:58.481Z · LW(p) · GW(p)

...and so we meet again, what I know for sure that just ain't so. Thanks for the correction; and "black-box testing" seems googleabler. One thing unclear from wikipedia (and further references, the most in-depth of which considered HVAC) is the difference between a functional test and a unit test.