Experiment Idea Thread - Spring 2011

post by Psychohistorian · 2011-05-06T18:10:45.329Z · score: 28 (29 votes) · LW · GW · Legacy · 54 comments

This is an idea that just occurred to me. We have a large community of people who think about scientific problems recreationally, many of whom are in no position to go around investigating them. Hopefully, however, some other community members are in a position to go around investigating them, or know people who are. The idea here is to allow people to propose relatively specific ideas for experiments, which can be upvoted if people think they are wise, and can be commented on and refined by others. Grouping them together in an easily identifiable, organized way in which people can provide approval and suggestions seems like it may actually help advance human knowledge, and with its high sanity waterline and (kind of) diverse group of readers, this community seems like an excellent place to implement this idea.

These should be relatively practical, with an eye towards providing some aspiring grad student or professor with enough of an idea that they could go implement it. You should explain the general field (physics, AI, evolutionary psychology, economics, psychology, etc.) as well as the question the experiment is designed to investigate, in as much detail as you are reasonably capable of.

If this is a popular idea, a new thread can be started every time one of these reaches 500 comments, or quarterly, depending on its popularity. I expect this to provide help for people refining their understanding of various sciences, and if it ever gets turned into even a few good experiments, it will prove immensely worthwhile.

I think it's best to make these distinct from the general discussion thread because they have a very narrow purpose. I'll post an idea or two of my own to get things started. I'd also encourage people to post not only experiment ideas, but criticism and suggestions regarding this thread concept. I'd also suggest that people upvote or downvote this post if they think this is a good or bad idea, to better establish whether future implementations will be worthwhile. 

54 comments

Comments sorted by top scores.

comment by James_Miller · 2011-05-06T18:35:56.088Z · score: 12 (12 votes) · LW(p) · GW(p)

Does the Dual n-back increase intelligence?

Some evidence indicates that playing the videogame Dual n-back increases working memory and fluid intelligence.

A group of us would first take a memory test. Next, a randomly selected subgroup would play the Duel n-back a few hours a week for, say, a month. Then, both groups would take another memory test. Next, we would wait, say, two months with no one playing the game. Finally, the two groups would again take a memory test. We could probably still learn a lot by omitting the control group.

Here is a free version of the game.

comment by D_Malik · 2011-05-06T20:22:28.761Z · score: 3 (3 votes) · LW(p) · GW(p)

In addition to the memory test, we should also use some fluid intelligence test (like RAPM). It would probably be good to use unspeeded RAPM and other fluid intelligence tests (rather than speeded RAPM, which is controversial.).

Also, we should investigate different modes, like multiple stimuli and arithmetic and crab n-back.

comment by Louie · 2011-05-08T18:41:49.545Z · score: 0 (8 votes) · LW(p) · GW(p)

A few of us at Singularity Institute tested Dual N-Back last year. For 1 week, 13 people were tested on dissimilar metrics of intelligence while some of them performed the same kind of Dual N-Back done in the original Jaeggi study.

Conclusion: It doesn't make you smarter.

Bonus: You get better at Dual N-Back though!

Interestingly, at around the same time as we were doing our tests last last year, the original research "replicated" her own results and published them again using new data. I'm sort of confused. I don't want to say Jaeggi doesn't understand training and practice effects... but I'm struggling to understand how else to explain this.

That said, it would still be cool to see LW folks test IA interventions. I just recommend exploring more promising ones. Perhaps seeking to confirm the results of these studies instead?

comment by AnnaSalamon · 2011-05-08T23:42:57.237Z · score: 12 (12 votes) · LW(p) · GW(p)

Louie, I don't remember the details of this. I thought folks ended up with a very small and not-very-powerful study (such that the effect would have had to be very large to show up anyhow), with the main goal of the "study", such as it was, being to test our procedures for running future potential experiments?

Could you refresh my memory on what tests were run?

Also, speaking about how "the folks at SIAI believe X" based on small-scale attempt run within the visiting fellows program last summer seems misleading; it may inaccurately seem to folks as though you're speaking for Eliezer, Michael Vassar, or others. I was there (unlike Eliezer or Michael) and I don't recall having the beliefs you mention.

comment by Will_Newsome · 2011-05-10T11:47:20.891Z · score: 1 (3 votes) · LW(p) · GW(p)

This definitely agrees with my memory, too...

I mostly felt compelled to actually sign in and comment because I wanted to point out that designating something as "[t]he opinion of folks at Singularity Institute" or the like is often an annoying simplification/inaccuracy that allows people to thenceforth feel justified in using their social cognition to model a group roughly as they would a person.

comment by gwern · 2011-05-11T14:34:28.577Z · score: 0 (0 votes) · LW(p) · GW(p)

When I asked back in September 2010, you said

They're ad hoc, we've used one for a dual n-back study which ended up yielding insufficient data....We didn't study long enough to get any statistically significant data. Like, not even close....really, there's no information there, no matter how much Bayes magic you use

So at least your memory has been consistent over the last 9 months.

comment by cousin_it · 2011-05-08T18:49:05.151Z · score: 8 (8 votes) · LW(p) · GW(p)

Hmm. If your replication attempt was good science, you could help the world by publishing it. If it wasn't good science, you probably shouldn't update on it very strongly.

comment by Louie · 2011-05-08T20:53:55.820Z · score: -4 (4 votes) · LW(p) · GW(p)

I don't know anyone at Singularity Institute who believes either of those statements.

Did you mean this seriously or is this a level?

EDIT: I was thinking of "good science" as the social process of science. By that standard, it's well known on less wrong that lots of junk can be published, that many published results are bogus, and that lots of useful experiments which you could meaningfully update on could not be published.

That's why I thought you must be joking. It's like if my friend was at a gay pride rally and said, "We're just trying to undermine family values and destroy America." I would laugh. It's such a naive and backwards statement given what everyone around him believes, that I could safely assume it was complete sarcasm (even though that sentence could be taken another way).

Also, Anna claims to believe your statements. So I was wrong to say no one at Sing Inst believes them. I can't rightfully claim to speak for others like that so sorry to confuse anyone. In her defense, Anna must be reading "good science" and "publishing" as referring to the theoretical ideals of Science and Journals. In that case, I'd guess lots of people could agree with your statements.

comment by cousin_it · 2011-05-08T20:57:28.049Z · score: 6 (6 votes) · LW(p) · GW(p)

I'm not sure what the opinions of folks at SIAI have to do with this (just mentioning them doesn't constitute an valid argument, I'm not at SIAI, I was speaking seriously), but I can recall a quote from Eliezer expressing a sentiment that's pretty close to my second one:

"What happened to me personally is only anecdotal evidence," Harry explained to her. "It doesn't carry the same weight as a replicated, peer-reviewed journal article about a controlled study with random assignment, many subjects, large effect sizes and strong statistical significance."

-- HP:MOR, Ch.6

But that's not very relevant. What's relevant is this: if I believed in the efficacy ot the DNB task, it would be wrong for me to change my opinion substantially after reading your comment on LW. 13 people, 1 week, methodology and results unpublished and unverified?

comment by Louie · 2011-05-08T22:12:30.908Z · score: -3 (5 votes) · LW(p) · GW(p)

The opinion of folks at Singularity Institute is VERY relevant when discussing the methodology of studies done at Singularity Institute. They laughed at the thought of going through the motions of collecting the extra evidence and doing the extra rituals with statistics to impress journal editors. They did the Bayesian version instead.

If you want an enormous, controlled, statistically-significant study that's been published in a high-quality, peer reviewed journal, check out this study of brain training software in Nature.

I actually just found this study in the past hour. Their conclusion: Brain Training software doesn't work. But you get better at the games!

They cover Lumosity too which is funny because I was just looking into them myself. I was a bit concerned when I tried to look up the evidence for how "scientifically proven" Lumosity was (since they claim it ALL OVER THEIR SITE) and I later realized that the extent of their published findings were 2 conference posters, that aren't available anywhere online, and that were accepted to conferences I've never even heard of.

I think I'm gonna go with the researchers from Nature and Singularity Institute on this one.

comment by wedrifid · 2011-05-08T22:47:50.584Z · score: 2 (2 votes) · LW(p) · GW(p)

If you want an enormous, controlled, statistically-significant study that's been published in a high-quality, peer reviewed journal, check out this study of brain training software in Nature.

The finding is surprising.

The training experience that I have most appreciated is that of pushing my brain towards the state of flow, releasing any stress or rumination and constantly letting go of the attachment to the frustration of failure while also not being frustrated by the fact that I may be frustrated about failure. This has a strong overlap with the process involved in some forms of meditation and is certainly the kind of thing that I would expect to have generalized benefit - albeit not necessarily to one of an improvement on tests of general intelligence. The format of game with a score to be maximised invokes my rather strong competitive instincts and so rather more motivating than the abstract thought "I should do meditation because meditation is good for me'.

The abstract is insufficiently concrete and vague for me to tell whether their studies relate to directly to the kind of training that I am interesting. I would expect not - since my interest is in things that are rather hard to test! My curiosity is not quite sufficient for me to bipass the feeling of disgust and frustration at the paywall and round up the rest of the document

comment by wedrifid · 2011-05-08T23:32:44.891Z · score: 1 (1 votes) · LW(p) · GW(p)

This seems to miss the point cousin_it was making.

comment by cousin_it · 2011-05-09T21:04:16.761Z · score: 2 (2 votes) · LW(p) · GW(p)

Re your edit: yes, I was referring to the theoretical ideal of Science. (Don't care about Journals.) If a lot of published science is bogus, IMO the right response is to try to do better and nail down our results with more precision, not less. Especially in a topic as important as intelligence amplification.

That said, I wasn't very convinced by the evidence in favor of the DNB task in the first place. In my mind the jury's still out.

comment by Louie · 2011-05-09T21:49:18.584Z · score: 0 (4 votes) · LW(p) · GW(p)

the right response is to try to do better

Agreed. Better is better.

and nail down our results with more precision, not less

I disagree. For example, you can rule out Dual N-Back as a possible Intelligence Amplification intervention with less precision than Jaeggi used to repeatedly mis-prove it as one. Depends on what you mean by precision I suppose. If you mean more time, effort, people, and statistical significance then precision is not needed. If by precision, you just mean being more right... well, I agree, we should be more right.

Most bogus science is very precise: That's why it looks stronger than it is. Poor methodology and experimental design will still allow someone to prove any correlation with p < 0.05 significance. If I want to disprove someone who published an incorrect result, should I have to expend more time, people, and resources than they used just to over-prove the counter-claim with "more precision" -- even though their claim was never wrong due to "lack of precision" in the first place?

Calling for "more precision" is like calling for "more preparation". It has 100% applause appeal and costs nothing for people to call for. But it costs people actually doing research a lot of time. When you advocate for smarter people to use "more precision", you're also advocating for smarter people to do "less research"... the extra precision comes from somewhere.

Are you actually in favor of smarter people doing less research than they currently do?

comment by Vladimir_Nesov · 2011-05-10T09:48:46.952Z · score: 3 (9 votes) · LW(p) · GW(p)

Are you actually in favor of smarter people doing less research than they currently do?

Downvoted for this piece of empty rhetoric.

comment by Will_Newsome · 2011-05-10T11:55:45.484Z · score: 0 (2 votes) · LW(p) · GW(p)

Upvoted; I think this is a good downvoting policy but hope that whoever uses it takes the time to point out what they perceive as empty rhetoric. (I think the habit of spouting such rhetoric is particularly poisonous and particularly easy to stop, making it rather worth the effort of correcting.)

comment by cousin_it · 2011-05-10T08:56:07.447Z · score: 3 (5 votes) · LW(p) · GW(p)

Let's agree on the interpretation "we should be more right" and skip over the issues of time and costs.

Sometimes a published result can indeed be overturned by a small amount of Bayesian evidence. But that's only possible if you also prove that your methodology was much more right than the original paper's methodology. Right now I have no way of knowing that from your comments. If you add a critique of Jaeggi's study and an explanation why your study was better, that will work for me.

comment by Desrtopa · 2011-05-11T04:10:21.034Z · score: 2 (2 votes) · LW(p) · GW(p)

Are you actually in favor of smarter people doing less research than they currently do?

Considering how much research, given the low levels of confidence warranted by its methodology, is essentially worthless, yes, I am willing to say without reservation that there is dead wood to cut away.

comment by [deleted] · 2011-05-08T19:08:50.396Z · score: 4 (4 votes) · LW(p) · GW(p)

For 1 week, 13 people were tested on dissimilar metrics of intelligence

One week seems very short compared to the studies. I didn't check all, but one mentioned in Wikipedia was 5 weeks and another whose abstract I found was 4 weeks. It seems probable that any effects on the brain would be slow to accumulate. As a point of comparison, it takes a lot longer than one week to assess the value of a particular muscle training program.

comment by steven0461 · 2011-05-08T19:25:24.270Z · score: 0 (0 votes) · LW(p) · GW(p)

Conclusion: It doesn't make you smarter.

How much smarter does it not make you, specifically?

comment by Alicorn · 2011-05-06T18:58:37.753Z · score: 5 (7 votes) · LW(p) · GW(p)

Per this discussion post: do siblings who look alike tend to get along better and help each other more?

Ideally you'd get a bunch of (full, tested genetically if there's budget and this passes the ethics board) sibling pairs, probably controlling for number and variety of extra siblings in the family. Take pictures of all the pairs, get third parties to rate their visual similarity (I can name half a dozen features that don't match between me and my sister but that didn't stop six of her friends from mistaking me for her when I walked into her school building one day, so I'm obviously no judge of the matter), and then measure altruism between the siblings (exchanges of babysitting services, frequency of contact/social support, any material assistance that they provide one another, etc.)

comment by Psychohistorian · 2011-05-06T18:12:27.958Z · score: 4 (4 votes) · LW(p) · GW(p)

Here's an example, which I will review the comments from and use to develop a sort of standard structure, which I will then incorporate in the top level post for future. So please comment both on the idea and my expression of it, and make suggestions for what basic info should be included either in response to this or in response to the primary article.

Area: Evolutionary Psychology

Topic: Genetic versus "cultural" influence on mate attraction.

Specific problem: There's a belief that various aspects of physical attraction are genetically determined. It is difficult to separate genetic effects from cultural effects. This is an attempt to try to control for that, to see how (in)substantial the effects of culture are. The underlying idea is that, while different cultures also have different genetic makeups, different times in the same geographic area may see different cultures with much more related genetic makeups.

The actual experiment: Sample a group of people (possibly just one sex per experiment) and obtain their views on physical attractiveness. Show them images of people, or drawings, or ask questions about what physical qualities they would find desirable in a mate. (e.g. An attractive member of the opposite sex would be taller than me - strongly agree, agree, disagree, strongly disagree). Then, and this is the expensive part, use the exact same survey on people's offspring at about the same age. It may be ideal to compare people with aunts and uncles rather than parents, as parents are likely to have a more direct non-genetic effect on preferences.

This is a rather general description, but it should be perfectly adequate for someone in the relevant field to design a very effective and insightful experiment. It could even easily be incorporated as part of a larger experiment tracking qualities between generations.

comment by orthonormal · 2011-05-09T00:22:09.471Z · score: 1 (1 votes) · LW(p) · GW(p)

Have you seen this study yet?

comment by DSimon · 2011-05-07T23:30:38.904Z · score: 0 (0 votes) · LW(p) · GW(p)

This is a very cool idea.

One minor protocol thing: if it's a good idea to limit the sample group based on sex, then it would also be a good idea to limit based on sexual orientation, since the cultural factors that affect opposite-sex attraction are quite different from those that affect same-sex attraction, and there may be a difference in the genetic factors as well.

comment by Morendil · 2011-05-06T21:44:49.325Z · score: 2 (2 votes) · LW(p) · GW(p)

Field: Software Engineering. Issue: what are the determinants of efficiency in getting stuff done that entails writing software.

At the Paris LW meetup, I described to Alexandros the particular subtopic about which I noticed confusion (including my own) - people call it the "10x thesis". According to this, in a typical workgroup of software professionals (people paid to write code), there will be a ten to one ratio between productivities of the best and worst. According to a stronger version, these disparities are unrelated to experience.

The research in this area typically has the following setup: you get a group of N people in one room, and give them the same task to perform. Usually there is some experimental condition that you want to measure the effect of (for instance "using design patterns" vs "not using design patterns"), so you split them into subgroups accordingly. You then measure how long each takes to finish the task.

The "10x" result comes from interpreting the same kind of experimental data, but instead of looking at the effect of the experimental condition, you look at the variance itself. (Historically, this got noticed because it vexed researchers that the variance was almost always swamping out the effects of the experimental conditions.)

The issue that perplexes me is that taking a best-to-worst ratio in each group, in such cases, will give a measurement of variance that is composed of two things: first, how variable the time required to complete a task is intrinsically, and second, how different people in the relevant population (which is itself hard to define) differ in their effectiveness at completing tasks.

When I discussed this with Alexandros I brought up the "ideal experiment" I would want to use to measure the first component: take one person, give them a task, measure how long they take. Repeat N times.

However this experiment isn't valid, because remembering how you solved the task the first time around saves a huge amount of time in successive attempts.

So my "ideal" experiment has to be amended: the same, but you wipe the programmer's short-term memory each time, resetting them to the state they were in before the task. Now this is only an impossible experiment.

What surprised me was Alexandros' next remark: "You can measure the same thing by giving the same task to N programmers, instead".

This seems clearly wrong to me. There are two different probability distributions involved: one is within-subject, the other inter-subject. They do not necessarily have the same shape. What you measure when giving one task to N programmers is a joint probability distribution, the shape of which could be consistent with infinitely many hypotheses about the shape of the underlying distributions.

Thus, my question - what would be a good experimental setup and statistical tools to infer within-subject variation, which cannot be measured, from what we can measure?

Bonus question: am I totally confused about the matter?

comment by Perplexed · 2011-05-06T22:13:58.407Z · score: 8 (8 votes) · LW(p) · GW(p)
  1. Give one task to N programmers.
  2. Give a different task to the same N programmers.
  3. Repeat #2 several times.
  4. Say to self "I'll bet the same guy was a super-programmer on all of those tasks. He just is better at programming".
  5. Repeat #4 several times.
  6. Analyze the data by multiple regression. Independent variables are programmer ids and task ids. Intrinsic variability of tasks falls out of the analysis as unexplained variance, but what you are really interested is relative performance of programmers over all tasks.

Bonus: I don't think you are confused. But you seem to be assuming that the 10x thesis applies to specific programming tasks (like writing a parser, or a diagram editor, or a pretty-printer). But I think the hypothesis is stronger than that. Some people are better at all types of programming than are lesser mortals. So, you can smooth the noise by aggregating several tasks without losing the 10x signal.

comment by Morendil · 2011-05-07T20:56:31.544Z · score: 0 (0 votes) · LW(p) · GW(p)

Analyze the data by multiple regression

I'd appreciate practical advice on how to do that in R/RStudio. I have data from an empirical study, loaded in RStudio as "29 observations of 8 variables". My variables are "Who, T1, T2, T3 (etc)" where "Who" is programmer id and T1, etc. are the times taken for tasks 1 through 8.

What R command will give me a multiple regression of times over programmer id and task id?

[ETA: OK, I figure what I've got to do is make this a data frame with 3 variables, those being Who, TaskId, Time. Right? Maybe I can figure it out. Worst case, I'll create a spreadsheet organized that way.]

[ETA2: I've done the above, but I don't know how to interpret the results. What do you expect to see - in terms of coefficients of regression?]

comment by Perplexed · 2011-05-07T22:41:08.413Z · score: -1 (1 votes) · LW(p) · GW(p)

I think you need one variable per programmer (value 0 or 1), one variable per task (value 0 or 1), and one variable for time taken to complete the task (real number). So, with 8 tasks and 29 programmers, you have 38 (= 29 + 8 + 1) variables, all but 3 of which are zero for each observation. And you have 232 observations.

Since you have 37 independent variables, you will have 37 regression coefficients (each presumable in units of hours) plus one additional parameter that applies to all observations. The results claim that you get a good estimate of the time required for programmer j to complete task k by adding together the j-th programmer coefficient, the k-th task coefficient and the extra parameter.

comment by Morendil · 2011-05-08T08:10:00.604Z · score: 0 (0 votes) · LW(p) · GW(p)

I'm not seeing why the ProgID and TaskID variables need to be booleans - or maybe R implicitly converts them to that. I've left them in symbolic form.

Here is a subset of the PatMain data massaged (by hand!) into the format I thought would let me get a regression, and the regression results as a comment. I got this into a data frame variable named z2 and ran the commands:

fit = lm(Time ~ .,data=z2)
summary(fit)

I suck at statistics so I may be talking nonsense here, and you're welcome to check my results. The bottom line seems to be that the task coefficients do a much better job of predicting the completion time than do the programmer coefficients, with t-values that suggest you could easily not care about who performs the task with the exception of programmer A6 who was the slowest of the lot.

(For instance the coefficients say that the best prediction for the time taken is "40 minutes", then you subtract 25 minutes if the task is ST2. This isn't a bad approximation, except for programmer A4 who takes 40 minutes on ST2. It's not that A4 is slow - just slow on that task.)

comment by Perplexed · 2011-05-08T13:30:51.330Z · score: 0 (0 votes) · LW(p) · GW(p)

You had asked for assistance and expertise on using R/RStudio. Unfortunately, I have never used them.

maybe R implicitly converts them

Judging from your results, I'm sure you are right.

The bottom line seems to be that the task coefficients do a much better job of predicting the completion time than do the programmer coefficients.

Yes, and if you added some additional tasks into the mix - tasks which took hours or days to complete - then programmer ID would seem to make even less difference. This points out the defect in my suggested data-analysis strategy. A better approach might have been to divide each time by the average time for the task (over all programmers), optionally also taking the log of that, and then exclude the task id as an independent variable. After all, the hypothesis is that Achilles is 10x as fast as the Tortoise, not that he takes ~30 minutes less time regardless of task size.

comment by Morendil · 2011-05-07T11:27:49.675Z · score: 0 (0 votes) · LW(p) · GW(p)

you seem to be assuming that the 10x thesis applies to specific programming tasks (like writing a parser, or a diagram editor, or a pretty-printer)

Where is that implied in what I wrote above?

Some people are better at all types of programming than are lesser mortals

Are you making that claim, or suggesting that this is what the 10x thesis means?

(Dijkstra once claimed that "the use of COBOL cripples the mind". If true, it would follow that someone who is a great COBOL programmer would be a poor programmer in other languages.)

comment by Perplexed · 2011-05-07T17:11:52.076Z · score: 3 (3 votes) · LW(p) · GW(p)
Some people are better at all types of programming than are lesser mortals

Are you making that claim, or suggesting that this is what the 10x thesis means?

Both.

(Dijkstra once claimed that "the use of COBOL cripples the mind". If true, it would follow that someone who is a great COBOL programmer would be a poor programmer in other languages.)

Amusingly, that does not follow. A great COBOL programmer completes his COBOL tasks in 1/10 the time of lesser folk, and hence becomes 1/10 as crippled.

you seem to be assuming ...

Where is that implied in what I wrote above?

It appears that I somehow misinterpreted your point and thereby somehow offended you. That was not my intention.

You begin by mentioning the problem of testing the 10x hypothesis, and then switched to the problem of trying to separate out "how variable the time required to complete a task is intrinsically". That is an odd problem to focus on, and my intuition tells me that it is best approached by identifying that variance as a residual rather than by inventing an ideal thought experiments that measure it directly. But if someone else has better ideas, that is great.

comment by Morendil · 2011-05-07T20:43:28.181Z · score: 2 (2 votes) · LW(p) · GW(p)

somehow offended you

No offense taken. Just curious to know. I'm declaring Crocker's Rules in this thread.

You are asserting "some people are better at all types of programming than are lesser mortals". In that case I'd like to know what evidence convinced you, so that I can have a better understanding of "better at".

Some of the empirical data I have looked at contradicted your hypothesis "the same guy was a super-programmer on all of those tasks". In that study, some people finished first on one task and last on some other task. (Prechelt's "PatMain" study.)

the problem of testing the 10x hypothesis

One of my questions is, "is the 10x claim even a testable hypothesis?". In other words, do we know what the world would look like if it was false?

When I've brought this up in one venue, people asked me "well, have you seen any evidence suggesting that all people code at the same rate?" This is dumb. Just because there exists one alternate hypothesis which is obviously false does not immediately confirm the hypothesis being tested.

Rather, the question is "out of the space of possible hypotheses about how people's rates of output when programming differ, how do we know that the best is the one which models each individual as represented by a single numerical value, such that the typical ratio between highest and lowest is one order of magnitude".

This space includes hypotheses where rate of output is mostly explained by experience, which appear facially plausuble - yet many versions of the 10x thesis explicitly discard these.

comment by Perplexed · 2011-05-07T23:04:22.683Z · score: 1 (1 votes) · LW(p) · GW(p)

My reasons for believing the 10x hypothesis are mostly anecdotal. I've talked to people who observed Knuth and Harlan Mills in action. I know of the kinds of things accomplished more recently by Torvalds and Hudak. Plus, I have myself observed differences of at least 5x in industrial and college classwork environments.

I looked at the PatMain study. I'm not sure that the tasks there are large enough (roughly 3 hours) to test the 10x hypothesis. Worse, they are program maintenance tasks, and they exclude testing and debugging. My impression is that top programmers achieve their productivity mostly by being better at the design and debugging tasks. That is, they design so that they need less code, and they code so they need dramatically less debugging. So I wouldn't expect PatMain data to back up the 10x hypothesis.

comment by Morendil · 2011-05-08T07:44:32.927Z · score: 0 (0 votes) · LW(p) · GW(p)

My reasons for believing the 10x hypothesis are mostly anecdotal.

Do you see it as a testable hypothesis though, as opposed to an applause light calling out the programming profession as one where remarkable individuals are to be found?

I'm not sure that the tasks there are large enough ... they are program maintenance tasks

You said earlier that a great programmer is good at all types of programming tasks, and program maintenance certainly is a programming task. Why the reversal?

Anyway, suppose you're correct and there are some experimental conditions which make for a poor test of 10x. Then we need to list all such exclusion criteria prior to the experiment, not come up with them a posteriori - or we'll be suspected of excluding the experimental results we don't like.

My impression is that top programmers achieve their productivity mostly by being better at the design and debugging tasks ... they design so that they need less code

Now this sounds as if you're defining "productivity" in such a way that it has less to do with "rate of output". You've just ruled out, a priori, any experimental setup in which you hand programmers a fixed design and measure the time taken to implement it, for instance.

At this point ISTM we still have made surprisingly little headway on the two questions at hand:

  • what kind of claim is the 10x claim - is it a testable hypothesis, and if not, how do we turn it into one
  • what kind of experimental setup will give us a way to check whether 10x is indeed favored among credible alternatives
comment by Perplexed · 2011-05-08T13:56:15.308Z · score: 1 (1 votes) · LW(p) · GW(p)

Do you see it as a testable hypothesis[?]

I believe it can be turned into one. For example, as stated, it doesn't take into account sample or population size. The reductio (N=2) is that it seems to claim the faster of two programmers will be 10x as fast as the slower. There is also a need to clarify and delimit what is meant by task.

You said earlier that a great programmer is good at all types of programming tasks, and program maintenance certainly is a programming task. Why the reversal?

Because you and I meant different things by task. (I meant different types of systems - compilers vs financial vs telephone switching systems for example.) Typing and attending meetings are also programming tasks, but I wouldn't select them out for measurement and exclude other, more significant tasks when trying to test the 10x hypothesis.

Now this sounds as if you're defining "productivity" in such a way that it has less to do with "rate of output". You've just ruled out, a priori, any experimental setup in which you hand programmers a fixed design and measure the time taken to implement it, for instance.

Yes, I have. And I think we are wasting time here. It is easy to refute a scientific hypothesis by uncharitably misinterpreting it so that it cannot possibly be true. So I'm sure you will succeed in doing so without my help.

comment by Morendil · 2011-05-08T15:46:58.614Z · score: 0 (0 votes) · LW(p) · GW(p)

It is easy to refute a scientific hypothesis by uncharitably misinterpreting it so that it cannot possibly be true.

Where specifically have I done that? (Is it the "applause light" part? Do you think it obviously false that the thesis serves as an applause light?)

And I think we are wasting time here.

Are you tapping out? This is frustrating as hell. Crocker's Rules, dammit - feel free to call me an idiot, but please point out where I'm being one!

Without outside help I can certainly go on doubting - holding off on believing what others seem to believe. But I want something more - I want to form positive knowledge. (As one fictional rationalist would have it, "My bottom line is not yet written. I will figure out how to test the magical strength of Muggleborns, and the magical strength of purebloods. If my tests tell me that Muggleborns are weaker, I will believe they are weaker. If my tests tell me that Muggleborns are stronger, I will believe they are stronger. Knowing this and other truths, I will gain some measure of power.")

For example, as stated, it doesn't take into account sample or population size.

Yeah, good catch. The 10x ratio is supposed to hold for workgroup-sized samples (10 to 20). What the source population is, that's less clearly defined. A 1983 quote from Mills refers to "programmers certified by their industrial position and pay", and we could go with that: anyone who gets full time or better compensation for writing code and whose job description says "programmer" or a variation thereof.

We can add "how large is the programmer population" to our list of questions. A quick search turns up an estimate from Watts Humphrey of 3 million programmers in the US about ten years ago.

So let's assume those parameters hold - population size of 3M and sample size of 10. Do we now have a testable hypothesis?

What is the math for finding out what distribution of "productivity" in the overall population gives rise to a typical 10x best-to-worst ratio when you take samples of that size? Is that even a useful line of inquiry?

comment by pengvado · 2011-05-11T20:09:19.130Z · score: 3 (3 votes) · LW(p) · GW(p)

The misinterpretation that stood out to me was:

Now this sounds as if you're defining "productivity" in such a way that it has less to do with "rate of output". You've just ruled out, a priori, any experimental setup in which you hand programmers a fixed design and measure the time taken to implement it, for instance.

I'm not sure whether you meant "design" to refer to e.g. internal API or overall program behavior, but they're both relevant in the same way:

The important metric of "rate of output" is how fast a programmer can solve real-world problems. Not how fast they can write lines of code -- LOC is a cost, not an output. Design is not a constant. If Alice implements feature X using 1 day and 100 LOC, and Bob implements X using 10 days and 500 LOC, then Alice was 10x as productive as Bob, and she achieved that productivity by writing less code.

I would also expect that even having a fixed specification of what the program should do would somewhat compress the range of observed productivities compared to what actually happens in the wild. Because translating a problem into a desired program behavior is itself part of the task of programming, and is one of the opportunities for good programmers to distinguish themselves by finding a more efficient design. Although it's harder to design an experiment to test this part of the hypothesis.

comment by thomblake · 2011-05-11T20:18:53.645Z · score: 0 (0 votes) · LW(p) · GW(p)

LOC is a cost, not an output

Yes.

comment by DSimon · 2011-05-07T23:33:24.227Z · score: 1 (1 votes) · LW(p) · GW(p)

A great COBOL programmer completes his COBOL tasks in 1/10 the time of lesser folk, and hence becomes 1/10 as crippled.

That has unfortunately not been my experience with similarly crippling languages. A great programmer finishes their crippling-language tasks much quicker than a poor programmer... and their reward is lots lots more tasks in the crippling language. :-\

comment by wedrifid · 2011-05-08T02:35:09.093Z · score: 2 (2 votes) · LW(p) · GW(p)

That has unfortunately not been my experience with similarly crippling languages. A great programmer finishes their crippling-language tasks much quicker than a poor programmer... and their reward is lots lots more tasks in the crippling language

I've seen this too - if something sucks it can be a good idea to make sure you appear to suck at it!

comment by DSimon · 2011-05-09T19:25:41.298Z · score: 0 (0 votes) · LW(p) · GW(p)

If you do the job badly enough...

comment by twanvl · 2011-05-06T23:14:35.221Z · score: 3 (3 votes) · LW(p) · GW(p)

If being a good or bad programmer is an intrinsic quality that is independent of the task, then you could just give the same subject different tasks to solve. So you take N programmers, and give team all K tasks to solve. Then you can determine the mean difficulty of each task as well as the mean quality of each programmer. Given that you should be able to infer the variance.

There are some details to be worked out, for example, is task difficulty multiplicative or additive? I.e. if task A is 5 times as hard as task B, will the standard deviation also be 5 times as large? But that can be solved with enough data and proper prior probabilities of different models.

comment by JoshuaZ · 2011-05-06T18:49:13.857Z · score: 2 (2 votes) · LW(p) · GW(p)

Things I'd like to see tested which may or may not have been tested before but I haven't seen in the literature.

1) There's a lot of evidence that people are wildly overconfident. The most classic version of this is how if you ask people to give a range for something such that they are 90% sure they got it right, and do it for a long list of things (like say the populations of various countries) they will get much lower than 90%. Will be people more properly calibrate when there is money at stake? (This is something that Mass Driver and I discussed a while back.) The way I'd test this is after they've given the various options see what bets they are willing to take about their being correct and how closely they match their estimated confidence.

2) Are people who have learned about cognitive biases less likely to be subject to them to any substantial degree? The one I'm most curious about is the conjunction fallacy. The obvious way to test this is to put people who have just finished a semester of intro psychology or something similar and see if they show less of a conjunction bias than students who have not done so.

3) Can training make one better at the color-word version of the Stroop interference test?

comment by gwern · 2011-05-07T23:18:21.397Z · score: 4 (4 votes) · LW(p) · GW(p)

3) Can training make one better at the color-word version of the Stroop interference test?

Yes. The Stroop test is, along with spaced repetition, one of the most confirmed and replicated tasks in all of psychology, so it would be deeply surprising if no one had come up with training to make you better at the Stroop test. (Heck, there's plenty of training available for IQ tests - like taking a bunch of IQ tests.)

I'd put a very high confidence on that, but as it happens, I don't have to since I recently saw discussion of one result on Stroop test and meditation:

. After training, subjects were tested on a variety of cognitive and personality tests, including associate learning, word fluency, depression, anxiety, locus of control, and of course Stroop. Results showed that the TM and MF groups together scored significantly higher on associate learning and word fluency than the no-training and relaxation-training groups. Perhaps most surprisingly, over a 36 month period, the survival rate for the TM and MF groups was significantly higher than for the relaxation and no-training groups (p<.00025). But more to the point, both TM and MF scored higher than MR and no-training on the Stroop task (p<.1; one-tailed test).

Or:

Incredibly, behavioral data showed that the standard stroop effect (again, a cost in reaction time when reading incongruent words relative to congruent words) was completely eliminated in terms of both reaction time and accuracy for both the experimental and control groups. [ERP analyses revealed decreased visual activity under suggestions , including suppression of early visual effects commonly known as the P100 and N100, while fMRI showed reductions in a variety of regions including anterior cingulate]. The bottom line, then, is that even strong suggestion is enough to accomplish some amount of deprogramming, as measured through the Stroop task.

comment by JoshuaZ · 2011-05-07T23:32:06.049Z · score: 0 (0 votes) · LW(p) · GW(p)

Thanks.

comment by Matt_Simpson · 2011-05-06T18:59:47.379Z · score: 0 (0 votes) · LW(p) · GW(p)

1) I'm surprised this hasn't already been done. Many economists like to argue that "people are rational when it counts" i.e. when there's stronger incentives. Similar to your proposal, I'm interested in seeing how priming affects decisions with incentives, and to my knowledge, this hasn't been done either (but IIRC it has been done without incentives).

2) IIRC the results have been replicated with economics and/or psychology graduate students (citation needed).

comment by Barry_Cotter · 2011-05-06T20:26:13.456Z · score: 1 (1 votes) · LW(p) · GW(p)

1) Different but related; people who trade stuff a lot suffer much less from the endowment effect, also while people are crap at randomising normally with money at stake they get better very quickly.

comment by JoshuaZ · 2011-05-06T19:04:07.325Z · score: 0 (0 votes) · LW(p) · GW(p)

It is possible that 1) has been done but if so I haven't seen the studies.

comment by Armok_GoB · 2011-05-10T16:37:39.138Z · score: 1 (1 votes) · LW(p) · GW(p)

Oooh, I tend to get these quite often, lesse if i can remember any that's actually workable...

I had this idea for a narrow AI experiment where you have two populations of algorithms, of many different and unrelated types, in a predator-prey like arms race where one side tries to forge false sensory (for example images or snippets of music), and the other tries to distinguish those falsifications from human or nature supplied data, and the first group is scored on how well it fools the second. That's the basic idea, if anyone would actually be interested in actually trying it out I thought a bunch more about the details of how to implement it and possible problems and further small things you could do to make it work even better than the raw version and stuff like that.

comment by orthonormal · 2011-10-13T14:21:00.087Z · score: 0 (0 votes) · LW(p) · GW(p)

Inspired by this article on the effects of gut bacteria, I'd like to know whether chronically obese people starting a new diet would benefit from taking a course of antibiotics at the same time.

Of course, if this becomes widespread, it would increase antibiotic resistance (and other health risks) for a relatively low payoff, but I'm still curious.

comment by Jordan · 2011-05-06T21:17:19.200Z · score: 0 (2 votes) · LW(p) · GW(p)

Field: Electrical Engineering. No idea how practical this is though:

An important problem with increasing the number of cores on a chip is having enough bandwidth between the cores. Some people are working on in-silicone optical channels, which seems promising. Instead of this would it be possible for the different cores to communicate with each other wirelessly? This requires integrated transmitter and receivers, but I believe both exist.

comment by twanvl · 2011-05-06T23:07:16.195Z · score: 1 (1 votes) · LW(p) · GW(p)

I am not an electrical engineer, but as far as I know, wireless communication requires a relatively large antenna. Also, the bandwidth is likely a lot worse than that of a wire. There is a good reason that people still use wires whenever possible.

comment by Jordan · 2011-05-07T01:36:16.212Z · score: 1 (1 votes) · LW(p) · GW(p)

I should have done some more due diligence before suggesting my idea:

http://www.cs.ucla.edu/~sblee/Papers/mobicom09-wnoc.pdf

Edit: I was originally concerned about bandwidth, but the above article claims

On-chip wireless channel capacity. Because of such low signal loss over on-chip wireless channels and new techniques in generating terahertz signals on-chip [14,31], the on-chip wireless network becomes feasible. In addition, it is possible to switch a CMOS transistor as fast as 500 GHz at 32 nm CMOS [21], thus allowing us to implement a large number of high frequency bands for the onchip wireless network. Following a rule of thumb in RF design, the maximum available bandwidth is 10% of the carrier frequency. For example, with a carrier frequency of 300 GHz, the data rate of each channel can be as large as 30 Gbps. Using a 32 nm CMOS process, there will be total of 16 available channels, from 100 GHz to 500 GHz, for the on-chip wireless network, and each channel can transmit at 10 to 20 Gbps. In the 1000-core CMPs design, the total aggregate data rate can be as high as 320 Gbps with 16 TX’s and 64 RX’s.