Statistical Prediction Rules Out-Perform Expert Human Judgments

post by lukeprog · 2011-01-18T03:19:33.866Z · LW · GW · Legacy · 199 comments

A parole board considers the release of a prisoner: Will he be violent again? A hiring officer considers a job candidate: Will she be a valuable asset to the company? A young couple considers marriage: Will they have a happy marriage?

The cached wisdom for making such high-stakes predictions is to have experts gather as much evidence as possible, weigh this evidence, and make a judgment. But 60 years of research has shown that in hundreds of cases, a simple formula called a statistical prediction rule (SPR) makes better predictions than leading experts do. Or, more exactly:

When based on the same evidence, the predictions of SPRs are at least as reliable as, and are typically more reliable than, the predictions of human experts for problems of social prediction.1

For example, one SPR developed in 1995 predicts the price of mature Bordeaux red wines at auction better than expert wine tasters do. Reaction from the wine-tasting industry to such wine-predicting SPRs has been "somewhere between violent and hysterical."

How does the SPR work? This particular SPR is called a proper linear model, which has the form:

P = w1(c1) + w2(c2) + w3(c3) + ...wn(cn)

The model calculates the summed result P, which aims to predict a target property such as wine price, on the basis of a series of cues. Above, cn is the value of the nth cue, and wn is the weight assigned to the nth cue.2

In the wine-predicting SPR, c1 reflects the age of the vintage, and other cues reflect relevant climatic features where the grapes were grown. The weights for the cues were assigned on the basis of a comparison of these cues to a large set of data on past market prices for mature Bordeaux wines.3

There are other ways to construct SPRs, but rather than survey these details, I will instead survey the incredible success of SPRs.

And that is barely scratching the surface.

If this is not amazing enough, consider the fact that even when experts are given the results of SPRs, they still can't outperform those SPRs (Leli & Filskov 1985; Goldberg 1968).

So why aren't SPRs in use everywhere? Probably, suggest Bishop & Trout, we deny or ignore the success of SPRs because of deep-seated cognitive biases, such as overconfidence in our own judgments. But if these SPRs work as well as or better than human judgments, shouldn't we use them?

Robyn Dawes (2002) drew out the normative implications of such studies:

If a well-validated SPR that is superior to professional judgment exists in a relevant decision making context, professionals should use it, totally absenting themselves from the prediction.

Sometimes, being rational is easy. When there exists a reliable statistical prediction rule for the problem you're considering, you need not waste your brain power trying to make a careful judgment. Just take an outside view and use the damn SPR.4

 

 

Recommended Reading

 

Notes

1 Bishop & Trout, Epistemology and the Psychology of Human Judgment, p. 27. The definitive case for this claim is made in a 1996 study by Grove & Meehl that surveyed 136 studies yielding 617 comparisons between the judgments of human experts and SPRs (in which humans and SPRs made predictions about the same cases and the SPRs never had more information than the humans). Grove & Meehl found that of the 136 studies, 64 favored the SPR, 64 showed roughly equal accuracy, and 8 favored human judgment. Since these last 8 studies "do not form a pocket of predictive excellent in which [experts] could profitably specialize," Grove and Meehl speculated that these 8 outliers may be due to random sampling error.

2 Readers of Less Wrong may recognize SPRs as a relatively simple type of expert system.

3 But, see Anatoly_Vorobey's fine objections.

4 There are occasional exceptions, usually referred to as "broken leg" cases. Suppose an SPR reliably predicts an individual's movie attendance, but then you learn he has a broken leg. In this case it may be wise to abandon the SPR. The problem is that there is no general rule for when experts should abandon the SPR. When they are allowed to do so, they abandon the SPR far too frequently, and thus would have been better off sticking strictly to the SPR, even for legitimate "broken leg" instances (Goldberg 1968; Sawyer 1966; Leli and Filskov 1984).

 

References

Bloom & Brundage (1947). "Predictions of Success in Elementary School for Enlisted Personnel", Personnel Research and Test Development in the Natural Bureau of Personnel, ed. D.B. Stuit, 233-61. Princeton: Princeton University Press.

Carpenter, Gardner, McWeeny, & Emery (1977). "Multistage scory systemfor identifying infants at risk of unexpected death", Arch. Dis. Childh., 53: 606−612.

Carroll, Winer, Coates, Galegher, & Alibrio (1988). "Evaluation, Diagnosis, and Prediction in Parole Decision-Making", Law and Society Review, 17: 199-228.

Dawes (1971). "A Case Study of Graduate Admissions: Applications of Three Principles of Human Decision-Making", American Psychologist, 26: 180-88.

Dawes (2002). "The Ethics of Using or Not Using Statistical Prediction Rules in Psychological Practice and Related Consulting Activities", Philosophy of Science, 69: S178-S184.

DeVaul, Jervey, Chappell, Carver, Short, & O'Keefe (1957). "Medical School Performance of Initially Rejected Students", Journal of the American Medical Association, 257: 47-51.

Faust & Ziskin (1988). "The expert witness in psychology and psychiatry", Science, 241: 1143−1144.

Goldberg (1968). "Simple Models of Simple Process? Some Research on Clinical Judgments", American Psychologist, 23: 483-96.

Golding, Limerick, & MacFarlane (1985). Sudden Infant Death. Somerset: Open Books.

Edwards & Edwards (1977). "Marriage: Direct and Continuous Measurement", Bulletin of the Psychonomic Society, 10: 187-88.

Howard & Dawes (1976). "Linear Prediction of Marital Happiness", Personality and Social Psychology Bulletin, 2: 478-80.

Hunter & Hunter (1984). "Validity and utility of alternate predictors of job performance", Psychological Bulletin, 96: 72-98

Leli & Filskov (1984). "Clinical Detection of Intellectual Deterioration Associated with Brain Damage", Journal of Clinical Psychology, 40: 1435–1441.

Libby (1976). "Man versus model of man: Some conflicting evidence", Organizational Behavior and Human Performance, 16: 1-12.

Lowry (1975). "The identification of infants at high risk of early death", Med. Stats. Report, London School of Hygiene and Tropical Medicine.

Milstein, Wildkinson, Burrow, & Kessen (1981). "Admission Decisions and Performance during Medical School", Journal of Medical Education, 56: 77-82.

Oskamp (1965). "Overconfidence in Case Study Judgments", Journal of Consulting Psychology, 63: 81-97.

Sawyer (1966). "Measurement and Prediction, Clinical and Statistical", Psychological Bulletin, 66: 178-200.

Stillwell, Barron, & Edwards (1983). "Evaluating Credit Applications: A Validation of Multiattribute Utility Weight Elicitation Techniques", Organizational Behavior and Human Performance, 32: 87-108.

Swets, Dawes, & Monahan (2000). "Psychological Science Can Improve Diagnostic Decisions", Psychological Science in the Public Interest, 1: 1–26.

Thornton (1977). "Linear Prediction of Marital Happiness: A Replication", Personality and Social Psychology Bulletin, 3: 674-76.

Wiesner & Cronshaw (1988). "A meta-analytic investigation of the impact of interview format and degree of structure on the validity of the employment interview", Journal of Applied Psychology, 61: 275-290.

Wittman (1941). "A Scale for Measuring Prognosis in Schizophrenic Patients", Elgin Papers 4: 20-33.

199 comments

Comments sorted by top scores.

comment by Anatoly_Vorobey · 2011-01-18T16:47:07.175Z · LW(p) · GW(p)

I'm skeptical, and will now proceed to question some of the assertions made/references cited. Note that I'm not trained in statistics.

Unfortunately, most of the articles cited are not easily available. I would have liked to check the methodology of a few more of them.

For example, one SPR developed in 1995 predicts the price of mature Bordeaux red wines at auction better than expert wine tasters do.

The paper doesn't actually establish what you say it does. There is no statistical analysis of expert wine tasters, only one or two anecdotal statements of their fury at the whole idea. Instead, the SPR is compared to actual market prices - not to experts' predictions. I think it's fair to say that the claim I quoted is overreached.

Now, about the model and its fit to data. Note that the SPR is older than 1995, when the paper was published. The NYTimes article about it which you reference is from 1990 (the paper bizarrely dates it to 1995; I'm not sure what's going on there).

The fact that there's a linear model - not specified precisely anywhere in the article - which is a good fit to wine prices for vintages of 1961-1972 (Table 3 in the paper) is not, I think, very significant on its own. To judge the model, we should look at what it predicts for upcoming years. Both the paper and the NYTimes article make two specific predictions. First, the 1986 vintage, claimed to be extolled by experts early on, will prove mediocre because of the weather conditions that year (see Figure 3 in the paper - 1986 is clearly the worst of the 80ies). NYTimes says "When the dust settles, he predicts, it will be judged the worst vintage of the 1980's, and no better than the unmemorable 1974's or 1969's". The 1995 paper says, more modestly, "We should expect that, in due course, the prices of these wines will decline relative to the prices of most of the other vintages of the 1980s". Second, the 1989-1990 is predicted to be "outstanding" (paper), "stunningly good" (NYTimes), "adjusted for age, will outsell at a significant premium the great 1961 vintage (NYTimes)."

It's now 16 years later. How do we test these predictions?

First, I've stumbled on a different paper from the primary author, Prof. Ashenfelter, from 2007. Published 12 years later than the one you reference, this paper has substantially the same contents, with whole pages copied verbatim from the earlier one. That, by itself, worries me. Even more worrying is the fact that the 1986 prediction, prominent in the 1990 article and the 1995 paper, is completely missing from the 2007 paper (the data below might indicate why). And most worrying of all is the change of language regarding the 1989/1990 prediction. The 1995 paper says about its prediction that the 1989/1990 will turn out to be outstanding, "Many wine writers have made the same predictions in the trade magazines". The 2007 paper says "Ironically, many professional wine writers did not concur with this prediction at the time. In the years that have followed minds have been changed; and there is now virtually unanimous agreement that 1989 and 1990 are two of the outstanding vintages of the last 50 years."

Uhm. Right. Well, because the claims aren't strong enough, they do not exactly contradict each other, but this change leaves a bad taste. I don't think I should give much trust to these papers' claims.

The data I could find quickly to test the predictions is here. The prices are broken down by the chateaux, by the vintage year, the packaging (I've always chosen BT - bottle), and the auction year (I've always chosen the last year available, typically 2004). Unfortunately, Ashenfelter underspecifies how he came up with the aggregate prices for a given year - he says he chose a package of the best 15 wineries, but doesn't say which ones or how the prices are combined. I used 5 wineries that are specified as the best in the 2007 paper, and looked up the prices for years 1981-1990. The data is in this spreadsheet. I haven't tried to statistically analyze it, but even from a quick glance, I think the following is clear. 1986 did not stabilize as the worst year of the 1980s. It is frequently second- or third-best of the decade. It is always much better than either 1984 or 1987, which are supposed to be vastly better according to the 1995 paper's weather data (see Figure 3). 1989/1990 did turn out well, especially 1990. Still, they're both nearly always less expensive than 1982, which is again vastly inferior in the weather data (it isn't even in the best quarter). Overall, I fail to see much correlation between the weather data in the paper for the 1980s, the specific claims about 1986 and 1989/1990, and the market prices as of 2004. I wouldn't recommend using this SPR to predict market prices.

Now, this was the first example in your post, and I found what I believe to be substantial problems with its methodology and the quality of its SPR. If I were to proceed and examine every example you cite in the same detail, would I encounter many such problems? It's difficult to tell, but my prediction is "yes". I anticipate overfitting and shoddy methodology. I anticipate huge influence of the selection bias - the authors that publish these kinds of papers will not publish a paper that says "The experts were better than our SPR". And finally, I anticipate overreaching claims of wide-reaching applicability of the models, based on papers that actually indicate modest effect in a very specific situation with a small sample size.

I've looked at your second example:

Howard and Dawes (1976) found they can reliably predict marital happiness with one of the simplest SPRs ever conceived, using only two cues: P = [rate of lovemaking] - [rate of fighting].

I couldn't find the original paper, but the results are summarised in Dawes (1979). Looking at it, it turns out that when you say "predict marital happiness", it really means "predicts one of the partners' subjective opinion of their marital happiness" - as opposed to e.g. stability of the marriage over time. There's no indication as to how the partner to question was chosen from each pair (e.g. whether the experimenter knew the rate when they chose). There was very good correlation with binary outcome (happy/unhappy), but when a finer scale of 7 degrees of happiness was used, the correlation was weak - rate of 0.4. In a follow-up experiment, correlation rate went up to 0.8, but there the subject looked at the lovemaking/fighting statistics before opining on the degree of happiness, thus contaminating their decision. And even in the earlier experiment, the subject had been recording those lovemaking/fighting statistics in the first place, so it would make sense for them to recall those events when they're asked to assess whether their marriage is a happy one. Overall, the model is witty and naively appears to be useful, but the suspect methodology and the relatively weak correlation encourages me to discount the analysis.

Finally, the following claim is the single most objectionable one in your post, to my taste:

If you're hiring, you're probably better off not doing interviews.

My own experience strongly suggests to me that this claim is inane - and is highly dangerous advice. I'm not able to view the papers you base it on, but if they're anything like the first and second example, they're far, far away from convincing me of the truth of this claim, which I in any case strongly suspect to overreach gigantically over what the papers are proving. It may be true, for example, that a very large body of hiring decision-makers in a huge organisation or a state on average make poorer decisions based on their professional judgement during interviews than they would have made based purely on the resume. I can see how this claim might be true, because any such very large body must be largely incompetent. But it doesn't follow that it's good advice for you to abstrain from interviewing - it would only follow if you believe yourself to be no more competent than the average hiring manager in such a body, or in the papers you reference. My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial (though I will freely grant that different kinds of interviews vary wildly in their effectiveness).

Replies from: bentarm, shokwave, lukeprog, CronoDAS, XiXiDu, shokwave
comment by bentarm · 2011-01-19T00:41:17.744Z · LW(p) · GW(p)

If you're hiring, you're probably better off not doing interviews.

My own experience strongly suggests to me that this claim is inane - and is highly dangerous advice... My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial (though I will freely grant that different kinds of interviews vary wildly in their effectiveness).

The whole point of this article is that experts often think themselves better than SPR's when actually they perform no better than SPRs on average. Here we have an expert telling us that he thinks he would perform better than an SPR. Why should we be interested?

Replies from: Anatoly_Vorobey
comment by Anatoly_Vorobey · 2011-01-19T11:18:38.171Z · LW(p) · GW(p)

Because I didn't just state a blanket opinion. I dug into the studies, looked for data to test one of them in depth, and found it to be highly flawed. I called into question the methodology employed by the studies, as well as overgeneralizing and overreaching conclusions they're drummed up to support. The evidence that at least some studies are flawed and the methodology is shoddy should make you question the universal claim "... actually they perform no better than SPRs on average". That's why you should be interested.

My personal experience with interviewing is certainly not as important piece of evidence against the article as the specific criticisms of the studies. It's just another anecdotal data point. That's why I didn't expand on it as much as I did on the wine study, although I do believe it can be made more convincing through further elucidation.

comment by shokwave · 2011-01-18T17:02:11.752Z · LW(p) · GW(p)

My own experience strongly suggests to me that this claim is inane ... it would only follow if you believe yourself to be no more competent than the average hiring manager in such a body, or in the papers you reference.

What evidence do you have that you are better than average?

My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it!"

Replies from: Desrtopa, Dr_Manhattan
comment by Desrtopa · 2011-01-18T17:26:54.812Z · LW(p) · GW(p)

I have heard of one job interview that I felt constituted a useful tool that could not effectively be replaced by resume examination and statistical analysis. A friend of mine got a job working for a company that provides mathematical modeling services for other companies, and his "interview" was a several hour test to create a number of mathematical models, and then explaining to the examiner in layman's terms how and why the models worked.

Most job interviews are really not a demonstration of job skills and aptitude, and it's possible to simply bullshit your way through them. On the other hand, if you have a simple and direct way to test the competence of your applicants, then by all means use it.

Replies from: datadataeverywhere, knb
comment by datadataeverywhere · 2011-01-19T01:54:20.216Z · LW(p) · GW(p)

I'm most familiar with interviews for programming jobs, where an interview that doesn't ask the candidate to demonstrate job-specific skills, knowledge and aptitude is nearly worthless. These jobs are also startlingly prone to resume distortion that can make vastly different candidates look similar, especially recent graduates.

Asking for coding samples and calling previous employers, especially if coupled with a request for code solving a new (requested) problem, could potentially replace interviews. However, judging the quality of code still requires a person, so that doesn't seem to really change things to me.

Replies from: sketerpot
comment by sketerpot · 2011-01-19T22:34:33.365Z · LW(p) · GW(p)

That's what I think of, too, when I hear the phrase "job interview". Is this not typical outside fields like programming?

Replies from: retiredurologist
comment by retiredurologist · 2011-01-19T23:17:14.176Z · LW(p) · GW(p)

I can confirm that such a "job interview" is not common in medicine. The potential employer generally relies on the credentialing process of the medical establishment. Most physicians, upon completing their training, pass a test demonstrating their ability to regurgitate the teachers' passwords, and are recommended to the appropriate certification board as "qualified" by their program director; to do otherwise would reflect badly on the program. Also, program directors are loath to remove a resident/fellow during advanced training because some warm body must show up to do the work, or the professor himself/herself might have to fill in. It is difficult to find replacements for upper level residents; the only common reason such would be available is dismissal/transfer from another program. Consequently, the USA turns out physicians of widely varied skill levels, even though their credentials are similar. In surgical specialities, it is not unusual for a particularly bright individual with all the passwords but very poor technical skills to become a surgical professor.

Replies from: Desrtopa
comment by Desrtopa · 2011-01-19T23:22:15.410Z · LW(p) · GW(p)

My mother has told me an anecdote about a family friend who was a surgeon who had a former student call him while conducting an operation because he couldn't remember what to do.

Replies from: wedrifid
comment by wedrifid · 2011-01-19T23:55:53.574Z · LW(p) · GW(p)

My mother has told me an anecdote about a family friend who was a surgeon who had a former student call him while conducting an operation because he couldn't remember what to do.

The (rumored) student has my respect. I would expect most surgeons to have too much of an ego to admit to that doubt rather than stumble ahead full of hubris. It would be comforting to know that your surgeon acted as if (as opposed to merely believing that) he cared more about the patient than the immediate perception of status loss. (I wouldn't care whether that just meant his thought out anticipation of future status loss for a failed operation overrode his immediate social instincts.)

comment by knb · 2011-01-19T02:32:42.039Z · LW(p) · GW(p)

That isn't an interview, it's a test. Tests are extremely useful. IQ tests are an excellent predictor of job performance, maybe the best one available. Regardless, IQ tests are usually de facto illegal in the US due to disparate impact.

Replies from: Desrtopa
comment by Desrtopa · 2011-01-19T06:24:41.523Z · LW(p) · GW(p)

I put interview in quotes because they called it an interview. Speaking broadly enough, all interviews are tests, but most are unstructured and not very good at examining the relevant predictor variables. All tests are of course not necessarily interviews, but the part where they had applicants explain their processes in layman's terms might qualify it, at least if you're generous with your definitions.

Of course, it's certainly unclear if not outright incorrect to call it an interview, but that was their choice; possibly they felt that subjecting applicants to a "test" rather than an "interview" projected a less positive image.

comment by Dr_Manhattan · 2011-01-18T21:21:00.487Z · LW(p) · GW(p)

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it!"

I don't think it's fair, as his job is not being an interviewer, but perhaps hiring smart people we can benefit from.

comment by lukeprog · 2011-01-18T19:14:46.712Z · LW(p) · GW(p)

Cool, I'll look into these points.

I made one small change so far. The above article now read: "Reaction from the wine-tasting industry to such wine-predicting SPRs has been 'somewhere between violent and hysterical.'"

Also, I'll post links to the specific papers when I have time to visit UCLA and grab them.

Psychology is not my field, but my understanding is that the 'interview effect' for unstructured interviews is a very robust finding across many decades. For more, you can listen to my interview with Michael Bishop. But hey, maybe he's wrong!

Update 1: If I read the 1995 study correctly, they judged the accuracy of wine tasters by comparing the price of immature wines to those of mature wines, but I'm not sure. The way I phrased that is from Bishop & Trout, and that is how Bishop recalls it, though it's been several years now since he co-wrote Epistemology and the Psychology of Human Judgment.

comment by CronoDAS · 2011-01-19T07:41:05.383Z · LW(p) · GW(p)

Regarding hiring, I think the keyword might be "unstructured" - what makes an interview an "unstructured" interview?

Replies from: Anatoly_Vorobey
comment by Anatoly_Vorobey · 2011-01-19T08:26:43.778Z · LW(p) · GW(p)

That's what I thought too. The definitions I found searching all say that any interview where you decide what to ask and how to interpret the results is "unstructured". The only "structured" interviews seem to be tests with pre-determined sets of questions, and the candidate's answers judged by formal criteria.

I'm not sure this division of the "interview-space" is all that useful. I would distinguish three categories:

  1. You have an informat chat with me about the nature of the job, my experience, my previous employment, my claims about my aptitude, etc. Your impressions from this chat determine your judgement of my suitability for the job.
  2. You ask me to answer questions or perform tasks that demonstrate my aptitude. It's up to you to choose the tasks, interpret my performance, and guide the whole process.
  3. You give me a pre-determined set of questions/tasks that is the same for all candidates. My answers are mechanically interpreted by whether they coincide with the pre-determined set of correct answers.

If I interpret the definitions I could find correctly, 3 is a "structured" interview, and both 1 and 2 are "unstructured". To my mind, there's a world of difference between 1 and 2, however. 1 is of very limited utility (I want to say "next to worthless", but that'd be too presumptuous), and, quite possibly, does no better than deciding on the basis of the resume alone, thought I'd still want to see the data to be convinced. 2, when performed by a trained and calibrated interviewer, is - again, in my own experience - obviously superior both to 1 and to deciding on the basis of the resume alone. Maybe this is somehow unique to the profession I interview for, but I doubt it.

Suppose there's research which demonstrates that in some setting type 1 interviews are worse than using the resume alone. I don't know whether this is the case in the papers cited in this post (I couldn't read them), but I find it plausible. Suppose then that the conclusions drawn are the universal statements "unstructured interviews reliably degrade the decisions of gatekeepers" and "if you're hiring, you're probably better off not doing interviews". I consider such conclusions then to be obviously unsubstantiated, incredibly overreached, and highly dangerous advice.

comment by XiXiDu · 2011-01-18T18:38:45.984Z · LW(p) · GW(p)

The interview example makes sense to me if the usual hiring manager is strongly biased regarding information that are not crucial. A dossier only gives little but important information. In a face-to-face interview various other factors can play a role (often unconsciously), e.g. smell or the ability to return a look.

Replies from: XiXiDu
comment by XiXiDu · 2011-01-19T13:11:04.269Z · LW(p) · GW(p)

More here. Surely that isn't strong evidence but another indication that if you are not an LW type person then information that are not crucial might alter your perception and subsequent decision when doing face-to-face interviews versus dossier based ruling.

comment by shokwave · 2011-01-18T17:03:15.539Z · LW(p) · GW(p)

Read the Dawes pdf linked in the top post. I can't speak for the other examples, but that one is solid.

edit: my apologies, re-reading I see you discussed the marriage example. What is your opinion on the graduate rating and Hodgkin's disease examples?

Replies from: Perplexed, DanielVarga
comment by Perplexed · 2011-01-19T03:53:30.346Z · LW(p) · GW(p)

that one is solid

Why do you say that? My reaction to that paper was very negative. In large part, it was the anecdotal flavor of the arguments made there, but also because I didn't see the two things I was specifically looking for:

  • Citations of studies in which a linear model was constructed using one set of data, and then compared as to performance against the experts using a different set of data.
  • Failing that, some numbers that would convince me that the failure to test models using different data than was used to construct them just doesn't matter.

Instead, here and in the 1996 study by Grove & Meehl, I find arguments from incredulity - in effect: "Do our critics really think that this really matters? Don't be absurd!". I also notice that this ideology is being promoted by a small number of researchers who repeatedly cite each other's work, and do not cite critics (except as strawmen).

comment by DanielVarga · 2011-01-19T21:42:13.704Z · LW(p) · GW(p)

Like Perplexed, I hated this paper. Of course, it has the very good excuse that it is from 1979. But in 2011, it is sort of expected that you evaluate your model on a second, independent dataset. (My models often crash and burn at this stage.) Did any of these studies do this?

comment by Vladimir_M · 2011-01-19T09:07:13.109Z · LW(p) · GW(p)

Also, if I may be permitted to make a more general criticism in response to this post, I would say that while the article appears to be well-researched, it has demonstrated some of the worst problems I commonly notice on this forum. The same goes for the majority of the comments, even though many are knowledgeable and informative. What I have in mind is the fixation on concocting theories about human behavior and society based on various idées fixes and leitmotifs that are parts of the intellectual folklore here, while failing to notice issues suggested by basic common sense that are likely to be far more important.

Thus the poster notices that these models are not used in practice despite considerable evidence in their favor, and rushes to propose cognitive biases à la Kahneman & Tversky as the likely explanation. This without even stopping to think of two questions that just scream for attention. First, what is the importance of the fact that just about any issue of sorting out people is nowadays likely to be ideologically charged and legally dangerous? Second, what about the fact that these models are supposed to throw some high-status people out of work, and in a way that makes them look like they've been incompetent all along?

Regardless of whether various hypotheses based on these questions have any merit, the fact that someone could write a post without even giving them the slightest passing attention, offering instead a blinkered explanation involving the standard old LW/OB folklore, and still get upvoted to +40 is, in my opinion, indicative of some severe and widespread biases.

Replies from: lukeprog, Miller, FeepingCreature, shokwave
comment by lukeprog · 2011-01-19T15:58:28.740Z · LW(p) · GW(p)

My intent was to summarize the literature on SPRs, not provide an account for why they are not used more widely. I almost didn't include that sentence at all. Surely, more analysis would be important to have in a post intending to discuss the psychological issues involved in our reaction to SPRs, but that was not my subject.

In pointing to cognitive biases as an explanation, I was merely repeating what Bishop & Trout & Dawes have suggested on the matter, not making up my own explanations in light of LW lore.

In fact, the arrows point the other way. Many of the authors cited in my article worked closely with people like Kahneman who are the original academic sources of much of LW lore.

Edit: I've added a clause about the source of the "cognitive biases" suggestion, in case others are tempted to make the same mistaken assumption as you made.

comment by Miller · 2011-01-20T08:01:26.944Z · LW(p) · GW(p)

While this post has +40 upvotes, the majority of the top-voted comments are skeptical of it. I think this represents confusion as to how to upvote, although this is merely a hypothesis. The article surveys a very interesting topic that is right in the sweet spot of interest for this community, it also appears scholarly, however the conclusions synthesized by the author strike me as naive and I suspect that's also the conclusion of the majority. Whether it deserves an upvote is debateable. I downvoted.

Replies from: DanielVarga
comment by DanielVarga · 2011-01-20T17:10:05.233Z · LW(p) · GW(p)

I felt the confusion you are talking about. If readers could be expected to read the top-voted replies (RTFC), then the current distribution of votes would be ideal: The interesting article gets some well-deserved attention, and the skeptical replies give a counterbalance. But if readers don't read the comments, then frankly I think this article got too many upvotes when compared to many others.

Offtopic: Is there a meta thread somewhere discussing the semantics of votes? I am happy that we don't use slashdot's baroque insightful/interesting/funny distinctions, but some consensus about the meaning of +1 would be nice.

Replies from: TheOtherDave, Kevin
comment by TheOtherDave · 2011-01-20T17:26:01.340Z · LW(p) · GW(p)

I don't know about a meta-thread, but the rule of thumb I've seen quoted often is "upvote what you want more of; downvote what you want less of." Karma scores are intended, on this view, as an indicator of how many people (net) want more entries like that.

One implication of this view is that a score of 40 isn't "ten times better" than a score of 4, it just means that many more people want to see posts like this than don't want to.

Of course, this view competes with people's entirely predictable tendency to treat karma as an indicator of the entry's (and the user's) overall worth, or as a game to maximize one's score on, or as a form of reward/punishment.

Equally predictably, this predictable but unintended use of karma far far far outweighs the intended use.

Replies from: Will_Sawin
comment by Will_Sawin · 2011-01-24T16:39:24.531Z · LW(p) · GW(p)

Karma-maximizing is often but not always a good approximation to worth-as-judged-by-community maximizing, which is a good thing to maximize.

Replies from: TheOtherDave, Will_Newsome
comment by TheOtherDave · 2011-01-24T16:48:51.600Z · LW(p) · GW(p)

Yes. The question is how significant the gap between "often" and "always" is.

comment by Will_Newsome · 2011-01-24T16:46:20.270Z · LW(p) · GW(p)

Though if you have a target audience in mind, it is sometimes worth posting things that will be downvoted by the community-at-large.

(I've been doing this a lot recently, though I plan on cutting back and regaining some general rationalist credibility.)

comment by Kevin · 2011-01-24T16:53:52.243Z · LW(p) · GW(p)

The new meta-thread is due, feel free to make it. http://lesswrong.com/lw/1w4/fall_2010_meta_thread/

comment by FeepingCreature · 2011-01-19T13:27:44.098Z · LW(p) · GW(p)

The second reason is invalid unless the actor is self-deluding - a smart actor that faces being put out of work would silently adopt a SPR as his decision-making system without admitting to it. Since the superiority of SPR continues in many fields, either relevant actors are consistently not smart, performance is not a significant contributing criterion to their success, or they're self-deluding ie. overrating their own judgment as the poster stated. [edit] I'd guess a combination of the last two.

Replies from: Vladimir_M
comment by Vladimir_M · 2011-01-19T17:51:15.305Z · LW(p) · GW(p)

Yes, I'd say it's a combination of the last two points, with emphasis on the second last.

The critical question is whether maximizing the accuracy of your judgments is a practical way to get ahead in a given profession. Sometimes that is indeed the case, and in such fields we indeed see tremendous efforts to automate as much expert work as possible, often with great success, as in the electronics industry. But in professions that operate as more tightly-knit guilds, adherence to accepted standards is much more important than any objective metrics of effectiveness. Stepping outside of standard work procedures is often treated as a serious infraction with potentially severe consequences. (Especially if your non-standard methodology fails in some particular case, as it will sooner or later, and you can't cover your ass by claiming that you followed all the standard accepted procedures and having your profession back you up organizationally.)

Now, you could try enhancing your work with decision models in secret. But even then, it's hard to do it in a completely secretive way, and moreover, human minds being what they are, most people can achieve professional success only if they are really sincerely convinced in their expertise and effectiveness. Keeping a public facade is hard for everyone except a very small minority of people.

comment by shokwave · 2011-01-19T14:08:34.063Z · LW(p) · GW(p)

First, what is the importance of the fact that just about any issue of sorting out people is nowadays likely to be ideologically charged and legally dangerous? Second, what about the fact that these models are supposed to throw some high-status people out of work, and in a way that makes them look like they've been incompetent all along?

I am not sure what you think the answers to these questions are, but I would say my personal opinion on the matter is that the more ideologically charged and legally dangerous a matter is, the more important accuracy and correctness - at the expense, if necessary, of strongly-held beliefs. I would also say that protecting the reputation of competency enjoyed by high-status people is not an activity that strongly correlates with being right; I predict a small negative correlation, in fact.

Furthermore, there is a selection effect: learning the LW/OB folklore will result in you noticing specific cases of their application, and you are far, far more likely to write a post about that any about any given subject. That is, you see a prevalence of "standard bias explanation" because top-level posters are actively looking for actual cases of bias to discuss.

comment by Vladimir_M · 2011-01-18T21:37:22.303Z · LW(p) · GW(p)

So why aren't SPRs in use everywhere? Probably, we deny or ignore the success of SPRs because of deep-seated cognitive biases, such as overconfidence in our own judgments. But if these SPRs work as well as or better than human judgments, shouldn't we use them?

Without even getting into the concrete details of these models, I'm surprised that nobody so far has pointed out the elephant in the room: in contemporary society, statistical inference about human behavior and characteristics is a topic bearing tremendous political, ideological, and legal weight. [*] Nowadays there exists a firm mainstream consensus that the use of certain sorts of conditional probabilities to make statistical predictions about people is discriminatory and therefore evil, and doing so may result not only in loss of reputation, but also in serious legal consequences. (Note that even if none of the forbidden criteria are built into your decision-making explicitly, that still doesn't leave you off the hook -- just search for "disparate impact" if you don't know what I'm talking about.)

Now of course, making any prediction about people at all necessarily involves one sort of statistical discrimination or another. The boundaries between the types of statistical discrimination that are considered OK and those that are considered evil and risk legal liability are an arbitrary result of cultural, political, and ideological factors. (They would certainly look strange and arbitrary to someone who isn't immersed in the culture that generated them to the point where they appear common-sensical or at least explicable.) Therefore, while your model may well be accurate in estimating the probability of recidivism, job performance, etc., it's unlikely that it will be able to navigate the social conventions that determine these forbidden lines. A lot of the seemingly absurd and ineffective rituals and regulations in modern business, government, academia, etc. exist exactly for the purpose of satisfying these complex constraints, even if they're not commonly thought of as such.

--

[*] Edit: I missed the comment below in which the commenter Student_UK already raised a similar point.

Replies from: Quirinus_Quirrell
comment by Quirinus_Quirrell · 2011-01-19T01:31:54.281Z · LW(p) · GW(p)

If the best way to choose who to hire is with a statistical analysis of legally forbidden criteria, then keep your reasons secret and shred your work. Is that so hard?

Replies from: gwern, Vladimir_M
comment by gwern · 2011-01-19T01:35:17.704Z · LW(p) · GW(p)

A loophole already closed: http://en.wikipedia.org/wiki/Disparate_impact#The_80.25_rule

Replies from: Quirinus_Quirrell
comment by Quirinus_Quirrell · 2011-01-19T02:16:15.219Z · LW(p) · GW(p)

That doesn't close the loophole, it adds a constraint. And it's only significant for those who both hire enough people to be vulnerable to statistical analysis of their hiring practices, and receive too many bad applicants from protected classes. If it is a significant constraint, you want to find that out from the data, not from guesswork, and apply the minimum legally acceptable correction factor.

Besides, it's not like muggles are a protected class. And if they were? Just keep them from applying in the first place, by building your office somewhere they can't get to. There aren't any legal restrictions on that.

Replies from: SilasBarta, Strange7, gwern, Barry_Cotter
comment by SilasBarta · 2011-01-21T19:21:09.174Z · LW(p) · GW(p)

Besides, it's not like muggles are a protected class. And if they were? Just keep them from applying in the first place, by building your office somewhere they can't get to. There aren't any legal restrictions on that.

You joke, but the world [1] really is choking with inefficient, kludgey workarounds for the legal prohibition of effective employment screening. For example, the entire higher education market has become, basically, a case of employers passing off tests to universities that they can't legally administer themselves. You're a terrorist if you give an IQ test to applicants, but not if you require a completely irrelevant college degree that requires taking the SAT (or the military's ASVAB or whatever the call it now).

It feels so good to ban discrimination, as long as you don't have to directly face the tradeoff you're making.

[1] Per MattherW's correction, this should read "Western developed economies" instead of "the world" -- though I'm sure the phenomenon I've described is more general the form it takes in the West.

Replies from: MatthewW
comment by MatthewW · 2011-01-21T19:25:47.656Z · LW(p) · GW(p)

You say 'the world', but it seems to me you're talking about a region which is a little smaller.

Replies from: GLaDOS, SilasBarta
comment by GLaDOS · 2011-07-23T14:07:06.639Z · LW(p) · GW(p)

I'm not sure the correction is that relevant. The US and the EU together make up about 40% of global GDP (PPP).

Several minor economies with nearly identical conditions and restrictions such as Canada, New Zealand, Australia, South Africa, Norway, Switzerland ... add up to another 3% or so.Most states in Latin America have similar legal prohibitions as well, they are not as well enforced, but avoiding them still imposes costs. This is mentioning nothing of Japan or other developed East Asian economies (though to be fair losses are probably much smaller than the developed West and perhaps even Latin America).

The other half of the world's has a massive opportunity cost due to the mentioned half's described inefficiency. Converting this loss into number of lives or quality of life is a depressing exercise.

Fortunately that is only a problem if you care about humans.

Replies from: MatthewW
comment by MatthewW · 2011-07-27T20:06:33.899Z · LW(p) · GW(p)

Well, I'm in the UK, and there's no law against using IQ-style tests for job applicants here. Is that really the case in the US? (I assume the "You're a terrorist" bit was hyperbole.)

Employers here still often ask for apparently-irrelevant degrees. But admission to university here isn't noticeably based on 'generic' tests like the SAT; it's mostly done on the grades from subject-specific exams. So I doubt employers are treating the degrees as a proxy for SAT-style testing.

comment by SilasBarta · 2011-01-21T19:27:47.202Z · LW(p) · GW(p)

Correction accepted.

comment by Strange7 · 2011-01-21T18:57:59.087Z · LW(p) · GW(p)

Just keep them from applying in the first place, by building your office somewhere they can't get to. There aren't any legal restrictions on that.

You really are new here, aren't you?

http://en.wikipedia.org/wiki/Americans_with_Disabilities_Act_of_1990#Title_III_-_Public_Accommodations_.28and_Commercial_Facilities.29

http://en.wikipedia.org/wiki/Zoning

In short, there most certainly ARE legal restrictions on building your office somewhere deliberately selected for it's inaccessibility to those with a congenital inability to e.g. teleport, and a lack of teleportation-specific case law would not work in your favor, given the judge's access to statements you've already made.

Replies from: Quirinus_Quirrell
comment by Quirinus_Quirrell · 2011-01-22T00:12:17.847Z · LW(p) · GW(p)

In short, there most certainly ARE legal restrictions on building your office somewhere deliberately selected for it's inaccessibility to those with a congenital inability to e.g. teleport,

The Americans with Disabilities Act limits what you can build (every building needs ramps and elevators), not where you can build it. Zoning laws are blacklist-based, not whitelist-based, so extradimensional spaces are fine. More commonly, you can easily find office space in locations that poor people can't afford to live near. And in the unlikely event that race or national origin is the key factor, you get to choose which country or city's demographics you want.

A lack of teleportation-specific case law would not work in your favor, given the judge's access to statements you've already made.

This is the identity under which I speak freely and teach defense against the dark arts. This is not the identity under which I buy office buildings and hire minions. If it was, I wouldn't be talking about hiring strategies.

Replies from: None, Strange7
comment by [deleted] · 2011-01-26T10:48:29.860Z · LW(p) · GW(p)

This is the identity under which I speak freely and teach defense against the dark arts. This is not the identity under which I buy office buildings and hire minions. If it was, I wouldn't be talking about hiring strategies.

Up voted for having the sense to employ a blindingly obvious strategy that somehow consistently fails to become common sense.

comment by Strange7 · 2011-01-22T01:37:22.348Z · LW(p) · GW(p)

More commonly, you can easily find office space in locations that poor people can't afford to live near.

But that they could, in principle, walk to and from.

comment by gwern · 2011-01-19T15:08:00.332Z · LW(p) · GW(p)

That doesn't close the loophole, it adds a constraint.

Yes, it does close the loophole. You say conceal the cause (intent to discriminate) and you can get away with as much effect (disproportionate exclusion) as you want. Except the law already specifies that the effect is punishable as well as the cause.

So now the best you can do, assuming the populations are equally competent and suited for the job, is 20% discrimination.

And of course, in the real world, populations usually differ in their suitability for the job. Blacks tend not to have as many CS degrees as whites, for example. So if you are an employer of CS degrees, you may not be able to get away with any discrimination before you have breached the 20% limit, and may need to discriminate against the non-blacks in order to be compliant.

Besides, it's not like muggles are a protected class.

I would suspect that if the US Muggle legal system had anything to say about it, they would be. If magical-ness is conferred by genes, then it's violating either the general racial guideline or it's violating recent laws (signed by GWB, IIRC) forbidding employer discrimination based on genetics (in the context of genome sequencing, true, but probably general). If it's not conferred by genes, then there may be a general cultural basis on which to sue (Muggles as disabled because they lack an ability necessary for basic functioning in Wizarding society, perhaps).

Replies from: topynate
comment by topynate · 2011-01-19T15:31:21.006Z · LW(p) · GW(p)

You can put degree requirements on the job advertisement, which should act as a filter on applications, something that can't be caught by the 80% rule.

(Of course, universities tend to use racial criteria for admission in the US, something which, ironically, can be an incentive for companies to discriminate based on race even amongst applicants with CS degrees.)

Replies from: gwern
comment by gwern · 2011-01-19T15:46:17.963Z · LW(p) · GW(p)

The 80% rule is only part of it. Again, racist requirements is an obvious loophole you should expect to have been addressed; you can only get away with a little covert discrimination if any.

From http://en.wikipedia.org/wiki/Disparate_impact#Unintentional_discrimination :

For example, a fire department requiring applicants to carry a 100 lb (50 kg) pack up three flights of stairs. The upper-body strength required typically has an adverse impact on women. The fire department would have to show that this requirement is job-related for the position. This typically requires employers to conduct validation studies that address both the Uniform Guidelines and professional standards.

If you add unnecessary requirements as a stealth filter, how do you show the requirements are job-related?

Replies from: topynate
comment by topynate · 2011-01-19T16:03:17.411Z · LW(p) · GW(p)

I thought we were talking about how to use necessary requirements without risking a suit, not how to conceal racial preferences by using cleverly chosen proxy requirements. But it looks like you can't use job application degree requirements without showing a business need either.

Replies from: Vladimir_M, gwern
comment by Vladimir_M · 2011-01-20T17:42:50.899Z · LW(p) · GW(p)

topynate:

But it looks like you can't use job application degree requirements without showing a business need either.

The relevant landmark case in U.S. law is the 1971 Supreme Court decision in Griggs v. Duke Power Co. The court ruled that not just testing of prospective employees, but also academic degree requirements that have disparate impact across protected groups are illegal unless they are "demonstrably a reasonable measure of job performance."

Now of course, "a reasonable measure of job performance" is a vague criterion, which depends on controversial facts as well as subjective opinion. To take only the most notable example, these people would probably say that IQ tests are a reasonable measure of performance for a great variety of jobs, but the present legal precedent disagrees. This situation has given rise to endless reams of of case law and a legal minefield that takes experts to navigate.

At the end, as might be expected, what sorts of tests and academic requirements are permitted to different institutions in practice depends on arbitrary custom and the public perception of their status. The de facto rules are only partly codified formally. Thus, to take again the most notable example, the army and the universities are allowed to use what are IQ tests in all but name, which is an absolute taboo for almost any other institution.

comment by gwern · 2011-01-19T16:35:01.423Z · LW(p) · GW(p)

I thought we were talking about how to use necessary requirements without risking a suit, not how to conceal racial preferences by using cleverly chosen proxy requirements.

I wasn't. I was talking about how the obvious loopholes are already closed or have been heavily restricted (even at the cost of false positives), and hence how Quirrel's comments are naive and uninformed.

But it looks like you can't use job application degree requirements without showing a business need either.

Yes, that doesn't surprise me in the least.

comment by Barry_Cotter · 2011-01-19T14:17:16.585Z · LW(p) · GW(p)

Besides, it's not like muggles are a protected class. And if they were? Just keep them from applying in the first place, by building your office somewhere they can't get to. There aren't any legal restrictions on that.

My google-fu is not strong enough to find the legal doctrine, but in the US at least, you can be sued for ~implicit discrimination, i.e. if the newspaper you advertise in has a reader population that does not refelect the general population, you're discriminating against the under represented population.

Replies from: shokwave
comment by shokwave · 2011-01-19T14:43:26.605Z · LW(p) · GW(p)

i.e. if the newspaper you advertise in has a reader population that does not reflect the general population, you're discriminating against the under represented population.

...I thought this was a joke. Now... not so sure.

Replies from: Barry_Cotter
comment by Vladimir_M · 2011-01-19T02:11:02.563Z · LW(p) · GW(p)

See the last sentence of my first paragraph above (the one in parentheses).

comment by CronoDAS · 2011-01-18T05:00:20.053Z · LW(p) · GW(p)

An interesting story that I think I remember reading:

One study found that relatively inexperienced psychiatrists were more accurate at diagnosing mental illness than experienced ones. This is because inexperienced psychiatrists stuck closely to checklists rather than rely on their own judgment, and whether or not a diagnosis was considered "accurate" was based on how closely the reported symptoms matched the checklist. ;)

Replies from: Miller, Miller
comment by Miller · 2011-01-18T12:27:22.515Z · LW(p) · GW(p)

If we are measuring the accuracy of A vs. B, we are implicitly measuring A against gold standard C, and B against gold standard C. If a better C is not readily available, we may choose to use A or B as an approximation, the choice of which determines our outcome.

Now I wonder:

Are the people that are sympathetic to the hypothesis that computers are better in the cases above (and ignored because of biases) assuming we made the fallacy of using humans as a gold standard?

Are the people that are sympathetic to the hypothesis that humans are better (and ignored because of biases) assuming we made the fallacy of using computers as a gold standard?

The union of which is a lot of upvotes. I can't decide which was meant.

comment by Miller · 2011-01-20T07:47:00.521Z · LW(p) · GW(p)

This is one of the top 3 rated comments on this post. I think you should specify more directly how this anecdote relates to how you interpret the article's intention.

Replies from: prase
comment by prase · 2011-01-20T12:24:31.780Z · LW(p) · GW(p)

He should specify where he has read that.

Replies from: CronoDAS
comment by CronoDAS · 2011-01-21T13:34:18.490Z · LW(p) · GW(p)

I don't remember. I may have actually heard one of my parents talking about it instead of reading it. So consider it an urban legend.

comment by Psy-Kosh · 2011-01-18T05:06:43.825Z · LW(p) · GW(p)

If this is not amazing enough, consider the fact that even when experts are given the results of SPRs, they still can't outperform those SPRs (Leli & Filskov 1985; Goldberg 1968).

Now THAT part is just plain embarrassing. I mean, it's truly a mark of shame upon us if we have a tool that we know works, we are given access to the tool, and we still can't do better than the tool itself, unaided. (EDIT: By "we", I mean "the experts in the relevant fields"... which I guess isn't really a "we" as such, but you know what I mean)

Anyways, are there any nice online indexes or whatever of SPRs that make it easy to put in class of problem and have it find a SPR that's been verified to work for that sort of problem?

Replies from: SilasBarta
comment by SilasBarta · 2011-01-18T18:41:49.103Z · LW(p) · GW(p)

Now THAT part is just plain embarrassing. I mean, it's truly a mark of shame upon us if we have a tool that we know works, we are given access to the tool, and we still can't do better than the tool itself, unaided.

Coincidentally, I was planning to write an article "defending" the use of fallacies on Bayesian grounds. A typical passage would go like this:

People say it's fallacious to appeal to authority. However, if you learn that experts believe X, you should certainly update some finite amount in favor of believing X, as experts are, in general, more likely to believe X if it is true than it is false -- even as you may find many exceptions.

Indeed, it would be quite a strange world if experts were consistently wrong about a given subject matter X, thus making their opinions for X into evidence against X, because they would have to persist in this error, even knowing that their entanglement with X means they only have to invert their pronouncements or remain agnostic to improve accuracy.

Well, it seems we actually do live in such a world, where (some classes of) experts make predictable errors, and don't take trivial steps to make their opinions more accurate (and entangled with the subject matter).

Replies from: Psy-Kosh, JoshuaZ
comment by Psy-Kosh · 2011-01-18T20:21:41.492Z · LW(p) · GW(p)

Well, experts still do better than non-experts on average (afaik), just that they seem to totally ignore tools that could let them do a whole lot better, and also apparently can't do much better than the tools themselves, even when they're able to use the tools.

comment by JoshuaZ · 2011-01-18T18:48:30.298Z · LW(p) · GW(p)

Making predictable errors isn't the same thing as their opinions being anti-correlated with reality.

comment by sketerpot · 2011-01-19T23:00:15.198Z · LW(p) · GW(p)

If anybody would like to try some statistical machine learning at home, it's actually not that hard. The tough part is getting a data set. Once that's done, most of the examples in this article are things you could just feed to some software like Weka, press a few buttons, and get a statistical model. BAM!

Let's try an example. Here is some breast cancer diagnostic data, showing a bunch of observations of people with breast cancer (age, size of tumors, etc.) and whether or not the cancer reoccurred after treatment. Can we predict cancer recurrence?

If you look at it with a decision tree, it turns out that you can get about 70% accuracy by observing two of the several factors that were observed, in a very simple decision procedure. You can do a little better by using something more sophisticated, like a naive Bayes classifier. These show us what factors are the most important, and how.

If you're interested, go ahead and play around. It's pretty easy to get started. Obviously, take everything with a grain of salt, but still, basic machine learning is surprisingly easy.

Replies from: DanielVarga
comment by DanielVarga · 2011-01-21T18:43:53.925Z · LW(p) · GW(p)

I second the advice.

Let me brag a bit. Once in a friendly discussion the following question came up: How to predict for an unknown first name whether it is a male or female name. This was in a context of Hungarian names, as all of us were Hungarians. I had a list of Hungarian first names in digital format. The discussion turned into a bet: I said I can write a program in half an hour that tells with at least 70% precision the sex of a first name it never saw before. I am quite fast with writing small scripts. It wasn't even close: It took me 9 minutes to

  • split my sets of 1000 male and 1000 female names into a random 1000-1000 train-test split,
  • split each name into character 1,2- and 3-grams. E.g.: Luca was turned into ^L u c a$ ^Lu uc ca$ ^Luc uca$.
  • feed the training data into a command line tool to train a maxent model,
  • test the accuracy of the model on the unseen test data.

The model reached an accuracy of 90%. In retrospect, this is not surprising at all. Looking into the linear model, the most important feature it identified was whether the name ends with an 'a'. This trivial model alone reaches some 80% precision for Hungarian names, so if I knew this in advance, I could have won the bet in 30 seconds instead of 9 minutes, with the sed command s/a$/a FEMALE/.

Replies from: matt
comment by matt · 2011-01-27T07:03:46.698Z · LW(p) · GW(p)

These sound like powers I should acquire. Could you drop some further hints on:

  • "a command line tool to train a maxent model"
  • how you tested the accuracy of the model (tools that let you do that in the remaining minutes, rather than general principles)
Replies from: DanielVarga
comment by DanielVarga · 2011-01-27T08:34:31.629Z · LW(p) · GW(p)

I used Zhang Le's tool. Note that it is a rather obscure thing, not an industry standard like say, the huge Weka and Mallet packages. It made very easy the tasks you ask for. When I had a train and test data featurized,

maxent -m gender.model train.data

built the model and

maxent -p -m gender.model test.data

told me its accuracy on the test data.

comment by CronoDAS · 2011-01-18T04:55:31.251Z · LW(p) · GW(p)

Are some SPRs easy to exploit?

Replies from: shokwave
comment by shokwave · 2011-01-18T05:40:46.248Z · LW(p) · GW(p)

Depends on what you're measuring. I can't see how it would be exploitable for things like predicting wine quality (actually greenhousing your grapes to control temperature and rainfall might just make them better) but definitely a specific SPR for, say, rating dossiers for hiring would be exploitable if you knew or could guess at which cues it's using.

comment by JGWeissman · 2011-01-18T03:49:20.275Z · LW(p) · GW(p)

SPR's sound a lot like the Outside View.

Replies from: lukeprog, shokwave
comment by lukeprog · 2011-01-18T03:55:46.951Z · LW(p) · GW(p)

Oh yes. Added.

'Outside view', I believe, is a term of Kahneman's, and is used in the literature by lots of these people who work on SPRs, for example Dawes.

Kahneman begins his Edge.org master class on thinking by discussing the outside view.

comment by shokwave · 2011-01-18T04:19:29.759Z · LW(p) · GW(p)

SPRs sound like a method to ensure a very accurate outside view.

comment by SK2 (lunchbox) · 2011-01-19T20:44:42.502Z · LW(p) · GW(p)

This is a great article, but it only lists studies where SPRs have succeeded. In fairness, it would be good to know if there were any studies that showed SPRs failing (and also consider publication bias, etc.).

Replies from: lukeprog
comment by lukeprog · 2011-01-20T07:21:01.162Z · LW(p) · GW(p)

Definitely.

Replies from: Miller
comment by Miller · 2011-01-20T07:55:16.433Z · LW(p) · GW(p)

My principle problem with this article is that you appear to promote the idea that these SPRs are being ignored for extremely bad reasons, rather than they were ignored for decent reasons. So when you say 'definitely' here I have a problem that you are compartmentalizing the arguments and not admitting the problems with your post.

Also, I don't think this is a great article and in proportion to it getting +40 votes I have a poor opinion of this community (or at least it's karma system where 0 should be neutral).

edit: My last paragraph here is excessively dramatic and I retract it.

Replies from: lukeprog
comment by lukeprog · 2011-01-20T16:16:55.883Z · LW(p) · GW(p)

Miller,

Does this look like "not admitting the problems with [my] post"?

Replies from: Miller
comment by Miller · 2011-01-20T23:16:04.469Z · LW(p) · GW(p)

It would be more constructive of me if I actually helped find counter-evidence, rather than whinging about your not doing so. I think you've put a lot of effort into updating your position.

comment by Zachary_Kurtz · 2011-01-18T15:37:20.841Z · LW(p) · GW(p)

Does SPR beat prediction markets?

Replies from: sketerpot
comment by sketerpot · 2011-01-19T22:41:23.802Z · LW(p) · GW(p)

If it did, then you could make a lot of money on a prediction market with enough cash in it. This would cause the market to give better answers.

comment by Dmytry · 2012-02-25T22:35:13.599Z · LW(p) · GW(p)

Well, SPRs can plausibly outperform average expertise. That's because most of the expertise is utter and complete sham.

The recidivism in example...

The judges, or psychologists, or the like, what in the world makes them experts on predicting the criminals? Did they read an unbiased sample of recidivism? Did they do any practice, earning marks for predicting criminals? Anything?

Resounding no. They never in their lives did anything that should have earned them the expert status on this task. They did other stuff that puts them first on the list when you're looking for 'experts' on a topic for which there is no experts.

They are about as much experts on this task as a court janitor is an expert on law. He too did not do anything related to law, he did clean the courtroom.

comment by Student_UK · 2011-01-18T11:19:19.049Z · LW(p) · GW(p)

I have two concerns about the practical implementation of this sort of thing:

  1. It seems like there are cases where if a rule is being used then people could abuse it. For example, in job applications or admissions to medical schools. A better understanding of how the rule relates to what it predicts would be needed.

If X+Y predicts Z does that mean enhancing X and Y will up the probability of Z? Not necessarily, consider the example of happy marriages. Will having more sex make your relationship happier? Or does the rule work because happy couples tend to have more sex?

  1. It is not true in every case that we equally value all true beliefs, and equally value all false beliefs. Certain rules might work better if we take into consideration a person's race, sex, religion and nationality. But most people find this sort of thing unpalatable because it can lead to the systematic persecution of sub groups, even if it results in more true, and fewer false, beliefs overall. It also might be the case that some of these rules discriminate against groups of people in more subtle ways that won't be immediately obvious.

Of course neither of these problems mean that there won't be perfectly good cases where these rules would improve decision making a lot.

Replies from: shokwave, MatthewW, cousin_it, wedrifid, BillyOblivion
comment by shokwave · 2011-01-18T12:45:54.375Z · LW(p) · GW(p)

Will having more sex make your relationship happier?

Obviously, yes.

Replies from: TheOtherDave, DSimon
comment by TheOtherDave · 2011-01-18T13:44:35.707Z · LW(p) · GW(p)It probably depends somewhat on with whom you are having it. Replies from: shokwave
comment by shokwave · 2011-01-18T13:53:46.978Z · LW(p) · GW(p)

True. One of my nodes for "relationship" is consensual; most definitely in that case it would make the relationship much less happy.

comment by DSimon · 2011-01-18T15:49:15.103Z · LW(p) · GW(p)

Well, unless the quality of the sex is causally linked to the quantity, such that having lots and lots of sex (past a certain threshold) makes each individual session disproportionately worse. This is true for a lot of people's libidos.

To put it another way: it's not the frequency of the motion in the ocean, but the amplitude of the waves.

Replies from: shokwave
comment by shokwave · 2011-01-18T16:53:33.482Z · LW(p) · GW(p)

This is true for a lot of people's libidos.

But probably not true for the quantity of sex in almost all relationships, I would bet.

Replies from: datadataeverywhere
comment by datadataeverywhere · 2011-01-19T02:00:50.033Z · LW(p) · GW(p)

Although I agree with you, I feel like I should point out that it is somewhat nonsensical for most relationships to be sub-optimal in this way. If both parties want to have more sex, and they can (otherwise the question wouldn't really be valid), but they don't, that's a little weird, don't you think?

We can talk about optimizing for other things (e.g. careers), but I don't think that's really the issue, since many couples, when explicitly told that they would be happier if they had more sex, just start having more sex, without sacrificing anything that they end up wanting back.

Replies from: wedrifid
comment by wedrifid · 2011-01-19T02:29:42.399Z · LW(p) · GW(p)

Although I agree with you, I feel like I should point out that it is somewhat nonsensical for most relationships to be sub-optimal in this way. If both parties want to have more sex, and they can (otherwise the question wouldn't really be valid), but they don't, that's a little weird, don't you think?

Weird certainly but this is a kind of weirdness that humans are notorious for. We are terrible happiness optimisers. In the case of sex specifically having more of it is not as simple as walking over to the bedroom. For males and females alike you can want to be having more sex, be aware that having more sex would benefit your relationship and still not be 'in the mood' for it. A more indirect approach to the problem of libido and desire is required - the sort of thing that humans are not naturally good at optimising.

Replies from: datadataeverywhere
comment by datadataeverywhere · 2011-01-19T07:48:29.327Z · LW(p) · GW(p)

I agree on every point. I also think part of this is simply that shared knowledge that is not common knowledge (until acknowledged between parties) is much more difficult to act upon.

I think that "okay, we're going to have sex now, because it will make us happier" is a little like "okay, I'm going to the gym now, because it will make me feel better", which may be the same thing you meant about being "in the mood", but I think it's even harder for sex, because we are perhaps less willing to see sex except as immediate gratification.

Replies from: NancyLebovitz
comment by NancyLebovitz · 2011-01-20T00:53:02.193Z · LW(p) · GW(p)

I've heard more than once that having more sex on a schedule in the hopes of having children is a miserable experience for couples with fertility problems.

I don't know whether having more sex in the hopes of being happier (rather than because the people involved want sex more for the fun of it) could have similar side effects.

Replies from: Blueberry
comment by Blueberry · 2011-01-20T10:37:07.086Z · LW(p) · GW(p)

It's fairly common for sex therapists to recommend that couples schedule sex and have sex at all (but not only) scheduled times, on the grounds that people may not be in the mood at first, but enjoy it anyway. While it may be a miserable experience for a few people, I doubt that it is miserable in general (and I'm not sure why it would be).

Replies from: NancyLebovitz
comment by NancyLebovitz · 2011-01-20T16:06:42.208Z · LW(p) · GW(p)

It's certainly possible for people to have akrasia in regards to pleasure, and scheduling can help with that.

I think possible problems come in if a partner (possibly both partners in the case of fertility) really doesn't want to at the moment, but is feeling pressured.

comment by MatthewW · 2011-01-18T18:43:36.473Z · LW(p) · GW(p)

Yes, several of these models look like they're likely to run into trouble of the Goodhart's law type ("Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes").

comment by cousin_it · 2011-01-19T02:15:50.704Z · LW(p) · GW(p)

Will having more sex make your relationship happier?

I think it's safe to say that having less sex will make the relationship less happy, so there is some causality involved.

Replies from: AstroCJ
comment by AstroCJ · 2011-01-19T19:01:20.827Z · LW(p) · GW(p)

What? Are you from the mythical land where every partnering has the same intensity of sex drive?

comment by wedrifid · 2011-01-19T02:23:11.385Z · LW(p) · GW(p)

Not necessarily, consider the example of happy marriages. Will having more sex make your relationship happier?

Yes. Almost certainly. But there are plenty of other examples you could pick from where there is not causality involved (and some for which causality is negative).

comment by BillyOblivion · 2011-01-20T10:04:20.476Z · LW(p) · GW(p)

[quote]Will having more sex make your relationship happier? [/quote]

Having more sex will make ME happier. If my wife finds out though...

comment by Alexandros · 2011-01-18T10:36:05.995Z · LW(p) · GW(p)

While this is promising indeed, it is wise not to forget about Optimization By Proxy that can occur when the thing being optimised is (or is under the control of) an intelligent agent.

comment by jsteinhardt · 2011-01-18T08:25:02.658Z · LW(p) · GW(p)

My gut reaction is that this doesn't demonstrate that SPRs are good, just that humans are bad. There are tons of statistical modeling algorithms that are more sophisticated than SPRs.

Unless, of course, SPR is another word for "any statistical modeling algorithm", in which case this is just the claim that statistical machine learning is a good approach, which anyone as Bayesian as the average LessWronger probably agrees with.

Replies from: shokwave
comment by shokwave · 2011-01-18T08:47:41.072Z · LW(p) · GW(p)

There are tons of statistical modeling algorithms that are more sophisticated than SPRs.

Not in and of itself a good thing. As demonstrated recently sophisticated statistics can suffice simply to allow one to confuse oneself in a sophisticated knot - that's harder to untie. There is a case to be made for promoting the simplest algorithm that outperforms current methods, and SPRs seem to fit this bill.

As for what SPR stands for, the post makes it pretty clear that they are a class of rules that predict a (desired) property using weighted cues (observable properties). I am not familiar enough with statistical modelling to say if that is a shared goal among all algorithms.

Replies from: jsteinhardt, jsteinhardt
comment by jsteinhardt · 2011-01-18T18:28:08.995Z · LW(p) · GW(p)

The post gives an example of an SPR that uses weighted cues. But he specifically says

This particular SPR is called a proper linear model,

indicating that there are other types of SPRs, and I currently have no idea what those other types might be.

I agree with you that complicated statistical tests can lead to spurious results; simple statistical tests can also lead to spurious results if the person using them doesn't understand them. I naievely associate both of these with "the test was designed to correct against a different type of flaw in experimental design than actually occurred".

When the focus of the statistical test is on accurately modeling a given situation, I think it is less difficult to realize when a model choice makes sense and when it doesn't, so more sophisticated approaches will probably do better, since they come closer to carving reality at its joints. This might be an inferential distance error on my part, though, since I have training in this area, so errors that I personally can avoid might not be generally avoidable.

Replies from: shokwave
comment by shokwave · 2011-01-19T06:08:06.158Z · LW(p) · GW(p)

I agree with you for smart people; I do see a lot of value, though, in idiot-proof statistics. Weighted-cue SPRs are almost too simple to screw up.

comment by jsteinhardt · 2011-01-18T18:28:28.291Z · LW(p) · GW(p)

Also, while this isn't super-relevant, given that I already agree with your claim about people confusing themselves, my impression is that the link you gave presents moderate-to-weak evidence against this.

I didn't read the entire article that was linked to discussing the statistical analysis (if there's a particular section you think I should read, please let me know), but my understanding was that in some sense the "experimental procedure" was the issue, not the statistics. In other words, Bem considered potentially hundreds of hypotheses about his data, but only reported on a few, so that p-values of 0.02 are not super-impressive (since out of 100 hypotheses we would expect a few to hit that by chance).

Bem's experiments all basically ask "is this coin biased", which isn't a very complicated question to answer. It is the sophisticated statistics that corrects for the flawed procedure.

Replies from: shokwave
comment by shokwave · 2011-01-19T06:11:12.304Z · LW(p) · GW(p)

It wasn't a very good example at all. I basically grepped my memory for "idiot statistics" and that one featured strongly. The problem there was not a misuse of statistical tests, it was a misinterpretation of the significance of statistical tests.

comment by Vladimir_M · 2011-01-19T08:42:18.807Z · LW(p) · GW(p)

Besides the legal issues with discrimination and disparate impact, another important issue here is that jobs that involve making decisions about people tend to be high-status. As a very general tendency, the higher-status a profession is, the more its practitioners are likely to organize in a guild-like way and resist intrusive innovations by outsiders -- especially innovations involving performance metrics that show the current standards of the profession in a bad light, or even worse, those that threaten a change in the way their work is done that might lower its status.

Discussions of such cases in medicine are a regular feature on Overcoming Bias, but it exists in a more or less pronounced form in any other high-status profession too. How much it accounts for the specific cases discussed in the above article is a complex question, but this phenomenon should certainly be considered as a plausible part of the explanation.

comment by Dr_Manhattan · 2011-01-18T21:17:54.246Z · LW(p) · GW(p)

Sometimes, being rational is easy. When there exists a reliable statistical prediction rule for the problem you're considering, you need not waste your brain power trying to make a careful judgment.

Unfortunately linear models for a lot of situations are simply not available. The dozen or so ones in the literature are the exception, not the rule.

Replies from: matt
comment by matt · 2011-01-26T20:49:21.911Z · LW(p) · GW(p)

And those that exist are not always easy to find.
And those that are found are not always easy to use in industry (where sophisticated computer skills are often the things the marketing grad taught erself to do in Excel).

comment by [deleted] · 2011-01-18T12:05:11.682Z · LW(p) · GW(p)

Correct me if I'm wrong, but the SPR is just a linear model, right? Statistics is an under appreciated field in many walks of life. My own field of speciality, experimental design, is treated with down right suspicion by scientists who have not encountered it before, who find the results counter-intuitive (when they have 4 controllable variables in an experiment they want to vary them one at a time, while the best way is to vary all 4 simultaneously...)

Replies from: Nominull, Sly
comment by Nominull · 2011-01-18T22:16:15.111Z · LW(p) · GW(p)

I also find that counter-intuitive, is there a short explanation of why?

comment by Sly · 2011-01-18T21:02:29.592Z · LW(p) · GW(p)

I am curious: could you explain why it is better to vary all 4?

Replies from: None
comment by [deleted] · 2011-01-18T23:51:32.579Z · LW(p) · GW(p)

Briefly: because to do so assumes that they do not interact, and if they DO interact, you will have gathered no information on said interactions.

Replies from: CronoDAS
comment by CronoDAS · 2011-01-19T07:58:45.035Z · LW(p) · GW(p)

That makes sense... if your inputs are X and Y, and you want to figure out what your output f(X,Y) is, it seems like you'll eventually have to vary X and Y simultaneously in order to tell the difference between f(X,Y) = aXY + c and f(X,Y) = aX + bY + c.

Replies from: None
comment by [deleted] · 2011-01-19T11:09:08.090Z · LW(p) · GW(p)

quite, although usually you'll have a model f(x,y)=aXY+bX+cY+d. I'm actually underselling this approach, because if I had two variables X, and Y which can be observed between (-1,1), and only have two observations to do it in then we're much better going (X,Y)=(-1,1) and (1,-1) rather than (0,1),(1,0), because we're gathering more information.

We always want to design in the location with the most variance, because thats the hardest place to predict. Given that the model we're looking at is linear in both the parameters and the variables then we know the places where we get the most variation will be at the extremes. Obviously we have no information if we think there might be some kind of quadratic terms here, but one of the nice things about design for linear models is you can build your experimentation to iteratively build up information.

Typically in an industrial setting we'll have a few dozen different factors which we think might affect our outcome, so we can design to eliminate down to a handful by using a very basic linear model in a screening experiment, then use a more sophisticated design called a central composite design.

Now if we want a mechanistic model, something based on what we know on the physics of the situation (say we have some differential equations describing the reaction), then designing becomes harder, which is where my research is.

comment by ChristianKl · 2011-01-26T13:29:04.372Z · LW(p) · GW(p)

You speak of incredible success without given a success rate of the models. The fact that there are a dozen cases where specific models outperformed human reasoning doesn't prove much.

At the moment you recommend other people to use SPRs for their decision making based on "expert judgment". How about providing us a SPR that tells us for which problems we should use SPRs?

comment by freyley · 2011-01-19T19:13:23.857Z · LW(p) · GW(p)

SPRs can be gamed much more directly than human experts. For example, imagine an SPR in place of all hiring managers. In our current place, with hiring managers, we can guess at what goes in to their decisionmaking and attempt to optimize for it, but because each manager is somewhat different, we can't know that well. A single SPR that took over for all the managers, or even a couple of very popular ones, would strongly encourage applicants to optimize for the variable most weighted in the equation. Over time this would likely decrease the value of the SPR back to that of a human expert.

This has a name in the literature, but I can't remember it at the moment. You see this problem in, for example, the current obsessive focus on GDP as the only measure of national well-being. Now that we've had that measure for some time, we're able to have countries whose GDP is improving but who suck on lots of other measures, and thus politicians who are proud of what they've done but who are hated by the people.

Yes, in some cases, this would cause us to improve the SPR to the point where it accurately reflected the qualities that go into success. But that's not a proven thing.

That said, I'd really like to see a wiki or other attempting-to-be-complete resource for finding an SPR for any particular application. Anyone got one?

Replies from: Barry_Cotter
comment by Barry_Cotter · 2011-01-19T19:27:41.900Z · LW(p) · GW(p)

This has a name in the literature, but I can't remember it at the moment

Goodhart's Law

A single SPR that took over for all the managers, or even a couple of very popular ones, would strongly encourage applicants to optimize for the variable most weighted in the equation.

W1(Quantitative skills) + W2(Written and Oral Communication Skills) + W3(Ability to work with loose supervision) + W4(Domain Expertise) + W5(Social Skills) + W6(Prestige Markers)

That said, I'd really like to see a wiki or other attempting-to-be-complete resource for finding an SPR for any particular application. Anyone got one?

No, but I imagine that taking a grab basket of plausible correlates of the desired trait and throwing them into a regression function would be a good first draft. Then iterate.

comment by JohnD · 2011-01-18T09:40:52.287Z · LW(p) · GW(p)

Hi Luke,

Great post. Will be writing something about the legal uses of SPRs in the near future.

Anyway, the link to the Grove and Meehl study doesn't seem to work for me. It says the file is damaged and cannot be repaired.

Replies from: Douglas_Knight, lukeprog
comment by Douglas_Knight · 2011-01-18T16:12:06.014Z · LW(p) · GW(p)

The copy on Meehl's posthumous website works for me. I found it with google scholar.

Replies from: JohnD
comment by JohnD · 2011-01-18T16:19:50.082Z · LW(p) · GW(p)

Ah yes, that seems to work. Thanks

comment by lukeprog · 2011-01-18T17:56:05.627Z · LW(p) · GW(p)

Fixed, thanks.

comment by NancyLebovitz · 2012-02-26T11:24:19.925Z · LW(p) · GW(p)

The thing that makes me twitch about SPRs is a concern that they won't change when the underlying conditions which created their data sets change. This doesn't mean that humans are good at noticing that sort of thing, either. However, it's at least worth thinking about which approach is likely to overshoot worse when something surprising happens. Or whether there's some reason to think that the greater usual accuracy of SPRs leads to enough bigger reserves that the occasional overshoot problem (if such are worse than in a non-SPR system) is compensated for.

comment by LorenzofromOz · 2011-01-21T15:23:35.932Z · LW(p) · GW(p)

On interviews, I had a great deal of success hiring for clerical assistant positions by simply getting the interviewees to do a simple problem in front of us. It turned out to be a great, reliable and easy-to-justify sorter of candidates.

But, of course, it was neither unstructured nor much of an "interview" as such.

Replies from: shokwave
comment by shokwave · 2011-01-21T15:30:52.811Z · LW(p) · GW(p)

Again, test not interview. Their GPA is an average measure of maybe thousands of such simple problems - probably on average more rigorously produced, presented, and corrected than your problem presented in the interview.

Deciding based on a test in person instead of deciding on a number that represents thousands of such individual tests smacks of anecdotal decision-making.

Replies from: eshear, Dmytry
comment by eshear · 2011-01-26T18:25:41.928Z · LW(p) · GW(p)

Unfortunately, GPAs can lie. You cannot be certain of the quality of the problems and evaluation that was averaged to produce the GPA. So running your own test of known difficulty works well to verify what you see on the resume.

For example, I have to hire programmers. We give all incoming programmers a few relatively easy programming problems as part of the interview process because we've found that no matter what the resume says, it's possible that they actually do not know how to program.

Good resume + good interview result is a much stronger indicator than good resume alone.

Replies from: tenshiko
comment by tenshiko · 2011-01-26T18:42:26.081Z · LW(p) · GW(p)

A significant problem is the weighting of certain courses, particularly Advanced Placement ones. A GPA of 3.7, seeming quite respectable to the unaware, can be obtained by work of quality 83%, and that's assuming the class didn't offer extra credit.

Replies from: Wandering_Sophist
comment by Wandering_Sophist · 2011-05-14T19:53:16.982Z · LW(p) · GW(p)

I don't think he is likely to hire programmers straight out of high school.

Giving IB/AP/Honors classes extra weight in high school is necessary to offset the additionally difficulty of these classes. Otherwise, high school students would have a direct disincentive to take advanced classes.

Replies from: None, tenshiko
comment by [deleted] · 2012-02-26T00:46:43.584Z · LW(p) · GW(p)

Giving IB/AP/Honors classes extra weight in high school is necessary to offset the additionally difficulty of these classes. Otherwise, high school students would have a direct disincentive to take advanced classes.

A swift googling brings up this forthcoming study of about 900 high schools in Texas:

Despite conventional wisdom to the contrary, grade weighting is not the primary factor driving students to increase their AP course-taking. Moreover, a lack of institutional knowledge about the importance of grade-weighting does not have a practically significant adverse impact on students with low historical participation rates in AP, although low income students are marginally less responsive to increases in the AP grade weight than others. The minimal connection between AP grade weights and course-taking behavior may explain why schools tinker with their weights, making changes in the hopes of finding the sweet spot that elicits the desired student AP-taking rates. The results presented here suggest that there is no sweet spot and that schools should look elsewhere for ways to increase participation in rigorous courses.

comment by tenshiko · 2011-05-14T20:18:15.030Z · LW(p) · GW(p)

But there's still the additional incentive of prestige and signalling, isn't there? That should be enough for the serious scholar. It's a significant problem when non-AP-labelled courses are often passed over for the purpose of a cheap grade boost.

comment by Dmytry · 2012-02-25T23:26:19.314Z · LW(p) · GW(p)

Since when did greater rigour and averaging of more problems imply greater degree of correlation with performance at one specific job?

I call halo effect here. Greater rigour, bigger number, more accurate, more corrected, all combined really 'good' qualities about the GPA value spill over into your feeling of how well it'll correlate with performance at specific job, versus a 'bad' ill measured value.

Truth is, say, ill measured hand size based on eyeballing can easily correlate better with measured finger length, than body weight measured using ultra high precision scientific scales with accuracy of a milligram (microgram, nanogram, whatever). Just because hammer is a tool you build things with, and butter knife is a kitchen utensil, doesn't make hammer better than butter knife as a screw driver.

Replies from: shokwave
comment by shokwave · 2012-02-26T01:05:47.577Z · LW(p) · GW(p)

Just because hammer is a tool you build things with, and butter knife is a kitchen utensil, doesn't make hammer better than butter knife as a screw driver.

Well, actually...

But more on point, you'd need to justify that the test you give is more correlated than GPA with performance - this is why I support simple programming tests (because they demonstrably are more correlated than academic indicators) but for a 'clerical assistant' position as described above, a specific test doesn't immediately spring to mind, and so it's suspect.

Replies from: Dmytry
comment by Dmytry · 2012-02-26T10:55:06.607Z · LW(p) · GW(p)

You aren't looking for 'correlation' usually, you're looking for screening out the serial job applicant who can't do the job they're applying for (and keeps re-applying to many places)... just ask 'em to do some work similar to what they will be doing as per LorenzofromOz method, and you'll at least be assured they can do work. While with GPA you won't be assured of anything what so ever.

For the programming, the simplest dumbest check works to screen out those entirely incapable, when screening by PhD would not.

http://www.codinghorror.com/blog/2007/02/why-cant-programmers-program.html

PhD might correlate better with performance than fizzbuzz does (the latter being a binary test of extremely basic knowledge), but PhD does not screen out those who will just waste your time, and fizzbuzz (your personal variation of it) does.

Replies from: army1987
comment by A1987dM (army1987) · 2012-02-26T12:56:54.249Z · LW(p) · GW(p)

Holy crap... I think I had read about the FizzBuzz thing a while ago, but I didn't remember about the 199 in 200 thing... Would it be possible to sue the institutions issuing those PhD or something? :-)

Replies from: Dmytry
comment by Dmytry · 2012-02-26T13:31:08.261Z · LW(p) · GW(p)

Well, I don't know what % of the CS-related PhDs can't do FizzBuzz, maybe the percentage is rather small. (Also, sue for what? You are not their client. The incapable dude that was given a degree, that's their client. Your over-valuation of this degree as evidence of capability is your own problem)

The issue is that, as Joel explains, the job applicants are a sample extremely biased towards incompetence:

http://www.joelonsoftware.com/items/2005/01/27.html

[Though I would think that the incompetents with degrees would be more able to find incompetent employer to work at. And PhDs should be able to find a company that hires PhDs for signalling reasons]

The issue with the hiring methods here, is that we easily confuse "more accurate measurement of X" with "stronger correlation to Y", and "stronger correlation to Y" with hiring better staff (the one that doesn't sink your company), usually out of some dramatically different population than the one on which correlation was found.

Furthermore, a 'correlation' is such an inexact measure of how test relates to performance. Comparing correlations is like comparing apples to oranges by weight. The 'fizzbuzz' style problems measure performance near the absolute floor level, but with very high reliability. Virtually no-one who fails fizzbuzz is a good hire. Virtually no-one who passes fizzbuzz (an unique fizzbuzz, not the popular one) is completely incapable of programming. The degrees correlate to performance at the higher level, but with very low reliability - there are brilliant people with degrees, there are complete incompetents with degrees, there's brilliant people and incompetents without degrees.

edit: other example:

http://blog.rethinkdb.com/will-the-real-programmers-please-stand-up

Reversing a linked list is a good one unless the candidate knows how to. See, the issue is that educational institutions don't teach how to think up a way to reverse linked list. Nor do they test for that. They might teach how to reverse the linked list, then they might test if the person can reverse the linked list. Some people learn to think of a way to solve such problems. Some don't. It's entirely incidental.

comment by DanielLC · 2011-01-20T02:35:22.918Z · LW(p) · GW(p)

The post mentions the experts using the results of the SPR. What happens if you reverse it, and give the SPR the prediction of the expert?

Replies from: lukeprog
comment by lukeprog · 2011-01-20T07:20:36.143Z · LW(p) · GW(p)

That's called a 'bootstrapped' SPR. It's one way of building an SPR. And yes, in many cases the SPR ends up being reliably better than the expert judgments that were used to build it.

Replies from: DanielLC
comment by DanielLC · 2011-01-20T22:43:36.354Z · LW(p) · GW(p)

I was wondering more how much better it is than a normal SPR. Also, I wonder what weight it would give to the expert.

comment by Douglas_Knight · 2011-01-18T16:07:15.486Z · LW(p) · GW(p)

Cosma Shalizi has a nice bibliography here

60 years of research

I would like to emphasize this part. It's not just scattered papers back then. Meehl wrote a book surveying the field in 1955.

comment by Miller · 2011-01-18T05:08:39.577Z · LW(p) · GW(p)

AI systems can generally whoop humans when a limited feature set can be discovered that covers the span of a large class of examples to good effect. The challenge is when you seemingly need a new feature for each new example in order to differentiate it from the rest of the examples in that class. Essentially you are saying that the problem can be mapped to a simple function. Some problems can.

Let's imagine we are classifying avian vs. reptile. Our first example might be a gecko, and we might say 'well it's green'. So 'Color is Green' is a clue\feature and that works coincidentally for a few more examples. Then you get a parrot as an example, and you decide to add 'Has a beak'. Then you get the example of a turtle, and so you add 'Has a shell', etc. It seems to me the success of these systems boils down to whether the features can be added at a minimal rate compared to the examples on hand.

Where AI's compete well generally they beat trained humans fairly marginally on easy (or even most) cases, and then fail miserably at border or novel cases. This can make it dangerous to use them if the extreme failures are dangerous.

As to why humans can't ensemble with the machines, I suspect that has mostly to do with the humans not being well-trained to do so.

Replies from: shokwave
comment by shokwave · 2011-01-18T05:27:58.594Z · LW(p) · GW(p)

A fair point and good caution against turning SPRs into your hammer for every nail, but irrelevant in the case lukeprog is discussing; we already have the expert system, we already know it works better than the experts, we just aren't using it.

Replies from: Miller
comment by Miller · 2011-01-18T05:50:50.281Z · LW(p) · GW(p)

Irrelevant is excessive. When you say 'system A works better than system B' this implies that system A should be used and this is clear cut. But the notion 'works better' lacks a rigorous definition. Is the machine better if it wins 90% of the time by 5%, and fails the other 10% by 40%? It's not as simple as saying .9 .05 > .1 .4. The cost of error isn't necessarily linear.

Now why these systems aren't used in ensembles with humans is indeed a great question. I can imagine that in most cases we could also ask 'why don't we double the number of experts who are collaborating on a given problem?' under the presumption that more minds would likely result in a better performance across the board. I wouldn't be surprised if there was a lot of overlap in the answers. Coordination difficulty is likely high up there. Thus,

consider the fact that even when experts are given the results of SPRs, they still can't outperform those SPRs

possibly becomes the explanation.

Replies from: Vaniver, shokwave
comment by Vaniver · 2011-01-18T07:17:19.954Z · LW(p) · GW(p)

When you say 'system A works better than system B' this implies that system A should be used and this is clear cut. But the notion 'works better' lacks a rigorous definition.

What? These are generally binary decisions, with a known cost to false positives and false negatives, and known rates of false positives and false negatives. It should be be trivial to go from that to a utility-valued error score.

Replies from: Miller
comment by Miller · 2011-01-18T12:13:15.325Z · LW(p) · GW(p)

You just presumed away my argument. I claim specifically that the relationship between various classes of errors is not well-defined. This can lead to abuse of the term 'better'.

Please tell me why I should take that as a presumption.

Replies from: shokwave
comment by shokwave · 2011-01-18T12:44:40.367Z · LW(p) · GW(p)

Because those are the class of problems this post discusses.

From the top of the post:

A parole board considers the release of a prisoner: Will he be violent again? A hiring officer considers a job candidate: Will she be a valuable asset to the company? A young couple considers marriage: Will they have a happy marriage?

The cached wisdom for making such high-stakes predictions is to have experts gather as much evidence as possible, weigh this evidence, and make a judgment. But 60 years of research has shown that in hundreds of cases, a simple formula called a statistical prediction rule (SPR) makes better predictions than leading experts do.

Replies from: kybernetikos, Miller
comment by kybernetikos · 2011-01-19T17:32:18.107Z · LW(p) · GW(p)

A parole board considers the release of a prisoner: Will he be violent again?

I think this is the kind of question that Miller is talking about. Just because a system is correct more often, doesn't necessarily mean it's better.

For example if the human experts allowed more people out who went on to commit relatively minor violent offences and the SPRs do this less often, but are more likely to release prisoners who go on to commit murder then there would be legitimate discussion over whether the SPR is actually better.

I think this is exactly what he is talking about when he says

Where AI's compete well generally they beat trained humans fairly marginally on easy (or even most) cases, and then fail miserably at border or novel cases. This can make it dangerous to use them if the extreme failures are dangerous.

Whether or not there is evidence that says this is a real effect I don't know, but to address it what you really need to measure is total utility of outcomes rather than accuracy.

Replies from: Miller
comment by Miller · 2011-01-19T22:03:50.417Z · LW(p) · GW(p)

Yes. You got it, exactly.

comment by Miller · 2011-01-18T12:54:26.403Z · LW(p) · GW(p)

No. I'm talking about classes of errors.

As in, which is better?

  • A test that reports 100 false positives for every 100 false negatives for disease X
  • A test that reports 110 false positives for every 90 false negatives for disease X

The cost of fp vs. fn is not defined automatically. If humans are closer to #1 than #2, and I develop a system like #2, I might define #2 to be better. Then later on down the line I stop talking about how I defined better, and I just use the word better, and no one questions it because hey... better is better, right?

Replies from: shokwave, jimrandomh
comment by shokwave · 2011-01-18T13:41:34.234Z · LW(p) · GW(p)

Which is more costly, false positives or false negatives? This is an easy question to answer.

If false positives, #1 is better. If false negatives, #2. I really do not see what your point is. These problems you bring up are easily solved.

Replies from: handoflixue
comment by handoflixue · 2011-01-18T20:26:23.429Z · LW(p) · GW(p)

Which is better: Releasing a violent prisoner, or keeping a harmless one incarcerated? If you can find an answer that 90% of the population agrees on, then I think you've done better than every politician in history.

That people do NOT agree suggest to me that it's hardly a trivial question...

Replies from: shokwave, Miller, handoflixue
comment by shokwave · 2011-01-19T06:19:34.226Z · LW(p) · GW(p)

Releasing a violent prisoner, or keeping a harmless one incarcerated?

How violent, how preventably violent, how harmless, how incarcerated, how long incarcerated? For any specific case with these agreed-upon, I am confident a supermajority would agree.

That people do NOT agree suggest to me that it's hardly a trivial question...

That people don't agree suggests one side is comparing releasing a serial killer to incarcerating a drifter in jail a short while, and the other side is comparing releasing a middle-aged man who in a fit of passion struck his adulterous wife to incarcerating Ghandi for the term of his natural life. More generally, they are deciding based on one specific example they have strongly available to them.

In the state you phrased it, that question is about as answerable as "how long is a piece of string?".

comment by Miller · 2011-01-18T23:00:38.037Z · LW(p) · GW(p)

Yes. Thank you. Since at least one person understood me, I'm gonna jump off the merry-go-round at this point.

comment by handoflixue · 2011-01-18T20:29:56.984Z · LW(p) · GW(p)

(For reference, I realize an expert runs in to the same issue, I just think it's unfair to say that the issue is "easily solved")

comment by jimrandomh · 2011-01-18T14:18:44.815Z · LW(p) · GW(p)

Many tests have a continuous, adjustable parameter for sensitivity, letting you set the trade-off however you want. In that case, we can refrain from judging the relative badness of false positives and false negatives, and use ROCA, which is basically the integral over all such trade-offs. Tests that are going to be combined into a larger predictor are usually measured this way.

Machine learning packages generally let you specify a "cost matrix", which is the cost of each possible confusion. For a 2-valued test, it would be a 2x2 matrix with zeroes on the diagonal, and the cost of A->B and B->A errors in the other two spots. For a test with N possible results, the matrix is NxN, with zeroes on the diagonals, and each (row,col) position is the cost of a mistake that confuses the result corresponding to that row with the result corresponding to that column.

comment by shokwave · 2011-01-18T06:31:44.359Z · LW(p) · GW(p)

Keep in mind this is in the conclusion of lukeprog's post:

When there exists a reliable statistical prediction rule for the problem you're considering

Now,

But the notion 'works better' lacks a rigorous definition. Is the machine better if it wins 90% of the time by 5%, and fails the other 10% by 40%? It's not as simple as saying .9 .05 > .1 .4. The cost of error isn't necessarily linear.

If the cost of error isn't linear, determine what function it follows, then use that function instead of a linear function to compare the relative costs, which will tell you which works better.

Irrelevant is excessive.

I stand by it. The post is saying, given that SPRs work, work better than experts, and don't fail where experts don't, we should use them instead of experts. Your points were that SPRs don't always work, tend not to work in border cases, and might fail in dangerous cases. The first point is only true in cases this post is not concerned with, the second is equally true of experts and SPRs, and the third is also equally true of experts and SPRs.

comment by BT_Uytya · 2011-12-05T18:51:43.235Z · LW(p) · GW(p)

Also, there is an article by Dawes, Faust and Meehl. Despite the fact it was published 7 years prior to House of Cards, it contains some information not described in the chapter 3 of House of Cards.

For example, the awesome result by Goldberg: linear models of human judges were more accurate than human judges themselves:

in cases of disagreement, the models were more often correct than the very judges on whom they were based.

comment by metaweta · 2011-01-20T00:17:36.906Z · LW(p) · GW(p)

I think the reason I don't use statistics more often is the difficulty of getting good data sets; and even when there is good data, there are often ethical problems with following it. For example: Bob lives in America, and is seeking to maximize his happiness. Americans who report high levels of spiritual conviction are twice as likely to report being "very happy" than the least religious. Should he become a devout Christian? There's evidence that the happiness comes from holding the majority opinion; should he then strive to believe whatever the polls say is the most common belief in his area?

Another example: Bob has three kids; he knows his wife is cheating on him, but he also knows the effect size of divorce on child outcomes (depression, grades, income, stability of future relationships, etc.) is larger than smoking on lung cancer, aspirin on heart attacks, or cyclosporine on organ transplants. When do the bad effects of staying in the marriage outweigh the bad effects of splitting up?

Replies from: IlyaShpitser
comment by IlyaShpitser · 2011-01-21T00:55:38.641Z · LW(p) · GW(p)

Bob should not become a Christian to become happier for the same reason that he should not stay away from hospitals if he's sick (after all, sick people are a lot more likely to be in a hospital).

comment by PatrickNoonan · 2011-01-19T11:51:08.045Z · LW(p) · GW(p)

People looking for additional resources on this matter should know that such linear models are often called "multi attribute utility models" (MAUT), and that they're discussed extensively in the literature of decision analysis and multi-criteria decision making. They're also used in choice models in the science of marketing.

The word "statistical" in the name used here is a bit of a red herring.

comment by gwern · 2011-06-19T16:15:18.514Z · LW(p) · GW(p)

Atlantic, The Brain on Trial:

In the past, researchers have asked psychiatrists and parole-board members how likely specific sex offenders were to relapse when let out of prison. Both groups had experience with sex offenders, so predicting who was going straight and who was coming back seemed simple. But surprisingly, the expert guesses showed almost no correlation with the actual outcomes. The psychiatrists and parole-board members had only slightly better predictive accuracy than coin-flippers. This astounded the legal community.

So researchers tried a more actuarial approach. They set about recording dozens of characteristics of some 23,000 released sex offenders: whether the offender had unstable employment, had been sexually abused as a child, was addicted to drugs, showed remorse, had deviant sexual interests, and so on. Researchers then tracked the offenders for an average of five years after release to see who wound up back in prison. At the end of the study, they computed which factors best explained the reoffense rates, and from these and later data they were able to build actuarial tables to be used in sentencing.

Which factors mattered? Take, for instance, low remorse, denial of the crime, and sexual abuse as a child. You might guess that these factors would correlate with sex offenders’ recidivism. But you would be wrong: those factors offer no predictive power. How about antisocial personality disorder and failure to complete treatment? These offer somewhat more predictive power. But among the strongest predictors of recidivism are prior sexual offenses and sexual interest in children. When you compare the predictive power of the actuarial approach with that of the parole boards and psychiatrists, there is no contest: numbers beat intuition. In courtrooms across the nation, these actuarial tests are now used in presentencing to modulate the length of prison terms.

comment by lukeprog · 2011-01-31T00:50:55.142Z · LW(p) · GW(p)

Update: Added about 10 more direct PDF links to the original article.

Replies from: paul_watcher
comment by paul_watcher · 2011-02-22T17:10:31.814Z · LW(p) · GW(p)

Thank you for this article. Some people may react to finding that their professional opinion be less accurate than a simple formula, but I get excited instead. It's such a great opportunity to become more accurate, with such comparatively little effort! I'm particularly interested in the medical SPRs; I aim to be a doctor, and if these will help me be better than the average doctor in many cases, then so be it. I suspect that I'll have to use them secretly.

comment by dreeves · 2011-01-20T10:21:14.975Z · LW(p) · GW(p)

Other related reading that I don't think has been mentioned yet:

Ian Ayres (cofounder of stickK.com) has a popular book called Super Crunchers that argues this exact thesis. http://www.amazon.com/Super-Crunchers-Thinking-Numbers-Smart/dp/0553805401

A classic is Tetlock's Expert Political Judgment. http://press.princeton.edu/titles/7959.html

comment by rabidchicken · 2011-01-18T04:37:55.606Z · LW(p) · GW(p)

I cannot help unleashing an evil laugh whenever I discover another tool to aid in world domination. Thank you.

comment by gwern · 2012-11-11T23:29:27.455Z · LW(p) · GW(p)

Another example of this: the US political models did fantastic in predicting all sorts of outcomes on election day 2012, far exceeding all sorts of pundits or people adjusting the numbers based on gut feelings and assumptions, despite often being pretty simple or tantamount to poll averaging.

comment by kwhitefoot · 2011-08-12T10:04:16.620Z · LW(p) · GW(p)

Just felt like saying thank you to lukeprog and all those who commented; this has been a great help to me in deciding what to read about next regarding determination of guaranteed values for the service the department I work in performs.

comment by ChristianKl · 2011-01-26T13:38:04.363Z · LW(p) · GW(p)

Humans use more complex utility functions to evaluate something like martial happiness. If you train a statistical model on a straight numeric value for martial happiness than the model only optimizes towards that specific aspect of happiness.

A good evaluation should test the model that trained on hedonistic happiness rating on something like the likelihood of divorce.

Replies from: wnoise
comment by wnoise · 2011-01-26T19:12:43.887Z · LW(p) · GW(p)

I think you mean "marital" here. (Despite the similarities, love is not a battlefield.)

Replies from: ChristianKl
comment by ChristianKl · 2011-01-27T10:17:11.982Z · LW(p) · GW(p)

Okay, English isn't my first language.

Replies from: komponisto
comment by komponisto · 2011-01-27T10:27:26.540Z · LW(p) · GW(p)

English isn't my first language

You could easily have made the same typo even if it were; we're talking about the mere transposition of two adjacent letters.

(Another example: "casual" vs. "causal", which often trips me up in reading.)

Replies from: army1987, SilasBarta
comment by A1987dM (army1987) · 2012-02-25T11:35:17.904Z · LW(p) · GW(p)

(Another example: "casual" vs. "causal", which often trips me up in reading.)

In Italian that's even worse, since causale does mean ‘causal’ but casuale means ‘random’.

comment by SilasBarta · 2011-01-27T16:32:30.382Z · LW(p) · GW(p)

(Another example: "casual" vs. "causal", which often trips me up in reading.)

Cool, that means you would get the joke about how "women are interested in causal sex"!

Replies from: ata
comment by ata · 2011-01-27T19:29:08.883Z · LW(p) · GW(p)

Is there acausal sex? (Would that be, like, having (phone/cyber)sex with someone in a different Tegmark universe via some form of communication built on UDT acausal trade?)

Replies from: HonoreDB, gwern, wedrifid, andrew-sauer, MugaSofer, David_Gerard
comment by HonoreDB · 2011-01-27T19:53:53.709Z · LW(p) · GW(p)

Acausal sexual reproduction is quite plausible, in a sense. Suppose you were a single woman living in a society with access to sophisticated genetic engineering, and you wanted to give birth to a child that was biologically yours and not do any unnatural optimizing. You could envision your ideal mate in detail, reverse-engineer the genetics of this man, and then create a sperm population that the man could have produced had he existed. I can easily imagine a genetic engineer offering this service: you walk into the office, describe the man's physical attributes, personality, and even life history, and the engineer does the rest as much as is possible (in this society, we know that a plurality of men who played shortstop in Little League have a certain allele, etc.) The child could grow up and meaningfully learn things about the counterfactual father--if you learned that the father was prone to depression, that would mean that you should watch out for that as well.

If the mother really wants to, she can take things further and specify that the man should be the kind of person who would have, had he existed, gone through the analogous procedure (with a surrogate or artificial womb), and that the counterfactual woman he would have specified would have been her. In this case, we can say that the man and the woman have acausally reproduced.

Replies from: cousin_it, Desrtopa, Vaniver
comment by cousin_it · 2011-01-27T20:50:22.943Z · LW(p) · GW(p)

Hmm. So the man has managed to "acausally reproduce", fulfill his utility function, in spite of not existing. You could go further and posit an imaginary couple who would have chosen each other for the procedure - so they succeed in "acausally reproducing", even though neither of them exists. Then when someone tries to write a story about the imaginary couple, the child becomes observable to the writer and starts doing some reproducing of her own :-)

comment by Desrtopa · 2011-01-27T21:41:58.911Z · LW(p) · GW(p)

My interpretation of acausal sexual reproduction would be something more like All You Zombies.

comment by Vaniver · 2011-01-27T20:46:22.592Z · LW(p) · GW(p)

What makes this acausal? That is, when are future inputs modifying present results? Or are you using a different definition of acausal?

Replies from: HonoreDB
comment by HonoreDB · 2011-01-27T21:04:23.658Z · LW(p) · GW(p)

I meant it in the sense of ata's parent comment, although there is a backwards arrow in there: the phenotype is determining the genotype rather than vice versa.

comment by gwern · 2012-02-25T16:04:41.495Z · LW(p) · GW(p)

You simply must read http://commonsenseatheism.com/wp-content/uploads/2010/10/Sinhababu-Possible-Girls.pdf - possibly the most romantic paper I've ever read.

Replies from: wedrifid
comment by wedrifid · 2012-02-25T20:14:33.014Z · LW(p) · GW(p)

That paper is absolutely brilliant! I kept laughing every time a new clearly logically reasoned yet humorous detail was explored.

comment by wedrifid · 2012-02-25T20:15:09.056Z · LW(p) · GW(p)

Is there acausal sex? (Would that be, like, having (phone/cyber)sex with someone in a different Tegmark universe via some form of communication built on UDT acausal trade?)

If you're basing the sex on acausal trade then you should perhaps refer to it as acausal prostitution. Or possibly acausal marriage.

comment by andrew sauer (andrew-sauer) · 2021-03-24T02:43:55.699Z · LW(p) · GW(p)

You might not even need to go to a different Tegmark universe lol, given that multiple people have independently come up with this idea

comment by MugaSofer · 2013-01-15T11:45:58.013Z · LW(p) · GW(p)
  • Simulate agent.

  • Check if it tries to do the same for you.

  • If it does, build it a body and have sex.

  • In a galaxy far far away, an agent simulates you, sees you try to do the same for them.

  • It clones you and has sex.

Does this fit the bill?

comment by mistercow · 2011-01-19T17:14:38.537Z · LW(p) · GW(p)

It's interesting to me that the proper linear model example is essentially a stripped down version of a very simple neural network with a linear activation function.

Replies from: gwern, Will_Sawin
comment by gwern · 2011-01-19T17:22:12.682Z · LW(p) · GW(p)

Is that really true? Couldn't one say that of just about any Turing-complete (or less) model of computation?

'Oh, it's interesting that they are really just a simple unary fixed-length lambda-calculus function with constant-value parameters.'

'Oh, it's interesting that they are really just restricted petri-nets with bounded branching factors.'

'Oh, it's interesting that these are modelable by finite automata.'

etc. (Plausible-sounding gobbledygook included to make the point.)

Replies from: mistercow
comment by mistercow · 2011-01-19T18:22:39.042Z · LW(p) · GW(p)

Yes, sort of, but a) a linear classifier is not a Turing-complete model of computation, and b) there is a clear resemblance that can be seen by merely glancing at the equations.

comment by Will_Sawin · 2011-01-21T04:45:02.141Z · LW(p) · GW(p)

I would argue that neurons, neural nets, SPRs, and everyone else doing linear regression use those techniques because it's the simplest way to aggregate data.

comment by Anatoly_Vorobey · 2011-01-18T16:47:18.554Z · LW(p) · GW(p)

I'm skeptical, and will now proceed to question some of the assertions made/references cited. Note that I'm not trained in statistics.

Unfortunately, most of the articles cited are not easily available. I would have liked to check the methodology of a few more of them.

|For example, one SPR developed in 1995 predicts the price of mature Bordeaux red wines at auction better than expert wine tasters do.

The paper doesn't actually establish what you say it does. There is no statistical analysis of expert wine tasters, only one or two anecdotal statements of their fury at the whole idea. Instead, the SPR is compared to actual market prices - not to experts' predictions. I think it's fair to say that the claim I quoted is overreached.

Now, about the model and its fit to data. Note that the SPR is older than 1995, when the paper was published. The NYTimes article about it which you reference is from 1990 (the paper bizarrely dates it to 1995; I'm not sure what's going on there).

The fact that there's a linear model - not specified precisely anywhere in the article - which is a good fit to wine prices for vintages of 1961-1972 (Table 3 in the paper) is not, I think, very significant on its own. To judge the model, we should look at what it predicts for upcoming years. Both the paper and the NYTimes article make two specific predictions. First, the 1986 vintage, claimed to be extolled by experts early on, will prove mediocre because of the weather conditions that year (see Figure 3 in the paper - 1986 is clearly the worst of the 80ies). NYTimes says "When the dust settles, he predicts, it will be judged the worst vintage of the 1980's, and no better than the unmemorable 1974's or 1969's". The 1995 paper says, more modestly, "We should expect that, in due course, the prices of these wines will decline relative to the prices of most of the other vintages of the 1980s". Second, the 1989-1990 is predicted to be "outstanding" (paper), "stunningly good" (NYTimes), "adjusted for age, will outsell at a significant premium the great 1961 vintage (NYTimes)."

It's now 16 years later. How do we test these predictions?

First, I've stumbled on a different paper from the primary author, Prof. Ashenfelter, from 2007. Published 12 years later than the one you reference, this paper has substantially the same contents, with whole pages copied verbatim from the earlier one. That, by itself, worries me. Even more worrying is the fact that the 1986 prediction, prominent in the 1990 article and the 1995 paper, is completely missing from the 2007 paper (the data below might indicate why). And most worrying of all is the change of language regarding the 1989/1990 prediction. The 1995 paper says about its prediction that the 1989/1990 will turn out to be outstanding, "Many wine writers have made the same predictions in the trade magazines". The 2007 paper says "Ironically, many professional wine writers did not concur with this prediction at the time. In the years that have followed minds have been changed; and there is now virtually unanimous agreement that 1989 and 1990 are two of the outstanding vintages of the last 50 years."

Uhm. Right. Well, because the claims aren't strong enough, they do not exactly contradict each other, but this change leaves a bad taste. I don't think I should give much trust to these papers' claims.

The data I could find quickly to test the predictions is here. The prices are broken down by the chateaux, by the vintage year, the packaging (I've always chosen BT - bottle), and the auction year (I've always chosen the last year available, typically 2004). Unfortunately, Ashenfelter underspecifies how he came up with the aggregate prices for a given year - he says he chose a package of the best 15 wineries, but doesn't say which ones or how the prices are combined. I used 5 wineries that are specified as the best in the 2007 paper, and looked up the prices for years 1981-1990. The data is in this spreadsheet. I haven't tried to statistically analyze it, but even from a quick glance, I think the following is clear. 1986 did not stabilize as the worst year of the 1980s. It is frequently second- or third-best of the decade. It is always much better than either 1984 or 1987, which are supposed to be vastly better according to the 1995 paper's weather data (see Figure 3). 1989/1990 did turn out well, especially 1990. Still, they're both nearly always less expensive than 1982, which is again vastly inferior in the weather data (it isn't even in the best quarter). Overall, I fail to see much correlation between the weather data in the paper for the 1980s, the specific claims about 1986 and 1989/1990, and the market prices as of 2004. I wouldn't recommend using this SPR to predict market prices.

Now, this was the first example in your post, and I found what I believe to be substantial problems with its methodology and the quality of its SPR. If I were to proceed and examine every example you cite in the same detail, would I encounter many such problems? It's difficult to tell, but my prediction is "yes". I anticipate overfitting and shoddy methodology. I anticipate huge influence of the selection bias - the authors that publish these kinds of papers will not publish a paper that says "The experts were better than our SPR". And finally, I anticipate overreaching claims of wide-reaching applicability of the models, based on papers that actually indicate modest effect in a very specific situation with a small sample size.

I've looked at your second example:

|Howard and Dawes (1976) found they can reliably predict marital happiness with one of the simplest SPRs ever conceived, using only two cues: P = [rate of lovemaking] - [rate of fighting].

I couldn't find the original paper, but the results are summarised in Dawes (1979). Looking at it, it turns out that when you say "predict marital happiness", it really means "predicts one of the partner's subjective opinion of their marital happiness" - as opposed to e.g. stability of the marriage over time. There's no indication as to how the partner to question was chosen from each pair (e.g. whether the experimenter knew the rate when they chose). There was very good correlation with binary outcome (happy/unhappy), but when a finer scale of 7 degrees of happiness was used, the correlation was weak - rate of 0.4. In a follow-up experiment, correlation rate went up to 0.8, but there the subject looked at the lovemaking/fighting statistics before opining on the degree of happiness, thus contaminating their decision. And even in the earlier experiment, the subject had been recording those lovemaking/fighting statistics in the first place, so it would make sense for them to recall those events when they're asked to assess whether their marriage is a happy one. Overall, the model is witty and naively appears to be useful, but the suspect methodology and the relatively weak correlation encourages me to discount the analysis.

Finally, the following claim is the single most objectionable one in your post, to my taste:

|If you're hiring, you're probably better off not doing interviews.

My own experience strongly suggests to me that this claim is inane - and is highly dangerous advice. I'm not able to view the papers you base it on, but if they're anything like the first and second example, they're far, far away from convincing me of the truth of this claim, which I in any case strongly suspect to overreach gigantically over what the papers are proving. It may be true, for example, that a very large body of hiring decision-makers in a huge organisation or a state on average make poorer decisions based on their professional judgement during interviews than they would have made based purely on the resume. I can see how this claim might be true, because any such very large body must be largely incompetent. But it doesn't follow that it's good advice for you to abstrain from interviewing - it would only follow if you believe yourself to be no more competent than the average hiring manager in such a body, or in the papers you reference. My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial (though I will freely grant that different kinds of interviews vary wildly in their effectiveness).

comment by Vaniver · 2011-01-18T07:19:25.366Z · LW(p) · GW(p)

I was thinking of writing a post about Bishop & Trout when I didn't see it mentioned on this site before, but I'm glad you beat me to it. (Among other things, I lent out my copy and so would have difficulty writing up a review). It's a great book.

comment by Randaly · 2011-01-18T05:13:48.625Z · LW(p) · GW(p)

Your upload of Dawes's "The Robust Beauty of Improper Linear Models in Decision Making" seems to be broken- at least, I'm not able to access it.

Replies from: shokwave
comment by shokwave · 2011-01-18T05:42:09.261Z · LW(p) · GW(p)

at least, I'm not able to access it.

Neither.

Replies from: lukeprog
comment by lukeprog · 2011-01-18T05:50:41.068Z · LW(p) · GW(p)

Dang. Fixed.

Replies from: shokwave
comment by shokwave · 2011-01-18T07:41:17.889Z · LW(p) · GW(p)

Wow. I highly recommend reading the Dawes pdf, it's illuminating:

Expert doctors coded [variables from] biopsies of patients with Hodgkin's disease and then made an overall rating of the severity of the process. The overall rating did not predict the survival time of the 193 patients, all of whom died. (The correlations of survival time with ratings was virtually 0, some in the wrong direction). The variables that the doctors coded, however, did predict survival time when they were used in a multiple regression model.

In summary, proper linear models work for a very simple reason. People are good at picking out the right predictor variables ... People are bad at integrating information from diverse and incomparable sources. Proper linear models are good at such integration ...

He then goes on to show that improper linear models still beat human judgment. If your reaction to the top-level post wasn't endorsement of statistical methods for these problems, this pdf is a bunch more evidence that you can use to update your beliefs about statistical methods of prediction.

Replies from: TheOtherDave
comment by TheOtherDave · 2011-01-18T13:49:29.777Z · LW(p) · GW(p)

People are good at picking out the right predictor variables ... People are bad at integrating information from diverse and incomparable sources.

That is a beautiful summary sentence, incidentally, and I am taking it with me as a shorthand "handle" for this whole idea.

I find it works well as a surface-level counter for the (alas, still inappropriately compelling) idea that a dumb algorithm can't get more accurate results than a smart observer.

Replies from: DSimon
comment by DSimon · 2011-01-18T15:46:31.714Z · LW(p) · GW(p)

Another possible metaphor is the pocket calculator.

It can find a number for any expression you can put into it, and in most cases it can do it way faster and more accurately than a human could. However, that doesn't make it a replacement for a human. An intelligent agent like a human is still needed for the crucial part of figuring out what expression would be meaningful to put into it.

Replies from: handoflixue
comment by handoflixue · 2011-01-18T21:49:10.888Z · LW(p) · GW(p)

That is a very helpful metaphor for wrapping my head around both the advantages and limitations of SPR, thank you! :)

comment by Dmytry · 2012-02-26T14:49:45.192Z · LW(p) · GW(p)

To think about it, the main critique i have for this article is:

  • Only lists cases where SPR 'outperformed' expertise. Of which in most we just loosely describe as 'experts' some people who had never did any proper training (with exercises and testing) to perform task in question.

  • Equates better correlation with "outperforms". Not the same thing. The maximum correlation happens when you classify into those with less than average risk of recidivism and those with larger than average risk. Parole board is not even supposed to work like this AFAIK.

  • If some SPR can 'outperform' average HR expertise, it doesn't mean SPR outperforms best expertise. Example where it matters: if you are a software start-up company founder, and if your expertise is average, your start-up will almost inevitably fail. Only small percentage successes, top 1% or less. You strive to maximize your chances at making into top 1%, not at making into top 50%.

  • What's about ethical issues? Race correlates with criminality, for example.

edit: not fully sure at the moment when maximum correlation happens.