Consider Representative Data Sets

vladimir_nesov

Consider Representative Data Sets

post by Vladimir_Nesov · 2009-05-06T01:49:21.389Z · LW · GW · Legacy · 15 comments

15 comments

In this article, I consider the standard biases in drawing factual conclusions that are not related to emotional reactions, and describe a simple model summarizing what goes wrong with the reasoning in these cases, that in turn suggests a way of systematically avoiding this kind of problems.

The following model is used to describe the process of getting from a question to a (potentially biased) answer for the purposes of this article. First, you ask yourself a question. Second, in the context of the question, a data set is presented before your mind, either directly, by you looking at the explicit statements of fact, or indirectly, by associated facts becoming salient to your attention, triggered by the explicit data items or by the question. Third, you construct an intuitive model of some phenomenon, that allows to see its properties, as a result of considering the data set. And finally, you pronounce the answer, that is read out as one of the properties of the model you've just constructed.

This description is meant to present mental paintbrush handles, to refer to the things you can see introspectively, and things you could operate consciously if you choose to.

Most of the biases in the considered class may be seen as particular ways in which you pay attention to a wrong data set, not representative of the phenomenon you model to get to the answer you seek. As a result, the intuitive model gets systematically wrong, and the answer read out from it gets biased. Below I review the specific biases, to identify the ways in which things go wrong in each particular case, and then I summarize the classes of mistakes of reasoning playing major roles in these biases and correspondingly the ways of avoiding those mistakes.

Correspondence Bias is a tendency to attribute to a person a disposition to behave in a particular way, based on observing an episode in which that person behaves in that way. The data set that gets considered consists only of the observed episode, while the target model is of the person's behavior in general, in many possible episodes, in many different possible contexts that may influence the person's behavior.

Hindsight bias is a tendency to overestimate the a priori probability of an event that has actually happened. The data set that gets considered overemphasizes the scenario that did happen, while the model that needs to be constructed, of the a priori belief, should be indifferent to which of the options will actually get realized. From this model, you need to read out the probability of the specific event, but which event you'll read out shouldn't figure into the model itself.

Availability bias is a tendency to estimate the probability of an event based on whatever evidence about that event pops into your mind, without taking into account the ways in which some pieces of evidence are more memorable than others, or some pieces of evidence are easier to come by than others. This bias directly consists in considering a mismatched data set that leads to a distorted model, and biased estimate.

Planning Fallacy is a tendency to overestimate your efficiency in achieving a task. The data set you consider consists of simple cached ways in which you move about accomplishing the task, and lacks the unanticipated problems and more complex ways in which the process may unfold. As a result, the model fails to adequately describe the phenomenon, and the answer gets systematically wrong.

The Logical Fallacy of Generalization from Fictional Evidence consists in drawing the real-world conclusions based on statements invented and selected for the purpose of writing fiction. The data set is not at all representative of the real world, and in particular of whatever real-world phenomenon you need to understand to answer your real-world question. Considering this data set leads to an inadequate model, and inadequate answers.

Proposing Solutions Prematurely is dangerous, because it introduces weak conclusions in the pool of the facts you are considering, and as a result the data set you think about becomes weaker, overly tilted towards premature conclusions that are likely to be wrong, that are less representative of the phenomenon you are trying to model than the initial facts you started from, before coming up with the premature conclusions.

Generalization From One Example is a tendency to pay too much attention to the few anecdotal pieces of evidence you experienced, and model some general phenomenon based on them. This is a special case of availability bias, and the way in which the mistake unfolds is closely related to the correspondence bias and the hindsight bias.

Contamination by Priming is a problem that relates to the process of implicitly introducing the facts in the attended data set. When you are primed with a concept, the facts related to that concept come to mind easier. As a result, the data set selected by your mind becomes tilted towards the elements related to that concept, even if it has no relation to the question you are trying to answer. Your thinking becomes contaminated, shifted in a particular direction. The data set in your focus of attention becomes less representative of the phenomenon you are trying to model, and more representative of the concepts you were primed with.

Knowing About Biases Can Hurt People. When you learn about the biases, you obtain a toolset for constructing new statements of fact. Similarly to what goes wrong when you propose solutions to a hard problem prematurely, you contaminate the data set with weak conclusions, allegations against specific data items that don't add to the understanding of phenomenon you are trying to model, distract from considering the question, take away whatever relevant knowledge you had, and in some cases even invert it.

A more general technique for not making these mistakes consists in making sure that the data set you consider is representative of the phenomenon you are trying to understand. Human brain can't automatically correct for the misleading selection of data, so you need to consciously ensure that you get presented with a balanced selection.

The first mistake is introduction of irrelevant data items. Focus on the problem, don't let the distractions get their way. The irrelevant data may find its way in your thoughts covertly, through priming effects you don't even notice. Don't let anything distract you, even if you understand that the distraction isn't related to the problem you are working on. Don't construct the irrelevant items yourself, as byproducts of your activity. Make sure that the data items you consider are actually related to the phenomenon you are trying to understand. To form accurate beliefs about something, you really do have to observe it. Don't think about fictional evidence, don't think about the facts that look superficially relevant to the question, but actually aren't, as in the case of the hindsight bias and reasoning by surface analogies.

The second mistake is to consider an unbalanced data set, overemphasizing some aspects of the phenomenon, and underemphasizing the others. The data needs to cover the whole phenomenon in a representative way, for the human mind to process it adequately. There are two sides to correcting this imbalance. First, you may take away the excessive data points, deliberatively refusing to consider them, so that your mind gets presented with less evidence, but this remaining evidence is more balanced, more representative of what you are trying to understand. This is similar to what happens when you take an outside view, for example, to avoid planning fallacy. Second, you may generate the correct data items to fill the rest of the model, from the cluster of evidence you've got. This generation may happen either formally, through using technical models of the phenomenon that allow to explicitly calculate more facts, or informally, through training your intuition to follow reliable rules for interpreting the specific pieces of evidence as the aspects of the whole phenomenon you are studying. Together, these feats constitute expertise in the domain, an art of knowing how to make use of the data that would only confuse a naive mind. When discarding evidence to correct the imbalance of data, only parts you don't posses expertise in need to be thrown away, while the parts that you are ready to process may be kept, making your understanding of the phenomenon stronger.

The third mistake is to mix reliable evidence with unreliable evidence. The mind can't tell between relevant info and fictional irrelevant info, let alone between solid relevant evidence and shaky relevant evidence. If you know some facts for sure, and some facts only through indirect unreliable methods, don't consider the latter at all when forming the initial understanding of the phenomenon. Your own untrained intuition generates weak facts, on the things in which you don't have domain expertise, for example when you spontaneously think up solutions to a hard problem. You only get wild guesses when the data is too thin for your intuition to retain at least minimal reliability, when getting a few steps away from the data. You get weak evidence from applying general heuristics that don't promise exceptional precision, such as knowledge of biases. You get weak evidence from listening to the opinion of the majority, from listening to the virulent memes. However, when you don't have reliable data, you need to start including less reliable evidence into consideration, but only the best of what you can come up with.

Your thinking shouldn't be contaminated by unrelated facts, shouldn't tumble over from the imbalance in knowledge, and shouldn't get diluted by the abundance of weak conclusions. Instead, the understanding should grow more focused on the relevant details, more comprehensive and balanced, attending to more aspects of the problem, and more technically accurate.

Think representative sets of your best data.

15 comments

Comments sorted by top scores.

comment by Mike Bishop (MichaelBishop) · 2009-05-08T01:55:22.245Z · LW(p) · GW(p)

I think this is a better post than a fair number that get promoted. It probably has more room for improvement as well.

comment by Vladimir_Nesov · 2009-05-07T02:00:32.689Z · LW(p) · GW(p)

So, what's wrong with this article? Is it bad prose, or too hand-wavy assertions, or overly obscure presentation/inferential distance, or too much text, or too obvious a point? Please leave a comment, I really don't understand.

In general, I think that getting a custom of writing some formal review-like comments would be valuable as feedback, not about the subject of the article, but about presentation, especially if the article looks bad and there is much for the author to work on improving.

Replies from: jimrandomh, maia, talisman, Eliezer_Yudkowsky, badger, steven0461, MrHen

↑ comment by jimrandomh · 2009-05-07T06:10:03.052Z · LW(p) · GW(p)

I think it's a stylistic issue: there are too many function words. After finishing a draft, do a second pass deleting as many unnecessary words as possible. If changing the tense or person lets you eliminate a few pronouns, do so. I applied my personal editing procedure to your second paragraph, and went from this:

The following model is used to describe the process of getting from a question to a (potentially biased) answer for the purposes of this article. First, you ask yourself a question. Second, in the context of the question, a data set is presented before your mind, either directly, by you looking at the explicit statements of fact, or indirectly, by associated facts becoming salient to your attention, triggered by the explicit data items or by the question. Third, you construct an intuitive model of some phenomenon, that allows to see its properties, as a result of considering the data set. And finally, you pronounce the answer, that is read out as one of the properties of the model you've just constructed.

To this:

The following model describes the process of getting from questions to (possibly biased) answers. First, ask a question. That brings a data set to mind, either directly (by looking at explicit statements of fact), or indirectly (by associated facts coming to your attention, triggered by explicit data items or by the question). Then, using that data set, construct an intuitive model of the phenomenon that lets you see its properties. Finally, read out the answer as a property of the model you've constructed.

I didn't change anything in the actual content, but to my mind this reads much better.

Replies from: ciphergoth, Vladimir_Nesov

↑ comment by Paul Crowley (ciphergoth) · 2009-05-07T07:55:40.200Z · LW(p) · GW(p)

This comment made me wish I'd asked this question on every article I've written. I shall do next time!

↑ comment by Vladimir_Nesov · 2009-05-07T11:04:08.596Z · LW(p) · GW(p)

Thanks, I'll reedit the article in place, following this and other suggestions, and post a comment announcing the second revision.

↑ comment by maia · 2012-04-12T18:32:24.690Z · LW(p) · GW(p)

You don't provide any examples and your text is too dense and abstract. Try using smaller words and shorter sentences. Also, you put too much emphasis on your numbering scheme: You bold "The first mistake" and then you de-emphasize the part where you actually say what that mistake is, and the rest of the paragraph is a wall of text.

It's also, frankly, not very helpful.

The irrelevant data may find its way in your thoughts covertly, through priming effects you don't even notice... Don't think about fictional evidence, don't think about the facts that look superficially relevant to the question, but actually aren't

You can't avoid priming effects once you've been primed. "Don't think about it" just won't work.

It seems like the main point of your post is to present biases in a new way: modeling them all as some form of "using the wrong data." I'm skeptical that this is a helpful model, but honestly, I would probably be a lot less skeptical if your post focused on that main point more clearly.

↑ comment by talisman · 2009-05-07T05:21:37.035Z · LW(p) · GW(p)

I think the problem is a combination of:

length
density of ideas too low --- long section resummarizing old posts
prose hard to read, feels somehow flat --- try using shorter paragraphs, varying sentence lengths, using more tangible words and examples

Comparing to Robin's and Eliezer's stuff, the gold standards:

Robin's are generally very short, high-level, and high-density. Easy to read quickly for "what's this about? do I care?" and then reread several times to think carefully about.

Eliezer's are long and lower-density but meticulous and carefully arranged so that the ideas build brick on brick (and also offset length with effective, dramatic prose).

I would suggest trying to write this post Robin-style and see how it comes out: present your key points in as strong, terse and efficient a way as you can, even if you lose some people. Writing long posts seems harder.

Also, try pulling out some individual sentences and reading them out of context. Just to grab one almost at random: "Contamination by Priming is a problem that relates to the process of implicitly introducing the facts in the attended data set." Pretty inscrutable.

Compare to Anna Salomon's description of the same thing: "To sum up the principle briefly: your brain builds you up a self-image. You are the kind of person who says, and does... whatever it is your brain remembers you saying and doing." Even though hers is longer in words, the concepts are clearer and more explicit. The text is bouncier and has more places for the mind to grab onto.

Hope that helps? Good luck!

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2009-05-07T10:47:02.445Z · LW(p) · GW(p)

Thank you, that was helpful. I'll write a shorter summary article in a few days (linking to a revised version of this article).

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-05-07T05:06:49.173Z · LW(p) · GW(p)

Something about it makes it hard to read - I had to squint and concentrate and even then my eyes kept skipping sentences.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2009-05-07T10:40:03.505Z · LW(p) · GW(p)

Does it mean that your impression is that the text is hard to read and at the same time features redundancy? I guess it's possible that I was unconsciously trying to compensate the opaque presentation with repetition...

↑ comment by badger · 2009-05-07T04:17:25.576Z · LW(p) · GW(p)

I'm in the same boat as Mr. Hen. I haven't thoroughly read it yet because honestly it looks long. Your recommendations look obvious at first glance, which lessens the incentive to read deeper.

I would have moved the three general mistakes to the beginning and tightened up their description a little more. That would do a better job of drawing people into the article, and then you could describe how the three mistakes manifest themselves in the specific biases. Is there a (representative!) anecdote that could liven it up?

I wonder if it would be worthwhile for some of the less experienced writers to set up an informal draft exchange. It'd be nice to have another set of eyes look over an article before posting it. I don't have plans for any articles in the near future, but if anyone wants me to look over a draft , feel free to pm me.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2009-05-07T11:44:06.877Z · LW(p) · GW(p)

The third and especially the second mistakes seem nontrivial to me (at least, I thought about them explicitly and written them down only about a month ago, which gave me the idea of writing this article).

Replies from: badger

↑ comment by badger · 2009-05-07T17:09:32.108Z · LW(p) · GW(p)

Just to be clear, upon actually reading them, I agree with you. It's just that on first glance they don't look like anything new and are buried fairly deep in the article, so they are easy to pass over. That's why I think it might have gone over better if you had lead with them.

↑ comment by steven0461 · 2009-05-07T15:53:25.530Z · LW(p) · GW(p)

No strong opinion of the content until I reread, but the writing style seems rather dry and abstract. That's a problem I tend to have myself; probably it comes from not speaking English in daily life.

↑ comment by MrHen · 2009-05-07T03:05:19.626Z · LW(p) · GW(p)

Honestly, I just haven't gotten around to reading it yet.

Of note, I have only ever considered the formatting a problem on one article here. I downvoted and sent the author a private message about it.

Consider Representative Data Sets

Contents

15 comments