Suspiciously balanced evidence

gjm

Suspiciously balanced evidence

post by gjm · 2020-02-12T17:04:20.516Z · LW · GW · Legacy · 24 comments

24 comments

What probability do you assign to the following propositions?

"Human activity has caused substantial increases in global mean surface temperature over the last 50 years and barring major policy changes will continue to do so over at least the next 50 years -- say, at least half a kelvin in each case."
"On average, Christians in Western Europe donated a larger fraction of their income last year to non-religious charitable causes than atheists."
"Over the next 10 years, a typical index-fund-like basket of US stocks will increase by more than 5% per annum above inflation."
"In 2040, most global electricity generation will be from renewable sources."

These are controversial and/or difficult questions. There are surely a lot of people who think they know the answers and will confidently proclaim that these propositions are just true or, as the case may be, false. But you are a sophisticated reasoner, able to think in probabilistic terms, and I expect your answers to questions like these mostly lie between p=0.1 and p=0.9 or thereabouts, just like mine. No foolish black-and-white thinking for the likes of us!

(I confess that my estimate for the first of those propositions is above $p = 0.9$ . But it's not above $p = 0.99$ .)

... But isn't it odd that the evidence should be so evenly balanced? No more than 3 bits or so either way from perfect equality? Shouldn't we expect that the totality of available evidence, if we could evaluate it properly, would make for a much larger imbalance? If we encounter only a small fraction of it (which we'd need to, to explain such evenly balanced results), shouldn't we expect that randomness in the subset we happen to encounter will in some cases make us see a large imbalance even if any given single piece of evidence is about as likely to go one way as the other? What's going on here?

Let me add a bit of further fuel to the fire. I could have tweaked all those propositions somewhat -- more than 3%, or more than 7%, above inflation; more than 40%, or more than 60%, or 2050 instead of 2040. Surely that ought to change the probabilities quite a bit. But the answers I'd have given to the questions would still have probabilities between 0.1 and 0.9, and I bet others' answers would have too. Can things really be so finely enough balanced to justify this?

I can think of two "good" explanations (meaning ones that don't require us to be thinking badly) and one not-so-good one.

Good explanation #1: I chose propositions that I know are open to some degree of doubt or controversy. When I referred to "questions like these", you surely understood me to mean ones open to doubt or controversy. So questions where the evidence is, or seems to us to be, much more one-sided were filtered out. (For instance, I didn't ask about young-earth creationism, because I think it's almost certainly wrong and expect most readers here to feel the same way.) ... But isn't it strange that there are so many questions for which the evidence we have is so very balanced?

Good explanation #2: When assessing a question that we know is controversial but that seems one-sided to us, we tend to adjust our probabilities "inward" towards 1:1 as a sort of a nod to the "outside view". I think this, or something like it, is probably a very sensible idea. ... But I think it unlikely that many of us do it in a principled way, not least because it's not obvious how to.

Not-so-good explanation: We have grown used to seeing probability estimates as a sign of clear thought and sophistication, and every time we accompany some opinion with a little annotation " $p ≃ 0.7$ " we get a little twinge of pride at how we quantify our opinions, avoid black-and-white thinking, etc. And so it becomes a habit, and we translate an internal feeling of confidence-but-not-certainty into something like " $p ≃ 0.7$ " even when we haven't done the sort of evidence-weighing that might produce an actual numerical result.

Now, I'm not sure which of two quite different conclusions I actually want to endorse.

"The temptation to push all probabilities for not-crazy-sounding things into the middle of the possible range is dangerous. We are apt to treat things as substantially-open questions that really aren't; to be timid where we should be bold. Let's overcome our cowardice and become more willing to admit when the evidence substantially favours one position over another."
"Our practice is better than our principles. Empirically, we make lots of mistakes even in cases where numerical evidence-weighing would lead us to probabilities close to 0 or 1. So we should continue to push our probability estimates inward. The challenge is to figure out a more principled way to do it."

Here is a possible approach that tries to combine the virtues of both:

Allow accumulating evidence to push your probability estimates towards 0 or 1; be unashamed by these extreme-sounding probabilities. BUT
Keep a separate estimate of how confident you are that your approach is correct; that your accumulation of evidence is actually converging on something like the right answer. THEN,
When you actually need a probability estimate, bring these together.

Suppose your "internal" probability estimate is $p$ , your probability that your approach is correct is $q$ , and your probability conditional on your approach being wrong is $r$ . Then your overall probability estimate is $q p + (1 - q) r$ and (holding $q, r$ constant) in effect your internal probability estimates are linearly squashed into the interval from $(1 - q) r$ to $(1 - q) r + q$ . So, for instance, if you're 90% sure your approach is right and your best guess if your approach is all wrong is that the thing's 75% likely to be true, then your estimates are squashed into the range [7.5%,97.5%].

Cautionary note: There's an important error I've seen people make when trying to do this sort of thing (or encourage others to do it), which is to confuse the propositions "I'm thinking about this all wrong" and "the conclusion I'm fairly sure of is actually incorrect". Unless the conclusion in question is a very specific one, that's likely a mistake; the probability I've called $r$ above matters and surely shouldn't be either 0 or 1.

24 comments

Comments sorted by top scores.

comment by [deleted] · 2020-02-12T23:04:54.229Z · LW(p) · GW(p)

But isn't it strange that there are so many questions for which the evidence we have is so very balanced?

But there aren’t. We just focus on the very tiny subset of questions for which that is the case, because those are the ones that are interesting.

Replies from: matthew-barnett, TAG

↑ comment by Matthew Barnett (matthew-barnett) · 2020-02-13T04:58:34.835Z · LW(p) · GW(p)

Relevant post from Robin Hanson.

↑ comment by TAG · 2020-02-13T08:57:04.986Z · LW(p) · GW(p)

If the evidence is overwhelmingly for or, we call that a fact or a falsehood, not a question.

comment by gwern · 2020-02-13T17:49:53.565Z · LW(p) · GW(p)

If you look at prediction datasets like PredictionBook or GJP or other calibration datasets (or even prediction markets with their longshot biases), which cover a wide variety of questions (far wider than most policy or political debates, and typically with neutral valence such that most predictors are disinterested), it seems like people are generally uncalibrated in the direction of extremes, not 50%.

So that's evidence against people actually holding beliefs which are biased to be too close to 50%, and suggests something else is doing on, like topic filtering or attempting to appear rhetorically reasonable / nonfanatical. (The second definitely seems like a concern. I notice that people seem to really shy away from publicly espousing strong stands like when we were discussing the Amanda Knox case, or putting a 0/100% on PB even when that is super obviously correct just from base rates; there's clear status/signaling dynamics going on there.)

Replies from: Vaniver

↑ comment by Vaniver · 2020-02-13T18:23:33.168Z · LW(p) · GW(p)

On things like PredictionBook, is it easy to compare predictions of the question-asker and others? It seems like the sort of thing where I want to predict facts that I'm ~50% on, but outsiders are more extreme (because I'm more likely to be unusually confused about the question), but I'm not sure how that compares to other effects (like general overconfidence).

comment by PP · 2020-05-19T16:46:25.656Z · LW(p) · GW(p)

I have a confession to make, I can make no sense of probability as degree of belief. The measure-theoretic definition (due to Kolmogoroff?) suits me and I have never progressed beyond it. I can do naught but beg your forgiveness.

Replies from: gjm, Purplehermann

↑ comment by gjm · 2020-05-20T09:33:47.857Z · LW(p) · GW(p)

You don't need anyone's forgiveness. But it turns out that quantifying degrees of belief is useful sometimes, and that representing them as numbers from 0 to 1 that behave like probabilities is a good way to do that. (There are theorems that kinda-sorta say it's the only way to do that, if you want various nice-sounding things to be true, but how much you care about those nice-sounding things is up to you.) So you may be missing out on some useful thinking tools.

↑ comment by Purplehermann · 2020-05-20T09:52:08.304Z · LW(p) · GW(p)

What is the measure theoretic definition?

Replies from: gjm, PP

↑ comment by gjm · 2020-05-20T13:00:56.512Z · LW(p) · GW(p)

A probability measure is a measure $μ$ (on a $σ$ -algebra $A$ on a set $A$ ) such that $μ (A) = 1$ .

A measure on a $σ$ -algebra $A$ is a function $μ : A \to R$ with properties like "if $A \cap B = \emptyset$ then $μ (A \cup B) = μ (A) + μ (B)$ " etc.; the idea is that the elements of $A$ are the subsets of $A$ that are well-enough behaved to be "measurable" and then if $X$ is such a subset then $μ (X)$ says how big $X$ is.

A $σ$ -algebra on a set $A$ is a set $A$ of subsets of $A$ that (1) includes all-of- $A$ , (2) whenever it includes a set $X$ also includes its complement $A - X$ , and (3) whenever it includes all of countably many sets $X_{i}$ also includes their union.

And now probability theory is the study of probability measures. (So the measure-theoretic definition of "probability" would be "anything satisfying the formal properties of a probability measure", just as the mathematician's definition of "vector" is "anything lying in a vector space".)

"Bayesian" probability theory doesn't disagree with any of that; it just says that one useful application for (mostly the more elementary bits of) the theory of probability measures is to reasoning under uncertainty, where it's useful to quantify an agent's beliefs as a probability measure. Here $A$ is the set of ways the world could be; $A$ is something like the set of sets of ways the world could be that can be described by propositions the agent understands, or the smallest $σ$ -algebra containing all of those; $μ$ , more commonly denoted $P$ or $P$ or $P$ or something of the sort, gives you for any such set of ways the world could be a number quantifying how likely the agent thinks it is that the actual state of the world is in that set.

You can work with probability measures even if you think that it's inappropriate to use them to quantify the beliefs of would-be rational agents. I guess that's PP's position?

↑ comment by PP · 2020-05-30T16:59:16.579Z · LW(p) · GW(p)

I see that gjm has replied with beautifully typeset text.

comment by romeostevensit · 2020-02-13T01:53:58.198Z · LW(p) · GW(p)

It is literally impossible for untrained humans to pay attention to settled questions for more than a tiny amount of time. We are hardwired for glossing over them, conscious attention are very expensive compute cycles.

comment by orthonormal · 2020-02-21T21:24:08.471Z · LW(p) · GW(p)

A big part of the answer for me is something like this Scott Alexander post [LW · GW] about the probability of X within your model versus the probability that your model is miscalibrated in a relevant way. Given how shaky our models of the world are, this alone makes it hard for me to push past 99% on many questions, especially those that require predicting human decisions.

comment by Urshanabi · 2020-02-13T19:58:57.701Z · LW(p) · GW(p)

I assign a greater than 1% probability for each of these questions that all of the information I've taken in about the situation is completely wrong in some fashion. It's quite possible that I actually have no idea how the stock market, for example, works. There's lots of misinformation floating around each of these questions.

Generally, .1<p<.9 comes from not having a good gears-level model for the situation, for me. .9<p<.99 comes from concerns around the trustworthiness of data or from insufficiently tested models.

comment by Gordon Seidoh Worley (gworley) · 2020-02-12T19:12:21.370Z · LW(p) · GW(p)

An additional possibility: everything already adds up to normality, we're just failing to notice because of how we're framing the question [LW · GW] (in this case, whether or not holding middling probability estimates for difficult and controversial statements is correct).

Replies from: gjm

↑ comment by gjm · 2020-02-12T19:46:33.502Z · LW(p) · GW(p)

Can you suggest a better way of framing the question?

(I'm not very sure what sort of adding-up-to-normality you have in mind; are you saying my "good explanation #1" is likely the correct one?)

Replies from: gworley

↑ comment by Gordon Seidoh Worley (gworley) · 2020-02-12T21:30:45.613Z · LW(p) · GW(p)

Yeah, I think #1 sounds right to me, and there is nothing strange about it.

comment by Bunthut · 2020-02-13T19:13:06.418Z · LW(p) · GW(p)

One more good explanation: Numbers are hard. Think of a few things that you would definitely give >99%. I can just about rank them, but I have no idea how many nines are supposed to be there.
And one more not-so-good one: We aren't actually trying to figure it out. We just want to participate in a discussion game, and this one involves numbers called propability, and they are usually between .1 and .9. A good sign that this is happening (and propably in part a reason for the range restriction) is when the "winning criterion" is whether the propability is above or below 50%.

comment by gjm · 2020-02-13T01:24:22.169Z · LW(p) · GW(p)

I'm seeing a lot of replies saying, in effect, "duh, obviously #1: we only bother thinking about the questions whose answers are still somewhat open to doubt". Maybe that's the whole story, but I still don't think so: some questions remain in the category of "questions many people think about and aren't close to certain of the answers to" even as a lot of evidence accumulates.

(But maybe #1 is more of the story than I was inclined to think.)

Replies from: verbiage_ecstatic

↑ comment by verbiage_ecstatic · 2020-02-13T12:18:50.926Z · LW(p) · GW(p)

Here’s a better explanation that captures the intuition behind those responses: a middling probability doesn’t just come from balanced evidence, it also results from absence of evidence (because in the absence of evidence, we revert to our priors, which tend to be middling for events that aren’t dramatically outside the realm of what we consider possible). It’s not suspicious at all that there are millions of questions where we don’t have enough evidence to push us into certainty: understanding the world is hard and human knowledge is very incomplete. It would be suspicious if we couldn’t generate tons of questions with middling probabilities. Most of the examples in the post deal with complex systems that we can’t fully predict yet like the weather or the economy: we wouldn’t expect to be able to generate certainty about questions dealing with those systems.

comment by Dagon · 2020-02-12T20:39:01.214Z · LW(p) · GW(p)

I don't know if it's a good or not-good explanation, but in a lot of discussions of these examples, a different question will be answered than you're asking. But also #1, and also some ok-version-of-not-so-good: we cap our probability estimates with plausibility heuristics. If I'd be surprised but not shocked to see something, it's not less than 5% probability. If I'd be shocked, but not question everything I know, it's not less than 0.1% (for reference classes with a few-per-lifetime incidence of resolution).

"Human activity has caused substantial increases in global mean surface temperature over the last 50 years and barring major policy changes will continue to do so over at least the next 50 years -- say, at least half a kelvin in each case."

Generally, people seem to answer about whether policy changes are possible or useful, even when they claim to be discussing causality and prediction. This topic is rife with arguments-as-soldiers, and it's unsurprising that probability estimates are hard to find.

"On average, Christians in Western Europe donated a larger fraction of their income last year to non-religious charitable causes than atheists."

Wow, never thought about this, but I have trouble with the categorizations - I know a whole lot of semi-Christians (and Jews and Muslims) who could be counted in either camp, and a whole lot of charities which have religious origins but are currently not affiliated. In any case, I'd wonder why you were asking and how my prediction would change any behavior.

"Over the next 10 years, a typical index-fund-like basket of US stocks will increase by more than 5% per annum above inflation."

This one is better. Define the basket and inflation index and I'll make a bet. S&P 500 vs US CPI, I'd say about 15% to beat it by 5% per year (62% over the decade). Even so, I'd expect the joint prediction about inflation and return on investment to cause confusion. This is one where my proposed plausibility heuristic is putting a cap on my confidence level. For "reasonable" numbers, I don't think I can be more than 95% confident.

"In 2040, most global electricity generation will be from renewable sources."

Definition of "renewable" makes this one tricky, and questions about whether to include energy that could be electric or not (heating, for instance). Is nuclear included? I think we could narrow it down to a specific prediction that one could have a prediction market on. I don't know enough to bet, myself. And I won't be shocked in either direction, so my estimates can't be very strong.

comment by Teerth Aloke · 2020-05-20T15:13:40.457Z · LW(p) · GW(p)

There are a few questions on which the evidence is really one-sided - like the common ancestry of huamns and chimps.

comment by Daniel V · 2020-02-12T23:28:20.947Z · LW(p) · GW(p)

Good explanation #3- we perceive probabilities differently from their objective values (i.e., a question of calibration). Our responses to questions will be both a function of our "underlying" subjective probabilities and also the mapping of that to the response format. In the link, for example, responding with (p) 10% to 90% feels like being from (w(p)) 20% to 70% sure.

comment by jmh · 2020-02-12T23:20:15.048Z · LW(p) · GW(p)

" But isn't it strange that there are so many questions for which the evidence we have is so very balanced? "

What if we asked what type of informational distributions would produce a more skewed balances?

comment by noggin-scratcher · 2020-02-12T23:08:05.691Z · LW(p) · GW(p)

I would expect an enormous bias in terms of how much attention is paid to "interesting" questions - propositions we could sensibly rate near-1 or near-0 are boring because they're so obviously true or so obviously absurd that they feel mundane - like they're hardly a question in the first place.

It'd feel like cheating by way of degenerate examples to enumerate every common object and say "I'm extremely sure that at least one of those exists", but that is still a lot of propositions (even more propositions available if we dare to venture that two of each thing might exist)

Or for another partial explanation: on a question that is in any actual doubt, it's difficult to gather sufficient evidence to push a probability estimate very far into the tails. There are only so many independent observations one can make in a day.

Suspiciously balanced evidence

Contents

24 comments