Using Points to Rate Different Kinds of Evidence

post by ozziegooen · 2023-08-25T20:11:21.269Z · LW · GW · 3 comments

This is a link post for https://forum.effectivealtruism.org/posts/mrAZFnEjsQAQPJvLh/using-points-to-rate-different-kinds-of-evidence

Contents

  Equation
    Initial Points
      Scientific Evidence
      Market Prediction
      Expert Opinion
      Reasoning
      Personal Accounts
      Tradition / Use
    Point Modifiers
    Points, In Practice
  Meta
    Using an Equations for Discussion
      Compare:
    Presumptions
    Agreeing on an evidence-weighing algorithm before direct discussions
  Is this too complicated and speculative?
  Future Work
  Afterward: Quick Attempts by LLMS
    Claude
      Scientific Evidence
      Expert Opinion
      Reasoning
      Records
      Culture
    GPT-4
      Expert Testimony and Opinion
      Anecdotal and Personal Accounts
      Historical and Archival Evidence
      Logical and Theoretical Evidence
      Public Opinion and Mass Media
      Miscellaneous
None
3 comments

Epistemic Status: Briefly written. The specific equation here captures my quick intuition  - this is meant primarily as a demonstration. 

There’s a lot of discussion on the EA Forum and LessWrong about epistemics, evidence, and updating.

I don’t know of many attempts at formalizing our thinking here into concrete tables or equations. Here is one (very rough and simplistic) attempt. I’d be excited to see much better versions.

Equation

Initial Points

Scientific Evidence

20 - A simple math proof proves X

8 - A published scientific study in Economics supporting X

6 - A published scientific study in Psychology supporting X

Market Prediction

14 - Popular stock markets strongly suggest X

11 - Prediction markets claim X, with 20 equivalent hours of research

10 - A poll shows that 90% of LessWrong believe X

6 - Prediction markets claim X, with one equivalent hour of research

Expert Opinion

8 - An esteemed academic believes X, where it’s directly in their line of work

6 - The author has strong emotions about X

Reasoning

6 - There's a (20-100 node) numeric model that shows X

5 - A reasonable analogy between X and something clearly good/bad

4 - A long-standing proverb

Personal Accounts

5 - The author claims a long personal history that demonstrates X

3 - Someone in the world has strong emotions about X

2 - A clever remark, meme, or tweet

2.3 - An insanely clever, meme, or tweet

0 - Believing X is claimed to be personally beneficial

Tradition / Use

12 - Top businesses act as if X

8 - A long-standing social tradition about X

5 - A single statistic about X

Point Modifiers

Is this similar to existing evidence?
Subtract the similarity from the extra amount of evidence. This likely will remove most of the evidence value.

Is it convenient [EA · GW] for the source to believe or say X?
-10% to -90%

Is there a lot of money or effort put behind spreading this evidence? For example, as an advertising campaign? 
+5% to +40%

How credible is the author or source?
-100% to +30%

Do we suspect the source is goodharting on this scale?
-20%

Points, In Practice

Evidence Points, as outlined, are not trying to mimic mathematical bits of information or another clean existing unit. I attempted to find a compromise between accuracy and ease of use.


Meta

Using an Equations for Discussion

The equation above is rough, but at least it’s (somewhat) precise and upfront. This represents much information, and any part can easily be argued against.

I think such explicitness could help with epistemic conversations.

Compare:

“Smart people should generally use their inside view, instead of the outside view” vs. “My recommended points scores for inside-view style evidence, and my point scores for outside-view style evidence, are all listed below.”

“Using many arguments is better than one big argument” vs. “I’ve adjusted my point table function to produce higher values when multiple types of evidence are provided. You can see it returns values 30% higher than what others have provided for certain scenarios.”

“It’s really important to respect top [intellectuals|scientists|EAs]” vs. “My score for respected [intellectuals|scientists|EAs] is 2 points higher than the current accepted average.”

Chesterton’s Fence is something to pay a lot of attention to” vs. “See my score table the points from various kinds of traditional practices.

In a better world, different academic schools of thought could have their own neatly listed tables/functions. In an even better world, all of these functions would be forecasts of future evaluators.

Presumptions

This sort of point system makes some presumptions that might be worth flagging. Primarily, it claims that even really poor evidence is evidence.

I often see people throwing out low-informative evidence as completely worthless. I think this take is misguided. The antidote to a poor argument isn’t to mistakingly claim that the argument is entirely worthless - it’s typically to provide better counter-evidence.

Agreeing on an evidence-weighing algorithm before direct discussions

In classical debate, after choosing a side, debaters will talk up the importance of the sorts of evidence that they might happen to have and dismiss the sorts of evidence their opponents bring up.

This becomes particularly gnarly when a group (like a political interest) goes through a long list of heated discussions on different topics. In each, they’re likely to gerrymander their effective evidence points rankings in order to most benefit their argument.

An obvious epistemic improvement would be for parties to declare consistent epistemologies upfront. An even better state might involve parties agreeing on some shared aggregation of their epistemic preferences. Aggregate epistemics, not policy beliefs.

In some worlds, intellectuals would spend most of their time improving epistemic processes and revealing unbiased evidence. A book about tax reform might be rewarded based on how many total points of evidence it brings up, regardless of which side of the debate those points are on.

Is this too complicated and speculative?

As long-time readers will know, I’m a big fan of attempting to measure highly speculative concepts. I guess explicit and speculative models are often preferable to standard text discussions. There’s a potential danger that some people might over-trust these numbers because they are numbers. However, the alternative to modeling is often “lots of blog posts with different undefined ontologies and tons of misunderstanding,” so I think this is often a reasonable tradeoff.

One great thing about models is that you can improve them. As we get more evidence and opinions, I’m hopeful that eventually, models emerge that wind up being pretty okay. If you kill mediocre attempts, you likely eventually kill decent ones, too.

Future Work

This is basic now, but I think it illustrates a worthy goal. Potential future work (for someone, likely not us) would include:

Thanks to Nuño Sempere for discussion on this topic. You can also see a short attempt by him on something similar here.


Afterward: Quick Attempts by LLMS

I fed this post to Claude and ChatGPT to have them provide simple takes.

Claude

Scientific Evidence

Expert Opinion

Reasoning

Records

Culture

GPT-4

Experimental Evidence (Evidence from controlled experiments or scientific studies)

Empirical Evidence (Evidence based on observation or experience)

Expert Testimony and Opinion

Anecdotal and Personal Accounts

Historical and Archival Evidence

Logical and Theoretical Evidence

Public Opinion and Mass Media

Miscellaneous

This system offers a broader range of evidence types, especially those under 10 points, to capture the nuances of everyday information sources.

3 comments

Comments sorted by top scores.

comment by Dagon · 2023-08-25T23:21:12.437Z · LW(p) · GW(p)

Upvoted for interesting thinking about alternates to Bayes' rule (or heuristics for when it's inconvenient to use).  I don't think it'll work for me, though - there's enough variance that I don't believe those point values, and the "point modifiers", in cases I can think of, overwhelm the initial point scores anyway.  

I think it IS a useful excercise to try to enumerate the trust and surprise values for multiple pieces of evidence for and against a proposition, but I don't think that using a fixed point score for broad categories across different propositions is likely to work.  

comment by niplav · 2023-08-28T14:14:02.452Z · LW(p) · GW(p)

I like this!

I would reduce the point score of "A simple math proof proves X" from 20 down to 16. As far as I know, there is no literature on how faulty mathematical proofs are, but from personal history and anecdotes from others (I have many times found "bugs" in my proofs (5 points), and someone I know working in formal verification would often update me on the errors they found in published proofs (8 points)). I'd give a higher weight (18 points) to formally verified proofs. Not deducting more proofs because of the curious observation that even if a proof is faulty, the result it shows is usually true (just as neural networks want to learn, mathematical proofs want to be correct).

Additionally, proofs purportedly about real-world phenomena are often importance-hacked and one needs to read the exact result pretty carefully, which is often not done.

  • Randomized Controlled Trial (RCT) results: 25 points
  • Meta-analysis of multiple RCTs: 23 points

I find it amusing that GPT-4 considers meta-analyses to be worsening the results they attempt to pool together.

Replies from: gwern
comment by gwern · 2023-08-28T21:26:11.031Z · LW(p) · GW(p)

I find it amusing that GPT-4 considers meta-analyses to be worsening the results they attempt to pool together.

It's possibly an error, but interestingly, that is also in line with more recent meta-analytic thinking: meta-analyses can be worse than individual RCTs because they simply wind up pooling the systematic errors from all the studies, yielding inflated effect sizes and overly-narrow CIs compared to the best RCTs (which may have a larger sampling error compared to the meta-analysis, but more than makes up for it in having less systematic error).

An example of this would be the Many Labs where the well-powered pre-registered Many Labs replications turned in systematically much smaller effect sizes than not just the original papers, but the meta-analyses of their subsequent literatures as well. The meta-analyses yielded smaller and better estimates than the p-hacked original paper, true, but still were far from the truth.