How’s that Epistemic Spot Check Project Coming?

post by Elizabeth (pktechgirl) · 2019-12-16T22:50:01.675Z · LW · GW · 16 comments

Contents

  A parable in Three Books.
  Model Based Reading
  How do we Know This?
  How’s it Going with Roam?
None
16 comments

 

Quick context: Epistemic spot checks started as a process in which I did quick investigations a few of a book’s early claims to see if it was trustworthy before continuing to read it, in order to avoid wasting time on books that would teach me wrong things. Epistemic spot checks worked well enough for catching obvious flaws (*cou*Carol Dweck*ugh*), but have a number of problems. They emphasize a trust/don’t trust binary over model building, and provability over importance. They don’t handle “severely flawed but deeply insightful” well at all. So I started trying to create something better

Below are some scattered ideas I’m playing with that relate to this project. They’re by no means fully baked, but it seemed like it might be helpful to share them. This kind of assumes you’ve been following my journey with epistemic spot checks at least a little. If you haven’t that’s fine, a more polished version of these ideas will come out eventually.

 

A parable in Three Books.

I’m currently attempting to write up an investigation of Children and Childhood in Roman Italy (Beryl Rawson) (affiliate link) (Roam notes). This is very slow going, because CaCiRI doesn’t seem to have a thesis. At least, I haven’t found one, and I’ve read almost half of the content. It’s just a bunch of facts. Often not even syntheses, just “Here is one particular statue and some things about it.” I recognize that this is important work, even the kind of work I’d use to verify another book’s claims. But as a focal source, it’s deadly boring to take notes on and very hard to write anything interesting about. What am I supposed to say? “Yes, that 11 year old did do well (without winning) in a poetry competition and it was mentioned on his funeral altar, good job reporting that.” I want to label this sin “weed based publishing” (as in, “lost in the weeds”, although the fact that I have to explain that is a terrible sign for it as a name).

One particular bad sign for Children and Childhood in Roman Italy was that I found myself copying multiple sentences at once into my notes. Direct quoting can sometimes mean “there’s only so many ways to arrange these words and the author did a perfectly good job so why bother”, but when it’s frequent, and long, it often means “I can’t summarize or distill what the author is saying”, which can mean the author is being vague, eliding over important points, or letting implications do work that should be made explicit. This was easier to notice when I was taking notes in Roam (a workflowy/wiki hybrid) because Roam pushes me to make my bullet points as self-contained as possible (so when you refer them in isolation nothing is lost), so it became obvious and unpleasant when I couldn’t split a paragraph into self contained assertions. Obviously real life is context-dependent and you shouldn’t try to make things more self-contained than they are, but I’m comfortable saying frequent long quotes are a bad sign about a book.

On the other side you have The Unbound Prometheus (David S. Landes) (affiliate link) (Roam notes), which made several big, interesting, important, systemic claims (e.g., “Britain had a legal system more favorable to industrialization than continental Europe’s”, “Europe had a more favorable climate for science than Islamic regions”), none of which it provided support for (in the sections I read- a friend tells me he gets more specific later). I tried to investigate these myself and ended up even more confused- scholars can’t even agree on whether Britain’s patent protections were strong or weak. I want to label this sin “making me make your case for you”.

A Goldilocks book is The Fate of Rome (Kyle Harper) (affiliate link) (Roam notes). Fate of Rome’s thesis is that the peak of the Roman empire corresponds with unusually favorable weather conditions in the mediteranean. It backs this up with claims about climate archeology, e.g., ice core data (claim 1, 2). This prompted natural and rewarding follow up questions like “What is ice core capable of proving?” and “What does it actually show?”. My note taking system in Roam was superb at enabling investigations of questions like these (my answer).

Based on claims creation, Against the Grain (James Scott) (affiliate link) (Roam notes) is even better. It has both interesting high level models (“settlement and states are different thing that came very far apart”, “states are entangled with grains in particular”) and very specific claims to back them up (“X was permanently settled in year Y but didn’t develop statehood hallmarks A, B, and C until year Z”). It is very easy to see how that claim supports that model, and the claim is about as easy to investigate as it can be. It is still quite possible that the claim is wrong or more controversial than the author is admitting, but it’s something I’ll be able to determine in a reasonable amount of time. As opposed to Unbound Prometheus, where I still worry there’s a trove of data somewhere that answers all of the questions conclusively and I just failed to find it.

[Against the Grain was started as part of the Forecasting project, which is currently being reworked. I can’t research its claims because that would ruin our ability to use it for the next round, should we choose to do so, so evaluation is on hold.]

If you asked me to rate these books purely on ease-of-reading, the ordering (starting with the easiest) would be:

 

 

Which is also very nearly the order they were published in (Against the Grain came out six weeks before Fate of Rome; the others are separated by decades). It’s possible that the two modern books were no better epistemically but felt so because they were easier to read. It’s also possible it’s a coincidence, or that epistemics have gotten better in the last 50 years.

 

Model Based Reading

As is kind of implied in the parable above, one shift in Epistemic Spot Checks is a new emphasis on extracting and evaluating the author’s models, which includes an emphasis on finding load bearing facts. I feel dumb for not emphasizing this sooner, but better late than never. I think the real trick here is not identifying that knowing a book’s models are good, but creating techniques for how to do that.

 

How do we Know This?

The other concept I’m playing with is that “what we know” is inextricable from “how we know it”. This is dangerously close to logical positivism, which I disagree with my limited understanding of. And yet it’s really improved my thinking when doing historical research.

This is a pretty strong reversal for me. I remember strongly wanting to just be told what we knew in my science classes in college, not the experiments that revealed it. I’m now pretty sure that’s scientism, not science.

 

How’s it Going with Roam?

When I first started taking notes with Roam (note spelling), I was pretty high on it. Two months later, I’m predictably loving it less than I did (it no longer drives me to do real life chores), but still find it indispensable. The big discovery is that the delight it brings me is somewhat book dependent- it’s great for Against the Grain or The Fate of Rome, but didn’t help nearly so much with Children and Childhood in Roman Italy, because it was most very on-the-ground facts that didn’t benefit from my verification system and long paragraphs that couldn’t be disambiguated.

I was running into a ton of problems with Roam’s search not handling non-sequential words, but they seem to have fixed that. Search is still not ideal, but it’s at least usable

Roam is pretty slow. It’s currently a race between their performance improvements and my increasing hoard of #Claims.

16 comments

Comments sorted by top scores.

comment by Pattern · 2019-12-17T18:03:27.348Z · LW(p) · GW(p)
This is very slow going, because CaCiRI doesn’t seem to have a thesis. At least, I haven’t found one, and I’ve read almost half of the content. It’s just a bunch of facts. Often not even syntheses
frequent long quotes are a bad sign about a book.

I could say similar (if more positive) things about mass link posts like this one from SSC - the only way to compress the information they contain, is to send the link. (Although a book's big enough maybe the information could be sorted by seeming importance, or relevance to other topics.)

I want to label this sin “weed based publishing” (as in, “lost in the weeds”, although the fact that I have to explain that is a terrible sign for it as a name).

Sounds like a newspaper.

Possible name: Has no case?

I want to label this sin “making me make your case for you”.

The obvious 2x2 would be "is there evidence" and "is there a claim" - but evidence + claim isn't a sin, and no evidence + no claim is a different kind of book (if it's a book at all).

Sin list:

1. Unconnected Evidence, without any claims.

It seems that "Unconnected" is correlated with 'not having claims'.

2. (Connected) Claims, without Evidence.

A list of the important traits could use some fleshing out.

a new emphasis on extracting and evaluating the author’s models
“what we know” is inextricable from “how we know it”.

Here it is.

This is a pretty strong reversal for me. I remember strongly wanting to just be told what we knew in my science classes in college, not the experiments that revealed it. I’m now pretty sure that’s scientism, not science.

Incentive-wise, this might have to do with how the knowledge is tested.

If you know the conclusion, you can test it. Knowledge -> application is a natural place to focus, though the best way to do that might involve some degree of experimentation (if in ways more profit concerned than science should be).

Replies from: Pattern
comment by Pattern · 2019-12-17T18:07:57.209Z · LW(p) · GW(p)
“what we know” is inextricable from “how we know it”.

If we know something, but we don't know how we know it, then how can it be verified/disproven?

If don't know something, but we know how to know it (the color of the sky is found by looking at the sky), then that can be fixed (look at the sky). (Although that starts to get into "what is the sky" - the way you define it effects answers to questions like "what color is it".)

Replies from: pktechgirl
comment by Elizabeth (pktechgirl) · 2019-12-17T20:46:38.402Z · LW(p) · GW(p)
Although that starts to get into "what is the sky" - the way you define it effects answers to questions like "what color is it".

This feels like some of what I was getting at.

Replies from: pktechgirl
comment by Elizabeth (pktechgirl) · 2019-12-18T21:05:07.534Z · LW(p) · GW(p)

E.g. say there is a test T for cancer C, for which n% of positives track to real cancer, and the cancer has a d% 5-year risk of death. However the test preferentially picks up on deadlier forms of the cancer, so given a positive result, your risk of death is higher than n*d/100.

If I say "you have cancer C", you'll assume you have a 5 year risk of death of d.

If I say "You have an n% chance of having cancer C", you'll assume you have an n*d/100 5 year risk of death.

If I say "You tested positive on test T", you can discover your actual chance of death over 5 years. So knowing the test result rather than the summary, even the detailed summary, is more informative.

OTOH, your estimates in scenario 3 will be heavily dependent on who gets tested. If the governing body changes the testing recommendations, your chance of death given a positive result on T will change. So knowing "you have n% chance of cancer" is in some ways a more robust result.

comment by John_Maxwell (John_Maxwell_IV) · 2019-12-18T01:08:05.210Z · LW(p) · GW(p)

How are you deciding which books to do spot checks for? My instinct is to suggest finding some overarching question which seems important to investigate, so your project does double duty exploring epistemic spot checks and answering a question which will materially impact the actions of you / people you're capable of influencing, but you're a better judge of whether that's a good idea of course.

Replies from: pktechgirl
comment by Elizabeth (pktechgirl) · 2019-12-18T05:51:51.402Z · LW(p) · GW(p)

It depends; that is in fact what I'm doing right now, and I've done it before, but sometimes I just follow my interests.

Replies from: John_Maxwell_IV
comment by John_Maxwell (John_Maxwell_IV) · 2019-12-20T06:55:07.709Z · LW(p) · GW(p)

I see, interesting.

Here's another crazy idea. Instead of trying to measure the reliability of specific books, try to figure out what predicts whether a book is reliable. You could do a single spot check for a lot of different books and then figure out what predicts the output of the spot check: whether the author has a PhD/tenure/what their h-index is, company that published the book, editor, length, citation density, quality of sources cited (e.g. # citations/journal prestige of typical paper citation), publication date, # authors, sales rank, amount of time the author spent on the book/how busy they seemed with other things during that time period, use of a ghostwriter, etc. You could code all those features and feed them into a logistic regression and see which were most predictive.

Replies from: pktechgirl, ozziegooen
comment by Elizabeth (pktechgirl) · 2019-12-26T20:40:52.895Z · LW(p) · GW(p)

I had a pretty visceral negative response to this, and it took me a bit to figure out why.

What I'm moving towards with ESCs is no gods no proxies. It's about digging in deeply to get to the truth. Throwing a million variables at a wall to see what sticks seems... dissociated? It's a search for things you do instead of dig for information you evaluate yourself.

Replies from: Benito, John_Maxwell_IV, liam-donovan
comment by Ben Pace (Benito) · 2019-12-26T21:32:28.954Z · LW(p) · GW(p)

"No Gods, No Proxies, Just Digging For Truth" is a good tagline for your blog.

comment by John_Maxwell (John_Maxwell_IV) · 2019-12-29T06:27:10.984Z · LW(p) · GW(p)

A "spot check" of a few of a book's claims is supposed to a proxy for the accuracy of the rest of the claims, right?

Of course there are issues to work through. For example, you'd probably want to have a training set and a test set like people always do in machine learning to see if it's just "what sticks" or whether you've actually found a signal that generalizes. And if you published your reasoning then people might game whatever indicators you discovered. (Should still work for older books though.) You might also find that most of the variability in accuracy is per-book rather than per-author or anything like that. (Alternatively, you might find that a book's accuracy can be predicted better based on external characteristics than doing a few spot checks, if individual spot checks are comparatively noisy.) But the potential upside is much larger because it could help you save time deciding what to read on any subject.

Anyway, just an idea.

comment by Liam Donovan (liam-donovan) · 2019-12-26T21:32:37.830Z · LW(p) · GW(p)

What's the difference between John's suggestion and amplifying ESCs with prediction markets? (not rhetorical)

Replies from: pktechgirl
comment by Elizabeth (pktechgirl) · 2019-12-26T22:16:46.324Z · LW(p) · GW(p)

I don't immediately see how they're related. Are you thinking people participating in the markets are answering based on proxies rather than truly relevant information?

Replies from: liam-donovan
comment by Liam Donovan (liam-donovan) · 2019-12-28T06:23:13.313Z · LW(p) · GW(p)

I'm thinking that if there were liquid prediction markets for amplifying ESCs, people could code bots to do exactly what John suggests and potentially make money. This suggests to me that there's no principled difference between the two ideas, though I could be missing something (maybe you think the bot is unlikely to beat the market?)

Replies from: pktechgirl
comment by Elizabeth (pktechgirl) · 2019-12-28T19:43:00.876Z · LW(p) · GW(p)

I think I'd feel differently about John's list if it contained things that weren't goodhartable, such as... I don't know, most things are goodhartable. For example, citation density does probably have an impact (not just a correlation) on credence score. But giving truth or credibility points for citations is extremely gameable. A score based on citation density is worthless as soon as it becomes popular because people will do what they would have anyway and throw some citations in on top. Popular authors may not even have to do that themselves. The difference between what John suggested and a prediction market with a citation-count bot is that if that gaming starts to happen, the citation count bot will begin failing (which is an extremely useful signal, so I'd be happy to have citation count bot participating).

Put another way: in a soon-to-air podcast, an author described how reading epistemic spot checks gave them a little shoulder-Elizabeth when writing their own book, pushing them to be more accurate and more justified. That's a fantastic outcome that I'm really proud of, although I'll hold the real congratulations for after I read the book. I don't think a book would be made better by giving the author a shoulder-citation bot, or even a shoulder-complex multivariable function. I suspect some of that is because epistemic spot checks are not a score, they're a process, and demonstrating a process people can apply themselves, rather than a score they can optimize, leads to better epistemics.

A follow up question is "would shoulder-prediction markets be as useful?". I think they could be, but that would depend on the prediction market being evaluated by something like the research I do, not a function like John suggests. The prediction markets involve multiple people doing and sometimes sharing research; Ozzie has talked about them as a tool for collaborative learning as opposed to competition (I've pinged him and he can say more on that if he likes).

Additionally, John's suggested metrics are mostly correlated with traditional success in academia, and if I thought traditional academic success was a good predictor of truth I wouldn't be doing all this work. That's a testable hypothesis and tests of it might look something like what John suggests, but I would view it as "testing academia", not "discovering useful metrics".


This question has spurred some really interesting and useful thoughts for me, thank you for asking it.

Replies from: ozziegooen
comment by ozziegooen · 2019-12-28T20:31:34.589Z · LW(p) · GW(p)

On them being for "collaborative learning"; the specific thing I was thinking was how good prediction systems should really encourage introspectability and knowledge externalities in order to be maximally cost-effective. I wrote a bit about this here [LW(p) · GW(p)]

comment by ozziegooen · 2019-12-28T20:29:38.297Z · LW(p) · GW(p)

Just chiming in here;

I agree with Liam that amplifying ESCs with prediction markets would be a lot like John's suggestion. I think an elegant approach would be something like setting up prediction markets, and then allowing users to set up their own data science pipelines as they see fit. My guess is that this would be essential if we wanted to predict a lot of books; say, 200 to 1 Million books.

If predictors did a decent job at this, then I'd be on the whole excited about it being known and for authors to try to perform better on it; because I believe it would reveal more signal than noise (well, as long as this prediction was done decently, for a vague definition of decent.)

My guess is that a strong analysis would only include "number of citations" as being a rather minor feature. If it became evident that authors were trying to actively munchkin[1] things, then predictors should pick up on that, and introduce features for things like "known munchkiner", which would make this quite difficult. The timescales for authors to update and write books seem much longer than the timescales for predictors to recognize what's going on.

[1] I realize that "'munchkining" is a pretty uncommon word, but I like it a lot, and it feels more relevant than powergaming. Please let me know if there's a term you prefer. I think "Goodhart" is too generic, especially if things like "correlational Goodhart" count.