My PhD thesis: Algorithmic Bayesian Epistemology
post by Eric Neyman (UnexpectedValues) · 2024-03-16T22:56:59.283Z · LW · GW · 14 commentsThis is a link post for https://arxiv.org/abs/2403.07949
Contents
Chapter descriptions A note on the title None 14 comments
In January, I defended my PhD thesis, which I called Algorithmic Bayesian Epistemology. From the preface:
For me as for most students, college was a time of exploration. I took many classes, read many academic and non-academic works, and tried my hand at a few research projects. Early in graduate school, I noticed a strong commonality among the questions that I had found particularly fascinating: most of them involved reasoning about knowledge, information, or uncertainty under constraints. I decided that this cluster of problems would be my primary academic focus. I settled on calling the cluster algorithmic Bayesian epistemology: all of the questions I was thinking about involved applying the "algorithmic lens" of theoretical computer science to problems of Bayesian epistemology.
Although my interest in mathematical reasoning about uncertainty dates back to before I had heard of the rationalist community, the community has no doubt influenced and strengthened this interest.
The most striking example of this influence is Scott Aaronson's blog post Common Knowledge and Aumann's Agreement Theorem, which I ran into during my freshman year of college.[1] The post made things click together for me in a way that made me more intellectually honest and humble, and generally a better person. I also found the post incredibly intellectually interesting -- and indeed, Chapter 8 of my thesis is a follow-up to Scott Aaronson's academic paper on Aumann's agreement theorem.
My interest in forecast elicitation and aggregation, while pre-existing, was no doubt influenced by the EA/rationalist-adjacent forecasting community.
And Chapter 9 of the thesis (work I did at the Alignment Research Center) is no doubt causally downstream of the rationalist community.
Which is all to say: thank you! Y'all have had a substantial positive impact on my intellectual journey.
Chapter descriptions
The thesis contains two background chapters followed by seven technical chapters (Chapters 3-9).
In Chapter 1 (Introduction), I try to convey what exactly I mean by "algorithmic Bayesian epistemology" and why I'm excited about it.
In Chapter 2 (Preliminaries), I give some technical background that's necessary for understanding the subsequent technical chapters. It's intended to be accessible to readers with a general college-level math background. While the nominal purpose of Chapter 2 is to introduce the mathematical tools used in later chapters, the topics covered there are interesting in their own right.
Different readers will of course have different opinions about which technical chapters are the most interesting. Naturally, I have my own opinions: I think the most interesting chapters are Chapters 5, 7, and 9, so if you are looking for direction, you may want to tiebreak toward reading those. Here are some brief summaries:
Chapter 3: Incentivizing precise forecasts. You might be familiar with proper scoring rules, which are mechanisms for paying experts for forecasts in a way that incentivizes the experts to report their true beliefs. But there are many proper scoring rules (most famously, the quadratic score and the log score), so which one should you use? There are many perspectives on this question, but the one I take in this chapter is: which proper scoring rule most incentivizes experts to do the most research before reporting their forecast? (See also this blog post I wrote explaining the research.)
Chapter 4: Arbitrage-free contract functions. Now, what if you're trying to elicit forecasts from multiple experts? If you're worried about the experts colluding, your problem is now harder. It turns out that if you use the same proper scoring rule to pay every expert, then the experts can collude to all report the same forecast -- and then redistribute their rewards -- in a way that leaves every expert better off, no matter the outcome, than if they hadn't colluded. (The term for this sort of collusion is "arbitrage".) On the other hand, you now have more flexibility, because you can pay each expert in a way that depends on the other experts' reports. In this chapter, I resolve an open problem by showing how to pay experts in a way that makes such arbitrage impossible.
Chapter 5: Quasi-arithmetic pooling. (One of my favorites.) Let's say that you've used a proper scoring rule to elicit forecasts from multiple experts, and now you want to aggregate them. This chapter's basic take is that your method of aggregation ought to depend on the scoring rule you used. For instance, the log scoring rule incentivizes experts to "be careful" around extreme probabilities: an expert incentivized with the log scoring rule cares substantially about the difference between a 1% chance and a 0.01% chance, while an expert incentivized with the quadratic scoring rule basically doesn't care. So if you're using the log scoring rule for eliciting forecasts, it makes sense to "take low probabilities more seriously" when aggregating the forecasts.
In the chapter, I define quasi-arithmetic (QA) pooling with respect to a proper scoring rule to be a particular method of forecast aggregation that depends on the scoring rule. For example, QA pooling with respect to the quadratic score means averaging the forecasts, while QA pooling with respect to the log score means averaging the log odds [? · GW]. I show that QA pooling has a bunch of nice properties and argue that the connection it establishes between forecast elicitation and forecast aggregation is pretty natural and fundamental.
Chapter 6: Learning weights for logarithmic pooling. QA pooling allows you to put weights on experts, depending on their past performance/how much you trust them. In Chapter 5, I showed that if the proper scoring rule is bounded, then you can efficiently learn weights for experts, by updating the weights based on the experts' performance, in a way that lets your aggregates be almost as good as if you had known the right set of weights from the beginning. But the log scoring rule isn't bounded! This chapter extends this result to the log scoring rule, under the assumption that the experts' forecasts are calibrated.
Chapter 7: Robust aggregation of substitutable signals. (One of my favorites.) Let's say that Alice says there's a 60% chance that it'll rain tomorrow and Bob says there's a 70% chance. How should you aggregate these forecasts into a single, all-things-considered number? The obvious answer is that it depends: if Alice knows strictly more information than Bob, you should say 60%. If Bob knows strictly more than Alice, you should say 70%. If their forecasts are based on disjoint pieces of evidence, your answer should be different from if their forecasts are based on heavily overlapping evidence.
But suppose you just don't know. After all, in practice, you may not have this information. This chapter explores a solution concept called robust aggregation: finding an aggregation strategy that works as well as possible in the worst case over a broad class of possible information structures. The broad class I study here is, roughly speaking, information structures that satisfy informational substitutes, meaning that learning an additional expert's information is less valuable, the more information you already know (i.e., diminishing marginal returns). One highlight of this chapter is a theoretical justification for the practice of extremization: pushing the aggregate forecast away from the prior (see Jaime Sevilla's writeup here [? · GW]).
Chapter 8: When does agreement imply accuracy? Suppose that Alice and Bob disagree about the value of some quantity, because they have different private information. How can they reach agreement? They can of course do so by exchanging all of their information, but that can take a really large amount of communication. Scott Aaronson's 2005 paper shows that Alice and Bob can quickly reach agreement simply by repeatedly exchanging their estimates (Alice tells Bob her estimate; Bob tells Alice his estimate after updating on what Alice just said; Alice tells Bob her estimate after updating on what Bob just said; and so on). However, this agreement could be very superficial: once Alice and Bob agree, the number they agree on might differ substantially from the number they would have agreed on, had they actually exchanged all of their information. This chapter establishes a sufficient condition under which, if Alice and Bob agree, their agreed-upon estimate does reflect the estimate they would reach if they exchanged all of their information.
Chapter 9: Deductive circuit estimation. (One of my favorites.) This chapter represents work I did at the Alignment Research Center around April-May of last year. In Formalizing the presumption of independence, we asked whether the process of deductive reasoning can be formalized. In this chapter, I explore deductive reasoning in the special case of boolean circuits. Informally, a deductive estimation algorithm takes as input a boolean circuit and an explanation of the circuit's behavior (in some formal language) and outputs an estimate of the fraction of inputs on which the circuit outputs 1. I explore how deductive estimation algorithms ought to behave, and give both positive results ("here's an efficient deductive estimation algorithm that satisfies linearity and respect for proofs") and negative results ("but no efficient deductive estimation algorithm additionally satisfies 0-1 boundedness").
A note on the title
The thesis is called "Algorithmic Bayesian Epistemology". I think most people here know roughly what I mean by "Bayesian epistemology".
The word "algorithmic" is more of a term of art, referring to the "algorithmic lens" of theoretical computer science. Quoting from the introduction:
The relationship between Bayesian epistemology and algorithmic Bayesian epistemology is the same as the relationship between game theory and algorithmic game theory, and as the relationship between mechanism design and algorithmic mechanism design.
Mechanism design -- traditionally a sub-discipline of economics -- asks the question: how can we design systems containing strategic agents pursuing their own incentives, in a way that produces good outcomes? For example, how can we auction off multiple items to multiple bidders in a way that produces the optimal social welfare for the bidders? The traditional answer from economic theory is the Vickrey-Clarke-Groves (VCG) auction, which elicits bids, computes the optimal allocation of items, and charges each bidder based on their externality on the remaining bidders.
Computer scientists find this answer dissatisfying, for a simple reason: computing the optimal allocation is not feasible, from the standpoint of both communication and computation. First, the bidders' preferences may not be compactly representable, in which case it is infeasible to communicate them to the auctioneer. Second, even if the bidders' preferences are compactly representable, actually computing the optimal allocation may still be intractable. And so algorithmic mechanism design asks the question: how can we design a computationally and communicationally tractable auction mechanism that attains a large fraction of the optimal social welfare?
Algorithmic mechanism design belongs to a longstanding tradition in theoretical computer science: considering problems from other disciplines through an algorithmic lens.[2] That is, instead of asking for the optimal solution to a problem, computer scientists ask: what is the best solution that can actually be implemented, given real-world (or real-world-inspired) constraints?
Sometimes, these constraints are computational: what is the best solution that can be found in polynomial time? Other times, the constraints are communciational: what is the best solution if parties are limited in how much they can communicate? Other kinds of constraints are also common. For example:
- Constraints on information. For example, the subfield of online algorithms studies sequential decision making under uncertainty (incomplete information). Often, the goal of an online algorithm is to guarantee a result that is almost as good as the best possible result in hindsight, e.g. the prophet inequality from optimal stopping theory, or no-regret algorithms in online learning.
- Constraints imposed by the strategic behavior of agents in a system. For example, many computer scientists study the price of anarchy: how much the welfare of a system degrades because of self-interested actors, as compared to a welfare-optimizing central planner.
The study of real-world problems through the algorithmic lens has significantly impacted a variety of disciplines, including molecular biology, ecology, neuroscience, quantum physics, and various social sciences.[3]
And so, algorithmic Bayesian epistemology is simply the application of the algorithmic lens to the discipline of Bayesian epistemology. It is perhaps best to define algorithmic Bayesian epistemology by its examples, but to attempt a loose description:
A question belongs to the field of algorithmic Bayesian epistemology if it involves reasoning about uncertainty from a Bayesian perspective, but under constraints that prevent complete assimilation of all existing information.
- ^
I ran into the post because it was linked from this wonderful Slate Star Codex post, which was also one of the first SSC posts I had run into. I had the classic "oh man, I've found my people!" experience that day.
- ^
The term "algorithmic lens" was coined at Berkeley by members of the Theory of Computing research group around the year 2000.
- ^
See Chapter 20 here for a detailed discussion.
14 comments
Comments sorted by top scores.
comment by kave · 2024-04-03T20:21:21.543Z · LW(p) · GW(p)
Curated.
Using Bayes-type epistemology is a core LessWrong topic, and I think this represents a bunch of progress on that front (whether the results are already real-world-ready or just real-world-inspired). I have only engaged with small parts of the thesis, but those parts seem pretty exciting; so far, I particularly like knowing about quasi-arithmetic pooling. It feels like I've become less confused about something that I didn't know I was confused about — the connection between the character of the proper scoring rule and the right ways to aggregate those probabilities.
I also appreciate Eric's work making blogposts explaining more of his thoughts in a friendly way. Hope to see a few more distillations come out of this thesis!
comment by Richard_Ngo (ricraz) · 2024-03-25T17:00:58.006Z · LW(p) · GW(p)
Very cool work! A couple of (perhaps-silly) questions:
- Do these results have any practical implications for prediction markets?
- Which of your results rely on there being a fixed pool of experts who have to forecast a question (as opposed to experts being free to pick and choose which questions they forecast)?
- Do you know if your arbitrage-free contract function permits types of collusion that don't leave all experts better off under every outcome, but do make each of them better off in expectation according to their own credences? (I.e. types of collusion that they would agree to in advance.) Apart from just making side bets.
↑ comment by Eric Neyman (UnexpectedValues) · 2024-03-26T06:49:56.773Z · LW(p) · GW(p)
Great questions!
- I didn't work directly on prediction markets. The one place that my thesis touches on prediction markets (outside of general background) is in Chapter 5, page 106, where I give an interpretation of QA pooling in terms of a particular kind of prediction market called a cost function market. This is a type of prediction market where participants trade with a centralized market maker, rather than having an order book. QA pooling might have implications in terms of the right way to structure these markets if you want to allow multiple experts to place trades at the same time, without having the market update in between. (Maybe this is useful in blockchain contexts if market prices can only update every time a new block is created? I'm just spitballing; I don't really understand how blockchains work.)
- I think that for most contexts, this question doesn't quite make sense, because there's only one question being forecast. The one exception is where I talk about learning weights for experts over the course of multiple questions (in Chapter 5 and especially 6). Since I talk about competing with the best weighted combination of experts in hindsight, the problem doesn't immediately make sense if some experts don't answer some questions. However, if you specify a "default thing to do" if some expert doesn't participate (e.g. take all the other experts' weights and renormalize them to add to 1), then you can get the question to make sense again. I didn't explore this, but my guess is that there are some nice generalizations in this direction.
- I don't! This is Question 4.5.2, on page 94 :) Unfortunately, I would conjecture (70%) that no such contract function exists.
comment by Stephen Bennett (GWS) · 2024-03-28T07:07:07.815Z · LW(p) · GW(p)
Congratulations! I wish we could have collaborated while I was in school, but I don't think we were researching at the same time. I haven't read your actual papers, so feel free to answer "you should check out the paper" to my comments.
For chapter 4: From the high level summary here it sounds like you're offloading the task of aggregation to the forecasters themselves. It's odd to me that you're describing this as arbitrage. Also, I have frequently seen the scoring rule be used with some intermediary function to determine monetary rewards. For example, when I worked with IARPA on geopolitical forecasting, our forecasters would get financial rewards depending on what percentile they were in relative to other forecasters. One would imagine that this would eliminate the incentive to report the aggregate as your own answer, but there's a reason we (the researcher/platform/website) aggregate individual forecasts! It's actually just more accurate under typical conditions. In theory an individual forecaster could improve that aggregate by forming their own independent forecast before seeing the work of others, and then aggregating, but in practice the impact of an individual forecast is quite small. I'll have to read about QA pooling, it's surprising to me that you could disincentivize forecasters from reporting the aggregate as their individual forecast.
For chapter 7: It seems to me that under sufficiently pessimistic conditions, there would be no good way to aggregate those two forecasts. For example, if Alice and Bob are forecasting "Will AI cause human extinction in the next 100 years?", they both might individually forecast ~0% for different reasons. Alice believes it is impossible for AI to get powerful enough to cause human extinction, but if it were capable of acting it would kill us all. Bob believes any agent smart enough to be that powerful would necessarily be morally upstanding and believes it's extremely likely that it will be built. Any reasonable aggregation strategy will put the aggregate at ~0% because each individual forecast is ~0%, but if they were to communicate with one another they would likely arrive at a much higher number. I suspect that you address this in the assumptions of the model in the actual paper.
Congrats again, I enjoyed your high level summary and might come back for a more detailed read of your papers.
Replies from: UnexpectedValues↑ comment by Eric Neyman (UnexpectedValues) · 2024-03-28T21:02:39.481Z · LW(p) · GW(p)
Thanks! Here are some brief responses:
From the high level summary here it sounds like you're offloading the task of aggregation to the forecasters themselves. It's odd to me that you're describing this as arbitrage.
Here's what I say about this anticipated objection in the thesis:
For many reasons, the expert may wish to make arbitrage impossible. First, the principal may wish to know whether the experts are in agreement: if they are not, for instance, the principal may want to elicit opinions from more experts. If the experts collude to report an aggregate value (as in our example), the principal does not find out whether they originally agreed. Second, even if the principal only seeks to act based on some aggregate of the experts' opinions, their method of aggregation may be different from the one that experts use to collude. For instance, the principal may have a private opinion on the trustworthiness of each expert and wishes to average the experts' opinions with corresponding weights. Collusion among the experts denies the principal this opportunity. Third, a principal may wish to track the accuracy of each individual expert (to figure out which experts to trust more in the future, for instance), and collusion makes this impossible. Fourth, the space of collusion strategies that constitute arbitrage is large. In our example above, any report in [0.546, 0.637] would guarantee a profit; and this does not even mention strategies in which experts report different probabilities. As such, the principal may not even be able to recover basic information about the experts' beliefs from their reports.
For example, when I worked with IARPA on geopolitical forecasting, our forecasters would get financial rewards depending on what percentile they were in relative to other forecasters.
This would indeed be arbitrage-free, but likely not proper: it wouldn't necessarily incentivize each expert to report their true belief; instead, an expert's optimal report is going to be some sort of function of the expert's belief about the joint probability distribution over the experts' beliefs. (I'm not sure how much this matters in practice -- I defer to you on that.)
It's surprising to me that you could disincentivize forecasters from reporting the aggregate as their individual forecast.
In Chapter 4, we are thinking of experts as having immutable beliefs, rather than beliefs that change upon hearing other experts' beliefs. Is this a silly model? If you want, you can think of these beliefs as each expert's belief after talking to the other experts a bunch. In theory(?) the experts' beliefs should converge (though I'm not actually clear what happens if the experts are computationally bounded); but in practice, experts often don't converge (see e.g. the FRI adversarial collaboration on AI risk [EA · GW]).
It seems to me that under sufficiently pessimistic conditions, there would be no good way to aggregate those two forecasts.
Yup -- in my summary I described "robust aggregation" as "finding an aggregation strategy that works as well as possible in the worst case over a broad class of possible information structures." In fact, you can't do anything interesting in the worse case over all information structures. The assumption I make in the chapter in order to get interesting results is, roughly, that experts' information is substitutable rather than complementary (on average over the information structure). The sort of scenario you describe in your example is the type of example where Alice and Bob's information might be complementary.
comment by AxiomWriter (Disbeliever) · 2024-04-04T06:07:10.844Z · LW(p) · GW(p)
Hello, this looks very interesting to me. So I want sort of making a mark on it, so that I can find it later. At the moment I can´t do much out of personal reasons. I am a newbie here, barely knowing what this "bayesian thing" is about.
But knowing a bit about making a sound axiom system, I have a feeling there could be some work done to make the "table", the axiom system all this stands on, a better one.
My wordpress blog ( and I have still to learn how to write one) holds at the moment only my "resurrected" thesis and doctoral thesis from about 30 years ago. In german at that. I had to dig up the old latex files, then make pdf files out of it. Since latex changed a lot, I could not resurrect the correct pictures yet.
Just marking this for myself, so that I can find it later, at the moment real, deep mathematical thinking is not possible and actually I should not even handle this forum. But this just looked to interesting.
comment by LTM · 2024-04-04T13:11:24.106Z · LW(p) · GW(p)
Really fascinating stuff! I have a (possibly answered) question about how using expert updates on other expert prediction might be valuable.
You discuss the negative impacts of allowing experts to aggregate themselves, or viewing one another's forecasts before initially submitting their own. Might there be value in allowing experts to submit multiple times, each time seeing the submitted predictions of a previous round? The final aggregation scheme would be able to not only assign a credence to each expert, but also gain a proxy for what credence the experts give to one another. In a more practical scenario where experts will talk if not collude, this might give better insight into how expert predictions are being created.
Thanks for taking the time to distill this work into a more approachable format - it certainly made the thesis more manageable!
Replies from: UnexpectedValues↑ comment by Eric Neyman (UnexpectedValues) · 2024-04-11T17:59:09.275Z · LW(p) · GW(p)
Yeah, there's definitely value in experts being allowed to submit multiple times, allowing them to update on other experts' submissions. This is basically the frame taken in Chapter 8, where Alice and Bob update their estimate based on the other's estimate at each step. This is generally the way prediction markets work, and I think it's an understudied perspective (perhaps because it's more difficult to reason about than if you assume that each expert's estimate is static, i.e. does not depend on other experts' estimates).
comment by Sniffnoy · 2024-04-04T00:20:18.835Z · LW(p) · GW(p)
One annoying thing in reading Chapter 3 -- chapter 3 states that for l=2,4,8, the optimal scoring rules can be written in terms of elementary functions. However, you only actually give the full formula for the case l=8 (for l=2 you give it on half the interval). What are the formulas for the other cases?
(But also, this is really cool, thanks for posting this!)
Replies from: UnexpectedValues↑ comment by Eric Neyman (UnexpectedValues) · 2024-04-11T17:47:42.122Z · LW(p) · GW(p)
Thanks! I think the reason I didn't give those expressions is that they're not very enlightening. See here for l = 2 on (0, 1/2] and here for l = 4 on [1/2, 1).
Replies from: Sniffnoy↑ comment by Sniffnoy · 2024-04-11T20:20:47.635Z · LW(p) · GW(p)
Ha! OK, that is indeed nasty. Yeah I guess CASes can solve this kind of problem these days, can't they? Well -- I say "these days" as if it this hasn't been the case for, like, my entire life, I've just never gotten used to making routine use of them...
comment by Gustavo Lacerda (gustavo-lacerda) · 2024-04-19T04:27:50.723Z · LW(p) · GW(p)
‹‹ I noticed a strong commonality among the questions that I had found particularly fascinating: most of them involved reasoning about knowledge, information, or uncertainty under constraints ››
This is also true for me, and I loved reading this post for this reason!
Back in the day I applied to study with Joe Halpern because of his work on epistemic logic, and ended up studying Logic in Amsterdam. At some point I got tired of Logic and its contrived puzzles (Muddy Children, etc) and decided to focus on Probability instead.
comment by Gustavo Lacerda (gustavo-lacerda) · 2024-04-19T04:22:11.645Z · LW(p) · GW(p)
Has anyone studied the idea of rewarding people according to how much their input improves the aggregate (whatever algorithm is being used), rather than for their individual accuracy?
comment by Review Bot · 2024-03-18T06:04:58.419Z · LW(p) · GW(p)
The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?