[Part 1] Amplifying generalist research via forecasting – Models of impact and challenges

post by jacobjacob, ozziegooen, Elizabeth (pktechgirl), NunoSempere (Radamantis), bgold · 2019-12-19T15:50:33.412Z · score: 53 (13 votes) · LW · GW · 21 comments

Contents

  Overview of the set-up and results
  Models of impact
    Mitigating capacity bottlenecks
    A way for intellectual talent to build and demonstrate their skills
    Exploring new institutions for collaborative intellectual progress
    Getting more data on empirical claims made by the Iterated Amplification AI alignment agenda
    Exploring forecasting with distributions
    Forecasting fuzzy things
    Shooting for unknown unknowns
  Challenges
    Complexity and unfamiliarity of experiment.  
    Trust in evaluations. 
    Correlation between predictions and evaluations. 
    Difficulty of converting mental models into quantitative distributions. 
    Anti-correlation between importance and “outsourceability”  
    Overhead of question generation. 
    A scoring rule that discourages collaboration
  Footnotes
  Participate in future experiments or run your own
  Acknowledgements
None
21 comments

This post covers our models of impact and challenges with our exploration in amplifying generalist research using forecasting. It is accompanied by a second post [? · GW] with a high-level description of those models, and more detailed description of experiment set-up and results.

Many of the world’s most pressing problems require intellectual progress to solve [1]. Finding ways to increase the rate of intellectual progress might be a highly promising way of solving those problems.

One component of this is generalist research: the ability to judge and synthesise claims across many different fields without detailed specialist knowledge of those fields, in order to for example prioritise potential new cause areas or allocate grant funding. For example, this skill is expected by organisations at the EA Leaders Forum to be one of the highest demanded skills for their organisations over the coming 5 years (2018 survey, 2019 survey [EA · GW]).

In light of this, we recently tested a method of increasing the scale and quality of generalist research, applied to researching the industrial revolution [2], using Foretold.io (an online prediction platform).

In particular, we found that, when faced with claims like:

“Pre-Industrial Britain had a legal climate more favorable to industrialization than continental Europe”

And

“Pre-Industrial Revolution, average French wage was what percent of the British wage?”

a small crowd of forecasters recruited from the EA and rationality communities very successfully predicted the judgements of a trusted generalist researcher, with a benefit-cost ratio of around 73% compared to the original researcher. They also outperformed a group of external online crowdworkers.

Moreover, we believe this method can be scaled to answer many more questions than a single researcher could, as well as to have application in domains other than research, like grantmaking, hiring and reviewing content.

We preliminarily refer to this method as “amplification” given its similarity to ideas from Paul Christiano’s work on Iterated Distillation and Amplification in AI alignment (see e.g. this).

This was an exploratory project whose purpose was to build intuition for several possible challenges. It covered several areas that could be well suited for more narrow, traditional scientific studies later on. As such, the sample size was small and no single result was highly robust.

However, it did lead to several medium-sized takeaways that we think should be useful for informing future research directions and practical applications.

This post begins with a brief overview of our results. We then share some models of why the current project might be impactful and exciting, followed by some challenges this approach faces.

Overview of the set-up and results

(This section gives a very cursory overview of the set-up and results. A detailed report can be found in this post [LW · GW].)

The basic set-up of the project is shown in the following diagram, and described below.

A two-sentence version would be:

Forecasters predicted the conclusions that would be reached by Elizabeth van Norstrand, a generalist researcher, before she conducted a study on the accuracy of various historical claims. We randomly sampled a subset of research claims for her to actually evaluate. And since we can set that sampling probability arbitrarily low, this method is not bottlenecked by her time.

The below graph shows the evolution of the accuracy of the crowd prediction over time, starting from Elizabeth Van Nostrand’s prior. Predictions were submitted separately by two groups of forecasters: one based on a mailing list with participants interested in participating in forecasting experiments (recruited from effective altruism-adjacent events and other forecasting platforms), and one recruited from Positly, an online platform for crowdworkers.

The y-axis shows the accuracy score on a logarithmic scale, and the x-axis shows how far along the experiment is. For example, 14 out of 28 days would correspond to 50%. The thick lines show the average score of the aggregate prediction, across all questions, at each time-point. The shaded areas show the standard error of the scores, so that the graph might be interpreted as a guess of how the two communities would predict a random new question.

One of our key takeaways from the experiment is that our simple algorithm for aggregating predictions performed surprisingly well in predicting Elizabeth’s research output -- but only for the network-adjacent forecasters.

Another way to understand the performance of the aggregate is to note that the aggregate of network-adjacent forecasters had an average log score of -0.5. To get a rough sense of what that means, it's the score you'd get by being 70% confident in a binary event, and being correct (though note that this binary comparison merely serves to provide intuition, there are technical details making the comparison to a distributional setting a bit tricky).

By comparison, the crowdworkers and Elizabeth’s priors had a very poor log score of around -4. This is roughly similar to the score you’d get if you predict an event to be ~5% likely, and it still happens.

We also calculated a benefit/cost-ratio, as follows:

Benefit/cost ratio = % value provided by forecasters relative to the evaluator / % cost of forecasters relative to the evaluator

We measured “value provided” as the reduction in uncertainty weighted by the importance of the questions on which uncertainty was reduced.

Results were as follows.

In other words, each unit of resource invested in the network-adjacent forecasters provided 72% as much returns as investing it in Elizabeth directly, and each unit invested in the crowdworkers provided negative returns, as they tended to be less accurate than Elizabeth’s prior.

Overall, we tentatively view this as an existence proof of the possibility of amplifying generalist research, and in the future are interested in obtaining more rigorous results and optimising the benefit-cost ratio.

Models of impact

This section summarises some different perspectives on what the current experiment is trying to accomplish and why that might be exciting.

There are several perspectives here given that the experiment was designed to explore multiple relevant ideas, rather than testing a particular, narrow hypothesis.

As a result, the current design is not optimising very strongly for any of these possible uses, and it is also plausible that its impact and effectiveness will vary widely between uses.

To summarise, the models are as follows.

Mitigating capacity bottlenecks

The effective altruism and rationality communities face rather large bottlenecks in many areas, such as allocating funding, delegating research, vetting [EA · GW] talent [EA · GW] and reviewing content [LW · GW].

Prediction platforms (for example as used with the current “amplification” set-up) might be a promising tool to tackle some of those problems, for several reasons. In brief, they might act as a scalable way to outsource intellectual labor.

First, we’re using quantitative predictions and scoring rules. This allows several things.

Contrast receiving 100 predictions and receiving 20 Google docs. The latter would be prohibitively difficult to read through, does not have a straightforward means of aggregation, and might not even be analysable in an “apples to apples” comparison.

However, the big cost we pay to enable these benefits is that we are adding formalism, and constraining people to express their beliefs within the particular formalism/ontology of probabilities and distributions. We discuss this more in the section on challenges below.

Second, we’re using an internet platform. This makes it easier for people from different places to collaborate, and to organise and analyse their contributions. Moreover, given the benefits of quantification noted above, we can freely open the tournament to people without substantial credentials, since we’re not constrained in our capacity to evaluate their work.

Third, we’re using a mechanism specifically designed to overcome capacity bottlenecks. The key to scalability is that forecasters do not know which claims will be evaluated, and so are incentivised to make their honest, most accurate predictions on all of them. This remains true even as many more claims are added (as long as forecasters expect rewards for participating remain similar).

In effect, we’re shifting the bottleneck from access to a few researchers to access to prize money and competent forecasters. It seems highly implausible that all kinds of intellectual work could be cost-effectively outsourced this way. However, if some work could be outsourced and performed at, say 10% of the quality, but at only 1% of the cost, that could still be very worthwhile. For example, in trying to review hundreds of factual claims, the initial forecasting could be used as an initial, wide-sweeping filter, grabbing the low-hanging fruit; but also identifying which questions are more difficult, and will need attention from more senior researchers.

Overall, this is a model for how things might work, but it is as of yet highly uncertain whether this technique will actually be effective in tackling bottlenecks of various kinds. We provide some preliminary data from this experiment in the “Cost-effectiveness” section below.

A way for intellectual talent to build and demonstrate their skills

The following seems broadly true to some of us:

If these claims are true, we might use some novel versions of forecasting tournaments as a scalable system to identify and develop epistemic talent. This potential benefit looks quite different from using forecasting tournaments to help us solve novel problems or gain better or cheaper information than we could otherwise.

Currently there is no “driver’s license” for rationality or effective altruism. Demonstrating your abilities requires navigating a system of reading and writing certain blog posts, finding connections to more senior people, and going through work trials tailored to particular organisations. This system does not scale very well, and also often requires a social knowledge and ability to “be in the right place at the right time” which does not necessarily strongly correlate with pure epistemic ability.

It seems very implausible that open forecasting tournaments could solve the entire problem here. But it seems quite plausible that it could offer improvements on the margin, and become a reliable credentialing mechanism for a limited class of non-trivial epistemic abilities.

For example, EA student groups with members considering cause prioritisation career paths might organise tournaments where their members forecast the conclusions of OpenPhil write-ups, or maintain and update their own distributions over key variables in GiveWell’s cost-effectiveness models.

By running this experiment, writing up the results, and improving the Foretold platform, we hope to provide infrastructure that will allow others interested in this benefit to run their own experiments.

Exploring new institutions for collaborative intellectual progress

Many of our current most important institutions, like governments and universities, run on mechanisms designed hundreds of years ago, before fields like microeconomics and statistics were developed. They suffer from many predictable and well-understood incentive problems, such as poor replication rates of scientific findings following from a need to optimise for publications; the election of dangerous leaders due to the use of provably suboptimal voting systems; or the failure to adequately fund public goods like high-quality explanations of difficult concepts due to free-rider problem, just to name a few.

The academic literature in economics and mechanism design has a vast backlog of designs for new institutions that could solve these and other problems. One key bottleneck now seems to be implementation.

For example, ethereum founder Vitalik Buterin has argued that the key skill required is product development: making novel mechanisms with better incentives work in practice (search for “product people” in linked interview).

Similarly, Robin Hanson has argued that there is a large, promising literature on more effective institutions, but “what we need most [... is lots of concrete trials.] To get involved in the messy details of an organization, and just try out different variations until [we] see something that actually works” [4], [5].

Part of the spirit of the current experiment is an attempt to do just this, and, in particular, to do so in the domain of research intellectual progress.

Getting more data on empirical claims made by the Iterated Amplification AI alignment agenda

The key mechanism underlying this experiment, and its use of prediction and randomisation, is based on ideas from the Iterated Amplification approach to AI alignment. Currently groups at Ought, OpenAI and elsewhere are working on testing the empirical assumptions underlying that theory.

Compared to these groups, the current experiment had a more practical, short-term aim -- to find a “shovel-ready” method of amplifying generalist research, that could be applied to make the EA/rationality communities more effective already over the coming years.

Nonetheless, potential follow-ups from this experiment might provide useful theoretical insight in that direction.

Exploring forecasting with distributions

Little is known about doing forecasting with full distributions (e.g. “I think this is captured by two normals, with means 5 and 10 and variance 3”) rather than point estimates (e.g. “79%”). Before the launch of Foretold, there wasn’t any software available for easily running such experiments.

This was a quick way of getting data on many questions in distributional forecasting:

Forecasting fuzzy things

A major challenge [? · GW] with prediction markets and forecasting tournaments is the need to concretely specify questions; in order to clearly determine who was right and allocate payouts.

Often, this means that these mechanisms are limited to answering questions like:

> “What will the highest performance of an algorithmic benchmark x be at time t?”

Even though what we often care about is something more nebulous, like:

> “How close will we be to AGI at time t?”

The upside of this precision is that it enables us to use quantitative methods to estimate performance, combine predictions, and allocate rewards, as described above.

The current experiments try to get the best of both worlds: the incentive properties of forecasting tournaments and the flexibility of generalist research in tackling more nebulous questions. The proposed solution to this problem is simply to have one or many trusted evaluators who decide on the truth of a question, and then predict their judgements as opposed to the underlying question [6].

(Previously some of the authors set up the AI Forecasting Resolution Council [LW · GW] to enable such flexible resolution to also be used on AI questions.)

Shooting for unknown unknowns

This is related to the mindset of “prospecting for gold [? · GW]”. To a certain extent, we think that we have a potentially reliably inside view, a certain research taste which is worth following and paying attention to, because we are curious what we might find out.

A drawback with this is that it enables practices like p-hacking/publication bias if results are reported selectively. To mitigate this, all data from this experiment is publicly available here [7].

Challenges

This section discusses some challenges and limitations of the current exploration, as well as our ideas for solving some of them. In particular, we consider:

Complexity and unfamiliarity of experiment.

The current experiment has many moving parts and a large inferential distance [LW · GW]. For example, in order to participate, one would need to understand the mathematical scoring rule, the question input format, the randomisation of resolved questions and how questions would be resolved as distributions.

This makes the set-up challenging to understand to both participants and potential clients who want to use similar amplification set-ups in their own organisations.

We don’t think these things are inherently complicated, but have much work to do on explaining the set-up and making the app generally accessible.

Trust in evaluations.

The extent to which the results are meaningful depends on one’s trust in Elizabeth Van Nostrand’s ability to evaluate questions. We chose Elizabeth for the experiment as she has a reputation for reliable generalist research (through her blog series on “Epistemic Spot Checks” [LW · GW]), and 10+ public blog posts with evaluations of the accuracy of books and papers.

However, the challenge is that this trust often relies on a long history of interactions with her material, in a way which might be hard to communicate to third-parties.

For future experiments, we are considering several improvements here.

First, as hinted at above, we can ask forecasters both about their predictions of Elizabeth as well as their own personal beliefs. We might then expect that those who can both accurately predict Elizabeth and disagree with her knows something she does not, and so will be weighted more highly in the evaluation of the true claim.

Second, we might have set-ups with multiple evaluators; or more elaborate ways of scoring the evaluators themselves (for example based on their ability to predict what they themselves will say after more research).

Third, we might work to have more transparent evaluation processes, for example including systematic rubrics or detailed write-ups of reasoning. We must be careful here not to “throw out the baby with the bathwater”. The purpose of using judges is after all to access subjective evaluations which can’t be easily codified in concrete resolution conditions. However, there seems to be room for more transparency on the margin.

Correlation between predictions and evaluations.

Elizabeth had access to a filtered version of forecaster comments when she made her evaluations. Hence the selection process on evidence affecting her judgements was not independent from the selection process on evidence affecting the aggregate. This introduces a potential source of bias and a “self-fulfilling prophecy” dynamic of the experiments.

For future experiments, we’re considering obtaining an objective data-set with clear ground truth, and test the same set-up without revealing the comments to Elizabeth, to get data on how serious this problem is (or is not).

Difficulty of converting mental models into quantitative distributions.

In order to participate in the experiment, a forecaster has to turn their mental models (represented in whichever way the human brain represents models) into quantitative distributions (which is a format quite unlike that native to our brains), as shown in the following diagram:

Each step in this chain is quite challenging, requires much practice to master, and can result in a loss of information.

Moreover, we are uncertain how the difficulty of this process differs across questions of varying importance. It might be that some of the most important considerations in a domain tend to be confusion-shaped (e.g. “What does it even mean to be aligned under self-improvement when you can’t reliably reason about systems smarter than yourself?”), or very open-ended (e.g. “What new ideas could reliably improve the long-term future?” rather than “How much will saving in index funds benefit future philanthropists?”)). Hence filtering for questions that are more easily quantified might select against questions that are more important.

Consider some solutions. For the domains where quantification seems more promising, it seems at least plausible that there should be possible to have some kind of “division of labor” between them.

For future experiments, we’re looking to better separate “information contribution” and “numerical contribution”, and find ways of rewarding both. Some participants might specialise in research or model-generation, and others in turning that research into distributions.

A challenge here is to appropriately reward users who only submit comments but do not submit predictions. Since one of the core advantages of forecasting tournaments is that they allow us to precisely and quantitatively measure performance, it seems plausible that any solution should try to make use of this fact. (As opposed to, say, using an independent up- and downvoting scheme.) As example mechanisms, one might randomly show a comment to half the users, and reward a comment based on the performance of the aggregate for users who’ve seen it and users who haven’t. Or one might release the comments to forecasters sequentially, and see how much each improves the aggregate. Or one might simply allow users to vote, but weigh the votes of users with a better track-record higher.

Moreover, in future experiments with Elizabeth we’ll want to pair her up with a “distribution buddy”, whose task is to interview her to figure out in detail what distribution best captures her beliefs, allowing Elizabeth to focus simply on building conceptual models.

Anti-correlation between importance and “outsourceability”

Above we mentioned that the questions easiest to quantify might be anti-correlated with the ones that are most important. It is also plausible that the questions which are easiest to outsource to forecasters are not the same as those which are most important to reduce uncertainty on. Depending on the shape of these distributions, the experiment might not be capture a lot of value. (For illustration, consider an overly extreme example: suppose a venture capitalist tries to amplify their startup investments. The crowd always predicts “no investment”, and turn out to be right in 99/100 cases: the VC doesn’t investment. However, the returns for that one case where crowd fails and the VC actually would have invested by far dominate the portfolio.)

Overhead of question generation.

The act of creating good, forecastable questions is an art in and of itself. If done by the same person or small team which will eventually forecast the questions, one can rely on much shared context and intuition in interpreting the questions. However, scaling these systems to many participants requires additional work in specifying the questions sufficiently clearly. This overhead might be very costly. Especially since we think one of the key factors determining the usefulness of a forecasting question is the question itself. How well does it capture something we care about? From experience, writing these questions is hard [LW · GW]. In future we have much work to do to make this process easier.

A scoring rule that discourages collaboration

Participants were scored based on how much they outperformed the aggregate prediction. This scoring approach is similar to the default in prediction markets and major forecasting tournaments. It has the problem that sharing any information via commenting will harm your score (since it will make the performance of other users, and hence the aggregate, better). What’s more, all else remaining the same, doing anything that helps other users will be worse for your score (such as sharing tips and tricks for making better predictions, or pointing out easily fixable mistakes so they can learn from them).

There are several problems with this approach and how it a disincentives collaboration.

First, it provides an awkward change in incentives for groups who otherwise have regular friendly interactions (such as a team at a company, a university faculty, or members of the effective altruism community).

Second, it causes effort to be wasted as participants must derive the same key insights individually, utilising little division of labor (as any sharing information will just end up hurting their score on the margin). Having some amount of duplication of work and thinking can of course make the system robust against mistakes -- but we think the optimal amount is far less than the equilibrium under the current scoring rule.

In spite of these theoretical incentives, it is interesting to note that several participants actually ended up writing detailed comments. (Though basically only aimed at explaining their own reasoning, with no collaboration and back-and-forth between participants observed.) This might have been because they knew Elizabeth would see those comments, or for some other reason.

Nonetheless, we are working on modifying our scoring rule in a way which directly incentivises participants to collaborate, and actively rewards helping other users improve their models. We hope to release details of formal models and practical experiments in the coming month.

Footnotes

[1] Examples include: AI alignment, global coordination, macrostrategy and cause prioritisation.

[2] We chose the industrial revolution as a theme since it seems like a historical period with many lessons for improving the world. It was a time of radical change in productivity along with many societal transformations, and might hold lessons for future transformations and our ability to influence those.

[3] For example by averaging predictions and then weighing by past track-record and time until resolution, as done in the Good Judgement Project (among other things).

[4] Some examples of nitty-gritty details we noticed while doing this are:

[5] For a somewhat tangential, but potentially interesting, perspective, see Feynman on making experiments to figure out nitty-gritty details in order to enable other experiments to happen (search for “rats” in the link).

[6] A further direction we’re considering is to allow forecasters to both predict the judgements of evaluators and the underlying truth. We might then expect that those predictors who both accurately forecast the judgement of the evaluator and disagree in their own judgements, might provide valuable clues about the truth.

[7] For the record, before this experiment we ran two similar, smaller experiment (to catch easy mistakes and learn more about the set up), with about an order of magnitude less total forecasting effort invested. The aggregate from these experiments was quite poor at predicting the evaluations. The data from those experiments can be found here, and more details in Elizabeth’s write-ups here [LW · GW] and here [LW · GW].

Participate in future experiments or run your own

Foretold.io [LW · GW] was built as an open platform to enable more experimentation with prediction-related ideas. We have also made data and analysis calculations from this experiment publicly available.

If you’d like to:

We’d be happy to consider providing advice, operational support, and funding for forecasters. Just comment here or reach out to this email.

If you’d like to participate as a forecaster in future prediction experiments, you can sign-up here.

Acknowledgements

Funding for this project was provided by the Berkeley Existential Risk Initiative and the EA Long-term Future Fund.

We thank Beth Barnes and Owain Evans for helpful discussion.

We are also very thankful to all the participants.


21 comments

Comments sorted by top scores.

comment by Richard_Ngo (ricraz) · 2020-01-03T19:19:31.776Z · score: 14 (6 votes) · LW(p) · GW(p)

So the thing I'm wondering here is what makes this "amplification" in more than a trivial sense. Let me think out loud for a bit. Warning: very rambly.

Let's say you're a competent researcher and you want to find out the answers to 100 questions, which you don't have time to investigate yourself. The obvious strategy here is to hire 10 people, get them to investigate 10 questions each, and then pay them based on how valuable you think their research was. Or, perhaps you don't even need to assign them questions - perhaps they can pick their own questions, and you can factor in how neglected each question was as part of the value-of-research calculation.

This is the standard, "freeform" approach; it's "amplification" in the same sense that having employees is always amplification. What does the forecasting approach change?

  • It gives one specific mechanism for how you (the boss) evaluate the quality of research (by comparison with your own deep dive), and rules out all the others. This has the advantage of simplicity and transparency, but has the disadvantage that you can't directly give rewards for other criteria like "how well is this explained". You also can't reward research on topics that you don't do deep dives on.
    • This mainly seems valuable if you don't trust your own ability to evaluate research in an unbiased way. But evaluating research is usually much easier than doing research! In particular, doing research involves evaluating a whole bunch of previous literature.
    • Further, if one of your subordinates thinks you're systematically biased, then the forecasting approach doesn't give them a mechanism to get rewarded for telling you that. Whereas in the freeform approach to evaluating the quality of research, you can take that into account in your value calculation.
  • It gives one specific mechanism for how you aggregate all the research you receive. But that doesn't matter very much, since you're not bound to that - you can do whatever you like with the research after you've received it. And in the freeform approach, you're also able to ask people to produce probability distributions if you think that'll be useful for you to aggregate their research.
  • It might save you time? But I don't think that's true in general. Sure, if you use the strategy of reading everyone's research then grading it, that might take a long time. But since the forecasting approach is highly stochastic (people only get rewards for questions you randomly choose to do a deep dive on) you can be a little bit stochastic in other ways to save time. And presumably there are lots of other grading strategies you could use if you wanted.

Okay, let's take another tack. What makes prediction markets work?

1. Anyone with relevant information can use that information to make money, if the market is wrong.

2. People can see the current market value.

3. They don't have to reveal their information to make money.

4. They know that there's no bias in the evaluation - if their information is good, it's graded by reality, not by some gatekeeper.

5. They don't actually have to get the whole question right - they can just predict a short-term market movement ("this stock is currently undervalued") and then make money off that.

This forecasting setup also features 1 and 2. Whether or not it features 3 depends on whether you (the boss) manage to find that information by yourself in the deep dive. And 4 also depends on that. I don't know whether 5 holds, but I also don't know whether it's important.

So, for the sort of questions we want to ask, is there significant private or hard-to-communicate information?

  • If yes, then people will worry that you won't find it during your deep dive.
  • If no, then you likely don't have any advantage over others who are betting.
  • If it's in the sweet spot where it's private but the investigator would find it during their deep dive, then people with that private information have the right incentives.

If either of the first two options holds, then the forecasting approach might still have an advantage over a freeform approach, because people can see the current best guess when they make their own predictions. Is that visibility important, for the wisdom of crowds to work - or does it work even if everyone submits their probability distributions independently? I don't know - that seems like a crucial question.


Anyway, to summarise, I think it's worth comparing this more explicitly to the most straightforward alternative, which is "ask people to send you information and probability distributions, then use your intuition or expertise or whatever other criteria you like to calculate how valuable their submission is, then send them a proportional amount of money."

comment by bgold · 2020-01-05T18:22:22.198Z · score: 5 (3 votes) · LW(p) · GW(p)

IMO the term "amplification" fits if the scheme results in a 1.) clear efficiency gain and 2.) it's scalable. This looks like (delivering equivalent results but at a lower cost OR providing better results for an equivalent cost. (cost == $$ & time)), AND (~ O(n) scaling costs).

For example if there was a group of people who could emulate [Researcher's] fact checking of 100 claims but do it at 10x speed, then that's an efficiency gain as we're doing the same work in less time. If we pump the number to 1000 claims and the fact checkers could still do it at 10x speed without additional overheard complexity, then it's also scalable. Contrast that with the standard method of hiring additional junior researchers to do the fact checking - I expect it to not be as scalable ("huh we've got all these employees now I guess we need an HR department and perf reviews and...:)

It does seem like a fuzzy distinction to me, and I am mildly concerned about overloading a term that already has an association w/ IDA.

comment by ozziegooen · 2020-01-04T17:47:03.170Z · score: 4 (2 votes) · LW(p) · GW(p)

Good points! This covers a lot of ground that we've been thinking about.

So the thing I'm wondering here is what makes this "amplification" in more than a trivial sense.

To be honest, I'm really not sure what word is best here. "Amplification" is the word we used for this post. I've also thought about calling this sort of thing "Proliferation" after "Instillation" here [LW(p) · GW(p)] and have previously referred to this method as Prediction-Augmented Evaluation Systems [LW · GW]. I agree that the employee case could also be considered a kind of amplification according to this terminology. If you have preferences or other ideas for names for this, I'd be eager to hear!

but has the disadvantage that you can't directly give rewards for other criteria like "how well is this explained". You also can't reward research on topics that you don't do deep dives on.

Very true, at least at this stage of development of Foretold. I've written some more thinking on this here [LW(p) · GW(p)]. Traditional prediction markets don't do a good job incentivizing participants to share descriptions and research, but ideally future systems would. There are ways we are working on to improve this with Foretold. A very simple setup would be one that gives people points/money for writing comments that are upvoted by important predictors.

I think it's worth comparing this more explicitly to the most straightforward alternative, which is "ask people to send you information and probability distributions, then use your intuition or expertise or whatever other criteria you like to calculate how valuable their submission is, then send them a proportional amount of money."

This isn't incredibly far from what we're going for, but I think the additional presence of a visible aggregate and the ability for forecasters to learn / compete with each other are going to be useful in expectation. I also would want this to be a very systematized process, because then there is a lot of optimization that could arguably be done. The big downside of forecasting systems is that they are less flexible than free-form solutions, but one big upside is that it may be possible to optimize them in different ways. For instance, eventually there could be significant data science pipelines, and lots of statistics for accuracy and calibration, that would be difficult to attain in free form setups. I think in the short term online forecasting setups will be relatively expensive, but it's possible that with some work they could become significantly more effective for certain types of problems.

I'd definitely agree that good crowdsourced forecasting questions need to be in some sort of sweet spot of "difficult enough to make external-forecasting useful, but open/transparent enough to make external-forecasting possible."

comment by Richard_Ngo (ricraz) · 2020-01-03T19:53:14.660Z · score: 4 (2 votes) · LW(p) · GW(p)

Actually, the key difference between this and prediction markets is that this has no downside risk, it seems? If you can't lose money for bad predictions. So you could exploit it by only making extreme predictions, which would make a lot of money sometimes, without losing money in the other cases. Or by making fake accounts to drag the average down.

comment by jacobjacob · 2020-01-05T09:33:35.760Z · score: 8 (2 votes) · LW(p) · GW(p)

It might interest you that there's quite a nice isomorphism between prediction markets and ordinary forecasting tournaments.

Suppose you have some proper scoring rule for predictions on outcome . For example, in our experiment we used . Now suppose the :th prediction is paid the difference between their score and the score of the previous participant: . Then you basically have a prediction market!

To make this isomorphism work, the prediction market must be run by an automated market maker which buys and sells at certain prices which are predetermined by a particular formula.

To see that, let be the total cost of buying shares in some possibility (e.g. Yes or No). If the event happens, your payoff will be (we're assuming that the shares just pay $1 if the event happens and $0 otherwise). It follows that the cost of buying further shares -- the market price -- is .

We require that the market prices can be interpreted as probabilities. This means that the prices for all MECE outcomes must sum to 1, i.e. .

Now we set your profit from buying x shares in the prediction market be equal to your payout in the forecasting tournament, . Finally, we solve for , which specifies how the automated market maker must make its trades. Different scoring rules will give you different . For example, a logarithmic scoring rule will give: .

For more details, see page 54 in this paper, Section 5.3, "Cost functions and Market Scoring Rules".

comment by ozziegooen · 2020-01-04T17:55:45.586Z · score: 2 (1 votes) · LW(p) · GW(p)

This is why proper scoring rules are important. As long as you are adequately using proper scoring rules, and proper combinations of those scoring rules, then people will be incentivized to predict according to their own beliefs. If we assume that users can't make account, and are paid in proportion to their performance according to proper scoring rules, then they shouldn't be able to gain expected earnings by providing overconfident answers.

The log-scoring function we use is a proper scoring rule. The potential winnings if you do a great job are very capped due to this scoring rule.

In this specific experiment we had some trust in the participants and no obviously fake accounts. If we scaled this, fake accounts would be an issue, but there are ways around it. I also would imagine that a more robust system would look something like having users begin with little "trust"; that they would then build up by providing good forecasts. They would only begin being payed as long as they had some threshold of trust; but within that level the proper scoring rules should generally create reasonable incentives.

comment by Richard_Ngo (ricraz) · 2020-01-04T20:19:09.091Z · score: 4 (2 votes) · LW(p) · GW(p)

I have four concerns even given that you're using a proper scoring rule, which relate to the link between that scoring rule and actually giving people money. I'm not particularly well-informed on this though, so could be totally wrong.

1. To implement some proper scoring rules, you need the ability to confiscate money from people who predict badly. Even when the score always has the same sign, like you have with log-scoring (or when you add a constant to a quadratic scoring system), if you don't confiscate money for bad predictions, then you're basically just giving money to people for signing up, which makes having an open platform tricky.

2. Even if you restrict signups, you get an analogous problem within a fixed population who's already signed up: the incentives will be skewed when it comes to choosing which questions to answer. In particular, if people expect to get positive amounts of money for answering randomly, they'll do so even when they have no relevant information, adding a lot of noise.

3. If a scoring rule is "very capped", as the log-scoring function is, then the expected reward from answering randomly may be very close to the expected reward from putting in a lot of effort, and so people would be incentivised to answer randomly and spend their time on other things.

4. Relatedly, people's utilities aren't linear in money, so the score function might not remain a proper one taking that into account. But I don't think this would be a big effect on the scales this is likely to operate on.

comment by ozziegooen · 2020-01-04T20:44:57.408Z · score: 8 (4 votes) · LW(p) · GW(p)

The fact that we use a "proper scoring rule" definitely doesn't mean that the entire system, including the participant's true utility functions, are really "proper". There is really a fair bit of impropriety. For instance, people also may care about their online reputation, and that won't be captured in the proper scoring rule. The proper scoring rule really helps make sure that one specific aspect of the system is "proper" according to a simplified model. This is definitely subideal, but I think it's still good enough for a lot of things. I'm not sure what type of system would be "perfectly proper".

Prediction markets have their own disadvantages; as participants don't behave as perfectly rational agents their either. So I won't claim that the system is "perfectly aligned", but I will suggest that it seems "decently aligned" compared to other alternatives, with the ability to improve as we (or others with other systems) add further complexity.

If you don't confiscate money for bad predictions, then you're basically just giving money to people for signing up, which makes having an open platform tricky.

What was done in this case was that participants were basically paid a fixed fee for participating, with a second "bonus" that was larger, that was paid in proportion to how they did on said rule. This works in experimental settings where we can filter the participants. It would definitely be more work to make the system totally openly available, especially as the prizes increase in value, much for the reason you describe. We're working to try to figure out solutions that could hold up (somewhat) in these circumstances, but it is tricky, for reasons you suggest and for others.

I'd also point out that having a nice scoring system is one challenge out of many challenges. Having nice probability distribution viewers and editors is difficult. Writing good questions and organizing them, and having software that does this well, is also difficult. This is something that @jacobjacob has been spending a decent amount of time thinking about after this experiment, but I've personally been focusing on other aspects.

At least in this experiment, the scoring system didn't seem like a big bottleneck. The people who submitted who won the most money were generally those who seemed to have given thoughtful and useful probability distributions. Things are much easier when you have an audience who is generally taking things in good faith and who can be excluded from future rounds if it seems appropriate.

comment by Richard_Ngo (ricraz) · 2020-01-04T22:01:34.741Z · score: 6 (3 votes) · LW(p) · GW(p)

Cool, thanks for those clarifications :) In case it didn't come through from the previous comments, I wanted to make clear that this seems like exciting work and I'm looking forward to hearing how follow-ups go.

comment by ozziegooen · 2020-01-04T22:03:25.842Z · score: 2 (1 votes) · LW(p) · GW(p)

Thanks! I really do appreciate the thoughts & feedback in general, and am quite happy to answer questions. There's a whole lot we haven't written up yet, and it's much easier for me to reply to things than lay everything out.

comment by Richard_Ngo (ricraz) · 2020-01-03T19:41:42.546Z · score: 3 (2 votes) · LW(p) · GW(p)

Another point: prediction markets allow you to bet more if you're more confident the market is off. This doesn't, except by betting that the market is further off. Which is different. But idk if that matters very much, you could probably recreate that dynamic by letting people weight their own predictions.

comment by ozziegooen · 2020-01-04T17:50:03.976Z · score: 2 (1 votes) · LW(p) · GW(p)

This is definitely a feature we're considering adding in some form (likely, something like weight/leverage). The current scoring system we are using is quite simple, I expect it to get more sophisticated.

However, one big downside is that sophistication would come with complexity, which could be a lot for some users.

comment by jacobjacob · 2020-01-05T10:27:47.817Z · score: 2 (1 votes) · LW(p) · GW(p)

I'll try to paraphrase you (as well as extrapolating a bit) to see if I get what you're saying:

Say you want some research done. The most straightforward way to do so is to just hire a researcher. This "freeform" approach affords a lot of flexibility in how you delegate, evaluate, communicate, reward and aggregate the research. You can build up subtle, shared intuitions with your researchers, and invest a lot of effort in your ability to communicate nuanced and difficult instructions. You can also pick highly independent researchers who are able to make many decisions for themselves in terms of what to research, and how to research it.
By using "amplification" schemes and other mechanisms, you're placing significant restrictions on your ability to do all of those things. Hence you better get some great returns to compensate.
But looking through various ways you might get these benefits, they all seem at best... fine.
Hence the worry is that despite all the bells-and-whistles, there's actually no magic happening. This is just like hiring a researcher, but a bit worse. This is only "amplification" in a trivial sense.
As a corollary, if your research needs seem to be met by a handful in-house researchers, this method wouldn't be very helpful to you.

1) Does this capture your views?

2) I'm curious what you think of the sections: "Mitigating capacity bottlenecks [LW · GW]" and "A way for intellectual talent to build and demonstrate their skills [LW · GW]"?

In particular, I didn't feel like your comment engaged with A) the scalability of the approach, compared to the freeform approach, and B) that it might be used as a "game" for young researchers to build skills and reputation, which seems way harder to do with the freeform approach.

comment by Lawrence Phillips · 2020-01-03T21:02:11.083Z · score: 12 (5 votes) · LW(p) · GW(p)

Nice work. A few comments/questions:

  • I think you're being harsh on yourselves by emphasising the cost/benefit ratio. For one, the forecasters were asked to predict Elizabeth van Norstrand's distributions rather than their mean, right? So this method of scoring would actually reward them for being worse at their jobs, if they happened to put all their mass near the resolution's mean as opposed to predicting the correct distribution. IMO a more interesting measure is the degree of agreement between the forecasters' predictions and Elizabeth's distributions, although I appreciate that that's hard to condense into an intuitive statistic.
  • An interesting question this touches on is "Can research be parallelised?". It would be nice to investigate this more closely. It feels as though different types of research questions might be amenable to different forms of parallelisation involving more or less communication between individual researchers and more or less sophisticated aggregation functions. For example, a strategy where each researcher is explicitly assigned a separate portion of the problem to work on, and at the end the conclusions are synthesised in a discussion among the researchers, might be appropriate for some questions. Do you have any plans to explore directions like these, or do you think that what you did in this experiment (as I understand, ad-hoc cooperation among the forecasters with each submitting a distribution, these then being averaged) is appropriate for most questions? If so, why?
  • About the anticorrelation between importance and "outsourceablilty": investigating which types of questions are outsourceable would be super interesting. You'd think there'd be some connection between outsourceable questions and parallelisable problems in computer science. Again, different aggregation functions/incentive structures will lead to different questions being outsourcable.
  • One potential use case for this kind of thing could be as a way of finding reasonable distributions over answers to questions that require so much information that a single person or small group couldn't do the research in an acceptable amount of time or correctly synthesise their conclusions by themselves. One could test how plausible this is by looking at how aggregate performance tracks complexity on problems where one person can do the research alone. So an experiment like the one you've done, but on questions of varying complexity, starting from trivial up to the limit of what's feasible.
comment by ozziegooen · 2020-01-04T16:55:27.854Z · score: 5 (3 votes) · LW(p) · GW(p)

Great questions! I'll try to respond to the points in order.

Question 1

The distinction between forecasters/Elizabeth making predictions of her initial distributions or the final mean, was one that was rather confusing. I later wrote some internal notes to think through some implications in more detail. You can see them here.

I have a lot of uncertainty in how to best structure these setups. I think though that for cost effectiveness, Elizabeth's initial distributions should be seen as estimates given of the correct value, which is what she occasionally later gave. As such, for cost effectiveness we are interested in how well the forecasters did and estimating this correct value, vs. how well she did at estimating this correct value.

Separately, it's of course apparent that that correct value itself is an estimate, and there's further theoretical work to be done to best say what it should have been estimating, and empiricle work to be done to get a sense of how well it holds up against even more trustworthy estimates.

I personally don't regard the cost effectiveness here as that crucial, I'd instead treat much of this experiment as a set of structures that could apply to more important things in other cases. Elizabeth's time was rather inexpensive compared to other people/procedures we may want to use in the future, and we could also spend fixed costs improving the marginal costs of such a setup.

Question 2

We haven't talked about this specific thing, but I could definitely imagine it. The general hope is that even without such a split, many splits would happen automatically. One big challenge is to get the splits right. One may initially think that forecaster work should be split by partitions of questions, but this may be pretty suboptimal. It may be that some forecasters have significant comparative advantages to techniques that span across questions; for instance, some people are great at making mathematical models, and others are great at adjusting the tails of distributions to account for common biases. I think of this more as dividing cognitive work based on trading strategies than questions.

There are a whole ton of possible experiments to be done here, because there are many degrees of freedom. Pursuing these in an effective way is one of our main questions. Of course, if we could have forecasters help forecast which experiments would be effective, then that could help bootstrap a process.

Question 3

We've come up with a few "rubrics" to evaluate how effective a given question or question set will be. The main factors are things like:

  1. Tractability (How much progress for how many resources can be made? What if all the participants are outside the relevant organizations/work?)
  2. Importance (How likely is this information to be valuable for changing important decisions?)
  3. Risk (How likely is it that this work will really anger someone or lead to significant downsides?)

I think it's really easy to spend a lot of money predicting ineffective things if you are not careful. Finding opportunities that are EV-positive is a pretty significant challenge here. I think my general intended strategy is a mix of "try a bunch of things" and "try to set up a system so the predictors themselves could predict the rubric elements or similar for a bunch of things they could predict."

Question 4

Agreed! That said, there are many possible dimensions for "complexity", so there's a lot of theoretical and practical work to be done here.

comment by jacobjacob · 2020-01-05T10:51:07.467Z · score: 3 (2 votes) · LW(p) · GW(p)

Question 3

It seems like Ozzie is answering on a more abstract level than the question was asked. There's a difference between "How valuable will it be to answer question X?" (what Ozzie said) and "How outsourceable is question X?" (what Lawrence's question was related to).

I think that outsourceability would be a sub-property of Tractability.

In more detail, some properties I imagine to affect outsourceability, are whether the question:

1) Requires in-depth domain knowledge/experience

2) Requires substantial back-and-forth between question asker and question answerer to get the intention right

3) Relies on hard-to-communicate intuitions

4) Cannot easily be converted into a quantitative distribution

5) Has independent subcomponents which can be answered separately and don't rely on each other to be answered (related to Lawrence point about tractability)

comment by Pattern · 2019-12-20T22:30:33.239Z · score: 3 (2 votes) · LW(p) · GW(p)
and each unit invested in the crowdworkers provided negative returns, as they tended to be less accurate than Elizabeth’s prior.

Would they be useful for finding the wrong answer?

comment by Ozzie Gooen (ozzie-gooen) · 2019-12-21T14:50:37.628Z · score: 3 (2 votes) · LW(p) · GW(p)

If you are asking if we could effectively use some transformation on their results to get a useful signal, my strong net is “maybe, but barely so.” I know there are cases in finance where poor predictors are actually systematically wrong, in ways that a good predictor could use for updating; but expect that’s for specific reasons we don’t have.

comment by jacobjacob · 2019-12-21T08:11:37.386Z · score: 1 (2 votes) · LW(p) · GW(p)

I'm afraid I don't understand your question, could you clarify?

comment by Richard_Ngo (ricraz) · 2020-01-03T19:21:22.078Z · score: 5 (3 votes) · LW(p) · GW(p)

My interpretation: there's no such thing as negative value of information. If the mean of the crowdworkers' estimates were reliably in the wrong direction (compared with Elizabeth's prior) then that would allow you to update Elizabeth's prior to make it more accurate.

comment by Pattern · 2019-12-22T02:40:49.945Z · score: -1 (2 votes) · LW(p) · GW(p)

An oracle that is always wrong can still be useful.