[Part 1] Amplifying generalist research via forecasting – Models of impact and challenges
Nice work. A few comments/questions:
- I think you're being harsh on yourselves by emphasising the cost/benefit ratio. For one, the forecasters were asked to predict Elizabeth van Norstrand's distributions rather than their mean, right? So this method of scoring would actually reward them for being worse at their jobs, if they happened to put all their mass near the resolution's mean as opposed to predicting the correct distribution. IMO a more interesting measure is the degree of agreement between the forecasters' predictions and Elizabeth's distributions, although I appreciate that that's hard to condense into an intuitive statistic.
- An interesting question this touches on is "Can research be parallelised?". It would be nice to investigate this more closely. It feels as though different types of research questions might be amenable to different forms of parallelisation involving more or less communication between individual researchers and more or less sophisticated aggregation functions. For example, a strategy where each researcher is explicitly assigned a separate portion of the problem to work on, and at the end the conclusions are synthesised in a discussion among the researchers, might be appropriate for some questions. Do you have any plans to explore directions like these, or do you think that what you did in this experiment (as I understand, ad-hoc cooperation among the forecasters with each submitting a distribution, these then being averaged) is appropriate for most questions? If so, why?
- About the anticorrelation between importance and "outsourceablilty": investigating which types of questions are outsourceable would be super interesting. You'd think there'd be some connection between outsourceable questions and parallelisable problems in computer science. Again, different aggregation functions/incentive structures will lead to different questions being outsourcable.
- One potential use case for this kind of thing could be as a way of finding reasonable distributions over answers to questions that require so much information that a single person or small group couldn't do the research in an acceptable amount of time or correctly synthesise their conclusions by themselves. One could test how plausible this is by looking at how aggregate performance tracks complexity on problems where one person can do the research alone. So an experiment like the one you've done, but on questions of varying complexity, starting from trivial up to the limit of what's feasible.