A Framework of Prediction Technologies

post by isaduan · 2021-10-03T10:26:34.928Z · LW · GW · 2 comments

Contents

  Summary
      Some warnings of my epistemic status:
  A Framework of Prediction Technologies
    [1] Data Analysis Pipelines
      1.1. Classifiers and regressions
        Four use-cases
      1.2 Agent-based models
        Two types of models, and how AI comes in
    [2] Augmented Human Forecasters
        Two ways of augmentation
    [3] General-purpose AI Prediction Services
        Embeddedness required?
    Notes
None
2 comments

Summary

This is the first piece of a blog post series that explores how AI prediction services affect the risks of war. It is based on my 10-week summer research project at Stanford Existential Risk Institute.

In this post, I will discuss different prediction technologies - how they might be relevant to nation-state governments and what their development trajectories might look like. The second post [LW · GW] borrows from international relations literature to examine what implications follow for the risk of war. The third post [LW · GW] concludes with three possible world scenarios and a few final remarks.

Some warnings of my epistemic status:

The amount of private information, which affects the probability of reaching peace agreements;

The cost of wars, including offense-defense balance, international institutions, and the influence of domestic audiences.

However, wars break out for various reasons and for no reason. Further research that distinguishes different types of war (e.g. great power fringe war, nuclear war, high causality war) will likely be more useful, but that is beyond the scope of this analysis.

So, here’s an overarching guess: AI prediction services could gradually reduce the generic risk of war, not by reducing private information in peace bargaining, but by increasing the cost of war relative to peace.


A Framework of Prediction Technologies

Prediction is about generating new information from known information.[1] I understand prediction technologies as modeling tools. They structure known information in ways that generate useful new information. I outline a framework to understand different prediction technologies involving AI.[2]


[1] Data Analysis Pipelines

1.1. Classifiers and regressions

Four use-cases

Development trajectory. My guess is that the development of data-driven classifiers and regressions are likely to see a performance boost in the next 5 years but face diminishing marginal returns to research efforts, with a wide range of variance for different prediction tasks.

  1. Better performing models are harder to find under the “stepping-on-toes” effect: increasing the complexity of prediction models only marginally increases accuracy, and the extra accuracy from complexity may be counterbalanced by overfitting the training data, given many sources of uncertainty such as changing population, sample selection bias, changing classification and concepts, and changing evaluation criteria.
  2. There are also practical challenges of data quality, data uncertainty, and data availability: streaming data could cause unpredictable changes to the stored data and models, and their quality and structures are unknown until runtime; moreover, regulation and prohibition of surveillance and data use could tamper the quality and availability of data.
  3. Separate from the insufficiency of models and data, complex social systems may exhibit some intrinsic unpredictability: simulation shows that even with unlimited data about users, followers, content, past performance, the state-of-the-art algorithm could only predict less than 50% of the variance in popularity of Tweets. When relaxing the perfect information assumption, even a small degree of uncertainty in one variable could lead to substantially more restrictive bounds on predictability.[6] This suggests that, for some prediction tasks, a marginal performance improvement can be prohibitively expensive and unlikely to be pursued even when theoretically possible.

Variance among tasks. Certain prediction tasks can be easier than others:

if skill, the stable, intrinsic attributes of studied object, relative to luck, the systemic randomness independent of the object, contributes to the outcomes of interests;

if as population-level outcomes, randomness among individual components balances out each other;

if there is less incentive to misrepresent the outcomes or if the misrepresentation is difficult, e.g. using vocal characteristics of presidential candidates, possibly an unconscious, difficult-to-manipulate cue, to predict the outcome of actual elections; [7]

if quasi-experiments are possible, data abundant, and feedback cycles short, e.g. using social media data to model individual and collective viewpoints, preferences, sentiments, networks, attention to events, and responses to shocks.

Future research could distill this in more detail.

1.2 Agent-based models

Two types of models, and how AI comes in

Development trajectory. My guess is that in the next 5 to 10 years, applications of game theory or other multi-agent systems might be restricted to narrow settings where the structure of local strategic interaction is relatively stable.


[2] Augmented Human Forecasters

Two ways of augmentation

Development trajectory. My guess is that its development may be modest in the next 5 years, with substantive potentials possibly realized in the next 5 to 20 years but not guaranteed.


[3] General-purpose AI Prediction Services

**AI forecasters? **

It might be possible to train an AI forecaster to represent raw data from the Internet like a human forecaster on CSET’s Foretell or Metaculus by training it on different prediction tasks.

  1. GPT-3 trained with a small sample of resolved binary forecasting questions on Metaculus seems to be able to discriminate between non-occurring events from occurring ones.
  2. RL agents using knowledge graphs can build effective, hierarchical representations of the environment in text-based games. They show the ability to perform common-sense reasoning with temporal and relational awareness.[21]
  3. Memory-augmented neural networks trained on a sequence of similar classification tasks show the ability of meta-learning: they seem to learn an abstract method for obtaining useful representation from the raw data and, when presented a new task, to leverage a small amount of new data to make an accurate prediction.

Embeddedness required?

However, it could be that general-purpose superhuman prediction capability requires open-ended learning in the real world or a simulated environment sufficiently resembling the real world. If so, AI prediction services would become agent-like. Like DeepMind’s Agent57 and MuZero, they might have intrinsic reward functions separate from the extrinsic reward provided by the environment to encourage exploration of the environment; they might also have meta-controllers adaptively adjusting the exploration/exploitation trade-off.

  1. The progress of human-machine interaction and the accumulation of feedback data;
  2. The capability of simulated environments resembling some narrow strategic settings of the real world;
  3. The rate at which relevant settings of "real-world" move into augmented and virtual realities.

Development trajectory. Hard to say. If a superhuman text-based AI forecaster is possible, and there is a consistent effort pushing towards it, then it might happen within the next 15 to 20 years.


Notes


  1. Ajay K. Agrawal, Joshua S. Gans, and Avi Goldfarb, ‘Prediction, Judgment and Complexity: A Theory of Decision Making and Artificial Intelligence’ (National Bureau of Economic Research, 29 January 2018), https://doi.org/10.3386/w24243. ↩︎

  2. Nuño Sempere proposes an alternative framework: [1] human forecasters; [2] transparent data analysis systems; [3] black-box ML systems (e.g. weather forecasting systems); [4] black-box ML systems which have direct effects on the world (e.g. automated trading systems, automated missile detection systems). I benefit from his suggestion and think that his framework is better for understanding the technical side of prediction technologies. But I choose to structure the discussion here using my framework because I find it helpful when thinking about my research question, the geopolitical implications of those technologies. ↩︎

  3. Sathappan Muthiah et al., ‘EMBERS at 4 Years: Experiences Operating an Open Source Indicators Forecasting System’, ArXiv:1604.00033 [Cs], 31 March 2016, http://arxiv.org/abs/1604.00033. ↩︎

  4. See the IARPA’s Mercury program. ↩︎

  5. Tamay Besiroglu, ‘Are Models Getting Harder to Find?’ Master’s Thesis. University of Cambridge, 2020. ↩︎

  6. Travis Martin et al., ‘Exploring Limits to Prediction in Complex Social Systems’, Proceedings of the 25th International Conference on World Wide Web, 11 April 2016, 683–94, https://doi.org/10.1145/2872427.2883001. ↩︎

  7. Irena Pavela Banai, Benjamin Banai, and Kosta Bovan, ‘Vocal Characteristics of Presidential Candidates Can Predict the Outcome of Actual Elections’, Evolution and Human Behavior 38, no. 3 (May 2017): 309–14, https://doi.org/10.1016/j.evolhumbehav.2016.10.012. ↩︎

  8. Milind Tambe, Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned (Cambridge ; New York: Cambridge University Press, 2012). ↩︎

  9. Vincent Conitzer and Tuomas Sandholm, ‘Self-Interested Automated Mechanism Design and Implications for Optimal Combinatorial Auctions’, in Proceedings of the 5th ACM Conference on Electronic Commerce - EC ’04 (the 5th ACM conference, New York, NY, USA: ACM Press, 2004), 132, https://doi.org/10.1145/988772.988793. ↩︎

  10. Stephan Zheng et al., ‘The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies’, ArXiv:2004.13332 [Cs, Econ, q-Fin, Stat], 28 April 2020, http://arxiv.org/abs/2004.13332. ↩︎

  11. See slide 19 of “HFC Proposers’ Day. ↩︎

  12. Fred Morstatter et al., ‘SAGE: A Hybrid Geopolitical Event Forecasting System’, in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}, Macao, China: International Joint Conferences on Artificial Intelligence Organization, 2019), 6557–59, https://doi.org/10.24963/ijcai.2019/955. ↩︎

  13. Yue Ning et al., ‘Spatio-Temporal Event Forecasting and Precursor Identification’, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Anchorage AK USA: ACM, 2019), 3237–38, [https://doi.org/10.1145/3292500.3332291] ↩︎

  14. See, for example, the IARPA’s BETTER program. ↩︎

  15. See Ought’s Elicit. ↩︎

  16. I owe this point to Nuño Sempere. ↩︎

  17. See, for example, Keren Yarhi-Milo, Knowing the Adversary: Leaders, Intelligence, and Assessment of Intentions in International Relations (Princeton University Press, 2014), https://press.princeton.edu/books/hardcover/9780691159157/knowing-the-adversary. ↩︎

  18. Jeffrey A. Friedman, War and Chance: Assessing Uncertainty in International Politics (Oxford University Press, 2019). ↩︎

  19. My wild conjecture inspired by Jonas D Hasbach and Maren Bennewitz, ‘The Design of Self-Organizing Human–Swarm Intelligence’, Adaptive Behavior, 3 July 2021, 10597123211017550, https://doi.org/10.1177/10597123211017550. ↩︎

  20. Fred Morstatter et al., ‘SAGE: A Hybrid Geopolitical Event Forecasting System’. ↩︎

  21. Yunqiu Xu et al., ‘Deep Reinforcement Learning with Stacked Hierarchical Attention for Text-Based Games’, n.d., 13. ↩︎

2 comments

Comments sorted by top scores.

comment by SimonM · 2021-10-04T10:31:08.512Z · LW(p) · GW(p)

tl;dr - I don't believe the Metaculus prediction is materially better than the community median.

Another example is Metaculus Prediction, an ML algorithm that calibrates and weights each forecaster’s prediction after training on forecaster-level predictions and track records. From 2015 to 2021, it outperformed the median forecast in the Metaculus community by 24% on binary questions and by 9% on continuous questions.

This is at best a misleading way of describing the performance of the Metaculus prediction vs the community (median) prediction. 

We can slice the data in any number of ways, and I can't find any way to suggest the Metaculus prediction outperformed the median prediction by 24%. 

Looking at (for all time):

BrierResolveCloseAll Times
Community median0.1210.1230.153
Metaculus0.1160.1160.146
Difference4.3%6.0%4.8%
    
LogResolveCloseAll
Community median0.420.4120.274
Metaculus0.4310.4310.295
Difference2.6%4.6%7.7%

None of these are close to 24%. I also think that given the Metaculus algorithm only came into existence in June 2017, we should really only look at performance more recently. For example, the same table looking at everything from July 2018 onwards looks like:

BrierResolveCloseAll Times
Community median0.1070.1050.147
Metaculus0.1080.1130.156
Difference-0.9%-7.1%-5.8%
    
LogResolveCloseAll
Community median0.4620.4630.26
Metaculus0.4480.4260.226
Difference-3.0%-8.0%-13.1%

Now the community median outperforms every time!

For continuous questions the Metaculus forecast has more consistently outperformed out-of-sample, but still smaller differences than what you've claimed:

ContinuousResolveCloseAll
Community2.262.221.69
Metaculus2.322.321.74
Difference2.7%4.5%3.0%
    
    
Continuous (July '18 -)ResolveCloseAll
Community2.282.271.73
Metaculus2.352.381.79
Difference3.1%4.8%3.5%

I would also note that %age difference here is almost certainly the wrong metric for measuring the difference between Brier scores.

Replies from: isaduan
comment by isaduan · 2021-10-09T17:53:56.195Z · LW(p) · GW(p)

Thanks for checking! I think our main difference is that you use data from Metaculus prediction whereas I used Metaculus postdiction, which "uses data from all other questions to calibrate its result, even questions that resolved later." Right now, this gives Metaculus an average log score of 0.519 vs. the community's 0.419 (total questions: 885) for binary questions, 2.43 vs. 2.25 for 537 continuous questions, evaluated at resolve time.