Conversation on forecasting with Vaniver and Ozzie Gooen

bird-concept

Conversation on forecasting with Vaniver and Ozzie Gooen

post by Bird Concept (jacobjacob), Vaniver, ozziegooen · 2019-07-30T11:16:58.633Z · LW · GW · 18 comments

  Introduction
  Is infrastructure really what forecasting needs? (And clarifying the term “forecasting”)
  Fragility of value and difficulty of capturing important uncertainties in forecasts
  Vaniver’s conceptual model of why forecasting works
  Background on prediction markets and the Good Judgement Project
  Positive cultural externalities of forecasting AI
  Importance of software engineering vs. other kinds of infrastructure 
  Privacy
  Orgs using internal prediction tools, and the action-guidingness of quantitative forecasts
  Vaniver’s steelman of Ozzie
  How to explore the forecasting space
  Importance and neglectedness of forecasting work
  Tractability of forecasting work
  Technical tooling for Effective Altruism
    5-MIN BREAK
  Tractability of forecasting within vs outside EA
  Medium-term goals and lean startup methodology
  Limitations of current forecasting tooling
  Knowledge graphs and moving beyond questions-as-strings 
  Summary of cruxes 
  Ozzie’s conceptual uncertainties
None
18 comments

[Cross-posted to the EA Forum [EA · GW]]

This is a transcript of a conversation on forecasting between Vaniver [LW · GW] and Ozzie Gooen [LW · GW], with an anonymous facilitator (inspired by the double crux technique). The transcript was transcribed using a professional service.

We decided to record, transcribe, edit and post it as:

Despite an increase in interest and funding [EA · GW] for forecasting work in recent years, there seems to be a disconnect between the mental models of the people working on it and the people who aren’t. I want to move the community’s frontier of insight closer to that of the forecasting subcommunity
I think this is true for many more topics than forecasting. It’s incredibly difficult to be exposed to the frontier of insight unless you happen to be in the right conversations, for no better reason than that people are busy, preparing transcripts takes time and effort, and there are no standards and unclear expected rewards for doing so. This is an inefficiency in the economic sense. So it seems good to experiment with ways of alleviating it
This was a high-effort activity where two people dedicated several hours to collaborative, truth-seeking dialogue. Such conversations usually look quite different from comment sections (even good ones!) or most ordinary conversations. Yet there are very few records of actual, mind-changing conversations online, despite their importance in the rationality community.
Posting things publicly online increases the surface area of ideas to the people who might use them, and can have very positive, hard-to-predict effects.

Introduction

Facilitator: One way to start would be to get a bit of both of your senses of the importance of forecasting, maybe Ozzie starting first. Why are you excited about it and what caused you to get involved?

Ozzie: Actually, would it be possible that you start first? Because there are just so many ...

Vaniver: Yeah. My sense is that predicting the future is great. Forecasting is one way to do this. The question for “will this connect to things being better” is the difficult part. In particular, Ozzie had this picture before of, on the one hand, data science-y repeated things that happen a lot, and then on the other hand judgement-style forecasting, a one off thing where people are relying on whatever models because they can't do the “predict the weather”-style things.

Vaniver: My sense is that most of the things that we care about are going to be closer to the right hand side and also most of the things that we can do now to try and build out forecasting infrastructures aren't addressing the core limitations in getting to these places.

Is infrastructure really what forecasting needs? (And clarifying the term “forecasting”)

Vaniver: My main example here is something like prediction markets are pretty easy to run but they aren't being adopted in many of the places that we'd like to have them for reasons that are not ... “we didn't get a software engineer to build them.” That feels like my core reason to be pessimistic about forecasting as intellectual infrastructure.

Ozzie: Yeah. I wanted to ask you about this. Forecasting is such a big type of thing. One thing we have about maybe five to ten people doing timelines, direct forecasting, at OpenAI, OpenPhil and AI Impacts. My impression is that you're not talking about that kind of forecasting. You're talking about infrastructural forecasting where we have a formal platform and people making formalised things.

Vaniver: Yeah. When I think about infrastructure, I'm thinking about building tooling for people to do work in a shared space as opposed to individual people doing individual work. If we think about dentistry or something, like what dentists' infrastructure would look like is very different from people actually modifying mouths. It feels to me like that OpenAI and similar people are doing more of the direct style work than infrastructure.

Ozzie: Yeah okay. Another question I have is something like a lot of trend extrapolations stuff, e.g. for particular organizations, “how much money do you think they will have in the future?” Or for LessWrong, “how many posts are going to be there in the future?” and things like that. There's a lot of that happening. Would you call that formal forecasting? Or would you say that's not really tied to existing infrastructure and they don't really need infrastructure support?

Vaniver: That's interesting. I noticed earlier I hadn't been including Guesstimate or similar things in this category because that felt to me more like model building tools or something. What do I think now ...

Vaniver: I'm thinking about two different things. One of them is the “does my view change if I count model building tooling as part of this category, or does that seem like an unnatural categorization?” The other thing that I'm thinking about is if we have stuff like the LessWrong team trying to forecast how many posts there will be… If we built tools to make that more effective, does that make good things happen?

Vaniver: I think on that second question the answer is mostly no because it's not clear that it gets them better counterfactual analysis or means they work on better projects or something. It feels closer to ... The thing that feels like it's missing there is something like how them being able to forecast how many posts there will be on LessWrong connects to whether LessWrong is any good.

Fragility of value and difficulty of capturing important uncertainties in forecasts

Vaniver: There was this big discussion that happened recently about what metric is the team should be trying to optimize for the quarter. My impression is this operationalization step connected people pretty deeply to the fact that the things that we care about are actually just extremely hard to put numbers on. This difficulty will also be there for any forecasts we might make.

Ozzie: Do you think that there could be value in people in the EA community figuring out how do put numbers on such things? For instance, like groups evaluate these things in the future in formal ways. Maybe not for LessWrong but for other kinds of projects.

Vaniver: Yeah. Here I'm noticing this old LessWrong post [LW · GW]…. Actually I don't know if this was one specific post, but this claim of the “fragility of value” where it's like “oh yeah, in fact the thing that you care about is this giant mess. If you drill it down to one consideration, you probably screwed it up somehow”. But it feels like even though I don't expect you to drill it down to one consideration, I do think having 12 is an improvement over having 50. That would be evidence of moral progress.

Ozzie: That’s interesting. Even so, the agenda that I've been talking about it quite broad. It's very much a lot of interesting things. A combination of forecasting and better evaluations. For forecasting itself, there are a lot of different ways to do it. That does probably mean that there is more work for us to do back and forth with specific types and their likelihood, which make this a bit challenging. It'll give you a wide conversation.

Ben G: Is it worth going over the double cruxing steps, the general format? I'm sorry. I'm not the facilitator.

Vaniver: Yeah. What does our facilitator think?

Facilitator: I think you're doing pretty good and exploring each other's stuff. Pretty cool... I'm also sharing a sense that forecasting has been replaced with a vague “technology” or something.

Ozzie: I think in a more ideal world we'd have something like a list of every single application and for each one say what are the likelihoods that I think it's going to be interesting, what you think is going to be interesting, etc.

Ozzie: We don't have a super great list like that right now.

Vaniver: I'm tickled because this feels like a very forecasting way to approach the thing where it's like “we have all these questions, let's put numbers on all of them”.

Ozzie: Yeah of course. What I'd like to see, what I'm going for, is a way that you could formally ask forecasters these things.

Vaniver: Yeah.

Ozzie: That is a long shot. I'd say that's more on the experimental side. But if you could get that to work, that'd be amazing. More likely, that is something that is kind of infrequent.

Vaniver’s conceptual model of why forecasting works

Vaniver: When I think about these sorts of things, I try to have some sort of conceptual model of what's doing the work. It seems to me the story behind forecasting is there's a lot of, I'm going to say, intelligence for hire out there and that the thing that we need to build is this marketplace that connects the intelligence for hire and the people who need cognitive work done. The easiest sorts of work for us to use for are these predictions about the future because it's easy to verify later and ....

Vaniver: I mean the credit allocation problem is easy because of everyone who moved the prediction in a good direction gets money and everyone who moved it in the wrong direction loses money. Whereas if we're trying to develop a cancer drug and we do scientific prizes, it may be very difficult to do the credit allocation for “here's a billion dollars for this drug”. Now all the scientists who made some sort of progress along the way figure out who gets what of that money.

Vaniver: I'm curious how that connects with your conception of the thing. Does that seem basically right or you're like there's this part that you're missing or you would characterize differently or something?

Ozzie: Different aspects about it. One is I think that's one of the possible benefits. Hypothetically, it may be one of the main benefits. But even if it's not an actual benefit, even if it doesn't come out to be true, I think that there are other ways that this type of stuff would be quite useful.

Background on prediction markets and the Good Judgement Project

Ozzie: Also to stand back a little bit, I'm not that excited about prediction markets in a formal way. My impression is that A) they're not very legal in the US, and B), it's very hard to incentivize people to forecast the right questions. Then C), there are issues around a lot of these forecasting systems you have people that want private information and stuff. There's a lot of nasty things with those kinds of systems. They could be used for some portion of this.

Ozzie: The primary area that I'm more interested in forecasting applications similar to Metaculus and PredictionBook and one that I'm working on right now. More, they're working differently. Basically, people build up good reputations by having good track records. Then there's basically a variety of ways to pay people. The Good Judgement Project does it by basically paying people a stipend. There are around 125 super forecasters who work on specific questions for specific companies. I think you pay like $100,000 to get a group of them.

Ozzie: Just a quick question, are you guys familiar with how they do things in specific? Not many people are.

Ozzie: Maybe one of the most interesting examples of paid forecasters which was similar to this. For them, they basically have the GJP Open where they find the really good forecasters. Then those become the super forecasters. There's about 200 of these, 125, are the ones that they're charging other companies for.

Vaniver: Can you paint me more of a picture of who is buying the forecasting service and what they're doing it for?

Ozzie: Yeah. For one thing, I'll say that this area is pretty new. This is still on the cutting edge and small. OpenPhil bought some of their questions ... I think they basically bought one batch. The questions I know about them asking were things like “what are the chances of nuclear between the US and Russia?” “What are the chances of nuclear war between different countries?” where one of the main ones was Pakistan and India. Also specific questions about outcomes of interventions that they were sponsoring. OpenPhil already internally does forecasting on most of its grant applications. When a grant is made internally they would have forecasts about how well it's going to do and they track that. That is a type of forecasting.

Ozzie: The other groups that use them are often businesses. There are two buckets in how that's useful. One of them is to drive actual answers. A second one is to get the reasoning behind those answers. A lot of times what happens -- although it may be less useful for EAs -- is that these are companies maybe do not have optimal epistemologies, but instead have systematic biases. They basically purchase this team of people who do provably well at some of these types of questions. Those people would have discussions about their kinds of reasoning. Then they find their reasoning interesting.

Vaniver: Yeah. Should I be imagining an oil company deciding whether to build a bunch of wells in Ghana and has decided that they just want to outsource the question of what's the political environment in Ghana going to be for the next 10 years?

Ozzie: That may be a good interpretation. Or there'd be the specific question of what's the possibility that there'll be a violent outbreak.

Vaniver: Yeah. This is distinct from Coca Cola trying to figure out which of their new ad campaigns would work best.

Ozzie: This is typically different. They've been focused on political outcomes mostly. That comes in assuming that they were working with businesses. A lot of GJP stuff is covered by NDA so we can't actually talk about it. We don't have that much information.

Ozzie: My impression is that some groups have found it useful and a lot of businesses don't know what to do with those numbers. They get a number like 87% and they don't have ways to directly make that interact with the rest of their system.

Ozzie: That said, there are a lot of nice things about that hypothetically. Of course some of it does come down to the users. A lot of businesses do have pretty large biases. That is a known thing. It's hard to know if you have a bias or not. Having a team of people who has a track record of accuracy is quite nice if you want to get a third party check. Of course another thing for them is that it is just another way to outsource intellectual effort.

Positive cultural externalities of forecasting AI

Facilitator: Vaniver, is this changing your mind on anything essentially important?

Vaniver: The thing that I'm circling around now is a question closer to “in what contexts does this definitely work?” and then trying to build out from that to “in what ways would I expect it to work in the future?”. For example here, Ozzie didn't mention this, but a similar thing that you might do is have pundits just track their predictions or somehow encourage them to make predictions that then feed into some reputation score where it may matter in the future. The people who consistently get economic forecasts right actually get more mindshare or whatever. There's versions of this that rely on the users caring about it and then there are other versions that rely less on this.

Vaniver: The AI related thing that might seem interesting is something like 2.5 years ago Eliezer asked this question at the Asilomar AI conference which was “What's the least impressive thing that you're sure won't happen in two years?” Somebody came back with the response of “We're not going to hit 90% on the Winograd Schema.” [Editor’s note: the speaker was Oren Etzioni] This is relevant because a month ago somebody hit 90% on the Winograd Schema. This turned out to have been 2.5 years after the thing. This person did successfully predict the thing that would happen right after the deadline.

Vaniver: I think many people in the AI space would like there to be this sort of sense of “people are actually trying to forecast near progress”. Or sorry. Maybe I should say medium term progress. Predicting a few years of progress is actually hard. But it's categorically different from three months. You can imagine something where people who are building up the infrastructure to be good at this sort of forecasting does actually make the discourse healthier in various ways and gives us better predictions of the future.

Importance of software engineering vs. other kinds of infrastructure

Vaniver: Also I'm having some question of how much of this is infrastructure and how much of this is other things. For example when we look at the Good Judgement Project I feel like the software engineering is a pretty small part of what they did as compared to the selection effects. It may still be the sort of thing where we're talking about infrastructure, though we're not talking about software engineering.

Vaniver: The fact that they ran this tournament at all is the infrastructure, not the code underneath the tournaments. Similarly, even if we think about a Good Judgment Project for research forecasting in general, this might be the sort of cool thing that we could do. I'm curious how that landed for you.

Ozzie: There's a lot of stuff in there. One thing is that on the question of “can we just ask pundits or experts”, I think my prior is that that would be a difficult thing, specifically in that in “Expert Political Judgment” Tetlock tried to get a lot of pundits to make falsifiable predictions and none of them wanted to ...

Vaniver: Oh yeah. It's bad for them.

Facilitator: Sorry. Can you tell me what you thought were the main points of what Vaniver was just saying then?

Ozzie: Totally. Some of them ...

Facilitator: Yeah. I had a sense you might go "I have a point about everything he might have said so I'll say all of them" as opposed the key ones.

Ozzie: I also have to figure out what he said in that last bit as opposed to the previous bit. It's one of them. There's a question. Most recent when it comes to the Good Judgment Project, how much of it was technology versus other things that we did?

Ozzie: I have an impression that you're focused on the AI space. You do talk about the AI space a lot. It's funny because I think we're both talking a bit on points that help the other side, which is kind of nice. You mentioned one piece where prediction was useful in the AI space. My impression is that you're skeptical about whether we could get a lot more wins like that, especially if we tried to do it with a more systematic effort.

Vaniver: I think I actually might be excited about that instead of skeptical. We run into similar problems as we did with getting pundits to predict things. However, the things that're going on with professors and graduates and research scientists is very different from the thing that's going on with pundits and newspaper editors and newspaper readers.

Vaniver: Also it ties into the ongoing question of “is science real?” that the psychology replication stuff is connected to. Many people in computer science research in particular are worried about bits of how machine learning research is too close to engineering or too finicky in various ways. So I could a imagine a "Hey, will this paper replicate?"-market catching on in computer science. I imagine getting from that to a “What State-of-the-Arts will fall when?”-thing. That also seems quite plausible that we could make that happen.

Ozzie: I have a few points now that connect to that. On pundits and experts, I think we probably agree that pundits often can be bad. Also experts often are pretty bad at forecasting it seems. That's something that's repeatable.

Ozzie: For instance in the AI expert surveys, a lot of the distributions don't really make sense with each other. But the people who do seem to be pretty good are the specific class of forecasters, specifically ones that we have evidence for, that's really nice. We only have so many of them right now but it is possible that we can get more of them.

Ozzie: It would be nice for more pundits to be more vocal about this stuff. I think Kelsey at Vox with their Future Perfect group is talking about making predictions. They've done some. I don't know how much we'll end up doing.

Privacy

Ozzie: When it comes to the AI space, there are questions about “what would interesting projects look like right now?” I've actually been dancing around AI in part because I could imagine a bad world or possibly a bad world where we really help make it obvious what research directions are exciting and then we help speed up AI progress by five years and that could be quite bad. Though, managing to do that in an interesting way could be important.

Ozzie: There are other questions about privacy. There's the question of “is this interesting?”, and the question of "conditional on it being kind of interesting. Should we be private about it?" We're right now playing for that first question.

Orgs using internal prediction tools, and the action-guidingness of quantitative forecasts

Ozzie: Some other things I'd like to bring into this discussion is that a lot of it right now is already being systemized. They say when you are an entrepreneur or something and try to build a tool it's nice to find that there are already internal tools. A lot of these groups are making internal systematic predictions at this point. They're just not doing it using very formal methods.

Ozzie: Some example, OpenPhil formally specifies a few predictions for grants. Open AI also has a setup for internal forecasting. These are people at Open AI who are ML experts basically. That's a decent sized thing.

Ozzie: There are several other organizations that are using internal forecasting for calibration. It's just a fun game that forces them to get a sense of what calibration is like. Then for that there are questions of “How useful is calibration?”, “Does it give you better calibration over time?”

Ozzie: Right now none of them seem to be using PredictionBook. We could also talk a bit about ... I think that thing is nice and shows a bit of promise. It may be that there are some decent wins to be done by making better tools for those people which right now aren’t using any specific tools because they looked at them and found them to be inadequate. It's also possible that even if they did use those tools it'd be a small win and not a huge win. That's one area where there could be some nice value. But it's not super exciting so I don't know if you want to push back against that and say "there'll be no value in that."

Vaniver: There I'm sort of confused. What are the advantages to making software as a startup where you make companies' internal prediction tools better? This feels similar to Atlassian of something where it's like "yeah, we made their internal bug reporting or other things better". It's like yeah, sure, I can see how this is valuable. I can see how I’d make them pay for it. But I don't see how this is ...

Vaniver: ...a leap towards the utopian goals if we take something like Futarchy or ... in your initial talk you painted some pictures of this is how in the future if you had much more intelligence or much more sophisticated systems you could do lots of cool things. [Editor’s note: see Ozzie’s sequence “Prediction-Driven Collaborative Reasoning Systems [? · GW]” for background on this] The software as a service vision doesn't seem like it gets us all that much closer and also feels like it's not pushing at the hardest bit which is something like “getting companies to adopt it”-thing. Or maybe what I think there is something like that the organizations themselves have to be structured very differently. It feels like there's some social tech.

Ozzie: When you say very differently, do you mean very differently? Right now they're already doing some predictions. Do you mean very differently for like predictions would be a very important aspect of the company? Because right now it is kind of small.

Vaniver: My impression is something like going back to your point earlier about looking back at answers like 87% and they won't really know what to do with it. Similarly, I was in a conversation with Oli earlier about whether or not organizations had beliefs or world models. There's some extent to which the organization has a world model that doesn't live in a person's head. It's going to be something like its beliefs are these forecasts on all these different questions and also the actions that the organization takes is just driven by those forecasts without having a human in the loop, where it feels to me right now often the thing that will happen is some executive will be unsure about a decision. Maybe they'll go out to the forecasters. The forecasters will come back with 87%. Now the executive is still making the decision using their own mind. Whether or not that “87%” lands as “the actual real number 0.87” or something else is unclear, or not sensibly checked, or something. Does that make sense?

Ozzie: Yeah. Everything's there. Let's say that ... 87% example is something that A) comes up if you're a bit naïve about what you want and B), comes up depending on how systematic your organization is with using number for things. If you happen to have a model what the 87% is, that could be quite valuable. With see different organizations are on different parts of the spectrum. Probably the one that's most intense about this is GiveWell. GiveWell has their multiple gigantic sheets of lots of forecasts essentially. It's possible that it'll be hard to make tooling that'll be super useful to them. I've been talking with them. There's experiments to be tried there. They're definitely in the case that as specific things change they may change decisions and they'll definitely change recommendations.

Ozzie: Basically they have this huge model where people estimate a bunch of parameters about moral decision making and a lot of other parameters about how well the different interventions are going to do. Out of all of that comes recommendations for what the highest expected values are.

Ozzie: That said, they are also in the domain that's probably the most certain of all the EA groups in some ways. They're able to do that more. I think the Open AI is probably a little bit... I haven't seen their internal models but my guess is that they do care a lot about the specifics of the numbers and also are more reasonable about what to do with them.

Ozzie: I think the 87% example is a case of most CEOs don't seem to know what a probability distribution is but I think the EA groups are quite a bit better.

Vaniver: When I think about civilization as a whole, there’s a disconnect between groups that think numbers are real and groups that don't think numbers are real. There's some amount of "ah, if we want our society to be based on numbers are real, somehow we need the numbers-are-real-orgs to eat everyone else. Or successfully infect everyone else.”

Vaniver’s steelman of Ozzie

Vaniver: What's up?

Facilitator: Vaniver, given what you can see from all the things you discussed and touched on in the forecasting space, I wonder if you had some sense of the thing Ozzie is working on. If you imagine yourself actually being Ozzie and doing the things that he's doing, I'm curious what are the main things that feel like you don't actually buy about what he's doing.

Vaniver: Yeah. One of the things ... maybe this is fair. Maybe this isn't. I've rounded it up to something like personality difference where I'm imagining someone who is excited about thinking about this sort of tool and so ends up with “here's this wide range of possibilities and it was fun to think about all of them, but of the wide range, here's the few that I think are actually good”.

Vaniver: When I imagine dropping myself into your shoes, there's much more of the ... for me, the “actually good” is the bit that's interesting (though I want to consider much of the possibility space for due diligence). I don't know if that's actually true. Maybe you're like, "No. I hated this thing but I came into it because it felt like the value is here."

Ozzie: I'm not certain. You're saying I wasn't focused on ... this was a creative ... it was enjoyable to do and then I was trying to rationalize it?

Vaniver: Not necessarily rationalize but I think closer to the exploration step was fun and creative. Then the exploitation step of now we're actually going to build a project for these two things was guided by the question of which of these will be useful or not useful.

How to explore the forecasting space

Vaniver: When I imagine trying to do that thing, my exploration step looks very different. But this seems connected to this because there's still some amount of everyone having different exploration steps that are driven by their interests. Then also you should expect many people to not have many well-developed possibilities outside of their interests.

Vaniver: This may end up being good to the extent that people do specialize in various ways. If we just randomly reassigned jobs to everyone, productivity would go way down. But this thing where the interests matter. You should actually only explore things that you find interesting makes sense. There's a different thing where I don't think I see the details of Ozzie's strategic map for something in the sense of “Here's the long term north star type things that are guiding us.” The one bit that I've seen that was medium term was the “yep, we could do the AI part testing stuff but it is actually unclear whether this is speeding up capabilities more than it's useful”. How many years is a “fire alarm for general intelligence” worth? [Editor’s note: Vaniver is referring to this post [LW · GW] by Eliezer Yudkowsky] Maybe the answer to that is “0” because we won't do anything useful with the fire alarm even if we had it.

Facilitator: To make sure I followed, the first step was: you have a sense of Ozzie exploring a lot of the space initially and now it's exploiting some of the things you think may be more useful. But you wouldn't have explored it that way yourself potentially because you wouldn't really have felt differently that there would have been something especially useful to find if you continued exploring?

Facilitator: Secondly, you're also not yet sufficiently sold on the actual medium term things to think that the exploiting strategies are worth taking?

Vaniver: “Not yet sold” feels too strong. I think it's more that I don't see it. Not being sold implies something like ... I would normally say I'm not sold on x when I can see it but I don't see the justification for it yet where here I don't actually have a crisp picture of what seven year success looks like.

Facilitator: Ozzie which one of those feels more like "Argh, I just want to tell Vaniver what I'm thinking now"?

Ozzie: So on exploration and exploitation. One the one hand, not that much time or resource is going into this yet. Maybe a few full-time months like to think about it and then several for making webapps. Maybe that was too much. I think it wasn't.

Ozzie: The amount of variety of types of proposals that are on the table right now compared to when I started I'm pretty happy with for like a few months of thinking. Especially since for me to get involved in AI would have taken quite a bit more time of education and stuff. It did seem like a few cheap wins at this point. I still kind of feel like that.

Importance and neglectedness of forecasting work

Ozzie: I also do get the sense that this area is still pretty neglected.

Vaniver: Yeah. I guess in my mind neglecting is both people aren't working on it and people should be working on it. Is that true for you also?

Ozzie: There are three aspects. Importance, tractable, and neglected. It could be neglected but not important. I'm just saying here that it's neglected.

Vaniver: Okay. You are just saying that people aren't working on it.

Ozzie: Yeah. You can talk about then the questions of importance and tractability.

Facilitator: I feel like there are a lot of things that one can do. One can Like try to start a group house in Cambridge, one can try and teach rationality at the FHI. Forecasting ... something about "neglected" doesn't feel like it quite gets at the thing because the space is sufficiently vast.

Ozzie: Yeah. The next part would be importance. I obviously think that it's higher in importance than a lot of the other things that seem similarly neglected. Let's say basically the ratio of importance in importance, neglected and tractable was pretty good for forecasting. I'm happy to spend a while getting into that.

Tractability of forecasting work

Vaniver: I guess I actually don't care all that much about the importance because I buy if we could ... in my earlier framing, we move everyone to a "numbers-are-real" organization. That would be excellent. The thing that I feel most doomy about is something like the tractability where it feels like most of the wins that people were trying to get before turned out to be extremely difficult and not really worth it. I'm interested in seeing the avenues that you think are promising in this regard.

Ozzie: Yeah. It's an interesting question. I think a lot of people have the notion that we've had tons and tons of attempts at forecasting systems since Robin Hanson started talking about Prediction markets. All of those have failed therefore Prediction markets have failed and it's not worth spending another person and it's like a heap of dead bodies.

Ozzie: The viewpoint that I have where it definitely doesn't look that way, for one thing, the tooling. If you actually look at a lot of the tooling that's been done, a lot of it is still pretty basic. One piece of evidence for that is the fact that almost no EA organizations are using it themselves.

Ozzie: That could also be that it's really hard to make good tooling. If you look at it, basically if you look at non-prediction market systems, in terms of prediction markets there were also a few attempts. But the area is kind of illegal. Like I said, there are issues with prediction markets.

Ozzie: If you look at non prediction market tournament applications. Basically you have a few. The GJP doesn't make their own. They've used Cultivate Labs. Now they're starting to try and make their own systems as well. But the GJP people are mostly political scientists and stuff, not developers.

Ozzie: A lot of experiments they've done are political. It's not like engineering questions about how there'd be an awesome engineering infrastructure. My take on that is if you put some really smart engineer/entrepreneur in that type of area, I'd expect them to generally have a very different approach.

Vaniver: There's a saying from Nintendo: "if your game is not fun with programmer art, it won't be fun in the final product" or something. Similarly, I can buy that there's some minimum level of tooling that we need for these sorts of forecasts that would be sensible it all. But it feels to me that if I expected forecasting to be easy in the relevant ways, the shitty early versions would have succeeded without us having to build later good versions.

Ozzie: There's a question of what "enough" is. They definitely have succeeded to some extent. PredictionBook has been used by Gwern and a lot of other people. Some also use their own setups and Metaculus and stuff... So. you can actually see a decent amount of activity. I don't see many other areas that have nearly that level of experimentation. There are very few other areas that are being used to the extent that predictions are used that we could imagine as future EA web apps.

Vaniver: The claim that I'm hearing there is something like “I should be comparing PredictionBook and Metaculus and similar things to reciprocity.io or something, as this is just a web app made in their spare time and if it actually sees use that's relevant”.

Ozzie: I think that there's a lot of truth to that, though maybe not exactly be the case. Maybe we're past a bit of reciprocity.

Vaniver: Beeminder also feels like it's in this camp to me to me although less like EA specific.

Ozzie: Yeah. Or like Anki.

Ozzie: Right.

Technical tooling for Effective Altruism

Ozzie: There's one question which is A), do we think that there's room for technical tooling around Effective Altruism? B) if there is, what are the areas that seems exciting? I don't see many other exciting areas. Of course, that is another question. If you think ... that's not exactly depending forecasting... but more like, if you don't like forecasting, what do you like? Because there's a conclusion that we just don't like EA tools and there's almost nothing in the space. Because there's not much more that seems obviously more exciting. But there's a very different side to the argument.

Vaniver: Yeah. It's interesting because on the one hand I do buy the frame of it might make sense to just try to make EA tools and then to figure out what the most promising EA tool is. Then also I can see the thing going in the reverse direction which is something like if none of the opportunities for EA tools are good then people shouldn't try it. Also if we do in fact come up with 12 great opportunities for EA tools this should be a wave of EA grants or whatever.

Vaniver: I would be excited about something double crux-shaped. But I worry this runs into the problem that argument mapping and mind mapping have all run into before. There's something that's nice about doing a double crux which makes it grounded out in the trace that one particular conversation takes as opposed to actually trying to represent minds. I feel like most of the other EA tools would be ... in my head it starts as silly one-offs. I'm thinking of things like for the 2016 election there was a vote-swapping thing to try to get third party voters in swing states to vote for whatever party in exchange for third party votes in safe states. I think Scott Aaronsson promoted it but I don't think he made it. But. It feels to me like that sort of thing. We may end up seeing lots of things like that where it's like “if we had software engineers ready to go, we would make these projects happen”. Currently I expect it's sufficient that people do that just for the glory of having done it. But the Beeminder style things are more like, “oh yeah, actually this is the sort of thing where if it's providing value then we should have people working for it and the people will be paid by the value they're providing”. Though that move is a bit weird because that doesn't quite capture how LessWrong is being paid for...

Ozzie: Yeah. Multiple questions on that. This could be a long winding conversation. One would be “should things like this be funded by the users or by other groups?”

Ozzie: One thing I'd say that ... I joined 80000 Hours about four years ago. I worked with them to help them with their application and decided at that point that it should be much less of an application and more of like a blog. I helped them scale it down.

Ozzie: I was looking for other opportunities to make big EA apps. At that point there was not much money. I kind of took a detour and I'm coming back to it in some ways. In a way I've experienced this with Guesstimate, which has been used a bit. Apps from Effective Altruism has advantages and disadvantages. One disadvantage is that writing software is an expensive thing. An advantage is that it's very tractable. By tractable I mean you could say “if I spent $200,000 and three engineer years I could expect to get this thing out”. Right now we are in a situation where we do have hypothetically a decent amount of money if it could beat a specific bar. The programmers don't even have to be these intense EAs (although it is definitely helpful).

5-MIN BREAK

Tractability of forecasting within vs outside EA

Ozzie: I feel like we both kind of agree, that, hypothetically, if a forecasting system was used and people decided it was quite useful, and we could get to the point that EA orgs were making decisions in big ways with it, that could be a nice thing to have. But there’s disagreement about whether that’s an existing possibility, and whether existing evidence shows us that won’t happen.

Vaniver: I’m now also more excited about the prospects of this for the EA space. Where I imagine a software engineer coming out of college saying “My startup idea is prediction markets”, and my response is “let’s do some market research!” But in the EA space the market research is quite different, because people are more interested in using the thing, and there’s more money for crazy long-shots… or not crazy long-shots, but rather, “if we can make this handful of people slightly more effective, there are many dollars on the line”.

Ozzie: Yeah.

Vaniver: It’s similar to a case where you have this obscure tool for Wall Street traders, and even if you only sell to one firm you may just pay for yourself.

Ozzie: I’m skeptical whenever I hear an entrepreneur saying “I’m doing a prediction market thing”. It’s usually crypto related. Interestingly most prediction platforms don’t predict their own success, and that kind of tells you something…

(Audience laughter)

Vaniver: Well this is just like the prediction market on “will the universe still exist”. It turns out it’s just asymmetric who gets paid out.

Medium-term goals and lean startup methodology

Facilitator: Vaniver, your earlier impression was you didn’t have a sense what medium term progress would look like?

Vaniver: It’s important to flag that I changed my mind. When I think about forecasting as a service for the EA space, I’m now more optimistic, compared to when I think of it as a service on the general market. It’s not surprising OpenPhil bought a bunch of Good Judgement forecasters. Whereas it would be a surprise if Exxon bought GJP questions.

Vaniver: Ozzie do you have detailed visions of what success looks like in several years?

Ozzie: I have multiple options. The way I see is that… when lots of YC startups come out they have a sense that “this is an area that seems kind of exciting”. We kind of have evidence that it may be interesting, and also that it may not be interesting. We don’t know what success looks like for an organisation in this space, though hopefully we’re competent and we could work quickly to figure it out. And it seems things are exciting enough for it to be worth that effort.

Ozzie: So AirBnB and the vast majority of companies didn’t have a super clear idea of how they were going to be useful when they started. But they do have good inputs, and a vague sense of what kind of cool outputs would be.

Ozzie: There’s evidence that statistically this seems to be what works in startup land.

Ozzie: Some of the evidence against. There was a question of “if you have a few small things that are working but are not super exciting, does that make it pretty unlikely you’ll see something in this space?”

Ozzie: It would be hard to make a strong argument that YC wouldn’t find any companies in such cases. They do fund things without any evidence of success.

Vaniver: But also if you’re looking for moonshots, mild success the first few times is evidence against “the first time it just works and everything goes great”.

Limitations of current forecasting tooling

Ozzie: Of course in that case you’re question is of exactly what is this that’s been tried. I think there are arguments that there are more exciting things on the horizon which haven’t been tried.

Ozzie: Now we have PredictionBook, Metaculus, and hypothetically Cultivate Labs and another similar site. Cultivate Labs does enterprise gigs, and are used by big companies like Exxon for ideation and similar things. They’re a YC company and have around 6 people. But they haven’t done amazingly well. They’re pretty expensive to use. At this point you’d have to spend around $400 for one instance per month. And even then you get a specific enterprise-y app that’s kind of messy.

Ozzie: Then if you actually look at the amount of work done on PredictionBook and Metaculus, it’s not that much. PredictionBook might have had 1-2 years of engineering effort, around 7 years ago. People think it’s cool, but not a serious site really. As for Metaculus, I have a lot of respect for their team. That project was probably around 3-5 engineering years.

Ozzie: They have a specific set of assumptions I kind of disagree with. For example, everyone has to post their questions in one main thread, and separate communities only exist by having subdomains. They’re mostly excited about setting up those subdomains for big projects.

Ozzie: So if a few of us wanted to experiment with “oh, let’s make a small community, have some privacy, and start messing around with questions” it’s hard to do that...

Vaniver: So what would this be for? Who wants their own instances? MMO guilds?

Audience: Here’s one example of the simplest thing you currently cannot do. (Or could not do around January 1st 2019.) Four guys are hanging out, and they wonder “When will people next climb mount everest?” They then just want to note down their distributions for this and get some feedback, without having to specify everything in a Google doc or a spreadsheet which doesn’t have distributions.

Facilitator: Which bit breaks?

Audience: You cannot small private channels for multiple people which take 5 minutes to set up where everyone records custom distributions.

Vaniver: So I see what you can’t do. What I want is the group that wants to do it. For example, one of my housemates loves these sites, but also is the sort of nerd that loves these kinds of sites in general. So should I just imagine there’s some MIT fraternity where everyone is really into forecasting so they want a private domain?

Ozzie: I’d say there's a lot of uncertainty. A bunch of groups may be interested, and if a few are pretty good and happen to be excited, that would be nice. We don’t know who those are yet, but we have ideas. There are EA groups now. A lot of them are kind of already doing this; and we could enable them to do it without having to pay $400-$1000 per month; or in a way that could make stuff public knowledge between groups… For other smaller EA groups that just wanted to experiment the current tooling would create some awkwardness.

Ozzie: If we want to run experiments on interesting things to forecast, e.g “how valuable is this thing?” or stuff around evaluation or LessWrong posts. We’d have to set up a new instance for each. Or maybe we could have one instance and use it for all experiments, but that would force a single privacy setting for all those experiments.

Ozzie: Besides that, at this point, I raised some money and spent like $11,000 to get someone to program. So a lot of this tooling work is already done and these things are starting to be experimented with.

Knowledge graphs and moving beyond questions-as-strings

Ozzie: In the medium-term there’s a lot of other interesting things. With the systems right now, a lot of them assume all questions are strings. So if you’re going to have a 1000 questions, it’s impossible to understand and for other people to get value from. So if you wanted to organise something like, “every EA org, how much money and personnel would they have each year for the coming 10 years” it would be impossible with current methods.

Vaniver: Instead we’d want like a string prefix combined with a list of string postfixes?

Ozzie: There are many ways to do it. I’m experimenting with using a formal knowledge graph where you have formal entities.

Vaniver: So there would be a pointer to the MIRI object instead of a string?

Ozzie: Yeah, and that would include information about how to find information about it from Wikipedia, etc. So if someone wanted to set up an automated system to do some of this they could. Combining this with bot support would enable experiments with data scientists and ML people to basically augment human forecasts with AI bots.

Vaniver: So, bot support here is like participants in the market (I’ll just always call a “forecast-aggregator” a market)? Somehow we have an API where they can just ingest question and respond with distributions?

Ozzie: Even without bots, just organising structured questions in this way makes it easier for both participants and observers to get value.

Summary of cruxes

Facilitator: Yeah, I don’t know… You chatted for a while, I’m curious what feels like some of the things you’ll likely think a bit more about, or things that seem especially surprising?

Ozzie: I got the sense that we agreed on more things than I was kind of expecting to. It seems lots of it now may be fleshing out what the mid-term would be, and seeing if there’s parts of it you agree are surprisingly useful, or if it does seem like all of them are long-shots?

Vaniver: When I try to summarise your cruxes, what would change your mind about forecasting, it feels like 1) if you thought there was a different app/EA tool to build, you would bet on that instead of this.

Ozzie: I’d agree with that.

Vaniver: And 2) if the track-record of attempts were more like… I don’t know what word to use, but maybe like “sophisticated” or “effortful”? If there were more people who were more competent than you and failed, then you’d decide to give up on it.

Ozzie: I agree.

Vaniver: I didn’t get the sense that there were conceptual things about forecasting that you expected to be surprised by. In my mind, getting data scientists to give useful forecasts, even if the questions are in some complicated knowledge graph or something, seems moderately implausible. Maybe I could transfer that intuition, but maybe the response is “they’ll just attempt to do base-rate forecasting, and it’s just an NLP problem to identify the right baserates”

Vaniver: Does it feel like it’s missing some of your cruxes?

Facilitator: Ozzie, can you repeat the ones he did say?

Audience: Good question.

Ozzie: I’m bad at this part. Now I’m a bit panicked because I feel like I’m getting cornered or something.

Vaniver: My sense was… 1) if there are better EA tools to build, you’d build them instead. 2) if better tries had failed, it would feel less tractable. And 3) Absence of conceptual uncertainties that we could resolve now. It feels it’s not like “Previous systems are bad because they got the questions wrong” or “Question/answer is not the right format”. It’s closer to “Previous systems are bad because their question data structure doesn’t give us the full flexibility that we want”.

Vaniver: Maybe that’s a bad characterization of the automation and knowledge graph stuff.

Ozzie: I’d definitely agree with the first two, although the first one is a bit more expansive than tools. If there was e.g. a programming tool I’d be better for and had higher EV, I’d do that instead. Number two, on tries, I agree if there were one or two other top programming teams who tried a few of these ideas and were very creative about it, and failed, and especially if they had software we could use now! (I’d feel much better about not having to make software) Then for three, The absence of conceptual uncertainties. I don’t know exactly how to pin this down.

Facilitator: I don’t know if we should follow this track.

Vaniver: I’m excited about hearing what Ozzie’s conceptual uncertainties are.

Facilitator: Yeah, I agree actually.

Ozzie’s conceptual uncertainties

Ozzie: I think the way I’m looking at this problem is one where there are many different types of approaches that could be useful. There are many kinds of people who could be doing the predicting. There are many kinds of privacy. Maybe there would be more EAs using it, or maybe we want non-EAs of specific types. And within EA vs non-EA, there are many different kinds of things we might want to forecast. There are many creative ways of organising question such that forecasting leads to an improved amount of accuracy. And I have a lot of uncertainty about this entire space, and what areas will be useful and what won’t.

Ozzie: I think I find it unlikely that absolutely nothing will be useful. But I do find it very possible that it’ll just be too expensive to find out useful things.

Vaniver: If it turned out nothing was useful, would it be the same reason for different applications, or would it be “we just got tails on every different application?”

Ozzie: If it came out people just hate using the tooling, then no matter what application you use it for it will kind of suck.

Ozzie: For me a lot of this is a question of economics. Basically, it requires some cost to both build the system and then get people to do forecasts; and then to make the question and do the resolution. In some areas the cost will be higher than value, and in some the value will be higher than the cost. It kind of comes down to a question of efficiency. Though, it’s hard to know, because there’s always the question of maybe if I would have implemented this feature things would have been different?

Vaniver: That made me think of something specific. When we look at the success stories, they are things like weather and sports, whereas for sports you had to do some amount of difficult operationalisation, but you sort of only had to do it once. The step I expect to be hard across most application domains is the “I have a question, and now I need to turn it into a thing-that-can-be-quantitatively-forecasted” and then I became kind of curious if we could get relatively simple NLP systems that could figure out the probability that a question is well-operationalised or not. And have some sort of automatic suggestions of like “ah, consider these cases” or whatever, or “write the question this way rather than that way”.

Ozzie: From my angle, you could kind of call those “unique question”, where the marginal cost per question is pretty high. I think that if we were in any ecosystem where things were tremendously useful, the majority of questions would not be like this.

Vaniver: Right so if I ask about the odds I will still be together with my partner a while from now, I’d be cloning the standard “will this relationship last?” question and substituting new pointers?

Ozzie: Yeah. And a lot of questions would be like “GDP for every country for every year” so there could be a large set of question templates in the ecosystem. So you don’t need any fancy NLP; you could get pretty far with trend analysis and stuff.

Ozzie: On the question of whether data scientists would be likely to use it, that comes down to funding and incentive structures.

Ozzie: If you go on upwork and pay $10k to a data scientist they could give you a decent extrapolation system, and you could then just build that into a bot and hypothetically just keep pumping out these forecasts as new data come in. Pipelines like that already exist. What this would be doing is to provide infrastructure to help support them basically.

END OF TRANSCRIPT

At this point the conversation opened up to questions from the audience.

While this conversation was inspired by the double-crux technique, there is a large variation in how such sessions might look. Even when both participants retain the spirit of seeking the truth and changing their minds in that direction, some disagreements dissipate after less than an hour, others take 10+ hours to resolve and some remain unsolved for years. It seems good to have more public examples of genuine truth-seeking dialogue, but at the same time should be noted that such conversations might look very different from this one.

18 comments

Comments sorted by top scores.

comment by Raemon · 2019-07-30T18:41:54.615Z · LW(p) · GW(p)

I had written up this summary of my takeaways [LW(p) · GW(p)] (after observing this conversation in realtime, plus some related conversations). This is fairly opinionated, rather than a strict summary. Seems maybe better to just list it entirely here:

Epistemic Status: quite rough, I didn't take very good notes and was summarizing the salient bits after the fact. Apologies for anything I got wrong here, grateful for Ozzie and Vaniver clarifying some things in the comments.

Just spent a weekend at the Internet Intellectual Infrastructure Retreat. One thing I came away with was a slightly better sense of was forecasting and prediction markets, and how they might be expected to unfold as an institution.

I initially had a sense that forecasting, and predictions in particular, was sort of "looking at the easy to measure/think about stuff, which isn't necessarily the stuff that connected to stuff that matters most."

Tournaments over Prediction Markets

Prediction markets are often illegal or sketchily legal. But prediction tournaments are not, so this is how most forecasting is done.

The Good Judgment Project

Held an open tournament, the winners of which became "Superforecasters". Those people now... I think basically work as professional forecasters, who rent out their services to companies, NGOs and governments that have a concrete use for knowing how likely a given country is to go to war, or something. (I think they'd been hired sometimes by Open Phil?)

Vague impression that they mostly focus on geopolitics stuff?

High Volume and Metaforecasting

Ozzie described a vision where lots of forecasters are predicting things all the time, which establishes how calibrated they are. This lets you do things like "have one good forecaster with a good track record make lots of predictions. Have another meta-forecaster evaluate a small sample of their predictions to sanity check that they are actually making good predictions", which could get you a lot of predictive power for less work than you'd expect."

This seemed interesting, but I still had some sense of "But how do you get all these people making all these predictions? The prediction markets I've seen don't seem to accomplish very interesting things, for reasons Zvi discussed here [LW · GW]." Plus I'd heard that sites like Metaculus end up mostly being about gaming the operationalization rules than actually predicting things accurately.

Automation

One thing I hadn't considered is that Machine Learning is already something like high volume forecasting, in very narrow domains (i.e. lots of bots predicting which video you'll click on next). One of Ozzie's expectations is that over time, as ML improves, it'll expand the range of things that bots can predict. So some of the high volume can come from automated forecasters.

Neural nets and the like might also be able to assist in handling the tricky "operationalization bits", where you take a vague prediction like "will country X go to war against country Y" and turn that into the concrete observations that would count for such a thing. Currently this takes a fair amount of overhead on Metaculus. But maybe at some point this could get partly automated.

(there wasn't a clear case for how this would happen AFAICT, just 'i dunno neural net magic might be able to help.' I don't expect neural-net magic to help here in the next 10 years but I could see it helping in the next 20 or 30. I'm not sure if it happens much farther in advance than "actual AGI" though)

I [think] part of the claim was that for both the automated-forecasting and automated-operationalization, it's worth laying out tools, infrastructure and/or experiments now that'll set up our ability to take advantage of them later.

Sweeping Visions vs Near-Term Practicality, and Overly Narrow Ontologies

An aesthetic disagreement I had with Ozzie was:

My impression is that Ozzie is starting with lots of excitement for forecasting as a whole, and imagining entire ecosystems built out of it. And... I think there's something important and good about people being deeply excited for things, exploring them thoroughly, and then bringing the best bits of their exploration back to the "rest of the world."

But when I look at the current forecasting ecosystem, it looks like the best bits of it aren't built out of sweeping infrastructural changes, they're built of small internal teams building tools that work for them, or consulting firms of professionals that hire themselves out. (Good Judgment project being one, and the How To Measure Anything guy being another)

The problem with large infrastructural ecosystems is this general problem you also find on Debate-Mapping sites – humans don't actually think in clean boxes that are easy to fit into database tables. They think in confused thought patterns that often need to meander, explore special cases, and don't necessarily fit whatever tool you built for them to think in.

Relatedly: every large company I've worked at has built internal tools of some sort, even for domains that seem like they sure out to be able to be automated and sold at scale. Whenever I've seen someone try to purchase enterprise software for managing a product map, it's either been a mistake, or the enterprise software has required a lot of customization before it fit the idiosyncratic needs of the company.

Google sheets is really hard to beat as a coordination tool (but a given google sheet is hard to scale)

So for the immediate future I'm more excited by hiring forecasters and building internal forecasting teams than ecosystem-type websites.

Replies from: jacobjacob, Vaniver, ozziegooen

↑ comment by Bird Concept (jacobjacob) · 2019-07-31T09:40:11.006Z · LW(p) · GW(p)

Factual correction:

Those people now... I think basically work as professional forecasters.

I don't think any of the superforecasters are full-time on forecasting, instead doing it as a contractor gig; mostly due to lack of demand for the services.

Vague impression that they mostly focus on geopolitics stuff?

Yes, initial tournament were sponsored by IARPA who cared about that, and Tetlock's earlier work in the 90's and 00's also considered expert political forecasting.

Replies from: Raemon

↑ comment by Raemon · 2019-07-31T21:25:08.735Z · LW(p) · GW(p)

I don't think any of the superforecasters are full-time on forecasting, instead doing it as a contractor gig; mostly due to lack of demand for the services.

Good to know. I'd still count contractors as professionals though.

↑ comment by Vaniver · 2019-07-30T20:32:40.865Z · LW(p) · GW(p)

(there wasn't a clear case for how this would happen AFAICT, just 'i dunno neural net magic might be able to help.' I don't expect neural-net magic to help here in the next 10 years but I could see it helping in the next 20 or 30. I'm not sure if it happens much farther in advance than "actual AGI" though)

I thought Ozzie's plan here was closer to "if you have a knowledge graph, you can durably encode a lot of this in ways that transfer between questions", and you can have lots of things where you rapidly build out a suite of forecasts with quantifiers and pointers. I thought "maybe NLP will help you pick out bad questions" but I think this is more "recognizing common user errors" than it is "understanding what's going on."

Replies from: Raemon, ozziegooen

↑ comment by Raemon · 2019-07-30T21:26:37.770Z · LW(p) · GW(p)

Nod, I definitely expect I missed some details, and defer to you or Ozzie on a more precise picture.

↑ comment by ozziegooen · 2019-07-30T20:56:39.026Z · LW(p) · GW(p)

Yep. I don't think any/much NLP is interesting for a lot of interesting work, if things are organized well with knowledge graphs. I haven't thought much about operationalizing questions using ML, but have been thinking that by focussing on questions that could be scaled (like, GDP/Population of every country for every year), we could get a lot of useful information without a huge amount of operationalization work.

↑ comment by ozziegooen · 2019-07-30T21:07:08.086Z · LW(p) · GW(p)

I think it would probably take a while to figure out the specific cruxes of our disagreements.

On your "aesthetic disagreement", I'd point out that there are, say, three types of forecasting work with respect to organizations.

Organization-specific, organization-unique questions. These are questions such as, "Will this specific initiative be more successful than this other specific initiative?" Each one needs to be custom made for that organization.
Organization-specific, standard questions. These are questions such as, "What is the likelihood that employee X will leave in 3 months"; where this question can be asked at many organizations and compared as such. A specific instance is unique to an organization, but the more general question is quite generic.
Inter-organization questions. These are questions such as, "Will this common tool that everyone uses get hacked by 2020?". Lots of organizations would be interested.

I think right now organizations are starting traditional judgemental forecasting for type (1), but there are several standard tools already for type (2). For instance, there are several startups that help businesses forecast key variables; like engineering timelines, sales, revenue, and HR issues. https://www.liquidplanner.com/

I think type (3) is most exciting to me; that's where PredictIt and Metaculus are currently. Getting the ontology right is difficult, but possible. Wikipedia and Wikidata are two successful (in my mind) examples of community efforts with careful ontologies that are useful to many organizations; I see many future public forecasting efforts in a similar vein. That said, I have a lot of uncertainty, so would like to see everything tried more.

I could imagine, in the "worst" case, that the necessary team for this could just be hired. You may be able to do some impressive things with just 5 full time equivalents, which isn't that expensive in the scheme of things. The existing forecasting systems don't seem to have that many full time equivalents to me (almost all forecasters are very part time)

comment by Vaniver · 2019-07-30T20:37:53.295Z · LW(p) · GW(p)

An idea that I had later which I didn't end up saying to Ozzie at the retreat was something like a "Good Judgment Project for high schoolers", in the same way that there are math contests and programming contests and so on. I would be really interested in seeing what happens if we can identify people who would be superforecasters as adults when they're still teens or in undergrad or whatever, and then point them towards a career in forecasting / have them work together to build up the art, and this seems like a project that's "normal" enough to catch on while still doing something interesting.

Replies from: romeostevensit, ozziegooen

↑ comment by romeostevensit · 2019-07-31T01:18:45.957Z · LW(p) · GW(p)

Today I wondered how much of forecasting plus data science is just people realizing that insurance isn't the only use of actuarial methods.

↑ comment by ozziegooen · 2019-07-30T20:50:02.864Z · LW(p) · GW(p)

Yea, I've been thinking about this too, though more for college students. I think that hypothetically forecasting could be a pretty cool team activity; perhaps different schools/colleges could compete with each other. Not only would people develop track records, but the practice of getting good at forecasting seems positive for epistemics and similar.

comment by ryan_b · 2019-07-31T21:20:22.984Z · LW(p) · GW(p)

I approve of this write-up, and would like to see more of this kind of content.

I feel like the most neglected part of forecasting is how it relates to anything else. The working assumption is that if it works well and is widely available, it will enable a lot of really cool stuff; I agree with this assumption, but I don't see much effort to bridge the gap between 'cool stuff' and 'what is currently happening'. I suspect that the reason more isn't being invested in this area is that we mostly won't use it regardless of how well it works.

There are other areas where we know how to achieve good, or at least better, outcomes in the form of best practices, like software and engineering. I think it is uncontroversial to claim that most software or engineering firms do not follow most best practices, most of the time.

But that takes effort, and so you might reason that perhaps trying to predict what will happen is more common when the responsibility is enormous and rewards are fabulous, to the tune of billions of dollars or percentage points of GDP. Yet that is not true - mostly people doing huge projects don't bother to try [LW · GW].

Perhaps then a different standard, where hundreds of thousands of lives are on the line and where nations hang in the balance. Then, surely, the people who make decisions will think hard about what is going to happen. Alas, even for wars it is not the case [LW · GW].

When we know the right thing to do, we often don't do it; and whether the rewards are great or terrible, we don't try to figure out if we will get them or not. The people who would be able to make the best use of forecasting in general follow a simpler rule: predict success, then do whatever they would normally do.

There's an important ambiguity at work, and the only discussion of it I have read is in the book Prediction Machines. This book talks about what the overall impact of AI will be, and they posit that the big difference will be a drop in cost of prediction. The predictions they talk about are mostly of the routine sort, like how much inventory is needed or expected number of applications, which is distinct from the forecasting questions of GJP. But the point they made that I thought was valuable is how deeply entwined predictions and decisions are in our institutions and positions, and how this will be a barrier to taking advantage of the new trends for businesses. We will have to rethink how decisions are made once we separate out the prediction component.

So what I would like to see from forecasting platforms, companies, and projects is a lot more specifics about how forecasting relates to the decisions that need to be made, and how it improves them. As it stands, forecasting infrastructure probably looks a lot like a bridge to nowhere from the perspective of its beneficiaries.

Replies from: jacobjacob, ozziegooen

↑ comment by Bird Concept (jacobjacob) · 2019-08-01T11:20:40.691Z · LW(p) · GW(p)

Glad that you found the write-up useful!

I might disagree with your other points. There's the idea that forecasting is only valuable if it's decision-relevant, or action-guiding, and so far no forecasting org has solved this problem. But I think this is the wrong bar to beat. Making something action-guiding is really hard -- and lots of things which we do think of as important don't meet this bar.

For example, think of research. Most people at e.g. FHI don't set out to write documents that will change how Bostrom takes decisions. Rather, they seek out something that they're curious about, or that seems interesting, or just generally important... and mostly just try to have true beliefs, more than having impactful actions. They're doing research, not decisions.

Most people think essays are important and enabling people to do essays better has high impact. But if you pick a random curated LW post and ask what decision was improved as a result, I think you'll be disappointed (though not as disappointed as with forecasting questions). And this is fine. Decision-making takes in a large number of inputs, considerations, emotions, etc. which influence it in strange, non-linear ways. Its mostly just a fact about human decision-making being complex, rather than a fact about essays being useless.

So I'm thinking that the evidence that should suggest to us that forecasting is valuable is not hearing an impactful person say "I saw forecast X which caused me to change decision Y", but rather "I saw forecast X which changed my mind about topic Y". Then, downstream, there might be all sorts of actions which changed as a result, and the forecast-induced mind-change might be one out of a hundred counterfactually important inputs. Yet we shouldn't propose an isolated demand for rigor that forecasting do the credit assignment problem any better than the other 99 inputs.

Replies from: bgold, ryan_b

↑ comment by Ben Goldhaber (bgold) · 2019-08-09T21:09:29.327Z · LW(p) · GW(p)

This seems true that there's a lot of way to utilize forecasts. In general forecasting tends to have an implicit and unstated connection to the decision making process - I think that has to do w/ the nature of operationalization ("a forecast needs to be on a very specific thing") and because much of the popular literature on forecasting has come from business literature (e.g. How to Measure Anything).

That being said I think action-guidingness is still the correct bar to meet for evaluating the effect it has on the EA community. I would bite the bullet and say blogs should also be held to this standard, as should research literature. An important question for an EA blog - say, LW :) - is what positive decisions it's creating (yes there are many other good things about having a central hub, but if the quality of intellectual content is part of it that should be trackable).

If in aggregate many forecasts can produce the same type of guidance or better as many good blog posts, that would be really positive.

Replies from: jacobjacob

↑ comment by Bird Concept (jacobjacob) · 2019-08-10T18:54:01.910Z · LW(p) · GW(p)

I wonder whether you have any examples, or concrete case studies, of things that were successfully action-guiding to people/organisations? (Beyond forecasts and blog-posts, though those are fine to.)

Replies from: bgold

↑ comment by Ben Goldhaber (bgold) · 2019-08-11T16:52:29.913Z · LW(p) · GW(p)

From a 2 min brainstorm of "info products" I'd expect to be action guiding:

Metrics and dashboards reflecting the current state of the organization.
Vision statements ("what do we as an organization do and thus what things should we consider as part of our strategy")
Trusted advisors
Market forces (e.g. price's of goods)

One concrete example is from when I worked in a business intelligence role. What executives wanted was extremely trustworthy reliable data sources to track business performance over time. In a software environment (e.g. all the analytic companies constantly posting to Hacker News) that's trivial, but in a non-software environment that's very hard. It was very action-guiding to be able to see if your last initiative worked, because if it did you could put a lot more money into it and scale it up.

↑ comment by ryan_b · 2019-08-01T16:21:25.217Z · LW(p) · GW(p)

I agree with you about the non-decision value of forecasting. My claim is that the decision value of forecasting is neglected, rather than that decisions are the only value. I strongly feel that neglecting the decisions aspect is leaving money on the table. From Ozzie:

My impression is that some groups have found it useful and a lot of businesses don't know what to do with those numbers. They get a number like 87% and they don't have ways to directly make that interact with the rest of their system.

I will make a stronger claim and say that the decisions aspect is the highest value aspect of forecasting. From the megaproject management example: Bent Flyvbjerg (of Reference Class Forecasting fame) estimates that megaprojects account for ~8% of global GDP. The time and budget overruns cause huge amounts of waste, and eyeballing his budget overrun numbers it looks to me like ~3% of global GDP is waste. I expect the majority of that can be resolved with good forecasting; by comparison with modelling of a different system which tries to address some of the same problems, I'd say 2/3 of that waste.

So I currently expect that if good forecasting became the norm only in projects of $1B or more, excluding national defense, it would conservatively be worth ~2% of global GDP.

Looking at the war example, we can consider a single catastrophic decision: disbanding the Iraqi military. I expect reasonable forecasting practices would have suggested that when you stop paying a lot of people who are in possession of virtually all of the weaponry, that they would have to find other ways to get by. Selling the weapons and their fighting skills, for example. This decision allowed an insurgency to unfold into a full-blown civil war, costing some 10^5 lives and 10^6 displaced people and moderately intense infrastructure damage.

Returning to the business example from the write-up, if one or more projects were to succeed in delivering this kind of value, I expect a lot more resources would be available for the pursuing true-beliefs-aspect of forecasting. I go as far as to say it would be a very strong inducement for people who do not currently care about having true beliefs to start doing so, in the most basic big pile of utility sense.

↑ comment by ozziegooen · 2019-08-01T13:12:34.236Z · LW(p) · GW(p)

So what I would like to see from forecasting platforms, companies, and projects is a lot more specifics about how forecasting relates to the decisions that need to be made, and how it improves them

My general impression is that there's a lot of creative experimentation to be done here. Right now there doesn't seem to be that much of this kind of exploration. In general though, there really aren't that many people focusing on forecasting infrastructure work of these types.

comment by ozziegooen · 2019-07-30T21:11:24.812Z · LW(p) · GW(p)

I just want to note that this transcript is probably kind of hard to read without much more context. Before this I gave a short pitch on my ideas, which is not included here.

Much of this thinking comes from work I've been doing, especially in the past few months since joining RSP. I've written up some of my thoughts on LessWrong, but have yet to write up most of it. There's a decent amount, and it takes a while to organize and write it.

Recently my priority has been to build a system and start testing out some of the ideas. My impression was that this would be more promising than writing up thoughts for other people to hopefully eventually do. I hope to announce some of that work shortly.

Happy to answer any quick questions here. Also happy to meet/chat with others who are specifically excited about forecasting infrastructure work.

https://www.lesswrong.com/s/YX6dCo6NSNQJDEwXR [? · GW]

Conversation on forecasting with Vaniver and Ozzie Gooen

Contents

Introduction

Is infrastructure really what forecasting needs? (And clarifying the term “forecasting”)

Fragility of value and difficulty of capturing important uncertainties in forecasts

Vaniver’s conceptual model of why forecasting works

Background on prediction markets and the Good Judgement Project

Positive cultural externalities of forecasting AI

Importance of software engineering vs. other kinds of infrastructure

Privacy

Orgs using internal prediction tools, and the action-guidingness of quantitative forecasts

Vaniver’s steelman of Ozzie

How to explore the forecasting space

Importance and neglectedness of forecasting work

Tractability of forecasting work

Technical tooling for Effective Altruism

Tractability of forecasting within vs outside EA

Medium-term goals and lean startup methodology

Limitations of current forecasting tooling

Knowledge graphs and moving beyond questions-as-strings

Summary of cruxes

Ozzie’s conceptual uncertainties

18 comments