How superforecasting could be manipulated

post by AllAmericanBreakfast · 2020-04-17T06:47:51.289Z · LW · GW · 4 comments


  How superforecasting could be manipulated
  Testing calibration is not a fix
  Why this matters


What if Philip Tetlock wanted to game his own superforecasting system? He runs his own consulting firm. The information he's selling is not available to the public. How would he do it in a way that isn't violating any of the rules he's set up for forecasting?

I'm not saying he is, not at all. I respect Tetlock greatly, admire the work he's done, and seek to learn from it. In fact, it's in the spirit he brings to his enterprise that I'm questioning it. After all, his work is based on calling out the wiggle room "expert" commentators give themselves, and trying to get rid of it. I'm examining whether there might yet be a little bit of wiggle room left in the system.

How superforecasting could be manipulated

My understanding is that Tetlock designates as superforecasters the top 2% of participants in his tournaments. He evaluates their accuracy to see whether it stays steady or regresses to the mean, and finds that superforecasters really are better than the average person at predicting complex world events.

In theory, the way the system could be manipulated is in the selection of questions.

Superforecasters are eventually set apart from regular forecasters. For example, the questions answered at Good Judgment, Tetlock's professional consulting firm, are not visible to the public. I have to assume they're not (all) mirrored at Good Judgment Open, his open-to-all platform for forecasting. Hence, no way for even a client to compare how superforecasters do on these questions compared to the average Joanne.

How could their accuracy be inflated, even while answering unambiguous questions and being evaluated in a purely algorithmic fashion?

By choosing primarily questions that are in fact easy to answer for a professional, yet appear difficult to the typical client.

Here are two versions of the same question:

"Will Donald Trump and Kim Jong Un meet on or before Dec. 31, 2020?"
"Will Donald Trump and Kim Jong Un meet in Seoul on or before Dec. 31, 2020?"

Most readers here are savvy enough to realize that the odds of these world leaders meeting in Seoul are lower than them meeting at all, and familiar enough with the conjunction fallacy to know that "in Seoul" may actually make the second version seem more plausible to some readers. A professional superforecaster (or question writer for a global risk analysis firm) would know that Seoul, being the capital of South Korea, is an unlikely location for a next meeting place for Dear Leader. So by positing this question, you give the superforecasters an easy A.

Include a large number of these seems-hard-but-actually-easy questions, and you can inflate the average accuracy of your team. Include a few genuinely tough questions, like the number of Coronavirus cases in a month's time, and when your team gets them right, you can trumpet it. When they get other truly uncertain questions wrong, well, you can't get them all.

I want to reiterate that I am not accusing Tetlock or any other group of forecasters of doing this, consciously or unconsciously. In fact, on their analytics page, Good Judgment advertises that clients can choose the questions:

Clients... can... pose their own questions to the Superforecasters... And, if we don't have your topic on our current subscription list, our expert question team will help frame your mission-critical questions to get the answers you need.

GJ might do this with perfect integrity. And clients using this service have an incentive to pose questions where an increase in accuracy has genuine value.

But an unscrupulous firm might get a lot of practice at the additional skill of guiding clients to frame questions that appear uncertain but are in fact easy to answer. If the incentives of those hiring the consulting firm diverge from the overall interests of the company - if those purchasing the analytics aren't directly invested in getting answers to the most pressing questions - if the object at the level of the transaction is to have reassuringly advanced methodology backing up your report - then heck, maybe it's in the interests of the client, as in whichever manager is doing the hiring, to have the consultant spit back a bunch of highly confident answers that almost all turn out to be objectively accurate. It makes it really easy to defend your choice of analytics firm later on.

Testing calibration is not a fix

After Nate Silver sort-of failed to predict Trump's Presidential win, he checked to see whether his predictions are well-calibrated. Turns out, he is. When he predicts an event has a 70% chance of occurring (rounded to the nearest 5%), 71% of the time, it does.

If an unscrupulous forecasting firm wanted to hack this, could they?

Let's say they have a batch of 100 questions. 99 are easy questions - so unlikely that on each of them, the superforecasters predict the event has a 1% chance of occurring. One, though, is a hard question, with a 50% chance, a coin flip.

All they have to do is predict that the hard question also has a 1% chance of occurring. Without the hard question, they might have gotten 1-2 questions wrong, maybe even 3. With the hard question, they now have a 50% chance of getting one more question wrong. Even with some bad luck, when they predict a 1% chance of an event occurring, it actually occurs 4% of the time. And doesn't that still look pretty darn good?

Why this matters

First, it suggests to me that if the superforecasting system can be hacked in this way, then we're back to square one. We are trusting an accuracy assessment of experts that is manipulable. It might be more difficult to manipulate than that of the pundits who never really forecast anything. But at least with the pundits, we know to be suspicious.

The superforecasters still have an aura of Truth about them. It can get to the point where if we find that our analysis disagrees with the superforecasters, we are inclined to doubt ourselves almost automatically, with a similar sort of fundamentalism that we might subscribe to the efficient market hypothesis.

The advantage we amateurs have is that we're likely to pick questions with high uncertainty and high importance. Why would I bother looking around for questions I can easily guess the answer to, unless I was trying to impress other people with my clairvoyance?

I don't think a good strategy is to accuse other people of using this strategy to artificially inflate their accuracy score. But it's hard for me to imagine this isn't already happening as the superforecasting meme spreads.

I'm not sure I have a well-drawn-out way to deal with this, but here are the sketches.

If someone posts a list of their predictions, you choose the ones from their list that you personally feel the most uncertain about. Make a public, dated statement that those are your uncertainties. When the questions are resolved, calibrate their accuracy based only on the questions that you had genuine uncertainty about.

A better way, of course, would be to submit that question to a formal forecasting tournament, and see whether the median prediction was closer to 1% or 99% (highly unlikely/likely), or closer to 50% (highly uncertain). But that's not typically tractable.

Betting is also an OK strategy, but the failure mode there is that if you can make up your own questions, then find an idiot to bet with, you can accomplish the same thing. It only really works if an outsider would respect both people involved. I do respect Warren Buffett's bet with Protégé Partners LLC about the relative long-term performance of an index fund vs. a hedge fund, because it was high-profile, with not just financial but reputational stakes for both involved.

I do plan, personally, on being on watch against people putting on airs because of the sheer number of correct predictions they've made. Based on this, I'm downgrading my heuristic for interpreting superforecaster predictions to "soft evidence" unless I can see their models and evidence, or can find some other way to evaluate whether the overall system they're participating is really being run with integrity.

In the end, it's a choice between believing the reason and evidence you can personally evaluate, and accepting a consulting firm's analysis because of what the CEO tells you about how they run the company. If you're in a long-term relationship with the consultants, maybe you can get a sense of whether they're steering you right over the long run. But when superforecasters predicted a 3% chance of 200,000 Coronavirus cases by mid-March, maybe it's time to downgrade our confidence in superforecasters, rather than in ourselves.


Comments sorted by top scores.

comment by lifelonglearner · 2020-04-17T17:58:43.906Z · LW(p) · GW(p)

Would most of your concerns be alleviated if Tetlock just made all or most of the questions public? More generally, it seems fine if you push for a norm of evaluating people only on public predictions, e.g. those made on Metaculus.

comment by AllAmericanBreakfast · 2020-04-17T18:26:47.933Z · LW(p) · GW(p)

Not really. Here’s a cartoon example to explain why:

You don’t know math beyond basic algebra and don’t know the difference between calculus and linear algebra. Somebody gives you a linear algebra problem to solve. You hire me to get the answer.

To attract your business, I tell you the answers to 99 calculus I problems and one linear algebra problem. I get the calculus right and the linear algebra problem wrong, but you don’t know the difference. If I tell you that I “have 99% accuracy on advanced math problems posed to me,” then I’m not lying.

But does this mean I’m 99% likely to give you the right answer to your linear algebra problem? No, of course not!

Let’s say I show you the problems I solved to obtain my 99% accuracy score. Do you have the ability to tell the difference between the calc and linear algebra problems? Not without some pretty thorough investigation, and you might make mistakes or not trust yourself because you’re not an expert.

What if my results were compared to a group of mathematicians answering the same problems? They’d get 100% vs my 99%. That doesn’t look like a big difference. You still have no way to know that I’m going to get your linear algebra problems wrong every time unless you ask me a lot of them.

Maybe I lose your business eventually, but I can still keep looking for clients and broadcasting my general accuracy on self-selected questions. If you ask 3 questions and I get them all wrong, I lose your business and you stop asking those hard questions. The business I keep continues asking gimmes, which I keep getting right. If new hard-question-askers keep trusting my overall accuracy without being able to discern between the hard and easy questions, I’ll keep misleading them.

Now in this cartoon, any mathematician could come along and demolish my forecasting firm by pointing out that I was mixing two different types of questions and systematically failing the linear algebra.

There’s no easy way to do that with real world forecasting.

The only way to deal with this problem is if an adversarial question-asker selects the questions posed to the forecasters.

So I am positing that in the absence of a formal system for doing that, you must be the adversarial question asker by only counting as evidence of accuracy the questions you genuinely feel uncertain about.

comment by Jeremiah820 · 2020-06-09T13:52:42.048Z · LW(p) · GW(p)

I have never commented on LW before, so I'm not sure what the policy is on links, but I wrote a whole post explaining my worries about superforecasting. On the off chance that it serves as a useful appendage to this post here's the link:

comment by ChristianKl · 2020-06-09T22:27:27.243Z · LW(p) · GW(p)

"Black swan" seems to me a very useful concept and anything that you would ask GDP to forcast and put a probability on per definition isn't a black swan. Taleb is concerned about events that are actually black swans. I think it's unhelpful to confuse them with modeled low-probability events the way you do in your post. I think Taleb actually has a valuable point but I don't think that you argued that point. 

If you have probabilities you can easily do an utilitarian benefit calcuation to find that you shouldn't engage in an action with a 5% probability of failure if the impact of the failure is too bad. That's very simple math. 

It's much easier to do that utilitarian calcuation then when you have reports that round up the 95% to "almost certain" they way the US for example handeled WMD's in Iraq. That bad decision lead them to run the tournament that GJP won. 

I think you ignore that in practice listening to superforcasters bring current decision makers to have to accept much higher uncertainty then they are currently faced with.