Bayesian Model Testing Comparisons

post by johnswentworth · 2019-07-20T16:40:50.879Z · LW · GW · 0 comments

We've now seen three [? · GW] different [? · GW] methods [? · GW] for model comparison:

Based on the derivations, we can make some predictions:

Let's test these predictions out with some dice-rolling simulations.

The graphs below show the difference in bits of evidence (i.e. or its approximations) between a biased model and an unbiased model, as a function of the number of data points in each simulation, for each of the three methods. A positive difference indicates that the method favors the biased model; a negative difference indicates that the method favors the unbiased model.

First up, let's compare a simulation with a 60-40 biased coin to a simulation with an unbiased coin. Hopefully, our methods will favor the biased model in the biased coin simulation and the unbiased model in the unbiased coin simulation.

Here's what's going on in those plots:

Because the BIC is a large-N approximation, ignoring terms which scale with k but not N, we'd expect BIC to perform worse as we crank up the number of parameters k. Let's try that: here's another pair of simulations with a 100-sided die (the biased die has 1/200 weight on half the faces and 3/200 on the other half).

This time, the BIC has a very large error - hundreds of bits of evidence in favor of an unbiased model, regardless of whether the coin is biased or not. That said, after the first few data points, the BIC's error mostly stays constant; recall that the terms ignored by the BIC are all roughly constant with respect to N. Meanwhile, the Laplace approximation agrees wonderfully with the exact calculation. (However, note that the Laplace approximation is absent in the leftmost part of each plot - for these models, it isn't well-defined until we've seen at least one of each outcome.)

Finally, notice that the exact calculation itself gives pretty reasonable probabilities in general, and in particular for small N. When the number of data points is small, it's always pretty close to zero, i.e. roughly indifferent between the models. In the high-k simulations, the exact solution gave reliably correct answers after a few hundred data points, and was roughly indifferent before that. Compare that to the BIC, which gave a very confident wrong answer in the biased case and only worked its way back to the correct answer after around 3000 data points. The moral of this story is: precise Bayesian calculations are more important when N is smaller and k is larger. We'll come back to that theme later.

Next post will add cross-validation into the picture, reusing the simulations above.

0 comments

Comments sorted by top scores.