From Laplace to BIC

post by johnswentworth · 2019-07-19T16:52:58.087Z · LW · GW · 2 comments

Contents

  BIC Derivation
  When Does BIC Work?
None
2 comments

The previous post [LW · GW] outlined Laplace approximation, one of the most common tools used to approximate hairy probability integrals. In this post, we'll use Laplace approximation to derive the Bayesian Information Criterion (BIC), a popular complexity penalty method for comparing models with more free parameters to models with fewer free parameters.

The BIC is pretty simple:

Thus: . Using this magic number, we compare any two models we like.

Let's derive that.

BIC Derivation

As usual, we'll start from . (Caution: don't forget that what we really care about is ; we can jump to only as long as our priors are close enough to be swamped by the evidence.) This time, we'll assume that we have N independent data points , all with the same unobserved parameters - e.g. N die rolls with the same unobserved biases. In that case, we have

Next, apply Laplace approximation and take the log.

where the Hessian matrix H is given by

Now for the main trick: how does each term scale as the number of data points N increases?

Let's go ahead and write H as , to pull out the N-dependence. Then, if we can remember how determinants scale:

so we can re-write our Laplace approximation as

where contains all the terms which are roughly constant with respect to N. The first two terms are the BIC.

In other words, the BIC is just the Laplace approximation, but ignoring all the terms which don't scale up as the number of data points increases.

When Does BIC Work?

What conditions need to hold for BIC to work? Let's go back through the derivation and list out the key assumptions behind our approximations.

That last condition is the big one. BIC is a large-N approximation, so N needs to be large for it to work. How large? That depends how big is - N needs to be exponentially larger than that. We'll see an example in the next post.

Next post will talk more about relative advantages of BIC, Laplace, and exact calculation for comparing models. We'll see a concrete example of when BIC works/fails.

2 comments

Comments sorted by top scores.

comment by Bucky · 2019-07-24T22:24:41.147Z · LW(p) · GW(p)

In removing the terms I think we're removing all of the widths of the peak in the various dimensions. So in the case where the widths are radically different between the models this would mean that N would need to be even larger for BIC to be a useful approximation.

The widths issue might come up, for example, when an additional parameter is added which splits the data into 2 populations with drastically different population sizes - the small population is likely to have a wider peak.

Is that right?

Replies from: johnswentworth
comment by johnswentworth · 2019-07-25T01:45:39.803Z · LW(p) · GW(p)

That is exactly correct.