# Do Bayesians like Bayesian model Averaging?

This is a question post.

Pages 4-5 of the International Society for Bayesian Analysis Newsletter Vol. 5 No. 2 contains a satirical interview with Thomas Bayes. In a part of the interview, they appear to criticise the idea of model Averaging

MC: So what did you like most about Merlise Clyde's paper, Tom?
Bayes: Well, I thought it was really cleaver how she pretended that the important question was how to compute model averages. So nobody noticed that the real question is whether it makes any sense in the
first place.

What’s going on here? I thought Bayesians liked model averaging because it allows us to marginalise over the unknown model:

where  represents the i-th model and  represents the data.

answer by Jan Christian Refsgaard · 2021-08-02T21:11:44.353Z · LW(p) · GW(p)

I agree with Radford Neal, model average and Bayes factors are very sensitive to the priors specification of the models, if you absolutely have to do model average methods such as PSIS-LOO or WAIC that focus on the predictive distribution are much better. If you had two identical models where one simply had a 10 times boarder uniform prior then their posterior predictive distributions would be identical but their Bayes factor would be 1/10, so a model average (assuming uniform prior on p(M_i)) would favor the narrow prior by a factor 10 where the predictive approach would correctly cobclude that they describe the data equal well and thus conclude that the models should be weighed equal.

Finally model average is usually conseptually wrong and can be solved by making a larger model that encompass all potential models, such as a hierarchical model to partial pool between the group and subject level models, gelmans 8 schools data is a good example: there are 8 schools and there are 2 simple models one with 1 parameter (all schools are the same) and one with 8 (every school is a special snow flake), and then the hierarchical model with 9 parameters, one for each school and one for how much to pool the estimates towards the group mean, gelmans radon dataset is also good for learning about hierarchical models

That seems to be a bit of conundrum: we need  but we can’t compute it? If can’t compute , then what hope is there for statistics?

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-08-03T04:35:36.098Z · LW(p) · GW(p)

I am a little confused by what x is on your statement, and by why you think we can't compute the likelihood or posterior predictive. In most real problems we can't compute the posterior but we can draw from it and thus approximate it via MCMC

Sorry! Bad notation... What I meant was that we can’t compute the conditional posterior predictive density  where . We can compute , where  is some model, approximately using MCMC by drawing samples from the parameter space of , i.e. we can approximate the integral below using MCMC:

where  is the parameter space of . But the quantity that we are interested in is

not  for a specific model i.e. we need to marginalise over the unknown model. How can we do this?

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-08-03T20:54:43.392Z · LW(p) · GW(p)

You are correct, we have to assume a model, just like we have to assume a prior. And strictly speaking the model is wrong and the prior is wrong :). But we can calculate how good the posterior predictive describe the data to get a feel for how bad our model is :)

Ignoring the practical problems of Bayesian model averaging, isn’t assuming that either M1, M2, or M3 is true better than assuming that some model M is true? So Bayesian model averaging is always better right (if it is practically possible)?

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-08-04T04:17:27.149Z · LW(p) · GW(p)

If there are 3 competing models then Ideally you can make a larger model where each submodel is realized by specific parameter combinations.

If a M2 is simply M1 with an extra parameter b2, then you should have a stronger prior b2 being zero in M2, if M3 is M1 with one parameter transformed, then you should have a parameter interpolating between this transformation so you can learn that between 40-90% interpolating describe the data better.

If it's impossible to translate between models like this then you can do model averaging, but it's a sign of you not understanding your data.

comment by Radford Neal · 2021-08-04T19:19:26.288Z · LW(p) · GW(p)

Yes, this is usually the right approach - use a single, more complex, model that has the various models you were considering as special cases.  It's likely that the best parameters of this extended model won't actually turn out to be one of the special cases.  (But note that this approach doesn't necessarily eliminate the need for careful consideration of the prior, since unwise priors for a single complex model can also cause problems.)

However, there are some situations where discrete models make sense.  For instance, you might be analysing old Roman coins, and be unsure whether they were all minted in one mint, or in two (or three, ...) different mints. There aren't really any intermediate possibilities between one mint or two.  Or you might be studying inheritance of two genes, and be considering two models in which they are either on the same chromosome or on different chromosones.

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-08-05T20:50:10.457Z · LW(p) · GW(p)

Good points, but can't you still solve the discrete problem with a single model and a stick breaking prior on the number of mints, right?

comment by Radford Neal · 2021-08-05T21:16:53.994Z · LW(p) · GW(p)

If you're thinking of a stick-breaking prior such as a Dirichlet process mixture model, they typically produce an infinite number of components (which would be mints, in this case), though of course only a finite number will be represented in your finite data set.  But we know that the number of mints producing coins in the Roman Empire was finite.  So that's not a reasonable prior (though of course you might sometimes be able to get away with using it anyway).

Ahhh... that makes a lot of sense.

I can't say what O'Hagan had in mind, but the reasons I have to be skeptical of results involving Bayesian model averaging are that model averaging makes sense only if you've been very, very careful in setting up the models, and you've also been very, very careful in specifying the prior distributions these models use for their parameters.  For some problems, being very, very careful may be beyond the capacity of human intellect.

Regarding the models:  For complex problems, it may be the none of the models you have defined represent the real phenomenon well, even approximately.  But the posterior model probabilities used in Bayesian model averaging assume that the true model is among those being considered.

If that's true (and the models have reasonable priors over their parameters), then model averaging - and it's limit of model selection when the posterior probability of one model is close to one - is a sensible thing to do.  That's because the true model is always the best one to use, regardless of your purpose in doing inference.

However, if you're actually using a set of models that are all grossly inadequate, then which of these terrible models is best to use (or what weights it's best to average them with) depends on your purpose. For example, with non-linear regression models relating y to x, you might be interested in predicting y at new values of x that are negative, or in predicting y at new values of x that are positive. If you've got the true model, it's good for both positive and negative x.  But if all you've got are bad models, it may be that the one that's best for negative x is not the same as the one that's best for positive x. Bayesian model averaging takes no account of your purpose, and so can't possibly do the right thing when none of the models are good.

Regarding priors:  The problem is not priors for the models themselves (assuming there aren't huge numbers of them), but rather priors for the parameters within each of the models.  (Note that different models may have different sets of parameters, so these priors are not necessarily parallel between models.) Once you have a fairly large amount of data, it's often the case that the exact prior for parameters that you choose isn't crucial for inference within a model - the posterior distribution for parameters may vary little over a wide class of reasonable priors (that aren't highly concentrated in some small region). You can often even get away with using an "improper" prior, such as a uniform distribution over the real numbers (which doesn't actually exist, of course).

But for computing model probabilities for use in Bayesian model averaging, the priors used for the parameters of each model are absolutely crucial.  Using an overly-vague prior in which probability is spread over a wide range of parameters that mostly don't fit the data very well will give a lower model probability than if you used a more well-considered prior, that put less probability on parameters that don't fit the data well (and that weren't really plausible even a priori). Using an improper prior for parameters will generally result in the model probability being zero, since there's zero prior probability for parameters that fit the data.

Especially when the parameter space is high-dimensional, it can be quite difficult to fully express your prior knowledge about which parameter values are plausible. With a lot of thought, you maybe can do fairly well.  But if you need to think hard, how can you tell whether you thought equally hard for each model? And, just thinking equally hard isn't even really enough - you need to have actually pinned down the prior that really expresses your prior knowledge, for every one of the models.  Most people doing Bayesian model averaging haven't done that.

comment by Kenny · 2021-08-02T16:16:22.487Z · LW(p) · GW(p)

The purpose of having priors is to compensate for lack of data, so that at least you are closer to the true model a posterior, and faster training since model averaging would take longer than training a single model. Also it's not that the true model is within the ensemble of models but that you know before hand that getting a true model is rather difficult, lack of data or just the sheer complexity of the true model and parameter size. If you have enough data, playing around with different prior wouldn't make any meaningful difference. I think when people talk about true model, what they really mean is how close they are to the true model. There isn't really a way to know. Take coin flip for example. You only have 50-50 if your flips are perfect and your coin is perfectly uniform, 0 wind, etc. These details are neglected because they aren't really important theoretically, but the true model isn't theoretically perfect either since it is supposed to reflect reality.