Interesting talk on Bayesians and frequentists
post by jsteinhardt · 2010-10-23T04:10:27.684Z · LW · GW · Legacy · 19 commentsContents
19 comments
I recently started watching an interesting lecture by Michael Jordan on Bayesians and frequentists; he's a pretty successful machine learning expert that takes both views in his work. You can watch it here: http://videolectures.net/mlss09uk_jordan_bfway/. I found it interesting because his portrayal of frequentism is much different than the standard portrayal on lesswrong. It isn't about whether probabilities are frequencies or beliefs, it's about trying to get a good model versus trying to get rigorous guarantees of performance in a class of scenarios. So I wonder why the meme on lesswrong is that frequentists think probabilities are frequencies; in practice it seems to be more about how you approach a given problem. In fact, frequentists seem more "rational", as they're willing to use any tool that solves a problem instead of constraining themselves to methods that obey Bayes' rule.
In practice, it seems that while Bayes is the main tool for epistemic rationality, instrumental rationality should oftentimes be frequentist at the top level (with epistemic rationality, guided by Bayes, in turn guiding the specific application of a frequentist algorithm).
For instance, in many cases I should be willing to, once I have a sufficiently constrained search space, try different things until one of the works, without worrying about understanding why the specific thing I did worked (think shooting a basketball, or riffle shuffling a deck of cards). In practice, it seems like epistemic rationality is important for constraining a search space, and after that some sort of online learning algorithm can be applied to find the optimal action from within that search space. Of course, this isn't true when you only get one chance to do something, or extreme precision is required, but this is not often true in everyday life.
The main point of this thread is to raise awareness of the actual distinction between Bayesians and frequentists, and why it's actually reasonable to be both, since it seems like lesswrong is strongly Bayesian and there isn't even a good discussion of the fact that there are other methods out there.
19 comments
Comments sorted by top scores.
comment by ata · 2010-10-23T05:05:26.750Z · LW(p) · GW(p)
I think the fundamental insight of Bayesianism is that Bayes' Theorem is the law of inference, not (just) a normative law but a descriptive law — that frequentist methods and other statistical algorithms that make no mention of Bayes aren't cleverly circumventing it, they're implicitly using it. Any time you use some data to generate a belief about some proposition, if you use a method whose output is systematically correlated with reality at all, then you are using Bayes, just with certain assumptions and simplifications mixed in.
The failing of frequentism is not in the specific methods it uses — it is perfectly true that we need simplified methods in order to do much useful inference — but in its claim of "objectivity" that really consists of treating its assumptions and simplifications as though they don't exist, and in its reliance on experimenters' intuition in deciding which methods should be used (considering that different methods make different assumptions that lead to different results). Frequentist methods aren't (all) bad, frequentist epistemology is.
If I remember correctly, it is perfectly possible to create Bayesian formulations of most frequentist methods; of course, they will often still talk about things that Bayesians don't usually care about, like P-values, but they will nevertheless reveal the deductively-valid Bayes-structure of the path from your data to that result. Revealing frequentist methods' hidden structure is important because it lets us understand why they work — when they do work — and it lets us predict when they won't be as useful.
Replies from: JohnDavidBustard, jsteinhardt↑ comment by JohnDavidBustard · 2010-10-24T21:56:34.092Z · LW(p) · GW(p)
From what I understand, in order to apply Bayesian approaches in practical situations it is necessary to make assumptions which have no formal justification, such as the distribution of priors or the local similarity of analogue measures (so that similar but not exact predictions can be informative). This changes the problem without necessarily solving it. In addition, it doesn't address the issue of AI problems not based on repeated experience, e.g. automated theorem proving. The advantage of statistical approaches such as SVMs is that they produce practically beneficial results with limited parameters. With parameter search techniques they can achieve fully automated predictions that often have good experimental results. Regardless of whether Bayesianism is the law of inference, if such approaches cannot be applied automatically they are fundamentally incomplete and only as valid as the assumptions they are used with. If Bayesian approaches carry a fundamental advantage over these techniques why is this not reflected in their practical performance on real world AI problems such as face recognition?
Oh and bring on the down votes you theory loving zealots :)
Replies from: jsteinhardt↑ comment by jsteinhardt · 2010-10-25T02:20:14.762Z · LW(p) · GW(p)
Bayesian approaches tend to be more powerful than other statistical techniques in situations where there is a relatively limited supply of data. This is because Bayesian approaches, due to being model-based, tend to have a richer structure that allows it to take advantage of more of the structure of the data; a second reason is because Bayes allows for the explicit integration of prior assumptions and is therefore usually a more aggressive form of inference than most frequentist methods.
I tried to find a good paper demonstrating this (called "learning from one example"), unfortunately I only came across this PhD thesis --- http://www.cs.umass.edu/~elm/papers/thesis.pdf , although there is certainly a lot of work being done on generalizing from one, or a small number of, examples.
Replies from: JohnDavidBustard↑ comment by JohnDavidBustard · 2010-10-25T14:41:51.336Z · LW(p) · GW(p)
Thanks for your reference it is good to get down to some more specific examples.
Most AI techniques are model based by necessity: it is not possible to generalise from samples unless the sample is used to inform the shape of a model which then determines the properties of other samples. In effect, AI is model fitting. Bayesian techniques are one scheme for updating a model from data. I call them incomplete because they leave a lot of the intelligence in the hands of the user.
For example, in the thesis reference the author designs a model of transformations on handwritten letters that (thanks to the authors intelligence) is similar to the set of transformations applied to numeric characters. The primary reason why the technique is effective is because the author has constructed a good transformation. The only way to determine if this is true is through experimentation, I doubt the bayesian updating is contributing significantly to the results, if another scheme such as an SVM was chosen I would expect it to produce similar recognition results.
The point is that the legitimacy or otherwise of the model parameter updating scheme is relatively insignificant in comparison to the difficulty in selecting a good model in the first place. As far as I am aware, as there are a potentially infinite set of models, Bayesian techniques cannot be applied to select between them, leaving the real intelligence being provided by the user in the form of the model. In contrast, SVMs are an attempt to construct experimentally useful models from samples and so are much closer to being intelligent in the sense of being able to produce good results with limited human interaction. However, neither technique addresses the fundamental difficulty of replicating the intelligence used by the author in creating the transformation in the first place. Fixating on a particular approach to model updating when model selection is not addressed is to miss the point, it may be meaningful for gambling problems but for real AI challenges the difference it makes appears to be irrelevant to actual performance.
I would love to discuss what the real challenges of GAI are and explore ways of addressing them, but often the posts on LW seem to focus on seemingly obscure game theory or gambling based problems which don't appear to be bringing us closer to a real solution. If the model selection problem can't be addressed then there is no way to guarantee that whatever we want an AI to value, it won't create an internal model that finds something similar (like paperclips) and decides to optimise for that instead.
Silently down voting criticism of Bayesian probability without justification is not helpful either.
Replies from: jsteinhardt↑ comment by jsteinhardt · 2010-10-25T22:18:53.602Z · LW(p) · GW(p)
Model selection is definitely one of the biggest conceptual problems in GAI right now (I would say that planning once you have a model is of comparable importance / difficulty). I think the way to solve this sort of problem is by having humans carefully pick a really good model (flexible enough to capture even unexpected situations while still structured enough to make useful predictions). Even with SVMs you are implicitly assuming some sort of structure on the data, because you usually transform your inputs into some higher-dimensional space consisting of what you see as useful features in the data.
Even though picking the model is the hard part, using Bayes by default seems like a good idea because it is the only general method I know of for combining all of my assumptions without having to make additional arbitrary choices about how everything should fit together. If there are other methods, I would be interested in learning about them.
What would the "really good model" for a GAI look like? Ideally it should capture our intuitive notions of what sorts of things go on in the world without imposing constraints that we don't want. Examples of these intuitions: superficially similar objects tend to come from the same generative process (so if A and B are similar in ways X and Y, and C is similar to both A and B in way X, then we would expect C to be similar to A and B in way Y, as well); temporal locality and spatial locality underly many types of causality (so if we are trying to infer an input-output relationship, it should be highly correlated over inputs that are close in space/time); and as a more concrete example, linear momentum tends to persist over short time scales. A lot of work has been done in the past decade on formalizing such intuitions, leading to nonparametric models such as Dirichlet processes and Gaussian processes. See for instance David Blei's class on Bayesian nonparametrics (http://www.cs.princeton.edu/courses/archive/fall07/cos597C/index.html) or Michael Jordan's tutorial on Dirichlet processes (http://www.cs.berkeley.edu/~jordan/papers/pearl-festschrift.pdf).
I'm beginning to think that a top-level post on how Bayes is actually used in machine learning would be helpful. Perhaps I will make on when I have a bit more time. Also, does anyone happen to know how to collapse URLs in posts (e.g. the equivalent of test in HTML).
Replies from: JohnDavidBustard, CronoDAS, jsalvatier, CronoDAS↑ comment by JohnDavidBustard · 2010-10-26T14:50:41.995Z · LW(p) · GW(p)
A high level post on its use would be very interesting.
I think my main criticism of the Bayes approach is that it leads to the kind of work you are suggesting i.e. have a person construct a model and then have a machine calculate its parameters.
I think that much of what we value in intelligent people is their ability to form the model themselves. By focusing on parameter updating we aren't developing the AI techniques necessary for intelligent behavior. In addition, because correct updating does not guarantee good performance (because the model properties dominate) then we will always have to judge methods based on experimental results.
Because we always come back to experimental results, whatever general AI strategy we develop its structure is more likely to be one that searches for new ways to learn (with bayesian model updating and SVMs as examples) and validates these strategies using experimental data (replicating the behaviour of the AI field as a whole).
I find it useful to think about how people solve problems and examine the huge gulf between specific learning techniques and these approaches. For example, to replicate a Bayesian AI researcher an AI needs to take a small amount of data, an incomplete informal model of the process that generates it (e.g. based on informal metaphors of physical processes the author is familiar with) and then find a way of formalising this informal model (so that its behaviour under all conditions can be calculated) and possibly doing some theorem proving to investigate properties of the model. They then apply potentially standard techniques to determine the models parameters and judge its worth based on experiment (potentially repeating the whole process if it doesn't work).
By focusing on Bayesian approaches we aren't developing techniques that can replicate these kinds of lateral and creative thinking behaviour. Saying there is only one valid form of inference is absurd because it doesn't address these problems.
I feel that trying to force our problems to suit our tools is unlikely to make much progress. For example, unless we can model (and therefore largely solve) all of the problems we want an AI to address we can't create a "Really Good Model".
Rather than manually developing formalisations of specific forms of similarity we need an algorithm to learn different types of similarity and then construct the formalisation itself (or not as I don't think we actually formalise our notions of similarity and yet can still solve problems).
Automated theorem proving is a good example where the problems are well defined yet unique, so any algorithm that can construct proofs needs to see meta patterns in other proofs and apply them. This brings home the difficulty of identifying what it means for things to be similar and also emphasises the incompleteness of a probabilistic approach: the proof that the AI is trying to construct has never been encountered before, in order for it to benefit from experience it needs to invent a type of similarity to map the current problem to the past.
Replies from: jsteinhardt↑ comment by jsteinhardt · 2010-10-27T01:59:02.389Z · LW(p) · GW(p)
But even "learning to learn" is done in the context of a model, it's just a higher-level model. There are in fact models that allow experience gained in one area to generalize to other areas (by saying that the same sorts of structures that are helpful for explaining things in one area should be considered in that other area). Talking about what an AI researcher would do is asking much more out of an AI than one would ask out of a human. If we could get an AI to even be as intelligent as a 3-year-old child then we would be more or less done. People don't develop sophisticated problem solving skills until at least high school age, so it seems difficult to believe that such a problem is fundamental to AGI.
Another reference, this time on learning to learn, although unfortunately it is behind a pay barrier (Tenenbaum, Goodman, Kemp, "Learning to learn causal models").
It appears that there is also a book on more general (mostly non-Bayesian) techniques for learning to learn: Sebastian Thrun's book. I got the latter just by googling, so I have no idea what's actually in it, other than by skimming through the chapter descriptions. It's also not available online.
↑ comment by jsalvatier · 2010-11-02T15:26:51.186Z · LW(p) · GW(p)
Is model selection really a big problem? I thought that there was a conceptually simple way to incorporate this into a model (just add a model index parameter), though it might be computationally tricky sometimes. As JohnDavidBustard points out below, the real difficulty seems like model creation. Though I suppose you can frame this as model selection if you have some prior over a broad enough category of models (say all turing machines).
Replies from: jsteinhardt↑ comment by jsteinhardt · 2010-11-04T03:51:11.361Z · LW(p) · GW(p)
It depends on what you mean by model selection. If you mean e.g. figuring out whether to use quadratics or cubics, then the standard solution that people cite is to use Bayesian Occam's razor, i.e. compute
p(Cubic | Data)/p(Quadratic | Data) = p(Data | Cubic)/p(Data | Quadratic) * p(Cubic)/p(Quadratic)
Where we compute the probabilities on the right-hand side by marginalizing over all cubics and quadratics. But the number you get out of this will depend strongly on how quickly the tails decay on your distribution over cubics and quadratics, so I don't find this particularly satisfying. (I'm not alone in this, although there are people who would disagree with me or propose various methods for choosing the prior distributions appropriately.)
If you mean something else, like figuring out what specific model to pick out from your entire space (e.g. picking a specific function to fit your data), then you can run into problems like having to compare probability masses to probability densities, or comparing measures with different dimensionality (e.g. densities on the line versus the plane); a more fundamental issue is that picking a specific model potentially ignores other features of your posterior distribution, like how concentrated the probability mass is about that model.
I would say that the most principled way to get a single model out at the end of the day is variational inference, which basically attempts to set parameters in order to minimize the relative entropy between the distribution implied by the parameters and the actual posterior distribution. I don't know a whole lot about this area, other than a couple papers I read, but it does seem like a good way to perform inference if you'd like to restrict yourself to considering a single model at a time.
Replies from: jsalvatier↑ comment by jsalvatier · 2010-11-04T15:45:05.455Z · LW(p) · GW(p)
OK, so you're saying that a big problem in model selection is coming up with good prior distributions for different classes of models, specifically those with different tail decays (it sounds like you think it could also be that the standard bayes framework is missing something). This is an interesting idea which I had heard about before, but didn't understand till now. Thank you for telling me about it.
I would say that when you have a somewhat dispersed posterior it is simply misleading to pick any specific model+parameters as your fit. The correct thing to do is average over possible models+parameters.
It's only when you have a relatively narrow posterior or the errors bars on the estimate you give for some parameter or prediction don't matter that it's OK to select a single model.
Replies from: jsteinhardt↑ comment by jsteinhardt · 2010-11-04T22:29:24.389Z · LW(p) · GW(p)
I think I basically agree with you on that; whenever feasible the full posterior (as opposed to the maximum-likelihood model) is what you should be using. So instead of using "Bayesian model selection" to decide whether to pick cubics or quadratics, and then fitting the best cubic or the best quadratic depending on the answer, the "right" thing to do is to just look at the posterior distribution over possible functions f, and use that to get a posterior distribution over f(x) for any given x.
The problem is that this is not always reasonable for the application you have in mind, and I'm not sure if we have good general methods for coming up with the right way to get a good approximation. But certainly an average over the models is what we should be trying to approximate.
↑ comment by jsteinhardt · 2010-10-23T07:24:47.589Z · LW(p) · GW(p)
But my understanding is that there are also frequentist methods that have no reasonable Bayesian interpretation (for instance because they don't satisfy coherence -- http://en.wikipedia.org/wiki/Coherence_(philosophical_gambling_strategy)) but have a rigorous guarantee on performance. Unfortunately, I can't think of any good examples off the top of my head, although Jordan gave SVMs as one; I don't know enough about them to know if that is actually a reasonable example or not.
Replies from: DanielLC↑ comment by DanielLC · 2011-08-30T06:24:50.035Z · LW(p) · GW(p)
But my understanding is that there are also frequentist methods that have no reasonable Bayesian interpretation
This is impossible. See Searching for Bayes-Structure. It may be difficult to find a reasonable Bayesian interpretation, and it may only approximate said interpretation, but if it's at all useful, it will have one.
Replies from: Academian, jsteinhardt↑ comment by Academian · 2011-08-30T14:36:00.085Z · LW(p) · GW(p)
It may be difficult to find a reasonable Bayesian interpretation, and it may only approximate said interpretation, but if it's at all useful, it will have one.
Observation: This theory that you've stated here --- that any useful frequentist method will have a Bayesian interpretation --- doesn't serve much in the way of controlled anticipation. Because there is so much flexibility in choosing priors and a loss function, the fact that "every useful frequentist method will be a Bayes method in disguise" doesn't tell us much about what frequentist methods will turn out to be useful.
It seems to me that the wisdom to treat beliefs as anticipation controllers is more general, and I think more important, than the choice of Bayesian vs Frequentist inference methods. Each school has their own heuristic for quantifying this wisdom.
As for Bayesian vs Frequentist interpretations of what the word "probability" means, I think that's a different (and sillier) debate.
Replies from: DanielLC↑ comment by DanielLC · 2011-08-30T17:16:28.782Z · LW(p) · GW(p)
This theory that you've stated here --- that any useful frequentist method will have a Bayesian interpretation --- doesn't serve much in the way of controlled anticipation.
A frequentist tool only works insomuch as it approximates a Bayesian approach. As such, given the domain in which it works well, you can prove that it approximates the Bayesian answer.
For example, if you're trying to find the probability of a repeatable event ending in success, the frequentist method says to use success/total. The Bayesain approach with a maximum entropy prior gives (success + 0.5)/(total + 1). It can be shown that, with a sufficient number of successes and failures, these will work out similarly. It's well known that with very few successes or very few failures, the frequentist version doesn't work very well.
Replies from: jsteinhardt↑ comment by jsteinhardt · 2011-09-01T12:53:02.184Z · LW(p) · GW(p)
the frequentist method says to use success/total
This is false (as explained in the linked-to video). If nothing else, the frequentist answer depends on the loss function (as does the Bayesian answer, although the posterior distribution is a way of summarising the answer simultaneously for all loss functions).
I think you're taking the frequentist interpretation of what a probability is and trying to forcibly extend it to the entire frequentist decision theory. As far as the "frequentist interpretation of probability" goes, I have never met a single statistician who even explicitly identified "probabilities as frequencies" as a possible belief to hold, much less claimed to hold it themselves. As far as I can tell, this whole "probabilities as frequencies" thing is unique to LessWrong.
Everyone I've ever met who identified as a frequentist meant "not strictly Bayesian". Whenever a method was identified as frequentist, it either meant "not strictly Bayesian" or else that it was adopting the decision theory described in Michael Jordan's lecture.
In fact, the frequentist approach (not as you've defined it, but as the term is actually used by statisticians) is used precisely because it works extremely well in certain circumstances (for instance, cross-validation). This is, I believe, what Mike is arguing for when he says that a mix of Bayesian and frequentist techniques is necessary.
↑ comment by jsteinhardt · 2011-09-01T12:38:13.557Z · LW(p) · GW(p)
Thanks for the link. That is a good point. I agree that every useful method has to have some amount of information-theoretic overlap with Bayes, but that overlap could be small and still be useful; we reach most conclusions only after there is overwhelming evidence in favor of them, so one could do as well as humans while only having a small amount of mutual information with proper Bayesian updating (or indeed without ever even working with a Bayesian model).