A Fervent Defense of Frequentist Statistics

post by jsteinhardt · 2014-02-18T20:08:48.833Z · LW · GW · Legacy · 129 comments

Contents

  References
None
129 comments

[Highlights for the busy: de-bunking standard "Bayes is optimal" arguments; frequentist Solomonoff induction; and a description of the online learning framework. Note: cross-posted from my blog.]

Short summary. This essay makes many points, each of which I think is worth reading, but if you are only going to understand one point I think it should be “Myth 5″ below, which describes the online learning framework as a response to the claim that frequentist methods need to make strong modeling assumptions. Among other things, online learning allows me to perform the following remarkable feat: if I’m betting on horses, and I get to place bets after watching other people bet but before seeing which horse wins the race, then I can guarantee that after a relatively small number of races, I will do almost as well overall as the best other person, even if the number of other people is very large (say, 1 billion), and their performance is correlated in complicated ways.

If you’re only going to understand two points, then also read about the frequentist version of Solomonoff induction, which is described in “Myth 6″.

Main article. I’ve already written one essay on Bayesian vs. frequentist statistics. In that essay, I argued for a balanced, pragmatic approach in which we think of the two families of methods as a collection of tools to be used as appropriate. Since I’m currently feeling contrarian, this essay will be far less balanced and will argue explicitly against Bayesian methods and in favor of frequentist methods. I hope this will be forgiven as so much other writing goes in the opposite direction of unabashedly defending Bayes. I should note that this essay is partially inspired by some of Cosma Shalizi’s blog posts, such as this one.

This essay will start by listing a series of myths, then debunk them one-by-one. My main motivation for this is that Bayesian approaches seem to be highly popularized, to the point that one may get the impression that they are the uncontroversially superior method of doing statistics. I actually think the opposite is true: I think most statisticans would for the most part defend frequentist methods, although there are also many departments that are decidedly Bayesian (e.g. many places in England, as well as some U.S. universities like Columbia). I have a lot of respect for many of the people at these universities, such as Andrew Gelman and Philip Dawid, but I worry that many of the other proponents of Bayes (most of them non-statisticians) tend to oversell Bayesian methods or undersell alternative methodologies.

If you are like me from, say, two years ago, you are firmly convinced that Bayesian methods are superior and that you have knockdown arguments in favor of this. If this is the case, then I hope this essay will give you an experience that I myself found life-altering: the experience of having a way of thinking that seemed unquestionably true slowly dissolve into just one of many imperfect models of reality. This experience helped me gain more explicit appreciation for the skill of viewing the world from many different angles, and of distinguishing between a very successful paradigm and reality.

If you are not like me, then you may have had the experience of bringing up one of many reasonable objections to normative Bayesian epistemology, and having it shot down by one of many “standard” arguments that seem wrong but not for easy-to-articulate reasons. I hope to lend some reprieve to those of you in this camp, by providing a collection of “standard” replies to these standard arguments.

I will start with the myths (and responses) that I think will require the least technical background and be most interesting to a general audience. Toward the end, I deal with some attacks on frequentist methods that I believe amount to technical claims that are demonstrably false; doing so involves more math. Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method, even though I think there’s actually a fair amount of variation among these methods, although also a fair amount of overlap (e.g. I’m throwing in statistical learning theory with minimax estimation, which certainly have a lot of overlap in ideas but were also in some sense developed by different communities).

The Myths:

Myth 1: Bayesian methods are optimal. Presumably when most people say this they are thinking of either Dutch-booking or the complete class theorem. Roughly what these say are the following:

Dutch-book argument: Every coherent set of beliefs can be modeled as a subjective probability distribution. (Roughly, coherent means “unable to be Dutch-booked”.)

Complete class theorem: Every non-Bayesian method is worse than some Bayesian method (in the sense of performing deterministically at least as poorly in every possible world).

Let’s unpack both of these. My high-level argument regarding Dutch books is that I would much rather spend my time trying to correspond with reality than trying to be internally consistent. More concretely, the Dutch-book argument says that if for every bet you force me to take one side or the other, then unless I’m Bayesian there’s a collection of bets that will cause me to lose money for sure. I don’t find this very compelling. This seems analogous to the situation where there’s some quant at Jane Street, and they’re about to run code that will make thousands of dollars trading stocks, and someone comes up to them and says “Wait! You should add checks to your code to make sure that no subset of your trades will lose you money!” This just doesn’t seem worth the quant’s time, it will slow down the code substantially, and instead the quant should be writing the next program to make thousands more dollars. This is basically what dutch-booking arguments seem like to me.

Moving on, the complete class theorem says that for any decision rule, I can do better by replacing it with some Bayesian decision rule. But this injunction is not useful in practice, because it doesn’t say anything about which decision rule I should replace it with. Of course, if you hand me a decision rule and give me infinite computational resources, then I can hand you back a Bayesian method that will perform better. But it still might not perform well. All the complete class theorem says is that every local optimum is Bayesan. To be a useful theory of epistemology, I need a prescription for how, in the first place, I am to arrive at a good decision rule, not just a locally optimal one. And this is something that frequentist methods do provide, to a far greater extent than Bayesian methods (for instance by using minimax decision rules such as the maximum-entropy example given later). Note also that many frequentist methods do correspond to a Bayesian method for some appropriately chosen prior. But the crucial point is that the frequentist told me how to pick a prior I would be happy with (also, many frequentist methods don’t correspond to a Bayesian method for any choice of prior; they nevertheless often perform quite well).

Myth 2: Bayesian methods are optimal except for computational considerations. We already covered this in the previous point under the complete class theorem, but to re-iterate: Bayesian methods are locally optimal, not global optimal. Identifying all the local optima is very different from knowing which of them is the global optimum. I would much rather have someone hand me something that wasn’t a local optimum but was close to the global optimum, than something that was a local optimum but was far from the global optimum.

Myth 3: We can deal with computational constraints simply by making approximations to Bayes. I have rarely seen this born out in practice. Here’s a challenge: suppose I give you data generated in the following way. There are a collection of vectors {x_1}, {x_2}, {\ldots}, {x_{10,000}}, each with {10^6} coordinates. I generate outputs {y_1}, {y_2}, {\ldots}, {y_{10,000}} in the following way. First I globally select {100} of the {10^6} coordinates uniformly at random, then I select a fixed vector {u} such that those {100} coordinates are drawn from i.i.d. Gaussians and the rest of the coordinates are zero. Now I set {x_n = u^{\top}y_n} (i.e. {x_n} is the dot product of {u} with {y_n}). You are given {x} and {y}, and your job is to infer {u}. This is a completely well-specified problem, the only task remaining is computational. I know people who have solved this problem using Bayesan methods with approximate inference. I have respect for these people, because doing so is no easy task. I think very few of them would say that “we can just approximate Bayesian updating and be fine”. (Also, this particular problem can be solved trivially with frequentist methods.)

A particularly egregious example of this is when people talk about “computable approximations to Solomonoff induction” or “computable approximations to AIXI” as if such notions were meaningful.

Myth 4: the prior isn’t a big deal because Bayesians can always share likelihood ratios. Putting aside the practical issue that there would in general be an infinite number of likelihood ratios to share, there is the larger issue that for any hypothesis {h}, there is also the hypothesis {h'} that matches {h} exactly up to now, and then predicts the opposite of {h} at all points in the future. You have to constrain model complexity at some point, the question is about how. To put this another way, sharing my likelihood ratios without also constraining model complexity (by focusing on a subset of all logically possible hypotheses) would be equivalent to just sharing all sensory data I’ve ever accrued in my life. To the extent that such a notion is even possible, I certainly don’t need to be a Bayesian to do such a thing.

Myth 5: frequentist methods need to assume their model is correct or that the data are i.i.d. Understanding the content of this section is the most important single insight to gain from this essay. For some reason it’s assumed that frequentist methods need to make strong assumptions (such as Gaussianity), whereas Bayesian methods are somehow immune to this. In reality, the opposite is true. While there are many beautiful and deep frequentist formalisms that answer this, I will choose to focus on one of my favorite, which is online learning.

To explain the online learning framework, let us suppose that our data are {(x_1, y_1), (x_2, y_2), \ldots, (x_T, y_T)}. We don’t observe {y_t} until after making a prediction {z_t} of what {y_t} will be, and then we receive a penalty {L(y_t, z_t)} based on how incorrect we were. So we can think of this as receiving prediction problems one-by-one, and in particular we make no assumptions about the relationship between the different problems; they could be i.i.d., they could be positively correlated, they could be anti-correlated, they could even be adversarially chosen.

As a running example, suppose that I’m betting on horses and before each race there are {n} other people who give me advice on which horse to bet on. I know nothing about horses, so based on this advice I’d like to devise a good betting strategy. In this case, {x_t} would be the {n} bets that each of the other people recommend, {z_t} would be the horse that I actually bet on, and {y_t} would be the horse that actually wins the race. Then, supposing that {y_t = z_t} (i.e., the horse I bet on actually wins), {L(y_t, z_t)} is the negative of the payoff from correctly betting on that horse. Otherwise, if the horse I bet on doesn’t win, {L(y_t, z_t)} is the cost I had to pay to place the bet.

If I’m in this setting, what guarantee can I hope for? I might ask for an algorithm that is guaranteed to make good bets — but this seems impossible unless the people advising me actually know something about horses. Or, at the very least, one of the people advising me knows something. Motivated by this, I define my regret to be the difference between my penalty and the penalty of the best of the {n} people (note that I only have access to the latter after all {T} rounds of betting). More formally, given a class {\mathcal{M}} of predictors {h : x \mapsto z}, I define

\displaystyle \mathrm{Regret}(T) = \frac{1}{T} \sum_{t=1}^T L(y_t, z_t) - \min_{h \in \mathcal{M}} \frac{1}{T} \sum_{t=1}^T L(y_t, h(x_t))

In this case, {\mathcal{M}} would have size {n} and the {i}th predictor would just always follow the advice of person {i}. The regret is then how much worse I do on average than the best expert. A remarkable fact is that, in this case, there is a strategy such that {\mathrm{Regret}(T)} shrinks at a rate of {\sqrt{\frac{\log(n)}{T}}}. In other words, I can have an average score within {\epsilon} of the best advisor after {\frac{\log(n)}{\epsilon^2}} rounds of betting.

One reason that this is remarkable is that it does not depend at all on how the data are distributed; the data could be i.i.d., positively correlated, negatively correlated, even adversarial, and one can still construct an (adaptive) prediction rule that does almost as well as the best predictor in the family.

To be even more concrete, if we assume that all costs and payoffs are bounded by {\$1} per round, and that there are {1,000,000,000} people in total, then an explicit upper bound is that after {28/\epsilon^2} rounds, we will be within {\epsilon} dollars on average of the best other person. Under slightly stronger assumptions, we can do even better, for instance if the best person has an average variance of {0.1} about their mean, then the {28} can be replaced with {4.5}.

It is important to note that the betting scenario is just a running example, and one can still obtain regret bounds under fairly general scenarios; {\mathcal{M}} could be continuous and {L} could have quite general structure; the only technical assumption is that {\mathcal{M}} be a convex set and that {L} be a convex function of {z}. These assumptions tend to be easy to satisfy, though I have run into a few situations where they end up being problematic, mainly for computational reasons. For an {n}-dimensional model family, typically {\mathrm{Regret}(T)} decreases at a rate of {\sqrt{\frac{n}{T}}}, although under additional assumptions this can be reduced to {\sqrt{\frac{\log(n)}{T}}}, as in the betting example above. I would consider this reduction to be one of the crowning results of modern frequentist statistics.

Yes, these guarantees sound incredibly awesome and perhaps too good to be true. They actually are that awesome, and they are actually true. The work is being done by measuring the error relative to the best model in the model family. We aren’t required to do well in an absolute sense, we just need to not do any worse than the best model. Of as long as at least one of the models in our family makes good predictions, that means we will as well. This is really what statistics is meant to be doing: you come up with everything you imagine could possibly be reasonable, and hand it to me, and then I come up with an algorithm that will figure out which of the things you handed me was most reasonable, and will do almost as well as that. As long as at least one of the things you come up with is good, then my algorithm will do well. Importantly, due to the {\log(n)} dependence on the dimension of the model family, you can actually write down extremely broad classes of models and I will still successfully sift through them.

Let me stress again: regret bounds are saying that, no matter how the {x_t} and {y_t} are related, no i.i.d. assumptions anywhere in sight, we will do almost as well as any predictor {h} in {\mathcal{M}} (in particular, almost as well as the best predictor).

Myth 6: frequentist methods can only deal with simple models and need to make arbitrary cutoffs in model complexity. A naive perusal of the literature might lead one to believe that frequentists only ever consider very simple models, because many discussions center on linear and log-linear models. To dispel this, I will first note that there are just as many discussions that focus on much more general properties such as convexity and smoothness, and that can achieve comparably good bounds in many cases. But more importantly, the reason we focus so much on linear models is because we have already reduced a large family of problems to (log-)linear regression. The key insight, and I think one of the most important insights in all of applied mathematics, is that of featurization: given a non-linear problem, we can often embed it into a higher-dimensional linear problem, via a feature map {\phi : X \rightarrow \mathbb{R}^n} ({\mathbb{R}^n} denotes {n}-dimensional space, i.e. vectors of real numbers of length {n}). For instance, if I think that {y} is a polynomial (say cubic) function of {x}, I can apply the mapping {\phi(x) = (1, x, x^2, x^3)}, and now look for a linear relationship between {y} and {\phi(x)}.

This insight extends far beyond polynomials. In combinatorial domains such as natural language, it is common to use indicator features: features that are {1} if a certain event occurs and {0} otherwise. For instance, I might have an indicator feature for whether two words appear consecutively in a sentence, whether two parts of speech are adjacent in a syntax tree, or for what part of speech a word has. Almost all state of the art systems in natural language processing work by solving a relatively simple regression task (typically either log-linear or max-margin) over a rich feature space (often involving hundreds of thousands or millions of features, i.e. an embedding into {\mathbb{R}^{10^5}} or {\mathbb{R}^{10^6}}).

A counter-argument to the previous point could be: “Sure, you could create a high-dimensional family of models, but it’s still a parameterized family. I don’t want to be stuck with a parameterized family, I want my family to include all Turing machines!” Putting aside for a second the question of whether “all Turing machines” is a well-advised model choice, this is something that a frequentist approach can handle just fine, using a tool called regularization, which after featurization is the second most important idea in statistics.

Specifically, given any sufficiently quickly growing function {\psi(h)}, one can show that, given {T} data points, there is a strategy whose average error is at most {\sqrt{\frac{\psi(h)}{T}}} worse than any estimator {h}. This can hold even if the model class {\mathcal{M}} is infinite dimensional. For instance, if {\mathcal{M}} consists of all probability distributions over Turing machines, and we let {h_i} denote the probability mass placed on the {i}th Turing machine, then a valid regularizer {\psi} would be

\displaystyle \psi(h) = \sum_i h_i \log(i^2 \cdot h_i)

If we consider this, then we see that, for any probability distribution over the first {2^k} Turing machines (i.e. all Turing machines with description length {\leq k}), the value of {\psi} is at most {\log((2^k)^2) = k\log(4)}. (Here we use the fact that {\psi(h) \geq \sum_i h_i \log(i^2)}, since {h_i \leq 1} and hence {h_i\log(h_i) \leq 0}.) This means that, if we receive roughly {\frac{k}{\epsilon^2}} data, we will achieve error within {\epsilon} of the best Turing machine that has description length {\leq k}.

Let me note several things here:

Interestingly, in the case that the prediction error is given by the negative log probability assigned to the truth, then the corresponding strategy that achieves the error bound is just normal Bayesian updating. But for other measurements of error, we get different update strategies. Although I haven’t worked out the math, intuitively this difference could be important if the universe is fundamentally unpredictable but our notion of error is insensitive to the unpredictable aspects.

Myth 7: frequentist methods hide their assumptions while Bayesian methods make assumptions explicit. I’m still not really sure where this came from. As we’ve seen numerous times so far, a very common flavor among frequentist methods is the following: I have a model class {\mathcal{M}}, I want to do as well as any model in {\mathcal{M}}; or put another way:

Assumption: At least one model in {\mathcal{M}} has error at most {E}.
Guarantee: My method will have error at most {E + \epsilon}.

This seems like a very explicit assumption with a very explicit guarantee. On the other hand, an argument I hear is that Bayesian methods make their assumptions explicit because they have an explicit prior. If I were to write this as an assumption and guarantee, I would write:

Assumption: The data were generated from the prior.
Guarantee: I will perform at least as well as any other method.

While I agree that this is an assumption and guarantee of Bayesian methods, there are two problems that I have with drawing the conclusion that “Bayesian methods make their assumptions explicit”. The first is that it can often be very difficult to understand how a prior behaves; so while we could say “The data were generated from the prior” is an explicit assumption, it may be unclear what exactly that assumption entails. However, a bigger issue is that “The data were generated from the prior” is an assumption that very rarely holds; indeed, in many cases the underlying process is deterministic (if you’re a subjective Bayesian then this isn’t necessarily a problem, but it does certainly mean that the assumption given above doesn’t hold). So given that that assumption doesn’t hold but Bayesian methods still often perform well in practice, I would say that Bayesian methods are making some other sort of “assumption” that is far less explicit (indeed, I would be very interested in understanding what this other, more nebulous assumption might be).

Myth 8: frequentist methods are fragile, Bayesian methods are robust. This is another one that’s straightforwardly false. First, since frequentist methods often rest on weaker assumptions they are more robust if the assumptions don’t quite hold. Secondly, there is an entire area of robust statistics, which focuses on being robust to adversarial errors in the problem data.

Myth 9: frequentist methods are responsible for bad science. I will concede that much bad science is done using frequentist statistics. But this is true only because pretty much all science is done using frequentist statistics. I’ve heard arguments that using Bayesian methods instead of frequentist methods would fix at least some of the problems with science. I don’t think this is particularly likely, as I think many of the problems come from mis-application of statistical tools or from failure to control for multiple hypotheses. If anything, Bayesian methods would exacerbate the former, because they often require more detailed modeling (although in most simple cases the difference doesn’t matter at all). I don’t think being Bayesian guards against multiple hypothesis testing. Yes, in some sense a prior “controls for multiple hypotheses”, but in general the issue is that the “multiple hypotheses” are never written down in the first place, or are written down and then discarded. One could argue that being in the habit of writing down a prior might make practitioners more likely to think about multiple hypotheses, but I’m not sure this is the first-order thing to worry about.

Myth 10: frequentist methods are unprincipled / hacky. One of the most beautiful theoretical paradigms that I can think of is what I could call the “geometric view of statistics”. One place that does a particularly good job of show-casing this is Shai Shalev-Shwartz’s PhD thesis, which was so beautiful that I cried when I read it. I’ll try (probably futilely) to convey a tiny amount of the intuition and beauty of this paradigm in the next few paragraphs, although focusing on minimax estimation, rather than online learning as in Shai’s thesis.

The geometric paradigm tends to emphasize a view of measurements (i.e. empirical expected values over observed data) as “noisy” linear constraints on a model family. We can control the noise by either taking few enough measurements that the total error from the noise is small (classical statistics), or by broadening the linear constraints to convex constraints (robust statistics), or by controlling the Lagrange multipliers on the constraints (regularization). One particularly beautiful result in this vein is the duality between maximum entropy and maximum likelihood. (I can already predict the Jaynesians trying to claim this result for their camp, but (i) Jaynes did not invent maximum entropy; (ii) maximum entropy is not particularly Bayesian (in the sense that frequentists use it as well); and (iii) the view on maximum entropy that I’m about to provide is different from the view given in Jaynes or by physicists in general [edit: EHeller thinks this last claim is questionable, see discussion here].)

To understand the duality mentioned above, suppose that we have a probability distribution {p(x)} and the only information we have about it is the expected value of a certain number of functions, i.e. the information that {\mathbb{E}[\phi(x)] = \phi^*}, where the expectation is taken with respect to {p(x)}. We are interested in constructing a probability distribution {q(x)} such that no matter what particular value {p(x)} takes, {q(x)} will still make good predictions. In other words (taking {\log p(x)} as our measurement of prediction accuracy) we want {\mathbb{E}_{p'}[\log q(x)]} to be large for all distributions {p'} such that {\mathbb{E}_{p'}[\phi(x)] = \phi^*}. Using a technique called Lagrangian duality, we can both find the optimal distribution {q} and compute its worse-case accuracy over all {p'} with {\mathbb{E}_{p'}[\phi(x)] = \phi^*}. The characterization is as follows: consider all probability distributions {q(x)} that are proportional to {\exp(\lambda^{\top}\phi(x))} for some vector {\lambda}, i.e. {q(x) = \exp(\lambda^{\top}\phi(x))/Z(\lambda)} for some {Z(\lambda)}. Of all of these, take the q(x) with the largest value of {\lambda^{\top}\phi^* - \log Z(\lambda)}. Then {q(x)} will be the optimal distribution and the accuracy for all distributions {p'} will be exactly {\lambda^{\top}\phi^* - \log Z(\lambda)}. Furthermore, if {\phi^*} is the empirical expectation given some number of samples, then one can show that {\lambda^{\top}\phi^* - \log Z(\lambda)} is propotional to the log likelihood of {q}, which is why I say that maximum entropy and maximum likelihood are dual to each other.

This is a relatively simple result but it underlies a decent chunk of models used in practice.

Myth 11: frequentist methods have no promising approach to computationally bounded inference. I would personally argue that frequentist methods are more promising than Bayesian methods at handling computational constraints, although computationally bounded inference is a very cutting edge area and I’m sure other experts would disagree. However, one point in favor of the frequentist approach here is that we already have some frameworks, such as the “tightening relaxations” framework discussed here, that provide quite elegant and rigorous ways of handling computationally intractable models.

 

References

(Myth 3) Sparse recovery: Sparse recovery using sparse matrices
(Myth 5) Online learning: Online learning and online convex optimization
(Myth 8) Robust statistics: see this blog post and the two linked papers
(Myth 10) Maximum entropy duality: Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory

129 comments

Comments sorted by top scores.

comment by lukeprog · 2014-02-10T19:03:57.106Z · LW(p) · GW(p)

I would love to know which parts of this post Eliezer disagrees with, and why.

Replies from: Eliezer_Yudkowsky, Benquo
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2014-02-12T21:44:25.494Z · LW(p) · GW(p)

Don't have time for a real response. Quickly and ramblingly:

1) The point of Bayesianism isn't that there's a toolbox of known algorithms like max-entropy methods which are supposed to work for everything. The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there's supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a "resistant concept" which simply cannot sink in for many people.

2) I did initially try to wade into the math of the linear problem (and wonder if I'm the only one who did so, unless others spotted the x-y inversion but didn't say anything), trying to figure out how I would solve it even though that wasn't really relevant for reasons of (1), but found that the exact original problem specified may be NP-hard according to Wikipedia, much as my instincts said it should be. And if we're allowed approximate answers then yes, throwing a standard L1-norm algorithm at it is pretty much what I would try, though I might also try some form of expectation-maximization using the standard Bayesian L2 technique and repeatedly truncating the small coefficients and then trying to predict the residual error. I have no idea how long that would take in practice. It doesn't actually matter, because see (1). I could go on about how for any given solution I can compute its Bayesian likelihood assuming Gaussian noise, and so again Bayes functions well as a background epistemology which gives us a particular minimization problem to be computed by whatever means, and if we have no background epistemology then why not just choose a hundred random 1s, etc., but lack the time for more than rapid rambling here. Jacob didn't say what he thought an actual frequentist or Bayesian approach would be, he just said the frequentist approach would be easy and that the Bayesian one was hard.

(3) Having made a brief effort to wade into the math and hit the above bog, I did not attempt to go into Jacob's claim that frequentist statistics can transcend i.i.d. But considering the context in which I originally complained about the assumptions made by frequentist guarantees, I should very much like to see explained concretely how Jacob's favorite algorithm would handle the case of "You have a self-improving AI which turns out to maximize smiles, in all previous cases it produced smiles by making people happy, but once it became smart enough it realized that it ought to preserve your bad generalization and faked its evidence, and now that it has nanotech it's going to tile the universe with tiny smileyfaces." This is the Context Change Problem I originally used to argue against trying for frequentist-style guarantees based on past AI behavior being okay or doing well on other surface indicators. I frankly doubt that Jacob's algorithm is going to handle it. I really really doubt it. Very very roughly, my own notion of an approach here would be a Bayesian-viewpoint AI which was learning a utility function and knew to explicitly query model ambiguity back to the programmers, perhaps using a value-of-info calculation. I should like to hear what a frequentist viewpoint on that would sound like.

(4) Describing the point of likelihood ratios in science would take its own post. Three key ideas are (a) instead of "negative results" we have "likelihood ratios favoring no effect over 5% effect" and so it's now conceptually simpler to get rid of positive-result bias in publication; (b) if we compute likelihood ratios on all the hypotheses which are actually in play then we can add up what many experiments tell us far more easily and get far more sensible answers than with present "survey" methods; and (c) having the actual score be far below expected log score for the best hypothesis tells us when some of our experiments must be giving us bogus data or having been performed under invisibly different conditions, a huge problem in many cases and something far beyond the ability of present "survey" methods to notice or handle.

EDIT: Also everything in http://lesswrong.com/lw/mt/beautiful_probability/

Replies from: IlyaShpitser, jsteinhardt, Unnamed, lukeprog
comment by IlyaShpitser · 2014-02-13T14:40:47.033Z · LW(p) · GW(p)

when a frequentist algorithm works, there's supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a "resistant concept" which simply cannot sink in for many people.

Perhaps the reason this is not sinking in for many people is because it is not true.


Bayes assumes you can write down your prior, your likelihood and your posterior. That is what we need to get Bayes theorem to work. If you are working with a statistical model where this is not possible*, you cannot really use the standard Bayesian story, yet there still exist ways of attacking the problem.

(*) Of course, "not possible in principle" is different from "we don't know how to yet." In either case, I am not really sure what the point of an official Bayesian epistemology explanation would be.


This idea that there is a standard Bayesian explanation for All The Things seems very strange to me. Andrew Gelman has a post on his blog about how to define "identifiability" if you are a Bayesian:

(http://andrewgelman.com/2014/02/12/think-identifiability-bayesian-inference/)

This is apparently a tricky (or not useful) concept to define within that framework. Which is a little weird, because it is both a very useful concept, and a very clear one to me.

Gelman is a pretty prominent Bayesian. Either he is confused, or I am confused, or his view on the stuff causal folks like me work on is so alien that it is not illuminating. The issue to me seems to me to be cultural differences between frameworks.

Replies from: Emile
comment by Emile · 2014-02-14T13:44:33.432Z · LW(p) · GW(p)

Do you have a handy example of a frequentist algorithm that works, for which there is no Bayesian explanation?

Replies from: IlyaShpitser
comment by IlyaShpitser · 2014-02-14T13:54:05.763Z · LW(p) · GW(p)

I wouldn't say "no Bayesian explanation," but perhaps "a Bayesian explanation is unknown to me, nor do I see how this explanation would illuminate anything." But yes, I gave an example elsewhere in this thread. The FCI algorithm for learning graph structure in the non-parametric setting with continuous valued variables, where the correct underlying model has the following independence structure:

A is independent of B and C is independent of D (and nothing else is true).

Since I (and to my knowledge everyone else) do not know how to write the likelihood for this model, I don't know how to set up the standard Bayesian story here.

comment by jsteinhardt · 2014-02-19T07:06:35.910Z · LW(p) · GW(p)

Eliezer,

The point of Bayesianism is to provide a coherent background epistemology which underlies everything; when a frequentist algorithm works, there's supposed to be a Bayesian explanation of why it works. I have said this before many times but it seems to be a "resistant concept" which simply cannot sink in for many people.

First, I object to the labeling of Bayesian explanations as a "resistant concept". I think it's not only uncharitable but also wrong. I started out with exactly the viewpoint that everything should be explained in terms of Bayes (see one of my earliest and most-viewed blog posts if you don't believe me). I moved away from this viewpoint slowly as the result of accumulated evidence that this is not the most productive lens through which to view the world.

More to the point: why is it that you think that everything should have a Bayesian explanation? One of the most-cited reasons why Bayes should be an empistemic ideal is the various "optimality" / Dutch book theorems, which I've already argued against in this post. Do you accept the rebuttals I gave, or disagree with them?

My guess is that you would still be in favor of Bayes as a normative standard of epistemology even if you rejected Dutch book arguments, and the reason why you like it is because you feel like it has been useful for solving a large number of problems. But frequentist statistics (not to mention pretty much any successful paradigm) has also been useful for solving a large number of problems, some of which Bayesian statistics cannot solve, as I have demonstrated in this post. The mere fact that a tool is extremely useful does not mean that it should be elevated to a universal normative standard.

but found that the exact original problem specified may be NP-hard according to Wikipedia, much as my instincts said it should be

We've already discussed this in one of the other threads, but I'll just repeat here that this isn't correct. With overwhelmingly high probability a Gaussian matrix will satisfy the restricted isometry property, which implies that appropriately L1-regularized least squares will return the exact solution.

I could go on about how for any given solution I can compute its Bayesian likelihood assuming Gaussian noise, and so again Bayes functions well as a background epistemology

The point of this example was to give a problem that, from a modeling perspective, was as convenient for Bayes as possible, but that was computationally intractable to solve using Bayesian techniques. I gave other examples (such as in Myth 5) that demonstrate situations where Bayes breaks down. And I argued indirectly in Myths 1, 4, and 8 that the prior is actually a pretty big deal and has the capacity to cause problems in ways that frequentists have ways of dealing with.

I should very much like to see explained concretely how Jacob's favorite algorithm would handle the case of "You have a self-improving AI which turns out to maximize smiles, in all previous cases it produced smiles by making people happy, but once it became smart enough it realized that it ought to preserve your bad generalization and faked its evidence, and now that it has nanotech it's going to tile the universe with tiny smileyfaces."

I think this is a very bad testing ground for how good a technique is, because it's impossible to say whether something would solve this problem without going through a lot of hand-waving. I think your "notion of how to solve it" is interesting but has a lot of details to fill in, and it's extremely unclear how it would work, especially given that even for concrete problems that people work on now, an issue with Bayesian methods is overconfidence in a particular model. I should also note that, as we've registered earlier, I don't think that what you call the Context Change Problem is actually a problem that an intelligent agent would face: any agent that is intelligent enough to behave at all functionally close to the level of a human would be robust to context changes.

However, even given all these caveats, I'll still try to answer your question on your own terms. Short answer: do online learning with an additional action called "query programmer" that is guaranteed to always have some small negative utility, say -0.001, that is enough to outweigh any non-trivial amount of uncertainty but will eventually encourage the AI to act autonomously. We would need some way of upper-bounding the regret of other possible actions, and of incorporating this utility constraint into the algorithm, but I don't think the amount of fleshing out is any more or less than that required by your proposal.

[WARNING: The rest of this comment is mostly meaningless rambling.]

I want to stress again that the above paragraph is only a (sketch of) an answer to the question as you posed it. But I'd rather sidestep the question completely and say something like: "OK, if we make literally no assumptions, then we're completely screwed, because moving any speck of dust might cause the universe to explode. Being Bayesian doesn't make this issue go away, it just ignores it.

So, what assumptions can we be reasonably okay with making that would help us solve the problem? Maybe I'd be okay assuming that the mechanism that takes in my past actions and returns a utility is a Turing machine of description length less than 10^15. But unfortunately that doesn't help me much, because for every Turing machine M, there's one of not that much longer description length that behaves identically to M up until I'm about to make my current decision, and then penalizes my current decision with some extraordinary large amount of disutility. Note that, again, being Bayesian doesn't deal with this issue, it just assigns it low prior probability.

I think the question of exactly what assumptions one would be willing to make, that would allow one to confidently reason about actions with potentially extremely discontinuous effects, is an important and interesting one, and I think one of the drawbacks of "thinking like a Bayesian" is that it draws attention away from this issue by treating it as mostly solved (via assigning a prior)."

Replies from: Eliezer_Yudkowsky, Wei_Dai, Vaniver
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2014-02-19T18:08:46.515Z · LW(p) · GW(p)

My guess is that you would still be in favor of Bayes as a normative standard of epistemology even if you rejected Dutch book arguments, and the reason why you like it is because you feel like it has been useful for solving a large number of problems.

Um, nope. What it would really take to change my mind about Bayes is seeing a refutation of Dutch Book and Cox's Theorem and Von Neumann-Morgenstern and the complete class theorem , combined with seeing some alternative epistemology (e.g. Dempster-Shafer) not turn out to completely blow up when subjected to the same kind of scrutiny as Bayesianism (the way DS brackets almost immediately go to [0-1] and fuzzy logic turned out to be useless etc.)

Neural nets have been useful for solving a large number of problems. It doesn't make them good epistemology. It doesn't make them a plausible candidate for "Yes, this is how you need to organize your thinking about your AI's thinking and if you don't your AI will explode".

some of which Bayesian statistics cannot solve, as I have demonstrated in this post.

I am afraid that your demonstration was not stated sufficiently precisely for me to criticize. This seems like the sort of thing for which there ought to be a standard reference, if there were such a thing as a well-known problem which Bayesian epistemology could not handle. For example, we have well-known critiques and literature claiming that nonconglomerability is a problem for Bayesianism, and we have a chapter of Jaynes which neatly shows that they all arise from misuse of limits on infinite problems. Is there a corresponding literature for your alleged reductio of Bayesianism which I can consult? Now, I am a great believer in civilizational inadequacy and the fact that the incompetence of academia is increasing, so perhaps if this problem was recently invented there is no more literature about it. I don't want to be a hypocrite about the fact that sometimes something is true and nobody has written it up anyway, heaven knows that's true all the time in my world. But the fact remains that I am accustomed to somewhat more detailed math when it comes to providing an alleged reductio of the standard edifice of decision theory. I know your time is limited, but the real fact is that I really do need more detail to think that I've seen a criticism and be convinced that no response to that criticism exists. Should your flat assertion that Bayesian methods can't handle something and fall flat so badly as to constitute a critique of Bayesian epistemology, be something that I find convincing?

We've already discussed this in one of the other threads, but I'll just repeat here that this isn't correct. With overwhelmingly high probability a Gaussian matrix will satisfy the restricted isometry property, which implies that appropriately L1-regularized least squares will return the exact solution.

Okay. Though I note that you haven't actually said that my intuitions (and/or my reading of Wikipedia) were wrong; many NP-hard problems will be easy to solve for a randomly generated case.

Anyway, suppose a standard L1-penalty algorithm solves a random case of this problem. Why do you think that's a reductio of Bayesian epistemology? Because the randomly generated weights mean that a Bayesian viewpoint says the credibility is going as the L2 norm on the non-zero weights, but we used an L1 algorithm to find which weights were non-zero? I am unable to parse this into the justifications I am accustomed to hearing for rejecting an epistemology. It seems like you're saying that one algorithm is more effective at finding the maximum of a Bayesian probability landscape than another algorithm; in a case where we both agree that the unbounded form of the Bayesian algorithm would work.

What destroys an epistemology's credibility is a case where even in the limit of unbounded computing power and well-calibrated prior knowledge, a set of rules just returns the wrong answer. The inherent subjectivity of p-values as described in http://lesswrong.com/lw/1gc/frequentist_statistics_are_frequently_subjective/ is not something you can make go away with a better-calibrated prior, correct use of limits, or unlimited computing power; it's the result of bad epistemology. This is the kind of smoking gun it would take to make me stop yammering about probability theory and Bayes's rule. Showing me algorithms which don't on the surface seem Bayesian but find good points on a Bayesian fitness landscape isn't going to cut it!

Replies from: jsteinhardt, thomascolthurst
comment by jsteinhardt · 2014-02-20T04:16:23.699Z · LW(p) · GW(p)

Eliezer, I included a criticism of both complete class and Dutch book right at the very beginning, in Myth 1. If you find them unsatisfactory, can you at least indicate why?

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2014-02-20T04:34:15.400Z · LW(p) · GW(p)

Your criticism of Dutch Book is that it doesn't seem to you useful to add anti-Dutch-book checkers to your toolbox. My support of Dutch Book is that if something inherently produces Dutch Books then it can't be the right epistemological principle because clearly some of its answers must be wrong even in the limit of well-calibrated prior knowledge and unbounded computing power.

The complete class theorem I understand least of the set, and it's probably not very much entwined with my true rejection so it would be logically rude to lead you on here. Again, though, the point that every local optimum is Bayesian tells us something about non-Bayesian rules producing intrinsically wrong answers. If I believed your criticism, I think it would be forceful; I could accept a world in which for every pair of a rational plan with a world, there is an irrational plan which does better in that world, but no plausible way for a cognitive algorithm to output that irrational plan - the plans which are equivalent of "Just buy the winning lottery ticket, and you'll make more money!" I can imagine being shown that the complete class theorem demonstrates only an "unfair" superiority of this sort, and that only frequentist methods can produce actual outputs for realistic situations even in the limit of unbounded computing power. But I do not believe that you have leveled such a criticism. And it doesn't square very much with my current understanding that the decision rules being considered are computable rules from observations to actions. You didn't actually tell me about a frequentist algorithm which is supposed to be realistic and show why the Bayesian rule which beats it is beating it unfairly.

If you want to hit me square in the true rejection I suggest starting with VNM. The fact that our epistemology has to plug into our actions is one reason why I roll my eyes at the likes of Dempster-Shafer or frequentist confidence intervals that don't convert to credibility distributions.

Replies from: Luke_A_Somers, jsteinhardt
comment by Luke_A_Somers · 2014-02-21T22:24:27.585Z · LW(p) · GW(p)

I could accept a world in which for every pair of a rational plan with a world, there is an irrational plan which does better in that world, but no plausible way for a cognitive algorithm to output that irrational plan

We already live in that world.

(The following is not evidence, just an illustrative analogy) Ever seen Groundhog Day? Imagine him skipping the bulk of the movie and going straight to the last day. It is straight wall to wall WTF but it's very optimal.

comment by jsteinhardt · 2014-02-20T04:44:05.576Z · LW(p) · GW(p)

One of the criticisms I raised is that merely being able to point to all the local optima is not a particularly impressive property of an epistemological theory. Many of those local optima will be horrible! (My criticism of VNM is essentially the same.)

Many frequentist methods, such as minimax, also provide local optima, but they provide local optima which actually have certain nice properties. And minimax provides a complete decision rule, not just a probability distribution, so it plugs directly into actions.

comment by thomascolthurst · 2014-03-12T15:05:03.222Z · LW(p) · GW(p)

FYI, there are published counterexamples to Cox's theorem. See for example Joseph Halpern's at http://arxiv.org/pdf/1105.5450.pdf.

Replies from: Vaniver
comment by Vaniver · 2014-03-12T17:32:07.776Z · LW(p) · GW(p)

You need to not include the period in your link, like so.

comment by Wei Dai (Wei_Dai) · 2015-07-01T23:35:30.455Z · LW(p) · GW(p)

Short answer: do online learning with an additional action called "query programmer" that is guaranteed to always have some small negative utility, say -0.001, that is enough to outweigh any non-trivial amount of uncertainty but will eventually encourage the AI to act autonomously.

This short answer is too short for me to understand, unfortunately. Do you think there is enough potential merit in this idea to try to understand it better or further develop it? (I've been learning about online learning recently in an effort to understand/evaluate Paul Christiano's recent "AI control" ideas. If you have your own ideas also based on online learning, I'd love to try to understand them while the online learning stuff is fresh in my mind.)

Replies from: John_Maxwell_IV
comment by John_Maxwell (John_Maxwell_IV) · 2018-08-06T07:31:44.070Z · LW(p) · GW(p)

Here is a control idea based on online learning--I think I independently generated something similar to what Jacob describes.

comment by Vaniver · 2014-02-19T18:03:45.201Z · LW(p) · GW(p)

We've already discussed this in one of the other threads, but I'll just repeat here that this isn't correct. With overwhelmingly high probability a Gaussian matrix will satisfy the restricted isometry property, which implies that appropriately L1-regularized least squares will return the exact solution.

I do wonder if it would have been better to include something along the lines of "with probability 1" to the claim that non-Bayesian methods can solve it easily. Compressed sensing isn't magic, even though it's very close.

any agent that is intelligent enough to behave at all functionally close to the level of a human would be robust to context changes.

Humans get tripped up by context changes very frequently. It's not obvious to me where you think this robustness would come from.

Replies from: Dentin
comment by Dentin · 2014-02-19T21:14:22.121Z · LW(p) · GW(p)

Compressed sensing isn't even magic, if you're halfway versed in signal processing. I understood compressed sensing within 30 seconds of hearing a general overview of it, and there are many related analogs in many fields.

Replies from: Vaniver
comment by Vaniver · 2014-02-19T21:40:25.853Z · LW(p) · GW(p)

Compressed sensing isn't even magic

The convex optimization guys I know are all rather impressed by compressed sensing- but that may be because they specialize in doing L1 and L2 problems, and so compressed sensing makes the things they're good at even more important.

comment by Unnamed · 2014-02-13T07:04:54.875Z · LW(p) · GW(p)

(c) having the actual score be far below expected log score for the best hypothesis tells us when some of our experiments must be giving us bogus data or having been performed under invisibly different conditions, a huge problem in many cases and something far beyond the ability of present "survey" methods to notice or handle.

The standard meta-analysis toolkit does include methods of looking at the heterogeneity in effect sizes. (This is fresh in my mind because it actually came up at yesterday's CFAR colloquium regarding some academic research that we were discussing.)

I do not know how the frequentist approach compares to the Bayesian approach in this case.

comment by lukeprog · 2014-02-12T23:53:53.354Z · LW(p) · GW(p)

Thanks!

comment by Benquo · 2014-02-10T19:21:47.106Z · LW(p) · GW(p)

Seconded.

comment by SilentCal · 2014-02-11T23:34:14.124Z · LW(p) · GW(p)

I don't have a technical basis for thinking this, but I'm beginning to suspect that as time goes on, more and more frequentist methods will be proven to be equivalent or good approximations to the ideal Bayesian approach. If that happens, (Edit: Hypothetical) Bayesians who refused to use those methods on ideological grounds would look kind of silly in hindsight, as if relativistic physics came first and a bunch of engineers refused to use Newtonian equations for decades until someone proved that they approximate the truth well at low speeds.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2014-02-19T06:21:55.642Z · LW(p) · GW(p)

Who are these mysterious straw Bayesians who refuse to use algorithms that work well and could easily turn out to have a good explanation later? Bayes is epistemological background not a toolbox of algorithms.

Replies from: SilentCal, Fhyve, buybuydandavis
comment by SilentCal · 2014-02-19T20:26:26.010Z · LW(p) · GW(p)

After a careful rereading of http://lesswrong.com/lw/mt/beautiful_probability/, the 747 analogy suggests that, once you understand the difference between an epistemological background and a toolbox, it might be a good idea to use the toolbox. But I didn't really read it that way the first time, so I imagine others might have made a similar mistake. I'll edit my post to make the straw Bayesians hypothetical, to make it clear that I'm making a point to other LW readers rather than criticizing a class of practicing statisticians.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2014-02-19T20:50:51.951Z · LW(p) · GW(p)

I'd actually forgotten I'd written that. Thank you for reminding me!

comment by Fhyve · 2014-02-22T21:07:37.781Z · LW(p) · GW(p)

Bayes is epistemological background not a toolbox of algorithms.

I disagree: I think you are lumping two things together that don't necessarily belong together. There is Bayesian epistemology, which is philosophy, describing in principle how we should reason, and there is Bayesian statistics, something that certain career statisticians use in their day to day work. I'd say that frequentism does fairly poorly as an epistemology, but it seems like it can be pretty useful in statistics if used "right". It's nice to have nice principles underlying your statistics, but sometimes ad hoc methods and experience and intuition just work.

Replies from: army1987
comment by A1987dM (army1987) · 2014-02-23T09:22:47.019Z · LW(p) · GW(p)

Yes, but the sounder the epistemology is the harder is to [ETA: accidentally] misuse the tools. Cue all the people misunderstanding what p-values mean...

comment by buybuydandavis · 2014-02-24T03:12:18.771Z · LW(p) · GW(p)

The fundamental confusion going on here comes from peculiar terminology.

jsteinhardt writes:

Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method

So every algorithm that isn't obviously Bayesian is labeled Frequentist, while in fact what we have are two epistemological frameworks, and a zillion and one algorithms that we throw at data that don't neatly fit into either framework.

comment by IlyaShpitser · 2014-02-10T09:09:22.029Z · LW(p) · GW(p)

Great post! It would be great if you had cites for various folks claiming myth k. Some of these sound unbelievable!


"Frequentist methods need to assume their model is correct."

This one is hilarious. Does anyone say this? Multiply robust methods (Robins/Rotnitzky/et al) aren't exactly Bayesian, and their entire point is that you can get a giant piece of the likelihood arbitrarily wrong and your estimator is still consistent.

Replies from: jsteinhardt, jsteinhardt
comment by jsteinhardt · 2014-02-11T05:56:43.182Z · LW(p) · GW(p)

Great post!

I thought you might enjoy it :). Your comments on LessWrong provided me with some of the motivation for writing it.

comment by TheMajor · 2014-02-10T07:43:46.990Z · LW(p) · GW(p)

I am confused about your use of the word optimal. In particular in the sentences

Bayesian methods are optimal (except for computational considerations).

and

Bayesian methods are locally optimal, not global optimal.

are you talking about the same sort of 'optimal'? From wikipedia (here and here) I found the rigorous definition of the word 'optimal' in the second sentence, which can be written in terms of your utility function (a decision rule is optimal if there is no other decision rule which will always give you at least as much utility and in at least one world will give you more utility).

Also I agree with many of your myths, namely 3,8,9 and 11. I was rather surprised to see that these things even needed to be mentioned, I don't see why making good trade-offs between truth and computation time should be 'simple' (3), as you mentioned the frequentist tests are chosen precisely with robustness in mind (8), bad science is more than getting your statistics wrong (9) (small sidenote: while it might be true that scientists can get confused by frequentist statistics, which might corrupt their science, I don't think the problem would be smaller when using a different form of statistics, and I therefore think it would not be correct to attribute this bad science to frequentism), and we know from practice that Bayesianism (not frequentism) is the method which has most problems with computational bounds (11).

However, I think it is important to make a distinction between the validity of Bayesianism and the application of Bayesianism. I recall reading on lesswrong (although I cannot find the post at this moment) that the relation between Bayesianism and frequentism should be seen like the relation between Quantum Mechanics and classical physics (although QM has lots of experimental data to support it, so it is rightfully more accepted than Bayesianism). Like QM, Bayesianism is governed by simple mathematical rules (Schrodinger's equation and Bayes' theorem), which will give the right answer when supplied with the correct initial conditions. However, to fly a plane we do not invoke QM, and similarly we will in most practical instances to estimate a parameter not invoke Bayes. Instead we use approximations (classical physics/frequentism), which will not give the exact answer but will give a good approximation thereof (as you mention: a method close to the global optimum, although I am still unclear what we are optimising for there). The key point is that these approximations are correct only insofar as they approximate the answer that would be given by the correct theory. If classical physics and QM disagree then QM was correct. Similarly if we have a parameter estimation obtained by a Bayesian algorithm, and one using a frequentist algorithm, the Bayesian one is going to give the correct subjective probability. But the correct algorithms are (nearly?) impossible to implement, so we stick with the approximations. This is why physicists still use and teach classical physics, and why I personally endorse many frequentist tools. The difference between validity and application seems to be lost in myths 4-7 and 10:

  • 4: Strictly speaking the only way to truly share your arguments for having a certain degree of belief in a hypothesis would be to share all sensory data that is dependent on the hypothesis (after all, this is how evidence works). This is clearly not feasible, but it would be the correct thing to do if we only care about being correct. You explain in this myth that this does not lead to a simple and quick algorithm. But this is not an argument against validity, it is an argument against a possible application.
  • 5: Again this whole myth deals with application. The myth you debunk states that the approximations made when turning degrees of belief into an actual strategy must be bad, and you debunk this by giving an algorithm that gets very good results. But this is not an argument that distinguishes between Bayesianism and frequentism, it merely states that there are easy-to-compute (in a relative sense) algorithms that get close to the correct answer, which we know how to find in theory but not in practice. (In case you are wondering: the approximation takes place in the step where you simplify your utility function to the Regret function, and your prior is whatever decision rule you use for the first horse.)
  • 6: This myth hinges on the word 'simple'. Frequentist methods can deal with many complicated problems, and a lot of high quality work has been done to increase the scope of the tools of frequentism. Saying that only simple models can be dealt with would be an insult. However, as mentioned above, these methods are all approximations, and each method is valid only if the approximations made are satisfied. So while frequentist methods can deal with many complicated models it is imporant to realise that the scope of each method is limited.
  • 7+10: Myth 10 seems to be a case of confusion by the people using the tools. Frequentist methods (derived from approximations) come with boundaries, such as limitations on the type of model that can be distilled from data or limitations on the meaning of the outcome of the algorithm (it might answer a different question than the one you hoped to answer). If you break one of these limitations it is not surprising that the results are wacky. This is not a problem of frequentism, provided the tools are explained properly. If the tools are not explained properly then problems arise. Your explanation, we have a class M and a solution E, and we look for a simple approximation which will give E+epsilon, is very clear. Problems arise when the class M is not specified, or the existence of E is unclear. I would like to classify this as an error of the practitioners of frequentism, rather than an error of the method.

Lastly I would like to make a small note that the example on myth 10 is very similar to something called the Boltzmann distribution from statistical physics, discovered in the 19th century. Here the function phi is the energy divided by the temperature.

Edit: during the writing of this post it seems that other people have already made this remark on myth 10. I agree that physicists would probably not interpret this as a game played between nature and the predictor.

Replies from: V_V, jsteinhardt
comment by V_V · 2014-02-10T16:19:36.017Z · LW(p) · GW(p)

However, I think it is important to make a distinction between the validity of Bayesianism and the application of Bayesianism. I recall reading on lesswrong (although I cannot find the post at this moment) that the relation between Bayesianism and frequentism should be seen like the relation between Quantum Mechanics and classical physics

Quantum Mechanics isn't consistent with General Relativity, our best explanation of gravity. Despite decades of trying, neither can be formulated as an approximation of the other. Even if one day physicists finally figure out a "Theory of Everything", it would still be a model. It would be epistemically incorrect to claim it was "exact".

Curiously, there is one interpretation of QM known as Quantum Bayesianism, which holds that wavefunctions are subjective and they are the fundamental concepts for reasoning about the world, and subjective probability distributions arise as approximations of wavefunctions under decoherence. That is, Bayesianism itself might be an approximation of a "truer" epistemic theory!

My humble opinion is that there is no ultimately "true" epistemic theory. They are all just models of what humans do to gain knowledge of the world. Some models can work better than others, often within certain bounds, but none of them is perfect.

Replies from: IlyaShpitser, jsteinhardt
comment by IlyaShpitser · 2014-02-10T18:11:49.967Z · LW(p) · GW(p)

I am very interested in Quantum Bayesianism (in particular Leifer's work) because one of the things we have to do to be "quantum Bayesians" is figure out a physically neutral description of quantum mechanics, that is, a description of quantum mechanics that doesn't use physical jargon like 'time.' In particular, physicists I believe describe spacelike and timelike separated entanglement differently.

That is, a Bell inequality violation system (that is where B and C are space separated) has this graph

A -> B <-> C <- D

(where famously, due to Bell inequality violation, there is no hidden variable corresponding to the bidirected arc connecting B and C).

But the same system can arise in a temporally sequential model which looks like this:

A -> B -> D -> C, with B <-> C

where an appropriate manipulation of the density matrix corresponding to this system ought to give us the Bell system above. In classical probability we can do this. In other words, in classical probability the notion of "probabilistic dependence" is abstracted away from notions like time and space.


Also we have to figure out what "conditioning" even means. Can't be Bayesian if we don't condition, now can we!

Replies from: V_V
comment by V_V · 2014-02-10T18:51:17.765Z · LW(p) · GW(p)

where an appropriate manipulation of the density matrix corresponding to this system ought to give us the Bell system above. In classical probability we can do this. In other words, in classical probability the notion of "probabilistic dependence" is abstracted away from notions like time and space.

Yes, but the notion of Bayesian inference, where you start with a prior and build a sequence of posteriors, updating as evidence accumulates, has an intrinsic notion of time. I wonder if that's enough for Quantum Bayesianism (I haven't read the original works, so I don't really know much about it).

Replies from: IlyaShpitser
comment by IlyaShpitser · 2014-02-10T20:57:54.861Z · LW(p) · GW(p)

The temporal order for sequential computation of posteriors is just our interpretation, it is not a part of the formalism. If we get pieces of evidence e1, e2, ..., ek in temporal order, we could do Bayesian updating in the temporal order, or the reverse of the temporal order, and the formalism still works (that is our overall posterior will be the same, because all the updates commute). And that's because Bayes theorem says nothing about time anywhere.

comment by jsteinhardt · 2014-02-10T18:02:21.191Z · LW(p) · GW(p)

My humble opinion is that there is no ultimately "true" epistemic theory. They are all just models of what humans do to gain knowledge of the world. Some models can work better than others, often within certain bounds, but none of them is perfect.

Exactly!

comment by jsteinhardt · 2014-02-10T08:07:56.066Z · LW(p) · GW(p)

Thanks for your comments. One thing you say a few times throughout your comment is "frequentist methods are an approximation to Bayes". I wouldn't agree with this. I think Bayesian and frequentist methods are often trying to do different things (although in many practical instances their usage overlaps). In what sense do you believe that Bayes is the "correct" answer?

At the beginning of your comment, I would have used "admissible" rather than "optimal" to describe the definition you gave:

a decision rule is optimal if there is no other decision rule which will always give you at least as much utility and in at least one world will give you more utility

I don't see how the online learning algorithm in myth 5 can be interpreted as an approximation to Bayes. The guarantee I'm getting just seems way better and more awesome than what Bayes provides. I also don't think it's right to say that "regret is an approximation to utility". Regret is an alternative formulation to utility that happens to lead to a set of very fruitful results, one of which I explained under myth 5.

Replies from: TheMajor
comment by TheMajor · 2014-02-10T08:40:30.824Z · LW(p) · GW(p)

While writing this answer I realised I forgot an important class of exceptions, namely the typical school example of hypothesis testing. My explanation now consists of multiple parts.

To answer the first question: the Bayesian method gives the "correct" answer in the sense that it optimises the expectation of your utility function. If you choose a utility function like log(p) this means that you will find your subjective probabilities. I also think Bayesianism is "correct" in the philosophical sense (which is a property of the theory), but I believe there are many posts on lesswrong that can explain this better than I can.

  • The approximation made can often be rewritten in terms of a particular choice of utility function (or risk function, which is more conventional according to wikipedia). As you mentioned choosing the Regret function for cost and a non-silly prior (for example whichever one you are using) will yield a Bayesian algorithm to your problem. Unfortunately I haven't looked at the specific algorithm in detail, but if admissible solutions are Bayesian algorithms, why would a Bayesian approach using your data not outperform (and therefore produce at least as good asymptotic behaviour) the frequentist algorithm? Also I would like to leave open the possibility that the algorithm you mention actually coincides with a Bayesian algorithm. Sometimes a different approach (frequentism/Bayesianism) can lead to the same conclusion (method).

  • Suppose I find myself in a situation in which I have several hypotheses and a set of data. The thing I'm interested in is the probability of each hypothesis given the data (in other words, finding out which hypothesis is correct). In frequentism there is no such thing as a 'probability of the hypothesis', after all a hypothesis is either true or false and we don't know which. So as a substitution frequentists consider the other conditional probability, the probability of seeing this data or worse provided the hypothesis is true, where worse must be defined beforehand. I'd say this is a wrong approach, a very very wrong approach. My opinion is that frequentists have adopted an incorrect worldview which leads them to dismiss and answer the wrong questions in this case. Here I expect pure conflict rather than some Bayesian approach which will coincide with frequentist methods.

I hope this explains how Bayesian and frequentist methods overlap and seem to disagree sometimes, and how many instances of frequentist algorthms should be compared to Bayesian algorithms with a properly chosen utility function.

Replies from: IlyaShpitser, private_messaging
comment by IlyaShpitser · 2014-02-10T14:33:33.725Z · LW(p) · GW(p)

Say I am interested in distinguishing between two hypotheses for p(a,b,c,d) (otherwise unrestricted):

hypothesis 1: "A is independent of B, C is independent of D, and nothing else is true"

hypothesis 2: "no independences hold"

Frequentists can run their non-parametric marginal independence tests. What is the (a?) Bayesian procedure here? As far as I can tell, for unrestricted densities p(a,b,c,d) no one knows how to write down the likelihood for H1. You can do a standard Bayesian setup here in some cases, e.g. if p(a,b,c,d) is multivariate normal, in which case H1 corresponds to a (simple) Gaussian ancestral graph model. Maybe one can do some non-parametric Bayes thing (???). It's not so simple to set up the right model sometimes, which is what Bayesian methods generally need.

Replies from: johnswentworth
comment by johnswentworth · 2014-02-11T05:56:14.434Z · LW(p) · GW(p)

You should check out chapter 20 of Jaynes' Probability Theory, which talks about Bayesian model comparison.

We wish to calculate P[H1 | data] / P[H2 | data] = P[data | H1] / P[data | H2] * P[H1] / P[H2].

For Bayesians, this problem does not involve "unrestricted densities" at all. We are given some data and presumably we know the space from which it was drawn (e.g. binary, categorical, reals...). That alone specifies a unique model distribution. For discrete data, symmetry arguments mandate a Dirichlet model prior with the categories given by all possible outcomes of {A,B,C,D}. For H2, the Dirichlet parameters are updated in the usual fashion and P[data | H2] calculated accordingly.

For H1, our Dirichlet prior is further restricted according to the independencies. The resulting distribution is not elegant (as far as I can tell), but it does exist and can be updated. For example, if the variables are all binary, then the Dirichlet for H2 has 16 categories. We'll call the 16 frequencies X0000, X0001, X0010, ... with parameters a0000, a0001, ... where the XABCD are the probabilities which the model given by X assigns to each outcome. Already, the Dirichlet for H2 is constrained to {X | sum(X) = 1, X > 0} within R^16. The Dirichlet for H1 is exactly the same function, but further constrained to the space {X | sum(X) = 1, X > 0, X00.. / X10.. = X01.. / X11.., X..00 / X10.. = X..01 / X..11} within R^16. This is probably painful to work with (analytically at the very least), but is fine in principle.

So we have P[data | H1] and P[data | H2]. That just leaves the prior probabilities for each model. At first glance, it might seem that H1 has zero prior, since it corresponds to a measure-zero subset of H2. But really, we must have SOME prior information lending H1 a nonzero prior probability or we wouldn't bother comparing the two in the first place. Beyond that, we'd have to come up with reasonable probabilities based on whatever prior information we have. Given no other information besides the fact that we're comparing the two, it would be 50/50.

Of course this is all completely unscalable. Fortunately, we can throw away information to save computation. More specifically, we can discretize and bin things much like we would for simple marginal independence tests. While it won't yield the ideal Bayesian result, it is still the ideal result given only the binned data.

I am a bit curious about the non-parametric tests used for H1. I am familiar with tests for whether A and B are independent, and of course they can be applied between C and D, but how does one test for independence between both pairs simultaneously without assuming that the events (A independent of B) and (C independent of D) are independent? It is precisely this difficulty which makes the Bayesian likelihood calculation of H1 such a mess, and I am curious how frequentist methods approach it.

My apologies for the truly awful typesetting, but this is not the evening on which I learn to integrate tex in lesswrong posts.

Replies from: IlyaShpitser
comment by IlyaShpitser · 2014-02-11T08:24:27.608Z · LW(p) · GW(p)

Thanks for this post.

The resulting distribution is not elegant (as far as I can tell).

In the binary case, the saturated model can be parameterized by p(S = 0) for S any non-empty subset of { a,b,c,d }. The submodel corresponding to H1 is just one where p({a,b} = 0) = p({a}=0)p({b}=0), and p({c,d} = 0) = p({c}=0)p({d}=0).

For Bayesians, this problem does not involve "unrestricted densities" at all.

I am sorry, Bayesians do not get to decide what my problem is. My problem involves unrestricted densities by definition. I don't think you get to keep your "fully general formalism" chops if you suddenly start redefining my problem for me.

how does one test for independence between both pairs simultaneously without assuming that the events (A independent of B) and (C independent of D) are independent?

This is a good question. I don't know a good answer to this that does not involve dealing with the likelihood in some way.

Replies from: johnswentworth
comment by johnswentworth · 2014-02-11T17:03:52.356Z · LW(p) · GW(p)

Sorry, I didn't mean to be dismissive of the general densities requirement. I mean that data always comes with a space, and that restricts the density. We could consider our densities completely general to begin with, but as soon as you give me data to test, I'm going to look at it and say "Ok, this is binary?" or "Ok, these are positive reals?" or something. The space gives the prior model. Without that information, there is no Bayesian answer.

I guess you could say that this isn't fully general because we don't have a unique prior for every possible space, which is a very valid point. For the spaces people actually deal with we have priors, and Jaynes would probably argue that any space of practical importance can be constructed as the limit of some discrete space. I'd say it's not completely general, because we don't have good ways of deriving the priors when symmetry and maximum entropy are insufficient. The Bayesian formalism will also fail in cases where the priors are non-normalizable, which is basically the formalism saying "Not enough information."

On the other hand, I would be very surprised to see any other method which works in cases where the Bayesian formalism does not yield an answer. I would expect such methods to rely on additional information which would yield a proper prior.

Regarding that ugly distribution, that parameterization is basically where the constraints came from. Remember that the Dirichlets are distributions on the p's themselves, so it's an hierarchical model. So yes, it's not hard to right down the subspace corresponding to that submodel, but actually doing an update on the meta-level distribution over that subspace is painful.

Replies from: IlyaShpitser
comment by IlyaShpitser · 2014-02-11T17:27:52.864Z · LW(p) · GW(p)

I mean that data always comes with a space, and that restricts the density.

Sorry I am confused. Say A,B,C,D are in [0,1] segment of the real line. This doesn't really restrict anything.

For the spaces people actually deal with we have priors.

I deal with this space. I even have a paper in preparation that deals with this space! So do lots of people that worry about learning graphs from data.

On the other hand, I would be very surprised to see any other method which works in cases where the Bayesian formalism does not yield an answer.

People use variations of the FCI algorithm, which from a Bayesian point of view is a bit of a hack. The asymptopia version of FCI assumes a conditional independence oracle, and then tells you what the model is based on what the oracle says. In practice, rather than using an oracle, people do a bunch of hypothesis tests for independence.


Regarding that ugly distribution

You are being so mean to that poor distribution. You know, H1 forms a curved exponential family if A,B,C,D are discrete. That's sort of the opposite of ugly. I think it's beautiful! H1 is an instance of Thomas Richardson's ancestral graph models, with the graph:

A <-> B <-> C <-> D <-> A

Replies from: johnswentworth
comment by johnswentworth · 2014-02-12T03:27:37.079Z · LW(p) · GW(p)

Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals, distributions over R^n, distributions over words starting with the letter k, distributions over Turing machines, distributions over elm trees more than 4 years old in New Hampshire, distributions over bizarre mathematical objects that I can't even think of... That's a LOT of prior information. It's a continuous space, so we can't apply a maximum entropy argument directly to find our prior. Typically we use the beta prior for [0,1] due to a symmetry argument, but that admittedly is not appropriate in all cases. On the other hand, unless you can find dependencies after running the data through the continuous equivalent of a pseudo-random number generator, you are definitely utilizing SOME additional prior information (e.g. via smoothness assumptions). When the Bayesian formalism does not yield an answer, it's usually because we don't have enough prior info to rule out stuff like that.

I think we're still talking past each other about the distributions. The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]. Perhaps the notation p[A,B,C,D ; X] is more familiar? Anyway, the hypothesis H1 corresponds to a subset of possible values of X. The beautiful distribution you talk about is p[A,B,C,D | X], which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph. Under that parameterization, X would be the lambda vector specifying the exponential model. Unfortunately, p[X] is the ugly one, and that elegant parameterization for p[A,B,C,D | X] will probably make p[X] even uglier.

It is much prettier for DAGs. In that case, we'd have one beta distribution for every possible set of inputs to each variable. X would then be the set of parameters for all those beta distributions. We'd get elegant generative models for numerical integration and life would be sunny and warm. So the simple use case for FCI is amenable to Bayesian methods. Latent variables are still a pain, though. They're fine in theory, just integrate over them when calculating the posterior, but it gets ugly fast.

Replies from: IlyaShpitser
comment by IlyaShpitser · 2014-02-12T08:56:01.657Z · LW(p) · GW(p)

Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals

???

There are easy to compute bijections from R to [0,1], etc.

The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]

Yes, parametric Bayes does this. I am giving you a problem where you can't write down p(A,B,C,D | X) explicitly and then asking you to solve something frequentists are quite happy solving. Yes I am aware I can do a prior for this in the discrete case. I am sure a paper will come of it eventually.

Latent variables are still a pain, though.

The whole point of things like the beautiful distribution is you don't have to deal with latent variables. By the way the reason to think about H1 is that it represents all independences over A,B,C,D in this latent variable DAG:

A <- u1 -> B <- u2 -> C <- u3 -> D <- u4 -> A

where we marginalize out the ui variables.


which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph

I think you might be confusing undirected and bidirected graph models. The former form linear exponential families and can be parameterized via cliques, the latter form curved exponential families, and can be parameterized via connected sets.

Replies from: labachevskij
comment by labachevskij · 2014-02-19T12:44:49.478Z · LW(p) · GW(p)

There are easy to compute bijections from R to [0,1], etc.

This is not true, there are bijections between R and (0,1), but not the closed interval.

Anyway there are more striking examples, for example if you know that A, B, C, D are in a discrete finite set, it restricts yout choices quite a lot.

Replies from: V_V, nshepperd, IlyaShpitser
comment by V_V · 2014-02-19T14:19:03.806Z · LW(p) · GW(p)

This is not true, there are bijections between R and (0,1), but not the closed interval.

No.

comment by nshepperd · 2014-02-19T17:04:50.930Z · LW(p) · GW(p)

Did you mean to say continuous bijections? Obviously adding two points wouldn't change the cardinality of an infinite set, but "easy to compute" might change.

Replies from: labachevskij
comment by labachevskij · 2014-02-21T09:52:50.932Z · LW(p) · GW(p)

You're right, I meant continuous bijections, as the context was a transformation of a probability distribution.

comment by IlyaShpitser · 2014-02-19T12:49:16.747Z · LW(p) · GW(p)

This is not true, there are bijections between R and (0,1), but not the closed interval.

You are right, apologies.

comment by private_messaging · 2014-02-12T10:24:41.670Z · LW(p) · GW(p)

In frequentism there is no such thing as a 'probability of the hypothesis', after all a hypothesis is either true or false and we don't know which. So as a substitution frequentists consider the other conditional probability, the probability of seeing this data or worse provided the hypothesis is true, where worse must be defined beforehand. I'd say this is a wrong approach, a very very wrong approach.

That's not a substitution, and it's the probability of seeing the data provided the hypothesis is false, not true.

It gives the upper bound on the risk that you're going to believe in a wrong thing if you follow the strategy of "do experiments, believe the hypothesis if confirmed".

Mostly we want to update all probabilities until they're very close to 0 or to 1 , because the uncertainty leads to loss of expected utility in the future decision making.

In frequentism there is no such thing as a 'probability of the hypothesis'

Yeah, and in Bayesianism, any number between 0 and 1 will do - there's still no such thing as a specific "probability of the hypothesis", merely a change to an arbitrary number.

edit: it's sort of like arguing that worst-case structural analysis of a building or a bridge is a "very very wrong approach", and contrast it with some approach where you make up priors about the quality of concrete, and end up shaving a very very small percent off the construction cost, while building a weaker bridge which bites you in the ass eventually anyway when something unexpected happens to the bridge.

comment by Cyan · 2014-02-11T19:14:50.817Z · LW(p) · GW(p)

I've been thinking about what program, exactly, is being defended here, and I think a good name for it might be "prior-less learning". To me, all procedures under the prior-less umbrella have a "minimax optimality" feel to them. Some approaches search for explicitly minimax-optimal procedures; but even more broadly, all such approaches aim to secure guarantees (possibly probabilistic) that the worst-case performance of a given procedure is as limited as possible within some contemplated set of possible states of the world. I have a couple of things to say about such ideas.

First, for the non-probabilistically guaranteed methods: these are relatively few and far between, and for any such procedure it must be ensured that the loss that is being guaranteed is relevant to the problem at hand. That said, there is only one possible objection to them, and it is the same as one of my objections to prior-less probabilistically guaranteed methods. That objection applies generically to the minimaxity of the prior-less learning program: when strong prior information exists but is difficult to incorporate into the method, the results of the method can "leave money on the table", as it were. Sometimes this can be caught and fixed, generally in a post hoc and ad hoc way; sometimes not.

For probabilistically-guaranteed methods, there is a epistemic gap -- in principle -- in going from the properties of such procedures in classes of repeating situations (i.e., pre-data claims about the procedure) to well-warranted claims in the cases at hand (i.e., post-data claims about the world). But it's obvious that this is merely an in-principle objection -- after all, many such techniques can be and have been successfully applied to learn true things about the world. The important question is then: does the heretofore implicit principle justifying the bridging of this gap differ significantly from the principle justifying Bayesian learning?

Replies from: jsteinhardt, alex_zag_al
comment by jsteinhardt · 2014-02-12T07:41:49.813Z · LW(p) · GW(p)

Thanks a lot for the thoughtful comment. I've included some of my own thoughts below / also some clarifications.

First, for the non-probabilistically guaranteed methods: these are relatively few and far between

Do you think that online learning methods count as an example of this?

when strong prior information exists but is difficult to incorporate into the method, the results of the method can "leave money on the table", as it were

I think this is a valid objection, but I'll make two partial counter-arguments. The first is that, arguably, there may be some information that is not easy to incorporate as a prior but is easy to incorporate under some sort of minimax formalism. So Bayes may be forced to leave money on the table in the same way.

A more concrete response is that, often, an appropriate regularizer can incorporate similar information to what a prior would incorporate. I think the regularizer that I exhibited in Myth 6 is one example of this.

For probabilistically-guaranteed methods...

I think it's important to distinguish between two (or maybe three) different types of probabilistic guarantees; I'm not sure whether you would consider all of the below "probabilistic" or whether some of them count as non-probabilistic, so I'll elaborate on each type.

The first, which I presume is what you are talking about, is when the probability is due to some assumed distribution over nature. In this case, if I'm willing to make such an assumption, then I'd rather just go the full-on Bayesian route, unless there's some compelling reason like computational tractability to eschew it. And indeed, there exist cases where, given distributional assumptions, we can infer the parameters efficiently using a frequentist estimation technique, while the Bayesian analog runs into NP-hardness obstacles, at least in some regimes. But there are other instances where the Bayesian method is far cheaper computationally than the go-to frequentist technique for the same problem (e.g. generative vs. discriminative models for syntactic parsing), so I only mean to bring this up as an example.

The second type of guarantee is in terms of randomness generated by the algorithm, without making any assumptions about nature (other than that we have access to a random number generator that is sufficiently independent from what we are trying to predict). I'm pretty happy with this sort of guarantee, since it requires fairly weak epistemic commitments.

The third type of guarantee is somewhat in the middle: it is given by a partial constraint on the distribution. As an example, maybe I'm willing to assume knowledge of certain moments of the distribution. For sufficiently few moments, I can estimate them all accurately from empirical data, and I can even bound the error to within high probability, making no assumption other than independence of my samples. In this case, as long as I'm okay with making the independence assumption, then I consider this guarantee to be pretty good as well (as long as I can bound the error introduced into the method by the inexact estimation of the moments, which there are good techniques for doing). I think the epistemic commitments for this type of method are, modulo making an independence assumption, not really any stronger than those for the second type of method, so I'm also fairly okay with this case.

Replies from: TsviBT
comment by TsviBT · 2014-02-12T17:31:09.205Z · LW(p) · GW(p)

there may be some information that is not easy to incorporate as a prior but is easy to incorporate under some sort of minimax formalism

If you can cook up examples of this, that would be helpful.

comment by alex_zag_al · 2014-02-17T07:41:55.588Z · LW(p) · GW(p)

For probabilistically-guaranteed methods, there is a epistemic gap -- in principle -- in going from the properties of such procedures in classes of repeating situations (i.e., pre-data claims about the procedure) to well-warranted claims in the cases at hand (i.e., post-data claims about the world).

Well, if you believe post-data probabilities reflect real knowledge, then that's a start. Because, you can think of pre-data probabilities as more conservative versions of post-data probabilities. That is, if pre-data calculations tell you to be sure of something, you can probably be at least that sure, post-data.

The example that's guiding me here is a confidence interval. When you derive a confidence interval, you're really calculating the probability that some parameter of interest R will be between two estimators E1 and E2.

%20=%20.95)

Post-data, you just calculate E1 and E2 from the data and call that your 95\% confidence interval. So you're still using the pre-data probability that R is between those two estimators.

I know of two precise senses in which the pre-data probabilities are conservative, when you use them in this way.

Sense the first: Let be the hypothesis that . is probably true, so you're probably going to get evidence in favor of it. The post-data probability, then, will probably be higher than the pre-data probability.

So, epistemically... I don't know. If you're doing many experiments, this explains why using pre-data probabilities is a conservative strategy: in most experiments, you're underestimating the probability that the parameter is between the estimators. Or, you can view this as logical uncertainty about a post-data probability that you don't know how to calculate: you think that if you did the calculation, it would probably make you more, rather than less sure that the parameter is between the estimators.

Another precise sense in which the pre-data probabilities are more conservative is that pre-data probability distributions have higher entropy than post-data ones, on average.

Let's say R and D are random variables. Let H(R) be the entropy of the probability distribution of R, likewise for D. That is,

%20=%20E[-\log{P(D)}])

I hope this notation is clear... see, usually I'd write P(D=d), but when it's in an expectation operator, I want to make it clear that D is a random variable that the expectation operator is integrating over, so I write things like E[P(D)] (the expected value of P(D=d) when d is randomly selected).

Define the conditional entropy as follows: %20=%20E[-\log{P(R|D=d)}|D=d])

The theorem, then, is this: ]%20\le%20H(R))

(I don't have a free reference on hand, but it's theorem 9.3.2 in Sheldon Ross's "A First Course in Probability")

So, imagine that R is a paRameter and D is some Data. And note that the expectation is not conditional on D, all this is in the pre-data state of knowledge. So what this theorem means is that, before seeing the data, the expected value of the post-data entropy is below the current entropy.

This one's a little weirder to interpret, but it clearly seems to be saying something relevant. As a statement about doing many independent experiments, it means that the average pre-data distribution entropy is higher than the average post-data distribution entropy, so when you use the pre-data probabilities, you're taking them from a higher-entropy distribution. So that's a sense in which you could call it a conservative strategy: it tends to use a probability distribution that's too spread out. As a statement about logical uncertainty, when you haven't calculated the post-data probabilities, I guess it could mean that your best estimate of the post-data entropy is lower than the entropy of the pre-data distribution. Which means, if your best estimate is near true, you're using a distribution that's too spread out, not too concentrated.

So that's what I've got. I think there's a lot more to be said here. I haven't read about this topic, I'm just putting together some stuff that I've observed incidentally, so I would appreciate a reference. But what it adds up to is that using pre-data probabilities is a conservative strategy.

And the reason that's important is because conservative strategies can be really useful for science. Sometimes you wanna gather evidence until you've got enough that you can publish and say that you've proved something with confidence. Conservative calculations can often show what you want to show, which is that your evidence is sufficient.

comment by JGWeissman · 2014-02-11T19:49:07.802Z · LW(p) · GW(p)

Regarding myth 5 and the online learning, I don't think the average regret bound is as awesome as you claim. The bound is square root( (log n) / T). But if there are really no structural assumptions, then you should be considering all possible models, and the number of possible models for T rounds is exponential in T, so the bound ends up being 1, which is the worst possible average regret using any strategy. With no assumptions of structure, there is no meaningful guarantee on the real accuracy of the method.

The thing that is awesome about the bounds guarantee is that if you assume some structure, and choose a subset of possible models based on that structure, you know you get increased accuracy if your structural assumptions hold.

So this method doesn't really avoid relying on structural assumptions, it just punts the question of which structural assumption to make to the choice of models to run the method over. This is pretty much the same as Bayesian methods putting the structural assumptions in the prior, and it seems that choosing a collection of models is an approximation of choosing a prior, though less powerful because instead of assigning models probabilities in a continuous range, it just either includes the model or doesn't.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-12T07:14:33.827Z · LW(p) · GW(p)

and the number of possible models for T rounds is exponential in T

??? Here n is the number of other people betting. It's a constant.

If you wanted to, you could create "super-people" that mix and match the bets of other people depending on the round. Then the number of super-people grows exponentially in T, and without further assumptions you can't hope to be competitive with such "super-people". If that's what you're saying, then I agree with that.

And I agree with the broader point that in general you need to make structural assumptions to make progress. The thing that's awesome about the regret bound is that it does well even in the presence of correlated, non-i.i.d., maybe even adversarial data, and even if the "true hypothesis" isn't in the family of models we consider.

Replies from: JGWeissman
comment by JGWeissman · 2014-02-12T20:27:49.455Z · LW(p) · GW(p)

and the number of possible models for T rounds is exponential in T

??? Here n is the number of other people betting. It's a constant.

Within a single application of online learning, n is a constant, but that doesn't mean we can't look at the consequences of it having particular values, even values that vary with other parameters. But, you seem to be agreeing with the main points that if you use all possible models (or "super-people") the regret bound is meaningless, and that in order to reduce the number of models so it is not meaningless, while also keeping a good model that is worth performing almost as well as, you need structural assumptions.

even if the "true hypothesis" isn't in the family of models we consider

I agree you don't need the model that is right every round, but you do need the model to be right in a lot of rounds. You don't need a perfect model, but you need a model that is as correct as you want your end results to be.

maybe even adversarial data

I think truly adversarial data gives a result that is within the regret bounds, as guaranteed, but still uselessly inaccurate because the data is adversarial against the collection of models (unless the collection is so large you aren't really bounding regret).

comment by Cyan · 2014-02-10T05:24:40.491Z · LW(p) · GW(p)

What you have labelled Cox's theorem is not Cox's theorem; it's some version of the Dutch book argument and/or Savage's theorem. Cox's theorem is more like this. (I am the author of those blog posts.)

Also, you've misstated the complete class theorem -- it's actually weaker than you claim. It says that every estimator is either risk-equivalent to some Bayesian minimum posterior expected loss estimator (BMPELE) or has worse risk in at least one possible world. Conversely, no non-BMPELE risk-dominates any BMPELE. (Here "risk" is statistical jargon for expected loss, where the expectation is taken with respect to the sampling distribution.)

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T05:38:15.569Z · LW(p) · GW(p)

Thanks. What do you think I should call it instead of Cox's theorem? Should I just call it "Dutch book argument"?

For the complete class theorem, is the beef with my use of "strictly worse" when I really mean "weakly worse" / "at least as bad"? That was me being sloppy and I'll fix it now, but let me know if there's a further issue.

Replies from: Cyan
comment by Cyan · 2014-02-10T06:02:56.701Z · LW(p) · GW(p)

Yeah, stick with "Dutch book".

For the complete class theorem, is the beef with my use of "strictly worse" when I really mean "weakly worse" / "at least as bad"? That was me being sloppy and I'll fix it now, but let me know if there's a further issue.

Yup, "strictly worse" overstates things. Basically, the complete class theorem says the class of Bayesian minimum posterior expected loss estimators is the risk-Pareto frontier of the set of all estimators.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T06:46:19.365Z · LW(p) · GW(p)

Thanks!

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2014-02-12T00:12:56.022Z · LW(p) · GW(p)

I assume you mean in the "infer u" problem? Or am I missing something?

Also, is there a good real-world problem which this reflects?

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-12T07:46:59.045Z · LW(p) · GW(p)

Yes, I mixed up x and y, good catch. It's not trivial for me to fix this while maintaining wordpress-compatibility, but I'll try to do so in the next few days.

This problem is called the "compressed sensing" problem and is most famously used to speed up MRI scans. However it has also had a multitude of other applications, see here: http://en.wikipedia.org/wiki/Compressed_sensing#Applications.

Replies from: Leon, Eliezer_Yudkowsky
comment by Leon · 2014-02-15T00:07:49.191Z · LW(p) · GW(p)

Many L1 constraint-based algorithms (for example the LASSO) can be interpreted as producing maximum a posteriori Bayesian point estimates with Laplace (= double exponential) priors on the coefficients.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-15T03:05:25.915Z · LW(p) · GW(p)

Yes, but in this setting maximum a posteriori (MAP) doesn't make any sense from a Bayesian perspective. Maximum a posteriori is supposed to be a point estimate of the posterior, but in this case, the MAP solution will be sparse, whereas the posterior given a laplacian prior will place zero mass on sparse solutions. So the MAP estimate doesn't even qualitatively approximate the posterior.

Replies from: Leon
comment by Leon · 2014-02-16T09:39:05.501Z · LW(p) · GW(p)

Ah, good point. It's like the prior, considered as a regularizer, is too "soft" to encode the constraint we want.

A Bayesian could respond that we rarely actually want sparse solutions -- in what situation is a physical parameter identically zero? -- but rather solutions which have many near-zeroes with high probability. The posterior would satisfy this I think. In this sense a Bayesian could justify the Laplace prior as approximating a so-called "slab-and-spike" prior (which I believe leads to combinatorial intractability similar to the fully L0 solution).

Also, without L0 the frequentist doesn't get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.

[FWIW I like the logical view of probability, but don't hold a strong Bayesian position. What seems most important to me is getting the semantics of both Bayesian (= conditional on the data) and frequentist (= unconditional, and dealing with the unknowns in some potentially nonprobabilistic way) statements right. Maybe there'd be less confusion -- and more use of Bayes in science -- if "inference" were reserved for the former and "estimation" for the latter.]

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-16T22:27:58.408Z · LW(p) · GW(p)

Also, without L0 the frequentist doesn't get fully sparse solutions either. The shrinkage is gradual; sometimes there are many tiny coefficients along the regularization path.

See this comment. You actually do get sparse solutions in the scenario I proposed.

Replies from: Leon
comment by Leon · 2014-02-17T01:30:25.483Z · LW(p) · GW(p)

Cool; I take that back. Sorry for not reading closely enough.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2014-02-12T19:16:12.459Z · LW(p) · GW(p)

Okay, I'm somewhat leaving my expertise here and going on intuition, but I would be somewhat surprised if the problem exactly as you stated it turned out to be solvable by a compressed-sensing algorithm as roughly described on Wikipedia. I was trying to figure out how I'd approach the problem you stated, using techniques I already knew about, but it seemed to me more like a logical constraint problem than a stats problem, because we had to end up with exactly 100 nonzero coefficients and the 100 coefficients had to exactly fit the observations y, in what I assume to be an underdetermined problem when treated as a linear problem. (In fact, my intuitions were telling me that this ought to correspond to some kind of SAT problem and maybe be NP-hard.) Am I wrong? The Wikipedia description talks about using L1-norm style techniques to implement an "almost all coefficients are 0" norm, aka "L0 norm", but it doesn't actually say the exact # of coefficients are known, nor that the observations are presumed to be noiseless.

Replies from: V_V, Vaniver
comment by V_V · 2014-02-12T20:39:24.593Z · LW(p) · GW(p)

You minimize the L1-norm consistently with correct prediction on all the training examples. Because of the way the training examples were generated, this will yield at most 100 non-zero coefficients.

It can be proved that problem is solvable in polynomial time due to a reduction to linear programming:
let m = 10,000

You can further manipulate it to get rid of the absolute value. For each coefficient introduce two variables: and :

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2014-02-12T21:14:14.692Z · LW(p) · GW(p)

Further reading shows that http://en.wikipedia.org/wiki/Sparse_approximation is supposed to be NP-hard, so it can't be that the L1-norm minimum produces the "L0-norm" minimum every time.

http://statweb.stanford.edu/~donoho/Reports/2004/l1l0approx.pdf which is given as the Wikipedia reference for L1 producing L0 under certain conditions, only talks about near-solutions, not exact solutions.

Also Jacob originally specified that the coefficients were drawn from a Gaussian and nobody seems to be using that fact.

Replies from: V_V, jsteinhardt
comment by V_V · 2014-02-12T21:50:57.867Z · LW(p) · GW(p)

Further reading shows that http://en.wikipedia.org/wiki/Sparse_approximation is supposed to be NP-hard, so it can't be that the L1-norm minimum produces the "L0-norm" minimum every time.

Yes, but if I understand correctly it occurs with probability 1 for many classes of probability distributions (including this one, I think).

http://statweb.stanford.edu/~donoho/Reports/2004/l1l0approx.pdf which is given as the Wikipedia reference for L1 producing L0 under certain conditions, only talks about near-solutions, not exact solutions.

It says that if the L0-pseudonorm solution has an error epsilon, then the L1-norm solution has error up to C*epsilon, for some positive C. In the exact case, epsilon is zero, hence the two solutions are equivalent.

Also Jacob originally specified that the coefficients were drawn from a Gaussian and nobody seems to be using that fact.

You don't really need the fact for the exact case. In the inexact case, you can use it in the form of an additional L2-norm regularization.

Replies from: sflicht
comment by sflicht · 2014-02-19T01:24:14.476Z · LW(p) · GW(p)

Note that in the inexact case (i.e. observation error) this model (the Lasso) fits comfortably in a Bayesian framework. (Double exponential prior on u.) Leon already made this point below and jsteinhardt replied

comment by jsteinhardt · 2014-02-13T05:35:57.993Z · LW(p) · GW(p)

As long as the matrix formed by the x_i satisfies the "restricted isometry property", the optimization problem given by V_V above will recover the optimal solution. For m >> k*log(n), (where in this case m = 10,000, k = 100, and n = 10^6), a random Gaussian matrix will satisfy this property with overwhelmingly large probability. I should probably have checked that the particular numbers I gave will work out, although I'm pretty sure they will, and if not we can replace 10,000 with 20,000 and the point still stands.

comment by Vaniver · 2014-02-12T20:31:20.152Z · LW(p) · GW(p)

My impression of compressed sensing is that you have a problem where you don't know it's exactly 100 coefficients- you're looking for the result that uses the minimum number of coefficients. Looking for the minimum takes more work than looking for one that's exactly 100, and you're right that this is binary optimization, which is a giant pain.

The beautiful compressed sensing result is that for a broad class of problems, the solution to the L1-norm minimum problem- which is much easier to solve- is also the solution to the L0-norm minimum problem. (For anyone not familiar, the L1 norm means "add the absolute value of the coefficients," and the L0 norm means "add the number of non-zero coefficients," which is not actually a norm.) In this particular context, I would look for the L1 minimum norm- and either it's less than or equal to 100 coefficients, in which case I'm okay, or it's more than 100 coefficients, in which case (so long as the conditions of the compressed sensing theorem apply) I know the problem as stated is infeasible, and I've found the least infeasible solution.

Replies from: Dentin, V_V
comment by Dentin · 2014-02-19T21:32:56.725Z · LW(p) · GW(p)

Compressed sensing in signal processing may help give an intuitive overview of what's going on. Consider a stream of data samples S, collected periodically at a rate R samples per second. According to the sampling theorem, we can perfectly reconstruct a continuous signal from those samples up to a frequency of rate 1/(2R). So for samples taken every 1/100th of a second, we can perfectly reconstruct signals from 0 to 50 hz.

Now, take two different subsets of S and compare them:

  • if we take a subset of S at fixed intervals, we can only reconstruct part of the original signal perfectly. The part that we can reconstruct is at a lower frequency: for example, if we take every other sample from S, we can only reconstruct perfectly up to a limit of 25 hz. Frequencies above that cannot be uniquely determined.

  • if we take a completely random subset of S, we can reconstruct the entire frequency range, but we can not reconstruct it perfectly. This is similar to holography, where cutting the hologram in half still reproduces the whole image, but at lower resolution.

We use fixed subsampling when we know our signal is band limited, and that we can safely throw away some band of frequencies.

The use of random sampling (compressed sensing) is when we have a signal that is not band limited, but is sparse enough that we can still accurately describe using a smaller number of data points. The more frequency sparse the signal is, the more accurate our estimate will be. For most compressed sensing applications, the signal can be very sparse indeed, and the number of needed samples can be quite low.

comment by V_V · 2014-02-12T21:08:30.239Z · LW(p) · GW(p)

Looking for the minimum takes more work than looking for one that's exactly 100, and you're right that this is binary optimization, which is a giant pain.

Why binary optimization?

Replies from: Vaniver
comment by Vaniver · 2014-02-12T21:24:57.109Z · LW(p) · GW(p)

I think I was thinking of it as including m binary variables, which I'll call z_j, which indicate whether or not the u_j are nonzero, and then an L0 minimization is that the cost function is the sum of the z_j or the L0 constraint is a constraint that their sum must equal 100. (Standard caveat than L0 is not actually a norm, which I should edit in to the previous post.)

The SAT problem Eliezer mentioned is binary (well, boolean), and it felt awkward to claim that it's a SAT problem when that could be mistaken as a question on the other SAT, so I decided to switch names without moving too far in concept-space. Your explanation seems much neater than mine, though.

[edit]I should be clear that I mean that looking for the L0 minimization directly is at least as hard as doing it the binary optimization way, but that the L1 minimization problem, which indirectly solves the L0 minimization problem only sometimes (read: almost always), is polynomial time, i.e. much easier to solve, because it's linear.

Replies from: V_V
comment by V_V · 2014-02-12T22:21:56.551Z · LW(p) · GW(p)

I should be clear that I mean that looking for the L0 minimization directly is at least as hard as doing it the binary optimization way, but that the L1 minimization problem, which indirectly solves the L0 minimization problem only sometimes (read: almost always) is polynomial time, i.e. much easier to solve, because it's linear.

Ok.

comment by Cyan · 2014-02-10T06:14:19.557Z · LW(p) · GW(p)

The biggest Bayesian objection to so-called "classical" statistics -- p-values and confidence intervals, not the online-learning stuff with non-probabilistic guarantees -- is that they provide the correct answer to the wrong question. For example, confidence intervals are defined as random intervals with certain properties under the sampling distribution. These properties are "pre-data" guarantees; the confidence interval procedure offers no guarantees to the one specific interval one calculates from the actual realized data one observes.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T06:44:36.138Z · LW(p) · GW(p)

I'm personally pretty comfortable with such "pre-data" guarantees as long as they're sufficiently high probability (e.g. if they hold with probability 99.9999%, I'm not too concerned that I might be unlucky for this specific interval). But I'm not necessarily that interested in defending p-values. I don't dislike them, and I think they can be quite useful in some situations, but they're not the best thing ever.

I do, however, think that concentration bounds are really good, which I would consider to be a sort of conceptual descendant of p-values (but I have no idea if that's actually how they were developed historically).

comment by V_V · 2014-02-10T15:56:02.228Z · LW(p) · GW(p)

Nice post, but it would be preferable if you provided references, especially for the mathematical claims.

Replies from: jsteinhardt, Gunnar_Zarncke
comment by jsteinhardt · 2014-02-10T18:51:10.751Z · LW(p) · GW(p)

Thanks, I added some references at the end (although they're almost certainly incomplete).

Replies from: V_V
comment by V_V · 2014-02-10T19:06:49.339Z · LW(p) · GW(p)

Thank you.

comment by Benquo · 2014-02-10T19:43:40.895Z · LW(p) · GW(p)

Great post! I learned a lot.

I agree that "frequentists do it wrong so often" is more because science is done by humans than due to any flaw in frequentist techniques. I also share your expectation that increased popularity of Bayesian techniques with more moving parts is likely to lead to more, not less, motivated selective use.

The online learning example, though, seems very unambitious in its loss function. Ideally you'd do better than any individual estimator in the underlying data. Let's say I was trying to use answers on the SAT test to predict college GPAs. A method that is no less predictive than the best individual question can still be pretty bad. A simple aggregation like the count of correct answers will very likely do better.

There are lots of statistical techniques that sound almost magical in their robustness, until you notice that this comes at the price of not narrowing our uncertainty very much. (Although some frequentist statistical measures are not bad at this either).

Replies from: jsteinhardt, jsalvatier
comment by jsteinhardt · 2014-02-11T02:00:45.724Z · LW(p) · GW(p)

Great post! I learned a lot.

Thanks! Glad you enjoyed it.

The online learning example, though, seems very unambitious in its loss function. Ideally you'd do better than any individual estimator in the underlying data.

I think this is good intuition; I'll just point out that the example I gave is much simpler than what you can actually ask for. For instance, if I want to be competitive with the best linear combination of at most k of the predictors, then I can do this with klog(n)/epsilon^2 rounds. If I want to be competitive with the best overall combination that uses all n predictors, I can do this with n/epsilon^2 rounds. The guarantees scale pretty gracefully with the problem instance.

(Another thing I'll point out in passing is that you're perfectly free to throw in "fraction of correct answers" as an additional predictor, although that doesn't address the core of your point, though I think that the preceding paragraph does address it.)

comment by jsalvatier · 2014-02-18T19:54:55.370Z · LW(p) · GW(p)

I agree that "frequentists do it wrong so often" is more because science is done by humans than due to any flaw in frequentist techniques. I also share your expectation that increased popularity of Bayesian techniques with more moving parts is likely to lead to more, not less, motivated selective use.

Another important reason is that basic frequentist statistics is quite complicated and non-intuitive for the vast majority of even highly educated and mathematically literate people (engineers for example). Bayesian statistics is dramatically simpler and more intuitive on the basic level.

A practitioner who knows basic Bayesian statistics can easily invent new models and know how to solve them conceptually (though often not practically). A practitioner who knows basic frequentist statistics can not.

comment by alex_zag_al · 2014-02-10T08:17:31.970Z · LW(p) · GW(p)

On the other hand, an argument I hear is that Bayesian methods make their assumptions explicit because they have an explicit prior. If I were to write this as an assumption and guarantee, I would write:

Assumption: The data were generated from the prior.

Guarantee: I will perform at least as well as any other method.

While I agree that this is an assumption and guarantee of Bayesian methods, there are two problems that I have with drawing the conclusion that “Bayesian methods make their assumptions explicit”. The first is that it can often be very difficult to understand how a prior behaves; so while we could say “The data were generated from the prior” is an explicit assumption, it may be unclear what exactly that assumption entails. However, a bigger issue is that “The data were generated from the prior” is an assumption that very rarely holds; indeed, in many cases the underlying process is deterministic (if you’re a subjective Bayesian then this isn’t necessarily a problem, but it does certainly mean that the assumption given above doesn’t hold).

In a post addressing a crowd where a lot of people read Jaynes, you're not addressing the Jaynesian perspective on where priors come from.

When E. T. Jaynes does statistics, the assumptions are made very clear.

The Jaynesian approach is this:

  • your prior knowledge puts constraints on the prior distribution,
  • and the prior distribution is the distribution of maximum entropy that meets all those constraints.

The assumptions are explicit because the constraints are explicit.

As an example, imagine that you're measuring something you've measured before. In a lot of situations, you'll end up with constraints on the mean and variance of your prior distribution. This is because you reason in the following way:

  • You think that your best estimate for the next measurement is the average of your previous measurements.
  • Your previous estimates have given you a sense of the general scale of the errors you get. You express this as an estimate of the squared difference between your estimated next measurement and your real next measurement.

If we take "best estimate" to mean "minimum expected least squared error", then these are constraints on the mean and variance of the prior distribution. If you maximize entropy subject to these constraints, you get a normal distribution. And that's your prior.

The assumptions are quite explicit here. You've assumed that these measurements are measuring some unchanging quantity in an unbiased way, and that the magnitude of the error tends to stay the same. And you haven't assumed any other relationship between the next measurement and the previous ones. You definitely didn't assume any autocorrelation, for example, because you didn't use the previous measurement in any of your estimates/constraints.

But since we're not assuming that the data are generated from the prior, we don't have the corresponding guarantee. Which brings up the question, what do we have? What's the point of this?

A paper I'm reading right now (Portilla & Simoncelli, 2000, "A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients") puts it this way:

Guarantee: "The maximum entropy density is optimal in the sense that it does not introduce any constraints on the [prior] besides those [we assumed]."

It's not a performance guarantee. Just a guarantee against hidden assumptions, I guess?

Despite apparently having nothing to do with performance, it makes a lot of sense from within Jaynes's perspective. To him, probability theory is logic: figuring out which conclusions are supported by your premises. It just extends deductive logic, allowing you to see to what degree each conclusion is supported by your premises.

And, I think your criticism of the Dutch Book argument applies here: is your time better spent making sure you don't have any hidden assumptions, or writing more programs to make more money? But it definitely makes your assumptions explicit, that's just part of the whole "prob theory as logic" paradigm.

(But I don't really understand what it means to not introduce any constraints besides those assumed, or why maxent is the procedure that achieves this. That's why I quoted the guarantee from someone who I assume does understand)

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T08:31:07.545Z · LW(p) · GW(p)

The assumptions are explicit because the constraints are explicit.

What does it mean to "assume that the prior satisfies these constraints"? As you already seem to indicate later in your comment, the notion of "a prior satisfying a constraint" is pretty nebulous. It's unclear what concrete statement about the world this would correspond to. So I still don't think this constitutes a particularly explicit assumption.

And, I think your criticism of the Dutch Book argument applies here: is your time better spent making sure you don't have any hidden assumptions, or writing more programs to make more money?

I'm responding to arguments that others have raised in the past that Bayesian methods make assumptions explicit while frequentist methods don't. If I then show that frequentist methods also make explicit assumptions, it seems like a weird response to say "oh, well who cares about explicit assumptions anyways?"

Replies from: alex_zag_al
comment by alex_zag_al · 2014-02-10T08:38:40.362Z · LW(p) · GW(p)

it seems like a weird response to say "oh, well who cares about explicit assumptions anyways?"

Yeah, sorry. I was getting a little off topic there. It's just that in your post, you were able to connect the explicit assumptions being true to some kind of performance guarantee. Here I was musing on the fact that I couldn't. It was meant to undermine my point, not to support it.

What does it mean to "assume that the prior satisfies these constraints"?

?? The answer to this is so obvious that I think I've misunderstood you. In my example, the constraints are on moments of the prior density. In many other cases, the constraints are symmetry constraints, which are also easy to express mathematically.

But then you bring up concrete statements about the world? Are you asking how you get from your prior knowledge about the world to constraints on the prior distribution?

EDIT: you don't "assume a constraint", a constraint follows from an assumption. Can you re-ask the question?

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T08:54:45.530Z · LW(p) · GW(p)

Yeah, sorry. I was getting a little off topic there. It's just that in your post, you were able to connect the explicit assumptions being true to some kind of performance guarantee. Here I was musing on the fact that I couldn't. It was meant to undermine my point, not to support it.

Ah my bad! Now I feel silly :).

But then you bring up concrete statements about the world? Are you asking how you get from your prior knowledge about the world to constraints on the prior distribution?

So the prior is this thing you start with, and then you get a bunch of data and update it and get a posterior. In general it's pretty unclear what constraints on the prior will translate to in terms of the posterior. Or at least, I spent a while musing about this and wasn't able to make much progress. And furthermore, when I look back, even in retrospect it's pretty unclear how I would ever test if my "assumption" held if it was a constraint on the prior. I mean sure, if there's actually some random process generating my data, then I might be able to say something, but that seems like a pretty rare case... sorry if I'm being unclear, hopefully that was at least somewhat more clear than before. Or it's possible that I'm just nitpicking pointlessly.

Replies from: alex_zag_al
comment by alex_zag_al · 2014-02-10T09:18:16.785Z · LW(p) · GW(p)

Hmm. Considering that I was trying to come up with an example to illustrate how explicit the assumptions are, the assumptions aren't that explicit in my example are they?

Prior knowledge about the world --> mathematical constraints --> prior probability distribution

The assumptions I used to get the constraints are that the best estimate of your next measurement is the average of your previous ones, and that the best estimate of its squared deviation from that average is some number s^2, maybe the variance of your previous observations. But those aren't states of the world, those are assumptions about your inference behavior.

Then I added later that the real assumptions are that you're making unbiased measurements of some unchanging quantity mu, and that the mechanism of your instrument is unchanging. These are facts about the world. But these are not the assumptions that I used to derive the constraints, and I don't show how they lead to the former assumptions. In fact, I don't think they do.

Well. Let me assure you that the assumptions that lead to the constraints are supposed to be facts about the world. But I don't see how that's supposed to work.

comment by Sniffnoy · 2014-02-20T05:49:24.235Z · LW(p) · GW(p)

Where is all this "local optimum" / "global optimum" stuff coming from? While I'm not familiar with the complete class theorem, going by the rough statement given in the article... local vs. global optima is just not the issue here, and is entirely the wrong language to talk about this here?

That is to say, talking about local maximum requires that A. things are being measured wrt some total order (though I suppose could be relaxed to a partial order, but you'd have to be clear whether you meant "locally maximum" or just "locally maximal"; I don't know that this is standard terminology) and B. some sort of topological structure on the domain so that you can say what's near a given position. The statement of the complete class theorem, as given, doesn't say anything about any sort of topological structure, or any total order.

Rather, it's a statement about a partial order (or preorder? Since people seem to use preorders in decision theory). And I mean I said above, the definition of "local maximum" could certainly be relaxed to that case, but that's not relevant here because there's just no localness anywhere there. Rather it's simply saying, "Every maximal element is Bayesian."

In particular, this implies that if there is a maximum element, it must be Bayesian, as certainly a maximum element must be maximal. Of course there's no guarantee that there is a maximum element, but I suppose you're considering the case where the partial order is extended to a total (pre)order with some maximum element.

Hell, even in the original post, with the local/global language that makes no sense, the logic is wrong: If we assume that the notions of "local optimum" and "global optimum" make sense here, well, a global maximum certainly is a local maximum! So if every local maximum is Bayesian, every global maximum is Bayesian.

None of this takes away from your point that knowing there exists a better Bayesian method doesn't tell you how to find it, let alone find it with bounded resources. And that just because a maximum, if it exists, must be Bayesian, doesn't imply there is anything good about other Bayesian points, and you may well be better off with a known-good frequentist method. But as best I can tell, all the stuff about local optima is just nonsense, and really just distracts from the underlying point. (So basically, you're wrong about Myth #2.)

comment by RolfAndreassen · 2014-02-10T07:24:55.366Z · LW(p) · GW(p)

The sense in which your online-learning example is frequentist escapes me.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T07:47:28.721Z · LW(p) · GW(p)

In what sense is it Bayesian?

I said at the beginning:

Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method, even though I think there’s actually a fair amount of variation among these methods, although also a fair amount of overlap (e.g. I’m throwing in statistical learning theory with minimax estimation, which certainly have a lot of overlap in ideas but were also in some sense developed by different communities).

I realize that using the label in this way is in some sense very inaccurate and a gross over-simplification, but on the other hand being purely Bayesian means you are likely to reject all of these techniques, so I think it's fair for the purposes of this post. And I think a large subset of the audience of this post may not even be aware of the distinctions between classical statistics, statistical learning theory, etc. Also I know both statistical learning theorists and classical statisticians who are happy to adopt the label of frequentist.

Replies from: VAuroch
comment by VAuroch · 2014-02-11T01:32:43.952Z · LW(p) · GW(p)

Without elaboration about what that strategy is, it sounds a lot like Bayesian updating based on other's beliefs. i.e. I see that this person is betting more successfully than me; I update my method to reflect theirs more closely.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-11T02:12:38.965Z · LW(p) · GW(p)

I see that this person is betting more successfully than me; I update my method to reflect theirs more closely

This seems like a description of pretty much all statistical methods ever. Okay that's not quite true, but I think "update parameters to get closer to the truth over time" is a pretty generic property for an algorithm to have.

Now, in the case of horse-racing, it is true that what you maintain is a probability distribution over the other people. However, there are a few key differences. The first is that when I actually choose which person to follow on a given round, I randomize according to my weights, whereas a Bayesian would always want to deterministically pick the person with the highest expected value.

The other difference is that I update my weight of person i by a multiplicative factor of, say, (1+a)^(-L(i,t)), where a is some constant and L(i,t) is the loss of person i in round t. The multiplicative factor is certainly reminiscent of Bayesian updating, and you could maybe find a way to interpret the updates as Bayesian updates under some well-motivated model, but it's at least not obvious to me how to do so.

As an aside, in the horse-racing case, this algorithm is called the weighted majority algorithm, and Eliezer spent an entire blog post talking about how he doesn't like it, so at least someone other than me doesn't think that it's "just being Bayesian implicitly".

Replies from: trist
comment by trist · 2014-02-11T04:27:46.862Z · LW(p) · GW(p)

And how do you expect to do better by using the weighted majority algorithm? Not in the worst case, but on average?

Replies from: DavidS, jsteinhardt
comment by DavidS · 2014-02-20T16:22:28.850Z · LW(p) · GW(p)

Suppose there are two experts, and two horses. Expert 1 always predicts horse A, expert 2 always predicts horse B, the truth is that the winning horse cycles ABABABABABA... The frequentist randomizes choice of expert according to weights; the Bayesian always chooses the expert who currently has more successes, and flips a coin when the experts are tied. (Disclaimer: I am not saying that this is the only possible strategies consistent with these philosophies, I am just saying that that these seem like the simplest possible instantiations of "when I actually choose which person to follow on a given round, I randomize according to my weights, whereas a Bayesian would always want to deterministically pick the person with the highest expected value.")

If the frequentist starts with weights (1,1), then we will have weights (1/2^k, 1/2^k) half the time and (1/2^k, 1/2^{k+1}) half the time. In the former case, we will guess A and B with equal probability and have a success rate of 50%; in the latter case, we will guess A (the most recent winner) with probability 2/3 and B with probability 1/3, for a success rate of 33%. On average, we achieve 5/12 = 41.7% success. Note that 0.583/0.5 = 1.166... < 1.39, as predicted.

On half the other horses, expert 1 has one more correct guess than expert 2, so the Bayesian will lose everyone of these bets. In addition, the Bayesian is guessing at random on the other horses, so his or her total success rate is 25%. Note that the experts are getting 50% success rates. Note that 0.75/0.5 = 1.5 < 2.41, as we expect.

As is usually the case, reducing the penalty from 1/2 to (for example) 0.9, would yield to slower convergence but better ultimate behavior. In the limit where the penalty goes to 1, the frequentist is achieving the same 50% that the "experts" are, while the Bayesian is still stuck at 25%.

Now, of course, one would hope that no algorithm would be so Sphexish as to ignore pure alternation like this. But the point of the randomized weighted majority guarantee is that it holds (up to the correctness of the random number generator) regardless of how much more complicated reality may be than the experts' models.

Replies from: Oscar_Cunningham, trist
comment by Oscar_Cunningham · 2014-02-20T18:32:01.724Z · LW(p) · GW(p)

A monkey who picked randomly between the experts would do better than both the "frequentist" and the "bayesian". Maybe that should worry us...

Replies from: DavidS
comment by DavidS · 2014-02-20T21:19:18.843Z · LW(p) · GW(p)

Well, I was trying to make the simplest possible example. We can of course add the monkey to our pool of experts. But part of the problem of machine learning is figuring out how long we need to watch an expert fail before we go to the monkey.

comment by trist · 2014-02-21T07:22:38.923Z · LW(p) · GW(p)

Consider all the possible outcomes of the races. Any algorithm will be right half the time (on average for the non-deterministic ones), on any subset of those races algorithms (other than random guessing) some algorithms will do better than others. We're looking for algorithms that do well in the subsets that match up to reality.

The more randomness in an algorithm, the less the algorithm varies across those subsets. By doing better in subsets that don't match reality the weighted maximum algorithm does worse in the subsets that do, which are the ones we care about. There are algorithms that does better in reality, and they have less randomness. (Now if none can be reduced from giant lookup tables, that'd be interesting...)

But the point of the randomized weighted majority guarantee is that it holds (up to the correctness of the random number generator) regardless of how much more complicated reality may be than the experts' models.

How often are the models both perfectly contradictory and equal to chance? How often is reality custom tailored to make the algorithm fail?

Those are the cases you're protecting against, no? I imagine there are more effective ways.

comment by jsteinhardt · 2014-02-11T07:02:45.127Z · LW(p) · GW(p)

I just presented you with an incredibly awesome algorithm, indeed one of the most awesome algorithms I can think of. I then showed how to use it to obtain a frequentist version of Solomonoff induction that is superior to the Bayesian version. Your response is to repeat the Bayesian party line. Is there really no respect for truth and beauty these days?

But okay, I'll bite. Better than what? What is the "average" case here?

Replies from: VAuroch
comment by VAuroch · 2014-02-11T08:03:18.231Z · LW(p) · GW(p)

Well, I'm not familiar enough with Solomoff induction to check your assertion that the frequentist induction is better, but your second question is easy. The average case would clearly be calculating an expected Regret rather than a bound. The proof is accurate, but it's measuring a slightly-wrong thing.

EDIT: Looking at the Blum paper, Blum even acknowledges the motivation for EY's objection as a space for future work. (Conclusion 5.2.)

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-11T17:18:16.152Z · LW(p) · GW(p)

Expectation with respect to what distribution?

Replies from: VAuroch
comment by VAuroch · 2014-02-11T21:01:05.580Z · LW(p) · GW(p)

The distribution of the 'expert adivsors' or whatever they actually are, their accuracy, and the actual events being predicted. I recognize this is difficult to compute (maybe Solomonoff hard), and bounding the error is a good, very-computable proxy. But it's just a proxy; we care about the expected result, not the result assuming that the universe hates us and wants us to suffer.

If we had a bound for the randomized case, but no bound for the deterministic one, that would be different. But we have bounds for both, and they're within a small constant multiple of each other. We're not opening ourselves up to small-probability large risk, so we just want the best expectation of value. And that's going to be the deterministic version, not the randomized one.

comment by rstarkov · 2014-03-01T14:27:24.645Z · LW(p) · GW(p)

One useful definition of Bayesian vs Frequentist that I've found is the following. Suppose you run an experiment; you have a hypothesis and you gather some data.

  • if you try to obtain the probability of the data, given your hypothesis (treating the hypothesis as fixed), then you're doing it the frequentist way
  • if you try to obtain the probability of the hypothesis, given the data you have, then you're doing it the Bayesian way.

I'm not sure whether this view holds up to criticism, but if so, I sure find the latter much more interesting than the former.

Replies from: waveman
comment by waveman · 2020-08-02T12:31:05.021Z · LW(p) · GW(p)

Interesting that, very often, people interpret a frequentist result as though it were Bayesian. E.g. that there is a 90% chance the true value is within the confidence interval. This is so common in medical research that it may possibly be the majority interpretation.

comment by orthonormal · 2014-02-20T05:03:57.890Z · LW(p) · GW(p)

This seems analogous to the situation where there’s some quant at Jane Street, and they’re about to run code that will make thousands of dollars trading stocks, and someone comes up to them and says “Wait! You should add checks to your code to make sure that no subset of your trades will lose you money!”

I would be quite surprised if Wall Street quant firms didn't use unit tests of this sort before letting a new algorithm play with real money. And in fact, I can imagine a promising-sounding algorithm flaming out by making a cycle of Dutch-booked trades...

Replies from: Luke_A_Somers
comment by Luke_A_Somers · 2014-02-21T21:50:33.698Z · LW(p) · GW(p)

I think the point is, it doesn't matter if the parts of your model that you're not thinking about are wrong.

A fine quant algorithm could have an incomplete, inconsistent model that is Dutch-bookable only in regions outside of its main focus. Before it would go out and make those Dutch-booking trades, though, it would focus on those areas and, get itself consistent there, and make good (or no) trades instead.

comment by adam_strandberg · 2014-02-19T04:46:55.852Z · LW(p) · GW(p)

I am deeply confused by your statement that the complete class theorem only implies that Bayesian techniques are locally optimal. If for EVERY non-Bayesian method there's a better Bayesian method, then the globally optimal technique must be a Bayesian method.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-19T07:14:45.901Z · LW(p) · GW(p)

There is a difference between "the globally optimal technique is Bayesian" and "a Bayesian technique is globally optimal". In the latter case, we now still have to choose from an infinitely large family of techniques (one for each choice of prior). Bayes doesn't help me know which of these I should choose. In contrast there are frequentist techniques (e.g. minimax) that will give me a full prescription of what I ought to. Those techniques can in many (but not all) cases be interpreted in terms of a prior, but "choose a prior and update" wasn't the advice that led me to that decision, rather it was "play the minimax decision rule".

As I said in my post:

I would much rather have someone hand me something that wasn’t a local optimum but was close to the global optimum, than something that was a local optimum but was far from the global optimum.

comment by private_messaging · 2014-02-11T09:48:11.453Z · LW(p) · GW(p)

Excellent post.

I especially love the concrete and specific example in #3

Mostly whenever you see people discuss something "Bayesian" or "frequentist" here, or updates, it just makes you appreciate the fact that Gladwell made up a new term for what-ever he intuited the eigenvalue was, that was not what eigenvalue is.

comment by Gunnar_Zarncke · 2014-02-10T22:27:23.202Z · LW(p) · GW(p)

What he doesn't tell you is that myth 5 (online learning) applies McDiarmid's_inequality to derive the more precise bound

%20%14\le%20R_{emp}(f_n)%20+%20\sqrt{log(N)%20+%20log(1/\delta)\over{2n}}%0A)

R is the real regret, R_emp the empirical regret (measured) and delta is the variance above.

This is derived e.g. in Introduction to Statistical Learning Theory, 2004, Bousquet et al

And you can get even better convergence if you know more about the distribution e.g. its VC dimension (roughly how difficult it is to decompose).

EDIT: LaTeX math in comment fixed.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-11T01:51:47.040Z · LW(p) · GW(p)

Thanks for adding the reference. Are you sure McDiarmid's inequality applies in the online learning case, though? The inequality you wrote down looks like a uniform convergence result, which as far as I'm aware still required an i.i.d. assumption somewhere (although uniform convergence results are also super-awesome; I was even considering including them in my post but removed them due to length reasons).

Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2014-02-11T10:43:37.105Z · LW(p) · GW(p)

Yes the formula is for the i.i.d case. See section 3.4 in the ref.

comment by EHeller · 2014-02-10T07:01:16.858Z · LW(p) · GW(p)

The view of maximum entropy you are providing is basically the same as what physicists would present. If your function phi were the system Hamiltonian (and the expectation of phi, the energy), the function q you set up is the Boltzmann distribution, etc.

If its different than the view presented by Jaynes, I'm not sure I see why.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T07:06:02.204Z · LW(p) · GW(p)

Do physicists think of maximum entropy as the Nash equilibrium of a game between the predictor and nature? I asked two physicists about this and they both said it was new to them, but perhaps it was just a fluke...

Replies from: EHeller
comment by EHeller · 2014-02-10T07:12:41.597Z · LW(p) · GW(p)

That would be different, but its also not what you presented, unless its hidden in your phrase "one can show."

Jaynes argued that finding the maximum likelihood of q was equivalent to maximizing entropy, which was the best you can do just knowing the extrinsic variables (phi^*), which seems to be the result you also presented.

Edit: I should point out that most physicists probably aren't in the Jaynes camp (which I'll call subjective thermodynamics), but I'd assume most theorists have at least stumbled upon it at least once.

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T07:26:40.240Z · LW(p) · GW(p)

A particular sentence I'd point to is

We are interested in constructing a probability distribution q(x) such that no matter what particular value p(x) takes, q(x) will still make good predictions.

This is the thing I don't think physicists (or Jaynes, though I haven't really read Jaynes) do. But if I'm wrong I'll edit the post to reflect that.

Replies from: EHeller
comment by EHeller · 2014-02-10T07:38:19.641Z · LW(p) · GW(p)

If you rephrased that as "no matter what microstate compatible with the extrinsic observations the system is in, our model makes good predictions" I think you'd find that most physicists would recognize that as standard thermodynamics, and also that (like me) they wouldn't recognize it as a game :).

Replies from: jsteinhardt
comment by jsteinhardt · 2014-02-10T07:41:54.722Z · LW(p) · GW(p)

Okay, I've edited the relevant part of the original post to link to this comment thread.

comment by waveman · 2020-08-02T12:47:35.266Z · LW(p) · GW(p)
(Myth 5) Online learning: Online learning and online convex optimization

Can you be more specific about which result(s) within the 84 pages of this document substantiate your claims? Actually it would be better to have a reference to a result in a paper in a peer reviewed journal.

comment by twanvl · 2014-02-19T11:58:11.538Z · LW(p) · GW(p)

As I understand it, the big difference between Bayesian and frequentist methods is in what they output. A frequentist methods gives you a single prediction $z_t$, while a Bayesian method gives you a probability distribution over the predictions, $p(z_t)$. If your immediate goal is to minimize a known (or approximable) loss function, then frequentist methods work great. If you want to combine the predictions with other things as part of a larger whole, then you really need to know the uncertainty of your prediction, and ideally you need the entire distribution.

For example, when doing OCR, you have some model of likely words in a text, and a detector that tells you what character is present in an image. To combine the two, you would use the probability of the image containing a certain character and multiply it by the probability of that character appearing at this point in an English sentence. Note that I am not saying that you need to use a fully Bayesian model to detect characters, just that you somehow need to estimate your uncertainty and be able to give alternative hypotheses.

In summary, combining multiple models is where Bayesian reasoning shines. You can easily paste multiple models together and expect to get a sensible result. On the other hand, for getting the best result efficiently, state of the art frequentist methods are hard to beat. And as always, the best thing is to combine the two as appropriate.

comment by SilentCal · 2014-02-20T00:51:17.512Z · LW(p) · GW(p)

I've come to the conclusion that the strength and weakness of the Bayesian method is that it's simultaneously your experimental procedure and your personal decision-making method (given a utility function, that is). This makes it more powerful, but can also make getting a good answer more demanding.

The strength is that it makes mathematical things that are done informally in frequentism. This is the way in which Bayes makes assumptions explicit that frequentism hides. For example, suppose a frequentist does an experiment that shows that broccoli causes cancer with 99% significance. Frequentist: "If broccoli didn't cause cancer, there would be a 1% chance of a result like that happening. That's an absolute fact, none of your Bayesian assumptions. So let it henceforth be Science that broccoli causes cancer." Bayesian: "Wait, that last sentence wasn't justified! In order to take any action based on your results, you have to invoke some kind of prior. What would you do if you got the same result with 'the transit of venus' in the place of 'broccoli'? You could just kind of wing it, but then you're hiding assumptions."

The weakness is that a human is liable to put too many zeroes in the 'absurd' prior that the transit of Venus causes cancer and end up ignoring overwhelming evidence if it turns out that this effect is somehow real. In an open scientific setting, sharing the likelihood ratio might solve the problem here, but if you give your machine learning algorithm a bad prior, that might not get noticed. And if you forget to give your rock-paper-scissors bot a nonzero prior that the opponent has its source code, it can get beaten 100% of the time by an opponent who can simulate it (since it'll have no reason to revert to random).

Frequentists might do better by implicitly using a 'one-size-fits-all' prior that beats what we would write down (because we're not perfect Bayesians). This is the only advantage of frequentism I can see apart from computational concerns.

comment by Daniel_Burfoot · 2014-02-12T05:06:24.087Z · LW(p) · GW(p)

Thanks for the thought-provoking post. How would you answer this criticism of the frequentist paradigm:

  • Frequentist methods cannot make statements about actual data, only about abstract properties of the data.

For example, suppose I generated a data set of 1000 samples by weighing a bunch of people. So I just have 1000 numbers x1, x2... x1000 representing their weight in kilograms. Given this data set, frequentist analysis can make various claims about abstract properties such as its mean or variance or whether it could possibly have been drawn from a certain null-hypothesis distribution. But say I simply want to know what is the probability that P(x > 200), ie when I go out and weigh a new person the result will be greater than 200 kg. I claim that frequentist methods simply cannot answer this question; and any method that claims to answer it is actually a Bayesian method in disguise, in particular, it is implicitly using a prior.