Teaching Bayesianism

jquinton

Teaching Bayesianism

post by JQuinton · 2012-06-08T20:18:35.085Z · LW · GW · Legacy · 45 comments

45 comments

I've had a bit of success with getting people to understand Bayesianism at parties and such, and I'm posting this thought experiment that I came up with to see if it can be improved or if an entirely different thought experiment would be grasped more intuitively in that context:

Say there is a jar that is filled with dice. There are two types of dice in the jar: One is an 8-sided die with the numbers 1 - 8 and the other is a trick die that has a 3 on all faces. The jar has an even distribution between the 8-sided die and the trick die. If a friend of yours grabbed a die from the jar at random and rolled it and told you that the number that landed was a 3, is it more likely that the person grabbed the 8-sided die or the trick die?

I originally came up with this idea to explain falsifiability which is why I didn't go with say the example in the better article on Bayesianism (i.e. any other number besides a 3 rolled refutes the possibility that the trick die was picked) and having a hypothesis that explains too much contradictory data, so eventually I increase the sides that the die has (like a hypothetical 50-sided die), the different types of die in the jar (100-sided, 6-sided, trick die), and different distributions of die in the jar (90% of the die are 200-sided but a 3 is rolled, etc.). Again, I've been discussing this at parties where alcohol is flowing and cognition is impaired yet people understand it, so I figure if it works there then it can be understood intuitively by many people.

45 comments

Comments sorted by top scores.

comment by private_messaging · 2012-06-08T21:30:01.805Z · LW(p) · GW(p)

It's very straightforward in frequentist interpretation: half the people pick normal die, one in 8 rolls 3, so 1/16 of original people roll 3 off normal die, while 1/2 roll 3 off trick die, for total of 9/16 rolling 3. 1/16 with normal die in 9/16 that roll 3, here's your probability. Trivial stuff people should be able to reinvent if they skip or forget. 4th or 5th grade math at most. Too bad there's no good training at the early enough age.

Train people to think straight and the Bayes will pop up; train people to do Bayes and they'll think wrong with Bayes.

edit: actually, what's up with this local trope of "Bayesianism" as opposed to "Frequentism"? The math abstracts out the philosophical detail of whenever probability is a degree of belief or product of convergence of long term trials.

Replies from: pragmatist, othercriteria, David_Gerard

↑ comment by pragmatist · 2012-06-09T11:44:53.537Z · LW(p) · GW(p)

There are practically relevant considerations that emerge from the philosophical distinction between Bayesianism and frequentism. If you have an epistemic conception of probability, then it makes sense to talk about the probability distribution of a theoretical parameter, such as the mean of some variable in a population. If you're a frequentist, though, this usually does not make sense. The variable itself has a relative frequency associated with each of its values, but it makes no sense to talk of the relative frequency of the mean of the variable. So on a frequentist conception of probability, you wouldn't assign a probability distribution over the theoretical parameter. The parameter is not treated as a random variable.

The upshot is that frequentists focus on likelihood functions, which a Bayesian would describe as giving the probability of observed data conditional on the theoretical parameter. Bayesians, on the other hand, look for posterior probability distributions, the distribution of the theoretical parameter conditional on observed data. Bayesians report the epistemic probability that the parameter value is in a certain interval, given the data. For frequentists, the interval is the random variable, and the unknown parameter is treated as fixed. So the probabilities they report are the relative frequency (in an ensemble of experiments) with which the confidence interval constructed using observed data will contain the theoretical parameter.

So the philosophical difference has important methodological consequences in statistics. These are explored at length in many advanced textbooks in theoretical statistics, such as this one. The distinction isn't just a local trope; it's an important foundational distinction in statistics.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-09T15:35:54.725Z · LW(p) · GW(p)

If you have an epistemic conception of probability, then it makes sense to talk about the probability distribution of a theoretical parameter, such as the mean of some variable in a population.

Please go on with an example of how it is practically relevant, so that the frequentism fails.

The Bayesianism here with the Solomonoff induction as a prior, is identical to frequentism over Turing machines anyway (or at least should be; if you make mistakes it won't be)

With the local trope, the local trope seem to be a complete misunderstanding of books such as the one you linked.

Replies from: pragmatist, army1987

↑ comment by pragmatist · 2012-06-09T19:20:10.092Z · LW(p) · GW(p)

Please go on with an example of how it is practically relevant, so that the frequentism fails.

For an actual scientific example of Bayesian and frequentist methods yielding different results when applied to the same problem, see Wagenmakers et al.'s criticisms [PDF] of Bem's precognition experiments.

Here's a toy example that (according to Bayesians, at least) illustrates a defect of frequentist methodology. You draw two random values from a uniform distribution with unknown mean m and known width 1. Let these values be v1 and v2, with v1 < v2. If you did this experiment repeatedly, then you would expect that 50% of the time, the interval (v1, v2) would include the population mean m. So according to the frequentist, this is the 50% confidence interval.

Suppose that on a particular run of the experiment, you get v1 = 0.1 and v2 = 1.0. For this particular data, the Bayesian would say that given our model, there is a 100% chance of the mean lying in the interval (v1, v2). The consistent frequentist, however, cannot say this. She can't talk about the probability of the mean lying in the interval, she can only talk about the relative frequency with which the interval (considered as a random variable) will contain the mean, and this remains 50%. So she will say that the interval (0.1, 0.9) is a 50% confidence interval. The Bayesian charge is that by refusing to conditionalize on the actual data available to her, the frequentist has missed important information: specifically, that the mean of the distribution is definitely between 0.1 and 1.0.

Replies from: Dreaded_Anomaly, private_messaging

↑ comment by Dreaded_Anomaly · 2012-06-10T04:05:33.460Z · LW(p) · GW(p)

A similar example is given in Wei_Dai's post Frequentist Magic vs. Bayesian Magic.

↑ comment by private_messaging · 2012-06-10T08:30:13.910Z · LW(p) · GW(p)

precognition experiments.

...

The Bayesian charge is that by refusing to conditionalize on the actual data available to her, the frequentist has missed important information: specifically, that the mean of the distribution is definitely between 0.1 and 1.0.

Still sounds like silly terminology collision. Like if in the physics you had right-hand-rulers and left-hand-rulers and some would charge that the direction of magnetic field is all wrong, while either party simply means different thing by 'magnetic field' (and a few people associated with insane clown posse sometimes straight calculate it wrong)

edit: ohh what you wrote is even worse than sense I accidentally read into it (misreading uniform as normal getting confused afterwards). Picking people screw up math as strawman. Stupid, very stupid. And boring.

Nowhere does it follow from seeing the probability as a limit in the infinite number of trials (frequentism), that the mean of that distribution with unknown mean wouldn't be restricted to specific range. Say, you draw 1 number from this distribution with width 1. It immediately follows that the values of unknown parameter of the generator of the number, that fall outside x-0.5 , x+0.5 are not possible. You keep drawing more, you narrow down the set of possible values. [i am banning the 'mean' because there's the mean that is the property of the system we are studying, and there is the mean of the model we are creating, 2 different things]

Replies from: pragmatist

↑ comment by pragmatist · 2012-06-11T00:50:05.438Z · LW(p) · GW(p)

Nowhere does it follow from seeing the probability as a limit in the infinite number of trials (frequentism), that the mean of that distribution with unknown mean wouldn't be restricted to specific range.

In the particular case I gave, of course frequentists could produce an argument that the mean must be in the given range. But this could not be a statistical argument, it would have to be a deductive logical argument. And the only reason a deductive argument works here is that the posterior of the mean being in the given range is 1. If it were only slightly less than 1, 0.99 say, there would be no logical argument either. In that case, the frequentist could not account for the fact that we should be extremely confident the mean is in that range without implicitly employing Bayesian methods. Frequentist methods neglect an important part of our ordinary inductive practices.

Now look, I don't think frequentists are idiots. If they encountered the situation I describe in my toy example, they would of course conclude that the mean is in the interval [0.1, 1.0]. My point is that this is not a conclusion that follows from frequentist statistics. This doesn't mean frequentist methodology precludes this conclusion, it just does not deliver it. A frequentist who came to this conclusion would implicitly be employing Bayesian methods. In fact, I don't think there is any such creature as a pure frequentist (at least, not anymore). There are people who have a preference for frequentist methodology, but I doubt any of them would steadfastly refuse to assign probabilites to theoretical parameters in all contexts.

I expect most scientists and statisticians are pluralists, willing to apply whichever method is pragmatically suited to a particular problem. I'm in the same boat. I'm hardly a Bayesian zealot. I'm not arguing that actual applied statisticians divide into perfect frequentists and perfect Bayesians. What I'm arguing is against your initial claim that no significant methodological distinction follows from conceiving of probabilities as epistemic vs. conceiving of them as relative frequencies. There are significant methodological distinctions, and these distinctions are acknowledged by virtually all practicing statisticians. Applying the different methodologies can lead to different conclusions in certain situations.

If your objection to LW is that Bayesianism shouldn't be regarded as the one true logic of induction, to the exclusion of all others, then I'm with you brother. I don't agree with Eliezer's Bayesian exclusionism either. But this is not the objection you raised. You seemed to be claiming that the distinction between Bayesian and frequentist methods is somehow idiosyncratic to this community, and this is just wrong.

(Incidentally, I am sorry you find my example boring and stupid, but your quarrel is not with me. It is with Morris DeGroot, in whose textbook I first encountered the example. I mention this as a counter to the view you seem to be espousing, that all respectable statisticians agree with you and only "sloppy philosophers" disagree.)

Replies from: othercriteria

↑ comment by othercriteria · 2012-06-11T02:01:42.393Z · LW(p) · GW(p)

In the particular case I gave, of course frequentists could produce an argument that the mean must be in the given range. But this could not be a statistical argument, it would have to be a deductive logical argument.

The frequentists do have an out here: conditional inference. Obviously, (v2+v1)/2 is sufficient for m, so they don't need any other information for their inference. But it might occur to them to condition on the ancillary statistic v2-v1. In repeated trials where v2-v1 = 0.9, the interval (v1,v2) always contains m.

Edit: As pragmatist mentions below, this is wrong wrong wrong. The minimal sufficient statistic is (v1,v2), although it is true that v2-v1 is ancillary and moreover it is the ancillary complement to the sample mean. That I was working with order statistics (and the uniform distribution!) is a sign that I shouldn't just grope for the sample mean and say good enough.

Replies from: pragmatist

↑ comment by pragmatist · 2012-06-11T04:22:21.904Z · LW(p) · GW(p)

True, but is there any motivation for the frequentist to condition on the ancillary statistic, besides relying on Bayesian intuitions? My understanding is that the usual mathematical motivation for conditioning on the ancillary statistic is that there is no sufficient statistic of the same dimension as the parameter. That isn't true in this case.

ETA: Wait, that isn't right... I made the same assumption you did, that the sample mean is obviously sufficient for m in this example. But that isn't true! I'm pretty sure in this case the minimal sufficient statistic is actually two-dimensional, so according to what I wrote above, there is a mathematical motivation to condition on the observed value of the ancillary statistic. So I guess the frequentist does have an out in this case.

↑ comment by A1987dM (army1987) · 2012-06-09T15:59:36.632Z · LW(p) · GW(p)

That's not what frequentists actually do. See e.g. Probability Theory: The Logic of Science by E.T. Jaynes.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-09T16:37:59.502Z · LW(p) · GW(p)

What is not what frequentists actually do?

Replies from: army1987

↑ comment by A1987dM (army1987) · 2012-06-09T17:21:57.186Z · LW(p) · GW(p)

Reasoning “over Turing machines” and thence getting the same results (or even using the same tools) as Bayesians.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-10T08:53:13.587Z · LW(p) · GW(p)

Where did you get your ideas of statistics, may I ask? The "what frequentists do" and "what bayesians do" isn't even part of mathematics, in the mathematics you learn the formulae, where those come from, and actually see how either works. Can't teach that in a post, you'll need several years studying.

You learn "what frequentists do" and "what bayesians do" predominantly from people whom can't actually do any interesting math and instead resort to some sort of confused meta for amusement.

Also: nobody actually uses Solomonoff induction, it's uncomputable.

The frequentism is seeing the probability as the limit of infinitely many trials. Nothing more, nothing less. You can do trials on Turing machine if you wish. If you read LessWrong you'd be thinking frequentism is some blatant rejection of the Bayes rule or something. The LessWrong seem to be predominantly focussed on meta of claiming itself to be less wrong than someone else, usually wrongly.

↑ comment by othercriteria · 2012-06-09T23:20:09.667Z · LW(p) · GW(p)

actually, what's up with this local trope of "Bayesianism" as opposed to "Frequentism"? The math abstracts out the philosophical detail of whenever probability is a degree of belief or product of convergence of long term trials.

Where do you get the idea that it's a local trope? Knowledgable and well-respected people in the field consider these foundational issues important, e.g., Brad Efron and Andrew Gelman.

You can make an argument that the philosophical details wash out as long as you're operating on a fully specified probability space. In that sense, probability is just sort of syntactic manipulation. But once you start thinking about statistics, where the events and probabilities have some semantic/denotative connection with the real world, you need to care about where the probability space you're working with comes from.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-09T23:34:31.596Z · LW(p) · GW(p)

Okay, there's a problem for you. Not a neat probability problem. A rectangular dice has sides with length 1cm, 1.1cm, 1.2cm, it is made of 316 stainless steel, the edges and corners are rounded to radius of 1mm , it is dropped onto 10cm thick steel plate made of same type of steel, and bounces several times. What would you do to find probabilities of landing on either side?

Clearly there is no disagreement that 1: agents may represent their uncertainty with probabilities, and 2: physical system such as dice work like a hash function of initial state, such that for perfect dice very nearly exactly 1/6 of initial state space gets transformed into either number, and with several bounces the points in the state space are transformed to different numbers are separated by less than attoradians of initial angle and attoradians per second of initial angular velocity. Effect of small deviations from symmetrical shape could be estimated from physical considerations. The outcomes of any games can be found starting from physics and counting over the states that are consistent with observations that took place; that is likewise not controversial.

Nobody respected disagrees that there exists such property of physical systems that incorporate chaos (act as a hash function, essentially); nobody respected disagrees that you can also have the degrees of beliefs that shouldn't be dutch-bookable; and a bunch of sloppy philosophers whom don't really understand either are very confused going on Bayesianism this Bayesianism that "Good Bayesian", "spoke fluent Bayesian" i kid you not. The latter sort of stuff seems to be local-ish trope.

edit: to summarize, we probably just need two different words, one for property of chaotic physical systems (or hash functions or the like), and other for degrees of belief which only have to obey certain properties between themselves to avoid dutch book or the like. The argument over whichever should be called 'probability' is pretty silly. Anything with dices in it falls straight into chaotic physical systems category.

Replies from: othercriteria

↑ comment by othercriteria · 2012-06-10T00:53:14.085Z · LW(p) · GW(p)

Ignoring, temporarily, everything but the first paragraph, there are two ways I might proceed.

Acting as a frequentist, I would suppose that die rolls could be modeled as independent identically distributed draws from a multinomial distribution with fixed but unknown parameters. (The independence, and to a lesser degree the identically distributed, assumption could also be verified although this gets a bit tricky.) I would roll the die some fixed number of times (possibly determined according to a a priori calculation of statistical power) and take the MLE as a point estimate of the unknown parameters. I would report this parameter as the probability of the die landing on the various sides. I might also report a 95% confidence region for the estimate, which is not to be interpreted as containing the true probabilities 95% of the time (it either does or does not, with certainty).

Acting as a Bayesian, I would assume the same data model, but I would also place a prior distribution on the unknown parameter. A natural prior in this case is the Dirichlet distribution, which is conjugate to the multinomial distribution. I would also use the same data collection approach, although the Bayesian formulation makes it easy to work with the special case of observing a single roll. Given the model likelihood and the prior distribution, Bayes' law tells me the new posterior distribution to which I should update to represent my uncertainty over the unknown parameter. I would continue to roll the die and update until the posterior distribution is sufficiently concentrated according to some reasonable stopping criterion. I would then report the posterior mean (or maybe the MAP estimate) as the probability of the die landing on the various sides. I would also report 95% credible region for the estimate, which I would give a 95% credence to containing the truth (although under questioning, I would probably be evasive/unclear about exactly what that means). I would also need to communicate a justification for my prior distribution and ideally evidence that the inference is not overly sensitive to it. I ought to just report the posterior distribution itself, but people tend to find it easier to base decision on point estimates.

There are obvious similarities to these two inferential approaches, but they are answering slightly different questions using vastly different methods.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-10T08:16:17.971Z · LW(p) · GW(p)

Suppose you are denied experimentation and denied extremely powerful computer (e.g. you can only do <100 simulated trials but want reasonable accuracy), or need high accuracy in limited time. I was more interested about what you do when you are to try to analytically solve something like this, finding probabilities for the 3 distinct sides.

The point here is that you want to go for physically justified stuff, and anything not physically justified that you are doing anywhere, is same in principle as wilfully putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better. Maybe you guys need "Overcoming Bayes" blog.

Replies from: othercriteria

↑ comment by othercriteria · 2012-06-10T15:23:23.468Z · LW(p) · GW(p)

The point here is that you want to go for physically justified stuff, and anything not physically justified that you are doing anywhere, is same in principle as wilfully putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better.

Statisticians, by and large, don't lose sleep over this problem. Even in your not-quite-fair die problem, the calculations involved are really hard. It wasn't made explicit in my comment but I wasn't even assuming that opposite sides have equal probability, because some subtle error in the setup could break the symmetry. In the Bayesian case, I was considered mentioning a mixture model that would take advantage of the symmetry if the data supported it. In KDD Cup types of problems, nobody is worried that a domain expert will show up with a winning solution that doesn't even need to see the training data (why would it if it were maximally physically justified?).

putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better. Maybe you guys need "Overcoming Bayes" blog.

Bayesians have made peace with bias. In fact, decision rules that are both Bayes and unbiased have zero risk, which is a nice way of saying that they don't exist in non-trivial situations. Noorbaloochi and Meeden (1983) have to go through definitional contortions to establish a positive connection between being Bayes and unbiased.

Bias is what lets you get good inferential performance in small-sample regimes. If I observe side counts (2, 0, 1, 3, 2, 2), I'd be okay with my estimator inferring equal side probabilities, because that will be closer to the truth than the unbiased estimator which guesses (0.2, 0.0, 0.1, 0.3, 0.2, 0.2); ten rolls is not enough data to tell me that I should never see a "2". On the other hand, with side counts (200, 0, 100, 300, 200, 200), something closer to the unbiased estimator seems like a good idea. As long as the estimator is asymptotically unbiased, you can even still have consistency.

Unlike cognitive bias, we have control over our statistical bias and we should not be squeamish about using it to learn about the parts of the world that are hard to model with complete accuracy to the extent that we wouldn't need statistics anyways.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-10T20:14:08.508Z · LW(p) · GW(p)

The point of the not quite fair die example was to demonstrate where 'probabilities' are coming from. The fair die, after several bounces, maps the initial state space into the final side-up states in a particular way, so that 1/6th of even a very tiny part (hypervolume) of initial state maps to each side-up final state. The not totally fair die is somewhat biased from that. Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.

With regard to the statisticians not losing sleep over that, there is a zillion examples in practice where you have to deal with e.g. electric current, or temperature, or illumination, or any other fundamentally statistical property, and you have limited computational power. A lot of my work is for doing this on illumination; I have to compute illumination in a huge number of points on the screen (and no you can't bruteforce even if you had 1000x the computing power, not to mention that when there's 1000x the power you'll have tighter constraints on error and time). I don't really care if some people don't find anything wrong with doing a wrong thing "because we won't be beaten in practice", when I am earning some of my money by beating those folks in practice. So better for me that some folks just don't understand that you shouldn't get to choose some arbitrary numbers. Yes, in various really fuzzy problems, you can do what ever you subjectively please. But to see this as fundamental - that's quite seriously silly.

There are many methods for finding out the resulting distribution; one particular method involves more regular sampling of the initial state than random (e.g. grid with jittering), so that you get error that improves much better than 1/sqrt(N) ; it can in principle be used for die simulation, and is used in practice in similar problems that are less messy (molecular dynamics comes to mind) . I generally find that nowadays a lot of very important insights are within the more applied fields; the knowledge has not yet propagated into this meta-ish land of arguing mostly over terminology and not having to be maximally correct against golden standard of reality.

Replies from: othercriteria

↑ comment by othercriteria · 2012-06-10T22:58:54.156Z · LW(p) · GW(p)

Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.

You're sketching out a methodology for solving forward problems (given model, determine observations), which is fine but it's not what motivates statisticians. Statisticians are generally concerned with the backward/inverse problem (given observations, determine model).

In reality, we're not presented with complete and accurate technical specifications for the die/table/thrower system we encounter. All we get to see is the sequence of sides that landed on top. If we're playing a game that uses the die, it's of interest to know how this sequence will continue into the future.

One general approach to figuring this out might involve inferring technical specifications. Maybe if we're really clever, we can figure out what grade of steel the die is made of just from the observed side counts. Less ambitiously, we might try to recover the relative side lengths and rounding radius. With all this information, we can then simulate forward to estimate the sequence of future throws. The number of parameters involved here may number in the tens or hundreds, or into the millions if we want to capture all the physiological details of a human thrower. It's also not quite clear whether a system like this would even converge to any stationary long-term behavior from which limiting relative frequencies could be calculated.

Another approach is ignore all the detail, assume independent identically distributed tosses, and just try to learn the five parameters (P(side 1), ..., P(side 5); P(side 6) = 1 - P(side 1) - ... - P(side 5)). Forward simulation in this case is just repeated sampling from the learned distribution.

Moreover, let's suppose that (effective) independence emerges from the technical specification model. Then we have a huge identifiability problem; all those hundreds of parameters are just providing a redundant parameterization of the iid model. We can't hope to learn all of the parameters from the data we get to observe.

I guess as long as you want to stick to forward problems, you can invoke Occam and deny that probability even exists. But don't assume that your understanding carries over to inverse problems. Probability is a useful technical tool there, and applying it to real problems requires translation/operationalization. Two different frameworks for this are frequentism and Bayesianism.

I don't really care if some people don't find anything wrong with doing a wrong thing "because we won't be beaten in practice", when I am earning some of my money by beating those folks in practice.

If you want to put your money where your mouth is, I have a proposal. Take a die of your choosing, or manufacture one according to your own specifications; it doesn't have to be remotely fair. Also supply a plate onto which it can be tossed if you desire. Do whatever measurements you want on them. Then convey them to a mutually-accepted third party. The third party rolls the die 200 times, according to instructions you publicly post, and then publicly posts the first half of the sequence of rolls and a hash of the second half of the sequence. We both predict the side counts in the second half of the sequence and post the predictions publicly. The third party reveals the second half of the sequence (which can be checked against the hash) and whoever was closer to the true side counts (in squared error distance) wins. The loser pays the winner some mutually-accepted amount, plus or minus half the die/plate shipping expenses as appropriate to split that cost.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-11T06:13:42.125Z · LW(p) · GW(p)

I am an applied mathematician who actually does work on finding the values of probabilistic quantities in better computing time than straightforward numerical experimentation. Probability is not just statistics.

In so much as what you think Bayesians do deviates from what I know has to be done, you have a wrong idea of what Bayesians do (or giving you benefit of the doubt at expense of others, are referring to some "Bayesians" whom are plain wrong), or something like that but the discussion is too fuzzy for me to tell which. (Ditto for frequentists)

The point of frequentism is seeing the probability as frequency in infinite number of trials. The point of my die example is to demonstrate that physically the probability plain comes in as frequency, via a function from initial phase space to final phase space that maps, for fair die, 1/6 of initial phase space to each final side-up, this being the objective property of a system that has to be adequately captured by what ever methods you are using. And I do not give a slightest damn if you don't know that in practice - not for dies but for many other systems - you have to find probabilities bottom up from e.g. laws of physics. If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!). Of course I won't bother making for you some example with actually the die, the point is the principle and i've done such solutions before with things that unfortunately don't make great examples.

edit: also, on science, the reason we do 'probability of data given model' is because science follows a strategy of committing to rarely (with certain probability) throwing out valid model. 'Probability of model given the data' is not well defined, unless you count stuff like 'Solomonoff induction as a prior', where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the 'we live inside Turing machine' model). The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.

Replies from: othercriteria

↑ comment by othercriteria · 2012-06-13T20:51:52.692Z · LW(p) · GW(p)

If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!).

The world often isn't nice enough to give us the steel die. Figuratively, the steel die may be inside someone's skull, thousands of years in the past, millions of light-years away, or you may have five slightly different dice and really want to learn about the properties of all dice.

I do understand the O(N^(-1/2)) convergence of errors. I spend a lot of time working on problems where even consistency isn't guaranteed (i.e., nonparametric problems where the "number of parameters" grows in some sense with the amount of data) and finding estimators with such convergence properties would be great there.

'Probability of model given the data' is not well defined, unless you count stuff like 'Solomonoff induction as a prior', where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the 'we live inside Turing machine' model).

It's perfectly well-defined. It's just subjective in a way that makes you (and a great number of informed, capable, and thoughtful statisticians) apparently very uneasy. There's some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer, in spite of choice of prior, given enough data. You probably wouldn't be happy with rates of convergence for these methods, because they tend to be slower and harder to obtain than for, e.g., MLE estimation of iid normally-distributed data.

The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.

They might well do this. As a frequentist, this is a natural step in establishing confidence intervals and such, after they have estimated the quantity of interest by choosing the model that maximizes the probability of the data. This choice may not look like "Standard Model versus something else" but it probably looks like "semi-empirical model of the system with parameter 1 = X" where X can range over some reasonable interval.

unless you count stuff like 'Solomonoff induction as a prior'

I don't see what role Solomonoff induction plays in a discussion of frequentism versus Bayesianism. I never mentioned it, I don't know enough about it to use it, and I agree with you that it shows up on LW more as a mantra than as an actual tool.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-14T07:32:06.213Z · LW(p) · GW(p)

The world often isn't nice enough to give us the steel die.

The point is that the probability with die comes in as frequency (the fraction of initial phase space). Yes, sometimes nature doesn't give you die; that does not invalidate the fact that there exists probability as objective property of a physical process, as per frequentism (related to how the process maps initial phase space to final phase space); the methods employing subjectivity have to try to conform to this objective property as closely as possible (e.g. by trying to know more about how the system works). The Bayesianism is not opposed to this, unless we are to speak of some terribly broken Bayesianism.

'Probability of model given the data' is not well defined,

It's perfectly well-defined.

Nope. Only the change to probability of model given the data is well defined. The probability itself isn't. You can pick arbitrary start point.

There's some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer,

The notion of 'true answer' is frequentist....

edit: Recall that the original argument was about the trope of Bayesianism being opposed to frequentism etc. here. The point with Solomonoff induction is that once you declare something like this a source of priors, all math youll be doing should be completely identical to frequentist math (when frequencies are within turing machines fed random tape, and the math is done as in my top level post for die), just as long as you don't simply screw your math up. The point with die example was that no Bayesianist worth their salt opposes to there being a property of chaotic process, what fraction of initial phase space gets mapped to where, because there really is this property.

↑ comment by David_Gerard · 2012-06-09T10:43:38.251Z · LW(p) · GW(p)

actually, what's up with this local trope of "Bayesianism" as opposed to "Frequentism"?

Eliezer got this straight from Jaynes.

Replies from: Dreaded_Anomaly

↑ comment by Dreaded_Anomaly · 2012-06-10T04:07:27.168Z · LW(p) · GW(p)

And he wrote a sizable post about the conflict: Frequentist Statistics are Frequently Subjective

The process of throwing away the actual experimental result, and substituting a class of possible results which contains the actual one - that is, deliberately losing some of your information - introduces a dose of real subjectivity.

Edit: didn't mean to retract this, hit the button by accident.

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-10T09:03:22.371Z · LW(p) · GW(p)

Another example of him having poor knowledge and going on confused and irrelevant for pages. LW is very effective at throwing away anyone who has a clue by referencing to highly loved incorrectness.

Replies from: wedrifid

↑ comment by wedrifid · 2012-06-10T19:51:26.543Z · LW(p) · GW(p)

LW is very effective at throwing away anyone who has a clue by referencing to highly loved incorrectness.

In the name of Cryonics, Bayesianism, MWI, FAI, FOOMing, physical realism and whatever other ideas that lesswrong folks endorse but you have a problem with I banish you!

Did it work?

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-10T22:39:54.792Z · LW(p) · GW(p)

Didn't work on me yet, coz i'm bored.

It's not even at the point of what ideas are endorsed, but how. E.g. I like MWI, okay? The arguments in favour of MWI here are utter crap, confused and irrelevant and go on for pages. Then the Bayes , irrelevant and confusing and for pages again every time, and with a poor style explanation to top it off, which is not even Bayesian but frequentist (and poor style in terms of useful stuff vs wankery ratio). The general pattern of 'endorsing' an idea on LW consists of very poor understanding of what idea is, how it differs from the rest, and some wankery slash signalling like "look at me i can choose correct idea in some way that you ought to understand if you are smart enough". If you read LW you would think that e.g. frequentists are people whom reject Bayes rule and can't solve the problem like in OP or something, the non-multiworlders are people who believe collapse literally happens when you look at stuff (which may well be the case among those 'non many worlders' whom don't know jack shit about quantum mechanics). Then the cryonics, for many times it's been told by experts that current methods lose a lot of important information irreversibly, you just go on with some crap about super-intelligence that will deduce missing info from life story, never mind algorithmic complexity or anything. Supposedly because you're a 'rationalist' you don't need to know anything or even think it over taking into account the subtleties, you'll be more correct without the subtleties than anyone would be with. (If it worked like that it'd be awesome).

Replies from: pragmatist

↑ comment by pragmatist · 2012-06-11T00:56:17.823Z · LW(p) · GW(p)

If you read LW you would think that e.g. frequentists are people whom reject Bayes rule and can't solve the problem like in OP or something...

You'd only arrive at this conclusion if you didn't read very carefully. No one claims that frequentists reject Bayes' rule. But Bayes' rule only applies when we have coherent conditional probabilities. Frequentists deny that one always has conditional probabilities of the form P(data | parameter), because in many cases they deny that the parameter can be treated as the value of a random variable. So the difference in methods comes not from a disagreement about Bayes' rule itself, but a disagreement about when this rule is applicable.

Replies from: othercriteria

↑ comment by othercriteria · 2012-06-11T02:12:16.108Z · LW(p) · GW(p)

Frequentists deny that one always has conditional probabilities of the form P(data | parameter), because in many cases they deny that the parameter can be treated as the value of a random variable.

If this is really what you mean, can you clarify it? Are you talking about going from P(data ; parameter) to P(data | parameter) by abuse of notation and then taking the conditioning seriously?

Replies from: pragmatist

↑ comment by pragmatist · 2012-06-11T03:58:32.431Z · LW(p) · GW(p)

I'm not sure what you mean by "abuse of notation". I don't think P(data ; parameter) and P(data | parameter) are the same thing. The former is a member of a family of distributions indexed by parameter value, the latter is a conditional distribution. I do think that, from a Bayesian point of view, the former determines the latter.

As a Bayesian, you treat the parameter value m as the value of an unobserved random variable M. The observed data y is the value of a random variable Y. Your model,

$P\_\{Y\}\(y;m\$ ),

can be used to straightforwardly derive the conditional distribution

$P\_\{Y|M\}\(y|m\$ ).

In conjunction with your prior distribution for M, this gives you the posterior probability of the parameter value being m.

I'm not a statistician, so I might be using notation in an unorthodox manner here, but I don't think there's anything wrong with the content of what I said. Is there?

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-11T06:32:16.886Z · LW(p) · GW(p)

Frequentists deny that one always has conditional probabilities of the form P(data | parameter), because in many cases they deny that the parameter can be treated as the value of a random variable.

What cases? Where does this what you said come from the view that probability is the limit in infinitely many trials?

This post doesn't clarify that. I'm still not sure what you mean exactly (or based on what you determined what 'frequentists' do, survey of literature? some sort of actual issue with interpreting probability as limit in many trials?).

Replies from: pragmatist

↑ comment by pragmatist · 2012-06-11T07:22:30.893Z · LW(p) · GW(p)

Suppose I'm performing an experiment whose purpose is to estimate the value of some physical constant, say the fine structure constant. Can you make sense of assigning a probability distribution to this parameter from a frequentist perspective? The probability of the constant being in some range would presumably be the limit of the relative frequency of that range as the number of trials goes to infinity, but what could a "trial" possibly be in this case?

Replies from: private_messaging

↑ comment by private_messaging · 2012-06-11T07:31:32.952Z · LW(p) · GW(p)

Let's see how Bayesianists here propose to assign probability distribution to something like that: Solomonoff induction, 'universal prior'. Trials of random tape on Turing machines (which you can do by considering all possible tape). The logic that follows afterwards should be identical; as you 'update your beliefs' you select states compatible with evidence, as per top post in that thread; mathematically, Bayes rule.

Not convinced that this issue is something specific to frequentism.

comment by [deleted] · 2012-06-08T21:20:25.489Z · LW(p) · GW(p)

I'm not sure how much Bayes is getting through here - people are primed to associate "3" with the trick die, so you can get the right answer via availability or similarity heuristics. I'm trying to think of a way to reformulate this without the priming while keeping it close-to-simple - maybe s/"told you that the number that landed was a 3"/"told you that the number that landed was odd"

Replies from: evand, JQuinton

↑ comment by evand · 2012-06-09T22:44:46.487Z · LW(p) · GW(p)

Make the jar half four-sided dice, and half 20-sided dice.

↑ comment by JQuinton · 2012-06-13T15:16:59.246Z · LW(p) · GW(p)

This is a good point I hadn't thought of. Maybe I'm really just priming people to get the right answer and my skills at thought experiments aren't as good as I thought :/

comment by wedrifid · 2012-06-09T23:52:34.833Z · LW(p) · GW(p)

I've had a bit of success with getting people to understand Bayesianism at parties and such

Oooh, I know this one). If the blonde is a 9 and you have a 30% chance with her and her friends are all 7s and you have a 40% chance with them, but that chance reduces to 15% if you have already been rejected by the blonde, which girl should you approach?

comment by Nighteyes5678 · 2012-06-08T22:05:26.700Z · LW(p) · GW(p)

It may be useful to actually type out how you use the above thought experiment to explain Bayes. That would make it more useful for those of us still confused or unsure about what Bayes means (hey, I'm a newbie, be nice), and it would help people critique the example in how it teaches the theorem.

For example, why is it better to ask, "If a 3 is pulled, is it more likely to be an 8-sided dice or not?" than to ask, "If a random dice is rolled, is it more likely to be a 3 or not?"

Replies from: witzvo

↑ comment by witzvo · 2012-06-09T04:22:46.800Z · LW(p) · GW(p)

For example, why is it better to ask, "If a 3 is pulled, is it more likely to be an 8-sided dice or not?" than to ask, "If a random dice is rolled, is it more likely to be a 3 or not?"

Good question. The second question is "just a probability" question. The first question asks you to condition on evidence ("If the randomly chosen die is rolled and comes up 3") and infer "backward" to what this tells you about the die. That's why Bayesian reasoning applies.

The reasoning goes like this: before I roll the die, the two kinds of dice are equally likely.

P(die is 8 sided)=P(die is only 3's)=1/2

Then I rolled the die and saw a three. The conditional probability of this if the die is eight sided is 1/8. The conditional probability of this if the die is only 3's is 1.

P(3|die is 8 sided)=1/8
P(3|die is only 3's)=1

The Bayesian update is to multiply out the probability of observing the evidence in the two cases:

P(3 and die is 8 sided)=1/2*1/8=1/16
P(3 and die is only 3's)=1/2*1=1/2

And then renormalize:

P(die is 8 sided | 3)=(1/16)/(1/2+1/16)=1/9=0.111
P(die is only 3's | 3)=(1/2)/(1/2+1/16)=8/9=0.888

JQuinton mentioned that he uses this to argue about falsifiability. I'd like to hear that explained more. I think the example is meant to show that a hypothesis that "can explain anything" (the 8 sided die), should lose probability if we obtain evidence that is "better explained" by the more specific hypothesis (the 3's only die).

Replies from: JQuinton

↑ comment by JQuinton · 2012-06-13T15:33:28.363Z · LW(p) · GW(p)

JQuinton mentioned that he uses this to argue about falsifiability. I'd like to hear that explained more. I think the >example is meant to show that a hypothesis that "can explain anything" (the 8 sided die), should lose probability if >we obtain evidence that is "better explained" by the more specific hypothesis (the 3's only die).

Yes, that's correct. The thing I was trying to illustrate is that some hypotheses are more falsifiable than others. A hypothesis that can explain too much data (e.g. a 1,000 sided die) would lose probability to a more restricted hypothesis like a 6 sided die if the numbers 1 - 6 are rolled. The compliment to that is if the numbers 7 - 1,000 are rolled this refutes the idea that the 6 sided die was rolled. Accounting for too much data and falsifiability are two sides of the same coin; explaining too much data tends towards unfalsfiability.

comment by witzvo · 2012-06-09T05:06:36.205Z · LW(p) · GW(p)

If I understand the point you're trying to make, you might try an example with curve fitting. If some data in a scatterplot is well explained by a line plus noise, then that's a better explanation than trying to draw ever more complicated curves that go through all the data exactly. Of course, identifying the very best model that has a few wiggles and less unexplained scatter is actually pretty tricky [c.f. AIC/BIC/cross-validation/splines].

comment by Vaniver · 2012-06-09T04:27:16.987Z · LW(p) · GW(p)

I wonder if doing the Monty Hall / Monty Fall / Monty Crawl would be a better way to teach probabilistic reasoning. (You'd probably want to start off with the fall, then move to the crawl, then move to the hall, if you want people to get them right.)

comment by RobertLumley · 2012-06-08T22:33:12.103Z · LW(p) · GW(p)

at parties

You must be going to different parties than I am...

comment by [deleted] · 2012-06-10T05:37:09.564Z · LW(p) · GW(p)

This video explained Bayesianism in a way that finally enabled me to "get it". Might wanna try lifting some of the examples from it.

Teaching Bayesianism

Contents

45 comments