Jaynesian interpretation - How does “estimating probabilities” make sense?

post by Haziq Muhammad (haziq-muhammad) · 2021-07-21T21:36:27.691Z · LW · GW · 34 comments

This is a question post.

In Professor Jaynes’ theory of probability, probability is the degree of plausibility about a thing given some knowledge and not an physical property of that thing.

However, I see people treating the probability of heads in a coin flip as a parameter that needs to be estimated. Even Professor Jaynes gives the impression that he is “estimating the probability” or looking for “the most plausible probability of heads” in page 164 of his book.

How does the idea of ”estimating a probability from data“ or finding the “most probable probability of heads in a coin flip given some data” make sense from this paradigm?

Thank you for your time

Answers

answer by deepthoughtlife · 2021-07-21T21:58:58.855Z · LW(p) · GW(p)

I don't know the theory itself, but from your description it seems likely that it is a simple ease of thinking thing. 'What should I believe is the likelihood that the result of a coinflip is heads?' isn't any different in meaning than 'estimating the probability of heads from data' or 'how plausible is heads?' as far as our actions go. We have formal ways of doing the middle of the three easily, so it is easier to think of that way, and we have built up intuitions about coinflips that require it.

Whether or not it is a physical property, it is easier to describe properties of individual things rather than of large combinations of things and actions. If his description of how the evidence should be weighed includes large parts of his theory, it could still be a valuable example.

comment by Dagon · 2021-07-21T22:27:54.985Z · LW(p) · GW(p)

That's my take as well.  "estimating the probability" really means "calculating the plausibility based on this knowledge".

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T04:34:43.300Z · LW(p) · GW(p)

I believe, mathematically, your claim can be expressed as:

  = 

where  is the ”probability“ parameter of the Bernoulli distribution, H represents the the proposition that heads occurs, and D represents our data. The left side of this equation is the plausibility based on knowledge and the right side is Professor Jaynes’ ‘estimate of the probability’ . How can we prove this statement?

Edit:

Latex is being a nuisance as usual :) The right side of the equation is the argmax with respect to theta of P(theta | data)

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-22T07:41:05.279Z · LW(p) · GW(p)

I think argmax is not the way to go as the beta distribution and binomial likelihood is only symmetric when the coin is fair, if you want a point estimate the mean of the distribution is better, which will always be closer to 50/50 than the mode, and thus more conservative, you are essentially ignoring all the uncertainty of theta and thus overestimating the probability.

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T08:47:17.381Z · LW(p) · GW(p)

What is the theoretical justification behind taking the mean? Argmax feels more intuitive for me because it is literally “the most plausible value of theta”. In either case, whether we use argmax or mean, can we prove that it is equal to P(H|D)? 

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-22T11:33:45.752Z · LW(p) · GW(p)

If I have a distribution of 2 kids and a professional boxer, and a random one is going to hit me, then argmax tells me that I will always be hit by a kids, sure if you draw from the distribution only once then argmax will beat the mean in 2/3 of the cases, but its much worse at answering what will happen if I draw 9 hits (argmax=nothing, mean=3hits from a boxer)

This distribution is skewed, like the beta distribution, and is therefore better summarized by the mean than the mode.

In Bayesian statistics argmax on sigma will often lead to sigma=0, if you assume that sigma follows a exponential distribution, thus it will lead you to assume that there is no variance in your sample

The variance is also lower around the mean than the mode if that counts as a theoretical justification :)

comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-22T07:36:13.720Z · LW(p) · GW(p)

Disclaimer: Subjective Bayesian
 

Here is how we evil subjective Bayesian think about it

Prior:

Lets imagine two people, Janes and an Alien, Janes knows that most coins are fair and has a Beta(20, 20) prior, the alien does not know this, and puts the 'objective' Beta(1, 1) prior which is uniform for all frequencies.

Data:

The data comes up 12 heads and 8 tails

Posterior:

Janes has a narrow posterior Beta(32, 28) and the alien a broader Beta(13, 9), Janes posterior is also close to 50/50

if Janes does not have access to the data that formed his prior or cannot explain it well, then what he believes about the coin and what the alien believes about the coin are both 'rational', as it is the posterior from their personal priors and the shared data.

How to think about it:

Janes can publish a paper with the Beta(13, 9) posterior, because that is what skeptical people with weak priors will believe, while himself believing in a Beta(32, 28) 

To make it more concrete Pfizer used a Beta(0.7, 1) prior for their COVID vaccine, but had they truly belied that prior they would have gone back to the drawing instead of starting a phase 3 trial, but the FDA is like the alien in the above example, with a very broad prior allowing most outcome, the Pfizers scientists are like Janes, we have all this data suggesting it should work pretty well so they may have believed in Beta(5, 15) or whatever

The other thing to notice is the coins frequency is a distribution and not an scalar because they are both unsure about the 'real' frequency

Does this help or am I way off?

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T10:43:32.044Z · LW(p) · GW(p)

I am very grateful for your answer but I have a few contentions from my paradigm of objective Bayesianism

  1. You have replaced probability with a physical property: “frequency“. I have also seen other people use terms like bias-weighting, fairness, center of mass, etc. which are all properties of the coin, to sidestep this question. I have nothing against theta being a physical property such that P(heads|theta=alpha) = alpha. In fact, it would make a ton of sense to me if this actually were the case. But the issue is when people say that theta is a probability and treat it as if it was a physical property. I presume you don’t view probabilities to be physical properties. Even subjective Bayesians are not that evil...
  2. “if Janes does not have access to the data that formed his prior or cannot explain it well, then what he believes about the coin and what the alien believes about the coin are both 'rational', as it is the posterior from their personal priors and the shared data.” If Professor Jaynes did not have access to the data that formed his prior, his prior would have been the same as the alien’s and they would have ended up with the same posterior. There is no such thing as a “personal prior”. I invite you to the light side: read Professor Jaynes’ book; it is absolutely brilliant
Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-22T11:54:10.772Z · LW(p) · GW(p)

I may be to bad at philosophy to give a satisfying answer, and it may turn out that I actually do not know and am simply to dumb to realize that I should be confused about this :)

  1. There is a frequency of the coin in the real world, let's say it has 
    1. Because I am not omniscient there is a distribution over   it's parameterized by some prior which we ignore (let's not fight about that :)) and some data x, thus In my head there exists a probability distribution 
    2. The probability distribution on my head is a distribution not a scaler, I don't know what  is but I may be 95% certain that it's between 0.4 and 0.6 
  2. I think there are problems with objective priors, but I am honored to have meet an objective Bayesian in the wild, so I would love to try to understand you, I am Jan Christian Refsgaard on the University of Bayes and Bayesian conspiracy discord servers. My main critique is the 'in-variance' of some priors under some transformations, but that is a very weak critique and my epistemology is very underdeveloped, also I just bought Jaynes book :) and will read when I find a study group, so who knows maybe I will be an objective Bayesian a year from now :)
Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T15:34:02.697Z · LW(p) · GW(p)

Response to point one: I do find that to be satisfactory from a philosophical perspective but only because theta refers to a real-world property called frequency and not the probability of heads. My question to you is this: if you have a point estimate of theta or if you find the exact real world-value of theta (perhaps by measuring it with an ACME frequency-o-meter), what does it tell you about the probability of heads?

Response to point two: The honour is mine :) If you ever create a study group or discord server for the book, then please count me in

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-23T06:37:38.274Z · LW(p) · GW(p)

In Bayesian statistics there are two distributions which I think we are conflating here because they happen to have the same value

The posterior  describes our uncertainty of , given data (and prior information), so it's how sure we are of the frequency of the coin

The posterior predictive is our prediction for new coin flips  given old coin flips 

 

For the simple Bernoulli distribution coin example, the following issue arise: the parameter , the posterior predictive and the posterior all have the same value, but they are different things.

Here is an example were they are different:

Here  was not a coin but the logistic intercept of some binary outcome with predictor variable x, let's imagine an evil Nazi scientist poisoning people, then we could make a logistic model of y (alive/dead) such as , Let's imagine that x is how much poison you ate above/below the average poison level, and that we have , so on average half died

Now we have:

the value if we were omniscient

The posterior of  because we are not omniscient there is error

Predictions for two different y with uncertainty:

Does this help?

 

I will PM you when we start reading Jaynes, we are currently reading Regression and other stories, but in about 20 weeks (done if we do 1 chapter per week) there is a good chance we will do Jaynes

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-23T09:48:49.229Z · LW(p) · GW(p)

To calculate the posterior predictive you need to calculate the posterior and to the calculate posterior you need to calculate the likelihood (in most problems). For the coin flipping example, what is the probability of heads and what is the probability of tails given that the frequency is equal to some value theta? You might accuse me of being completely devoid of intuition for asking this question but please bear with me...

Sounds good. I thought nobody was interested in reading Professor Jaynes’ book anymore. It’s a shame more people don’t know about him

Replies from: jan-christian-refsgaard, jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-23T21:26:58.115Z · LW(p) · GW(p)

Regarding reading Jaynes, my understanding is its good for intuition but bad for applied statistics because it does not teach you modern bayesian stuff such as WAIC and HMC, so you should first do one of the applied books. I also think Janes has nothing about causality.

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-23T23:09:04.799Z · LW(p) · GW(p)

I‘m afraid I have to disagree. I do sometimes regret not focusing more on applied Bayesian inference. (In fact, I have no idea what WAIC or HMC is.) But in my defence, I am an amateur philosopher & logician and I couldn’t help finding more non-sequiturs in statistics textbooks than plot-holes in Tolkien novels. Perhaps if had been more naive and less critical (no offence to anyone) when I read those books, I would have “progressed” faster. I had lost hope/trust in statistics before I read Professor Jaynes’ book; that’s why I respect the man so much. Now I have the intuition but I am still trying to reconcile it with what I read in the applied literature. I do sometimes find it frustrating that I am worrying about philosophical nuances and intricacies while others are applying their (perhaps less coherent) knowledge of statistics to solve problems but I guess it is worth it :)

comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-23T21:21:50.041Z · LW(p) · GW(p)

Given 1. your model and 2 the magical no uncertainty in theta, then it's theta, the posterior predictive allows us to jump from infrence about parameters to infence about new data, it's a distribution of y (coin flip outcomes) not theta (which describes the frequency)

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-23T23:20:16.978Z · LW(p) · GW(p)

Think I have finally got it. I would like to thank you once again for all your help; I really appreciate it. 

This is what I think “estimating the probability” means:

We define theta to be a real-world, objective, physical parameter/quantity s.t. P(H|theta=alpha) = alpha & P(T|theta=alpha) = 1 - alpha. We do not talk about the nature of this quantity theta because we do not know or care what it is. I don’t think it is appropriate to say that theta is “frequency” for this reason:

  1. “frequency” is not a well-defined physical quantity. You can’t measure “frequency” like you measure temperature.

But we do not need to dispute about this as theta being “frequency” is unnecessary.

Using the above definitions, we can compute the likelihood and then the posterior and then the posterior predictive which is represents the probability of heads given data from previous flips.

Is the above accurate?

So Bayesians who say that theta is the probability of heads and compute a point estimate of the parameter theta and say that they have “estimated the probability” are just frequentists in disguise?

I also do not think it is possible for there to error in a probability, e.g. p(theta|y), as it is what we assign. I thought only frequentists allow the possibility of there being error in a probability because of their interpretation of probability to be physical and existent like temperature? A statement like “temperature in this room is 10 + e degrees Celsius where e is a ‘small‘ number” makes complete sense to me

comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T04:19:46.265Z · LW(p) · GW(p)

Appreciate your reply. I think the source of my confusion is there being uncertainty in the degree of plausibility that we assign given our knowledge or there being uncertainty in our degree of belief given our knowledge. This feels a bit unnatural to me because this quantity is not an external/physical and unknown quantity but one that we assign given our knowledge. If we were to think of probabilities as physical properties that are unknown, then it makes sense to me that there can uncertainty in its value. How would you reconcile this?

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-22T07:26:18.498Z · LW(p) · GW(p)

The probability is an external/physical thing because your brain is physical, but I take your point.

I think the we/our distinction arises because we have different priors

Replies from: TAG, haziq-muhammad
comment by TAG · 2021-07-22T08:43:07.737Z · LW(p) · GW(p)

The probability is an external/physical thing because your brain is physical

That's a very misleading way of looking at it.

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T08:51:56.946Z · LW(p) · GW(p)

These subjective Bayesians... :) I feel the same way about that statement. Could you please elaborate?

Replies from: jan-christian-refsgaard
comment by Jan Christian Refsgaard (jan-christian-refsgaard) · 2021-07-22T12:55:37.532Z · LW(p) · GW(p)

Uncertainty is a statement about my brain not the real world, if you replicate the initial conditions then it will always land either Head or Tails, so even if the coin is "fair" , then maybe  . the uncertainty comes form be being stupid and thus being unable to predict the next coin toss.

Also there are two things we are uncertain about, we are uncertain about  (the coins frequency) and we are uncertain about , the next coin toss

comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T08:55:10.836Z · LW(p) · GW(p)

So you are saying that “we” are uncertain about the degree of belief/plausibility that what our brain is going to assign? Then who are “we” exactly? Apologies for being glib but I really don’t understand 

Also, it is a crime to have different priors given the same information according to us objective Bayesians so that can’t be the issue

answer by Maxwell Peterson · 2021-07-22T11:03:55.950Z · LW(p) · GW(p)

Jaynes has a wonderful section in the same book where he discusses coin-flipping in depth. He flips a pickle jar lid in his kitchen in different ways to demonstrate how the method of flipping is critical - I love this whole section - and ends by saying that it’s a “problem of mechanics, highly complicated”. Section 10.3 (p317), How to cheat at coin and die tossing.

I’d thought he talked about this kind of “probability of a probability” kind of thing in the Chapter on the A_p distribution, and page 560 does have that phrase (though later on the page he says “The term ‘probability of a probability’ misses the point”…), but reading it again now it seems like I didn’t really understand this section. But give pages 560-563 a shot anyway.

comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T15:17:12.647Z · LW(p) · GW(p)

Thank you so much for telling me about A_p distribution! This is exactly what I have been looking for.

“Pending a better understanding of what that means, let us adopt a cautious notation that will avoid giving possibly wrong impressions. We are not claiming that P(Ap|E) is a ‘real probability’ in the sense that we have been using that term; it is only a number which is to obey the mathematical rules of probability theory. Perhaps its proper conceptual meaning will be clearer after getting a little experience using it. So let us refrain from using the prefix symbol p; to emphasize its more abstract nature, let us use the bare bracket symbol notation (Ap|E) to denote such quantities, and call it simply ‘the density for Ap, given E’.” - Page 554 of Professor Jaynes’ book

The idea of the A_p distribution not being a real probability distribution but obeying the mathematical rules of probability theory is far too nuanced and intricate for me to be able to understand. 

I was reading an article on this site about the A_p distribution, Probability, knowledge, and meta-probability [LW · GW], and a commenter wrote: 

“I think a much better approach is to assign models to the problem (e.g. "it's a box that has 100 holes, 45 open and 65 plugged, the machine picks one hole, you get 2 coins if the hole is open and nothing if it's plugged."), and then have a probability distribution over models. This is better because keeps probabilities assigned to facts about the world.

It's true that probabilities-of-probabilities are just an abstraction of this (when used correctly), but I've found that people get confused really fast if you ask them to think in terms of probabilities-of-probabilities. (See every confused discussion of "what's the standard deviation of the standard deviation?")“

I would appreciate your thoughts on this. My current understanding of A_p distributions in light of this comment and in the context of coin flipping is this: 

 is defined to be a proposition such that  and  where  &  represents heads & tails and  represents the background information. This is similar to the definition Professor Jaynes gives in page 554 of his book.

Let , the data, be .

Using this definition, the posterior is   .

Assuming the background information  is indifferent to the ’s: 

  

   *   

Therefore in the set of propositions {}, the most plausible proposition given our data is . Each member of this set of propositions is called a model. The probability of heads given the most plausible model is 1.0

Is this a correct understanding?

Replies from: maxwell-peterson
comment by Maxwell Peterson (maxwell-peterson) · 2021-07-22T22:08:03.225Z · LW(p) · GW(p)

I don't think so. Like you, I don't really understand this  stuff philisophically. But the step where you drop the prior  to obtain  is, I think, not warranted. Dropping the prior term outright like that... I don't think there are many cases where that's acceptable. Doing so does not reflect a state of low knowledge, but instead a state of pretty strong knowledge. To give intuition on what I mean:

Contrast with the prior that reflects the state of knowledge "All I know is that H is possible and T is possible". This is closer to Jaynes' example about whether there's life on Mars. The prior that reflects that state of knowledge is Beta(1,1), which after two heads come up, becomes Beta(3, 1). The mean of Beta(3, 1) is 3/4 = 0.75. This is much less than the 1.0 you arrive at. 


A prior that gives 1.0 after the data H,H might be something like:

"This coin is very unfair in a well-known, specific way: It either always gives heads, always gives tails, or gives heads and tails alternating: 'H,T,H,T...'." 

Under that prior, the data HH would give you a probability of near-1 that H is next. But that's a prior that reflects definite, strong knowledge of the coin.
Maybe this argument changes given the nature of , which again I don't really understand. But whatever it is, I don't think it's valid to assume the prior away.
 

Replies from: maxwell-peterson
comment by Maxwell Peterson (maxwell-peterson) · 2021-07-22T22:14:20.255Z · LW(p) · GW(p)

Ah, wait, I misunderstood. You're interested in the mode, huh - that's why you're taking the argmax. In my Beta(3,1) example, the mode is also 1. So no problem there. I was focused on the mean in my previous comment. I still think dropping the prior is bad but now I'm not sure how to argue the point...

 

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-22T22:30:16.444Z · LW(p) · GW(p)

I dropped the prior for two reason:

  1. I assumed the background information to be indifferent to the A_p’s
  2. We do not explicitly talk about the nature of the A_p’s. Prof. Jaynes defines it as a proposition such that P(A|A_p, E) = p. In my example A_p is defined as a proposition such that P(H|A_p, I) = p. No matter what prior information we have, it is going to be indifferent to the A_p’s by virtue of the fact that we don’t know what A_p represents

Is this justification valid?

Replies from: maxwell-peterson
comment by Maxwell Peterson (maxwell-peterson) · 2021-07-22T23:19:31.654Z · LW(p) · GW(p)

Isn’t A_p the distribution over how often the coin will come up heads, or the probability of life on Mars? If so… there’s no way those things could be indifferent to the background information. A core tenet of the philosophy outlined in this book is that when you ignore prior information without good cause, things get wacky and fall apart. This is part of desiderata iii from chapter 2: “The robot always takes into account all of the evidence it has relevant to a question. It does not arbitrarily ignore some of the information, basing its conclusions only on what remains.”

(Then Jaynes ignores information in later chapters because it doesn’t change the result… so this desideratum is easier said than done… but yeah)

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-23T02:20:58.282Z · LW(p) · GW(p)

“[…] A_p the distribution over how often the coin will come up heads […]” - I understood A_p to be a sort of distribution over models; we do not know/talk about the model itself but we know that if a model A_p is true, then the probability of heads is equal to p by definition of A_p. Perhaps the model A_p is the proposition “the centre of mass of the coin is at p” or “the bias-weighting of the coin is p” but we do not care as long the resulting probability of heads is p. So how can the prior not be indifferent when we do not know the nature of each proposition A_p in a set of mutually exclusive and exhaustive propositions?

Replies from: maxwell-peterson
comment by Maxwell Peterson (maxwell-peterson) · 2021-07-23T05:22:25.729Z · LW(p) · GW(p)

I can’t see anything wrong in what you’ve said there, but I still have to insist without good argument that dropping P(A_p|I) is incorrect. In my vague defense, consider the two A_p distributions drawn on p558, for the penny and for Mars. Those distributions are as different as they are because of the different prior information. If it was correct to drop the prior term a priori, I think those distributions would look the same?

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-23T10:26:45.545Z · LW(p) · GW(p)

You are right; dropping priors in the A_p distribution is probably not a general rule. Perhaps the propositions don’t always need to interpretable for us to be able impose priors? For example, people impose priors over the parameter space of a neural network which is certainly not interpretable. But the topic of Bayesian neural networks is beyond me

Replies from: maxwell-peterson
comment by Maxwell Peterson (maxwell-peterson) · 2021-07-23T20:54:26.217Z · LW(p) · GW(p)

It seems like in practice, when there’s a lot of data, people like Jaynes and Gelman are happy to assign low-information (or “uninformative”) priors, knowing that with a lot of data the prior ends up getting washed away anyway. So just slapping a uniform prior down might be OK in a lot of real-world situations. This is I think pretty different than just dropping the prior completely, but gets the same job done.

Replies from: maxwell-peterson
comment by Maxwell Peterson (maxwell-peterson) · 2021-07-23T20:59:32.598Z · LW(p) · GW(p)

Now I’m doubting myself >_> is it pretty different?? Anyone lurking reading this who knows whether uniform prior is very different than just dropping the prior term?

Replies from: haziq-muhammad
comment by Haziq Muhammad (haziq-muhammad) · 2021-07-23T22:55:19.390Z · LW(p) · GW(p)

I believe it is the same thing. A uniform prior means your prior is constant function i.e. P(A_p|I) = x where x is a real number with the usual caveats. So if you have a uniform prior, you can drop it (from a safe height of course). But perhaps the more seasoned Bayesians disagree? (where are they when you need them)

34 comments

Comments sorted by top scores.