## Posts

Jan Christian Refsgaard's Shortform 2021-06-02T10:45:42.286Z
The Reebok effect 2021-05-21T17:11:57.076Z
Book Review of 5 Applied Bayesian Statistics Books 2021-05-21T10:23:51.672Z
Is there a way to preview comments? 2021-05-10T09:26:32.706Z
Prediction and Calibration - Part 1 2021-05-08T19:48:16.847Z
a visual explanation of Bayesian updating 2021-05-08T19:45:46.756Z

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Question about Test-sets and Bayesian machine learning · 2021-08-09T20:02:00.164Z · LW · GW

Wild Speculation:

I am 70% confident that if we were smarter then we would not need it.

If you have some data that you (magically) know the likelihood and prior. Then you would have some uncertainty from the parameters in the model and some from the parameters, this would then change the form of the posterior for example from normal to a t-distribution to account for this extra uncertainty.

In the real world we assume a likelihood and guess a prior, and even with simple models such as y ~ ax + b we will usually model the residual errors as a normal distribution and thus thus loose some of the uncertainty, thus our residual errors are different in and out of sample.

Practical Reason

Also, a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better

Modern Bayesians have found two ways to solve this issue

1. WAIC: Which uses information theory see how the posterior predictive distribution captures the generative process and penalizes for the effective number of parameters.
2. PSIS-LOO: does a very fast version of LOO-CV where for each  you factor that  contribution to the posterior to get an out of sample posterior predictive estimate for .

Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors.

*Some models have parameters which constrains other parameters thus what I mean is "effective" parameters according to the WAIC  or PSIS-LOO estimation, parameters with strong priors are very constrained and count as much less than 1.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-05T20:50:10.457Z · LW · GW

Good points, but can't you still solve the discrete problem with a single model and a stick breaking prior on the number of mints, right?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-04T04:17:27.149Z · LW · GW

If there are 3 competing models then Ideally you can make a larger model where each submodel is realized by specific parameter combinations.

If a M2 is simply M1 with an extra parameter b2, then you should have a stronger prior b2 being zero in M2, if M3 is M1 with one parameter transformed, then you should have a parameter interpolating between this transformation so you can learn that between 40-90% interpolating describe the data better.

If it's impossible to translate between models like this then you can do model averaging, but it's a sign of you not understanding your data.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-03T20:54:43.392Z · LW · GW

You are correct, we have to assume a model, just like we have to assume a prior. And strictly speaking the model is wrong and the prior is wrong :). But we can calculate how good the posterior predictive describe the data to get a feel for how bad our model is :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-03T04:35:36.098Z · LW · GW

I am a little confused by what x is on your statement, and by why you think we can't compute the likelihood or posterior predictive. In most real problems we can't compute the posterior but we can draw from it and thus approximate it via MCMC

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-02T21:11:44.353Z · LW · GW

I agree with Radford Neal, model average and Bayes factors are very sensitive to the priors specification of the models, if you absolutely have to do model average methods such as PSIS-LOO or WAIC that focus on the predictive distribution are much better. If you had two identical models where one simply had a 10 times boarder uniform prior then their posterior predictive distributions would be identical but their Bayes factor would be 1/10, so a model average (assuming uniform prior on p(M_i)) would favor the narrow prior by a factor 10 where the predictive approach would correctly cobclude that they describe the data equal well and thus conclude that the models should be weighed equal.

Finally model average is usually conseptually wrong and can be solved by making a larger model that encompass all potential models, such as a hierarchical model to partial pool between the group and subject level models, gelmans 8 schools data is a good example: there are 8 schools and there are 2 simple models one with 1 parameter (all schools are the same) and one with 8 (every school is a special snow flake), and then the hierarchical model with 9 parameters, one for each school and one for how much to pool the estimates towards the group mean, gelmans radon dataset is also good for learning about hierarchical models

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-26T18:05:17.959Z · LW · GW

I am one of those people with an half baked epistemology and understanding of probability theory, and I am looking forward to reading Janes. And I agree there are a lot of ad hocisms in probability theory which means everything is wrong in the logic sense as some of the assumptions are broken, but a solid moden bayesian approach has much less adhocisms and also teaches you to build advanced models in less than 400 pages.

HMC is a sampling approach to solving the posterior which in practice is superior to analytical methods, because it actually accounts for correlations in predictors and other things which are usually assumed away.

WAIC is information theory on distributions which allows you to say that model A is better than model B because the extra parameters in B are fitting noice, basically minimum description length on steroids for out of sample uncertainty.

Also I studied biology which is the worst, I can perform experiments and thus do not have to think about causality and I do not expect my model to acout for half of the signal even if it's 'correct'

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-26T15:44:41.602Z · LW · GW

I think the above is accurate.

I disagree with the last part, but it has two sources of confusion

1. Frequentists vs Bayesian is in principle about priors but in practice about about point estimates vs distributions
1. Good Frequentists use distributions and bad Bayesian use point estimates such as Bayes Factors, a good review is this is https://link.springer.com/article/10.3758/s13423-016-1221-4
2. But the leap from theta to probability of heads I think is an intuitive leap that happens to be correct but unjustified.

Philosophically then the posterior predictive is actually frequents, allow me to explain:
Frequents are people who estimates a parameter and then draws fake samples from that point estimate and summarize it in confidence intervals, to justify this they imagine parallel worlds and what not.

Bayesian are people who assumes a prior distributions from which the parameter is drawn, they thus have both prior and likelihood uncertainty which gives posterior uncertainty, which is the uncertainty of the parameters in their model, when a Bayesian wants to use his model to make predictions then they integrate their model parameters out and thus have a predictive distribution of new data given data*. Because this is a distribution of the data like the Frequentists sampling function, then we can actually draw from it multiple times to compute summary statistics much like the frequents, and calculate things such as a "Bayesian P-value" which describes how likely the model is to have generated our data, here the goal is for the p-value to be high because that suggests that the model describes the data well.

*In the real world they do not integrate out theta, they draw it 10.000 times and use thous samples as a stand in distribution because the math is to hard for complex models

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-23T21:26:58.115Z · LW · GW

Regarding reading Jaynes, my understanding is its good for intuition but bad for applied statistics because it does not teach you modern bayesian stuff such as WAIC and HMC, so you should first do one of the applied books. I also think Janes has nothing about causality.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-23T21:21:50.041Z · LW · GW

Given 1. your model and 2 the magical no uncertainty in theta, then it's theta, the posterior predictive allows us to jump from infrence about parameters to infence about new data, it's a distribution of y (coin flip outcomes) not theta (which describes the frequency)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-23T06:37:38.274Z · LW · GW

In Bayesian statistics there are two distributions which I think we are conflating here because they happen to have the same value

The posterior  describes our uncertainty of , given data (and prior information), so it's how sure we are of the frequency of the coin

The posterior predictive is our prediction for new coin flips  given old coin flips

For the simple Bernoulli distribution coin example, the following issue arise: the parameter , the posterior predictive and the posterior all have the same value, but they are different things.

Here is an example were they are different:

Here  was not a coin but the logistic intercept of some binary outcome with predictor variable x, let's imagine an evil Nazi scientist poisoning people, then we could make a logistic model of y (alive/dead) such as , Let's imagine that x is how much poison you ate above/below the average poison level, and that we have , so on average half died

Now we have:

the value if we were omniscient

The posterior of  because we are not omniscient there is error

Predictions for two different y with uncertainty:

Does this help?

I will PM you when we start reading Jaynes, we are currently reading Regression and other stories, but in about 20 weeks (done if we do 1 chapter per week) there is a good chance we will do Jaynes

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-22T12:55:37.532Z · LW · GW

Uncertainty is a statement about my brain not the real world, if you replicate the initial conditions then it will always land either Head or Tails, so even if the coin is "fair" , then maybe  . the uncertainty comes form be being stupid and thus being unable to predict the next coin toss.

Also there are two things we are uncertain about, we are uncertain about  (the coins frequency) and we are uncertain about , the next coin toss

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-22T11:54:10.772Z · LW · GW

I may be to bad at philosophy to give a satisfying answer, and it may turn out that I actually do not know and am simply to dumb to realize that I should be confused about this :)

1. There is a frequency of the coin in the real world, let's say it has
1. Because I am not omniscient there is a distribution over   it's parameterized by some prior which we ignore (let's not fight about that :)) and some data x, thus In my head there exists a probability distribution
2. The probability distribution on my head is a distribution not a scaler, I don't know what  is but I may be 95% certain that it's between 0.4 and 0.6
2. I think there are problems with objective priors, but I am honored to have meet an objective Bayesian in the wild, so I would love to try to understand you, I am Jan Christian Refsgaard on the University of Bayes and Bayesian conspiracy discord servers. My main critique is the 'in-variance' of some priors under some transformations, but that is a very weak critique and my epistemology is very underdeveloped, also I just bought Jaynes book :) and will read when I find a study group, so who knows maybe I will be an objective Bayesian a year from now :)
Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-22T11:33:45.752Z · LW · GW

If I have a distribution of 2 kids and a professional boxer, and a random one is going to hit me, then argmax tells me that I will always be hit by a kids, sure if you draw from the distribution only once then argmax will beat the mean in 2/3 of the cases, but its much worse at answering what will happen if I draw 9 hits (argmax=nothing, mean=3hits from a boxer)

This distribution is skewed, like the beta distribution, and is therefore better summarized by the mean than the mode.

In Bayesian statistics argmax on sigma will often lead to sigma=0, if you assume that sigma follows a exponential distribution, thus it will lead you to assume that there is no variance in your sample

The variance is also lower around the mean than the mode if that counts as a theoretical justification :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-22T07:41:05.279Z · LW · GW

I think argmax is not the way to go as the beta distribution and binomial likelihood is only symmetric when the coin is fair, if you want a point estimate the mean of the distribution is better, which will always be closer to 50/50 than the mode, and thus more conservative, you are essentially ignoring all the uncertainty of theta and thus overestimating the probability.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-22T07:36:13.720Z · LW · GW

Disclaimer: Subjective Bayesian

Here is how we evil subjective Bayesian think about it

Prior:

Lets imagine two people, Janes and an Alien, Janes knows that most coins are fair and has a Beta(20, 20) prior, the alien does not know this, and puts the 'objective' Beta(1, 1) prior which is uniform for all frequencies.

Data:

The data comes up 12 heads and 8 tails

Posterior:

Janes has a narrow posterior Beta(32, 28) and the alien a broader Beta(13, 9), Janes posterior is also close to 50/50

if Janes does not have access to the data that formed his prior or cannot explain it well, then what he believes about the coin and what the alien believes about the coin are both 'rational', as it is the posterior from their personal priors and the shared data.

Janes can publish a paper with the Beta(13, 9) posterior, because that is what skeptical people with weak priors will believe, while himself believing in a Beta(32, 28)

To make it more concrete Pfizer used a Beta(0.7, 1) prior for their COVID vaccine, but had they truly belied that prior they would have gone back to the drawing instead of starting a phase 3 trial, but the FDA is like the alien in the above example, with a very broad prior allowing most outcome, the Pfizers scientists are like Janes, we have all this data suggesting it should work pretty well so they may have believed in Beta(5, 15) or whatever

The other thing to notice is the coins frequency is a distribution and not an scalar because they are both unsure about the 'real' frequency

Does this help or am I way off?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-22T07:26:18.498Z · LW · GW

The probability is an external/physical thing because your brain is physical, but I take your point.

I think the we/our distinction arises because we have different priors

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jan Christian Refsgaard's Shortform · 2021-06-02T10:45:42.594Z · LW · GW

Cholera is the devil!

The National Center for Biotechnology Information has a Taxonomy database.

Q: What do you think taxid=666 is?

A: Vibrio cholerae, coincidence? I think not!

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Book Review of 5 Applied Bayesian Statistics Books · 2021-05-25T07:36:22.733Z · LW · GW

I loved that example as well, I have heard it elsewhere described as "The law of small numbers", where small subsets have higher variance and therefore more frequent extreme outcomes. I think it's particularly good as the most important part of the Bayesian paragdime is the focus on uncertainty.

The appendix on HMC is also a very good supplement to gain a deeper understanding of the algorithm after having read the description in another book first.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Reebok effect · 2021-05-22T16:42:22.959Z · LW · GW

I think we agree and are talking past each other, my original statement was "Most statisticians would agree with you. Unless..."

So we agree that there is more power in 3/5 than 2/3, and we happen to have divergent intuitions about what random Joe finds most persuasive, my intuition is rather weak so I would gladly update it towards 3/5 sounding more impressive to random people, if you feel strongly about it.

Most likely what the marketing folks have done is gotten a list of top 100 runners in different running disciplines and the reported "the most impressive top X in list Y",

We both agree that the reported statistic is inflated, which is the major thesis, we simply disagree about how much information can be recovered because we have different "impressiveness sounding heuristics"

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Book Review of 5 Applied Bayesian Statistics Books · 2021-05-22T16:18:55.230Z · LW · GW

Good point!

original: Applied Bayesian Statistics - Which book to read?

1. Applied Bayesian Statistics - Which book should you read?
2. Literature Review of 5 Applied Bayesian Statistics Books.
3. Book Review of 5 Applied Bayesian Statistics Books.

I picked 3, if other people have strong feeling feel free to suggest other titles

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Reebok effect · 2021-05-22T16:02:55.196Z · LW · GW

I disagree, if Reebok produced 64% of all shoes in the world and only 3/5 of top athletes used them, and this furthermore was the best statistics the marketing department could produce, then it's strong evidence that they are over hyped.

But I think you understood me as saying something different, words are hard :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Reebok effect · 2021-05-22T15:57:52.703Z · LW · GW

If Reebok wanted to report a valid statistic they would report something like 11% of the top 100 wears our shoos, I think a much smaller number than top 100 was picked exactly because that was where the effect was the most exaggerated. ChristianKl share your intuition that 3/5 sounds more impressive than 2/3. I also agree that reporting top 4 would seem even more fishy, though it could be spun as 75% of quarter finalists in knockout tournament sports.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Best Textbooks on Every Subject · 2021-05-22T09:23:08.844Z · LW · GW

I have written a review of the 5 most popular Applied Bayesian Statistics books

Where I recommended:

• Statistical Rethinking
• Up to speed fast, no integrals, very intuitive approach.
• Doing Bayesian Data Analysis
• This is the easiest book. If your goal is only to create simple models and you aren't interested in understanding the details, then this is the book for you.
• A Student’s Guide to Bayesian Statistics
• This book has the opposite focus of the Dog book. Here the author slowly goes through the philosophy of Bayes with an intuitive mathematical approach.
• Regression and Other Stories
• Good if you want a slower and thorough approach where you also learn the Frequentest perspective.
• Bayesian Data Analysis
• The most advanced text, very math heavy, best as a second book after reading one or two of the others, unless you are already a statistician.
Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Book Review of 5 Applied Bayesian Statistics Books · 2021-05-22T09:15:23.312Z · LW · GW

I will write a post shilling for myself, thanks. I was waiting for the post to be 'liked', if it got -10 karma then there would be no use in shilling for it :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Reebok effect · 2021-05-22T09:11:34.821Z · LW · GW

Most statisticians would agree with you. Unless Reeboks expected market share were between 2/3 and 3/5 of course. Though I expect that most laymen would have Phil's Intuition, in any case the general point: That the statement leaks information remains :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Reebok effect · 2021-05-21T19:10:18.491Z · LW · GW

What's wrong with it?, I am linking to the source material, should i only link if its a 100% copy?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Reebok effect · 2021-05-21T19:10:18.133Z · LW · GW

What's wrong with it?, I am linking to the source material, should i only link if its a 100% copy?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Is there a way to preview comments? · 2021-05-10T10:52:29.767Z · LW · GW

Nice Thanks, now My only issue is this: I would prefer pure markdown for writing posts and the new WYSIWYG for commenting. But after playing with it it seems that it remember your settings for posts, so you can switch back before clicking create post, and then you are golden :). Thanks a lot

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on a visual explanation of Bayesian updating · 2021-05-10T09:10:46.149Z · LW · GW

I am well aware that nobody asked for this, but here is the proof that the posterior is for the beta-bernoulli model.

Then we plug in the definition for the Bernoulli likelihood and Beta prior:

Let's collect the powers in the numerator, and things that does not depend on in the denominator

Here comes the conjugation shenanigans. If you squint, the top of the distribution looks like the top of a Beta distribution:

Let's continue the shenanigans, since the numerator looks like the numerator of a beta distribution, we know that it would be a proper beta distribution if we changed the denominator like this:

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on a visual explanation of Bayesian updating · 2021-05-09T06:22:34.727Z · LW · GW

Mine was the same, I became a bayesian statetisian 4 years ago. I gave a talk about Bayesian Statistics and this figure was what made it click to most students (including myself), so i wanted to share it

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on a visual explanation of Bayesian updating · 2021-05-09T06:19:11.730Z · LW · GW

The order does not matter, you can see that by focusing on which is always equal to , you can also see it from the conjugation rule where you end with no matter the order.

If you wanted the order to matter you could down weight earlier shots or widen the uncertainty between the updates, so previous posterior becomes a slightly wider prior to capture the extra uncertainty from the passage of time.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Prediction and Calibration - Part 1 · 2021-05-08T19:18:28.904Z · LW · GW

you may be disappointed, unless you make 40+ predictions per week it will be hard to compare weekly drift, the Bernoulli distribution has a much higher variance compared to the normal distribution, so the uncertainty estimate of the calibration is correspondingly wide (high uncertainty of data -> high uncertainty of regression parameters). My post 3 will be a hierarchical model which may suite your needs better but it will maybe be a month before I get around to making that model.

If there are many people like you then we may try to make a hackish model that down weights older predictions as they are less predictive of your current calibration than newer predictions, but I will have to think long and hard to make than into a full Bayesian model, so I am making no promises

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on What do the reported levels of protection offered by various vaccines mean? · 2021-05-05T21:36:58.431Z · LW · GW

It almost means 3. It means the Vaccine Efficacy is 95%

is calculated this way:

where are the number of sick people in the vaccine group and is the number of sick people in the control group

So if 100 got sick in the control group and 5 in the vaccine group then:

So it's a 95% reduction in your probability of getting COVID :)

Note that the number reported is sometimes the mode and sometimes the mean of the distribution, but beta/binomial distributions are skewed so the mean is often lower than the mode. I have written a blogpost where I redo the Pfizer analysis

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Prediction and Calibration - Part 1 · 2021-05-03T15:17:17.702Z · LW · GW

I have tried to add a paragraph about this, because I think it's a good point, and it's unlikely that you were the only one who got confused about this, Next weekend I will finish part 2 where I make a model that can track calibration independent of prediction, and in that model the 60% 61/100 will have a better posterior of the calibration parameter than then 60% 100/100, though the likelihood of the 100/100 will of course still be highest.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Prediction and Calibration - Part 1 · 2021-05-03T05:00:32.315Z · LW · GW

I have gotten 10 votes, the sum of which is 4, all of you guys who disliked the post can you please comment so I know why?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Prediction and Calibration - Part 1 · 2021-05-03T04:57:54.161Z · LW · GW

you mean the N'th root of 2 right?, which is what I called the null predictor and divided Scott predictions by in the code:

random_predictor = 0.5 ** len(y)

which is equivalent to where is the total number of predictions

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Prediction and Calibration - Part 1 · 2021-05-03T04:51:28.892Z · LW · GW

You are absolutely right, any framework that punishes you for being right would be bad, my point is that increasing your calibration helps a surprising amount and is much more achievable than "just git good" which is required for improving prediction.

I will try to put your point into the draft when I am off work , thanks

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Prediction and Calibration - Part 1 · 2021-05-02T19:52:22.617Z · LW · GW

Thanks, also thanks for pointing out that I had written a few places instead of , since everything is the bernoulli distribution I have changed everything to

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Prediction and Calibration - Part 1 · 2021-05-02T09:58:44.405Z · LW · GW

I have not been consistent with my probability notation, I sometimes use upper case P and sometimes lower case p, in future posts I will try to use the same notation as Andrew Gelman, which is for things that are probabilities (numbers) such as and for distributions such as . However since this is my first post, I am afraid that 'editing it' will waist the moderators time as they will have to read it again to check for trolling, what is the proper course of action?