Posts

Use Normal Predictions 2022-01-09T15:01:44.934Z
Genetics: It sometimes skips a generation is a terrible explanation! 2022-01-01T15:12:03.277Z
The Genetics of Space Amazons 2021-12-30T22:14:14.201Z
Jan Christian Refsgaard's Shortform 2021-06-02T10:45:42.286Z
The Reebok effect 2021-05-21T17:11:57.076Z
Book Review of 5 Applied Bayesian Statistics Books 2021-05-21T10:23:51.672Z
Is there a way to preview comments? 2021-05-10T09:26:32.706Z
Prediction and Calibration - Part 1 2021-05-08T19:48:16.847Z
a visual explanation of Bayesian updating 2021-05-08T19:45:46.756Z

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on List of Probability Calibration Exercises · 2022-01-24T17:21:31.375Z · LW · GW

It would be nice if you wrote a short paragraph for each link, "requires download", "questions are from 2011", or you sorted the list somehow :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-16T22:04:07.597Z · LW · GW

Yes, You can change future by being smarter and future by being better calibrated, my rule assumes you don't get smarter and therefore have to adjust only future .

If you actually get better at prediction you could argue you would need to update less than the RMSE estimate suggests :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-16T21:59:18.572Z · LW · GW

I agree with both points

If you are new to continuous predictions then you should focus on the 50% Interval as it gives you most information about your calibration, If you are skilled and use for example a t-distribution then you have for the trunk and for the tail, even then few predictions should land in the tails, so most data should provide more information about how to adjust , than how to adjust

Hot take: I think the focus 95% is an artifact of us focusing on p<0.05 in frequentest statistics.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-16T21:47:50.585Z · LW · GW

Our ability to talk past each other is impressive :)

would have been an easier way to illustrate your point). I think this is actually the assumption you're making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated].

Yes this is almost the assumption I am making, the general point of this post is to assume that all your predictions follow a Normal distribution, with as "guessed" and with a that is different from what you guessed, and then use to get a point estimate for the counterfactual you should have used. And as you point out if (counterfactual) then the point estimate suggests you are well calibrated.

In the post counter factual is

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-14T13:18:02.077Z · LW · GW

Thanks!, I am planing on writing a few more in this vein, currently I have some rough drafts of:

• 30% Done, How to callibrate normal predictions
• defence of my calibration scheme, and an explanation of how metaculus does.
• 10% Done, How to make overdispersed predictions
• like this one for the logistic and t distribution.
• 70% Done, How to calibrate binary predictions
• like this one + but gives a posterior over the callibration by doing an logistic regression with your predictions as "x" and outcome as "y"

I can't promise they will be as good as this one, but if they are not terrible then I would like them to be turned into a sequence :), how do I do this?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-14T10:56:03.482Z · LW · GW

Yes you are right, but under the assumption the errors are normal distributed, then I am right:

If:

Then Which is much less than 1.

proof:

import scipy as sp

x1 = sp.stats.norm(0, 0.5).rvs(22 * 10000)
x2 = sp.stats.norm(0, 1.1).rvs(78 * 10000)
x12 = pd.Series(np.array(x1.tolist() + x2.tolist()))
print((x12 ** 2).median())

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-13T15:36:13.159Z · LW · GW

I am making the simple observation that the median error is less than one because the mean squares error is one.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-13T06:10:22.504Z · LW · GW

That's also how I conseptiolize it, you have to change your intervals because you are to stupid to make better predictions, if the prediction was always spot on then sigma should be 0 and then my scheme does not make sense

If you suck like me and get a prediction very close then I would probably say: that sometimes happen :) note I assume the average squared error should be 1, which means most errors are less than 1, because 02+22=2>1

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-13T05:59:40.985Z · LW · GW

I agree, most things is not normal distributed and my callibrations rule answers how to rescale to a normal. Metaculus uses the cdf of the predicted distribution which is better If you have lots of predictions, my scheme gives an actionable number faster, by making assumptions that are wrong, but if you like me have intervals that seems off by almost a a factor of 2, then your problem is not the tails but the entire region :), so the trade of seems worth it.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-12T17:54:16.995Z · LW · GW

Agreed, More importantly the two distribution have different kurtosis, so their tails are very different a few sigmas away

I do think the Laplace distribution is a better beginner distribution because of its fat tails, but advocating for people to use a distribution they have never heard of seems like a to tough sell :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-12T17:49:39.314Z · LW · GW

My original opening statement got trashed for being to self congratulatory, so the current one is a hot fix :), So I agree with you!

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Genetics: It sometimes skips a generation is a terrible explanation! · 2022-01-12T12:33:49.973Z · LW · GW

Me to, I learned about this from another disease and taught, that's probably how it works for colorblindness as well.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-10T08:45:51.019Z · LW · GW

I would love you as a reviewer of my second post as there I will try to justify why I think this approach is better, you can even super dislike it before I publish if you still feel like that when I present my strongest arguments, or maybe convince me that I am wrong so I dont publish part 2 and make a partial retraction for this post :). There is a decent chance you are right as you are the stronger predictor of the two of us :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-10T08:36:44.875Z · LW · GW

Can I use this image for my "part 2" posts, to explain how "pros" calibrate their continuous predictions?, And how it stacks up against my approach?, I will add you as a reviewer before publishing so you can make corrections in case I accidentally straw man or misunderstand you :)

I will probably also make a part 3 titled "Try t predictions" :), that should address some of your other critiques about the normal being bad :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-10T07:36:11.625Z · LW · GW

Note 1 for JenniferRM: I have updated the text so it should alleviate your confusion, if you have time, try to re-read the post before reading the rest of my comment, hopefully the few changes should be enough to answer why we want RMSE=1 and not 0.
Note 2 for JenniferRM and others who share her confusion: if the updated post is not sufficient but the below text is, how do I make my point clear without making the post much longer?

With binary predictions you can cheat and predict 50/50 as you point out... You can't cheat with continuous predictions as there is no "natural" midpoint.

The insight you are missing is this:

1. I "try" to Convert my predictions to the Normal N(0, 1) using the predicted mean and error.
2. The variance of the unit Normal is 1: Var(N(0, 1)) = 1^2 = 1
3. If my calculated variance deviate from the unit normal, then that is evidence that I am wrong, I am making the implicit assumption that I cannot make "better point predictions" (change ) and thus is forced to only update my future uncertainty interval by .

To make it concrete, If I had predicted (sigma here is 10 wider than in the post):

• Biden ~ N(54, 30)
• COVID ~ N(15.000, 50.000)

then the math would give . Both the post predictions and the "10 times wider predictions in this comment" implies the same "recalibrated" :

(On a side note I hate brier scores and prefer Bernoulli likelihood, because brier says that predicting 0% or 2% on something that happens 1% of the time is 'equally wrong' (same square error)... where the Bernoulli says you are an idiot for saying 0% when it can actually happen)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-10T05:44:36.055Z · LW · GW

The big ask is making normal predictions, calibrating them can be done automatically here is a quick example using google sheets: here is an example

I totally agree with both your points, This comment From a Metaculus user have some good objections to "us" :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T22:42:41.021Z · LW · GW

I am sorry if I have straw manned you, and I think your above post is generally correct. I think we are cumming from two different worlds.

You are coming from Metaculus where people make a lot of predictions. Where having 50+ predictions is the norm and the thus looking at a U(0, 1) gives a lot of intuitive evidence of calibration.

I come from a world where people want to improve in all kids of ways, and one of them is prediction, few people write more than 20 predictions down a year, and when they do they more or less ALWAYS make dichotomous predictions. I expect many of my readers to be terrible at predicting just like myself.

You are reading a post with the message "raise the sanity waterline from 2% to 5% of your level" and asking "why is this better than making 600 predictions and looking at their inverse CDF", and the answer is: it's not, but it's still relevant because most people do not make 600 predictions and do not know what an inverse CDF is. I am even explaining what an normal distribution is because I do not expect my audience to know...

1. You are absolutely correct they probably do not share an error distribution. But I am trying to get people from knowing 1 distribution to knowing 2.

2. Scot Alexander makes a "when I predict this" then "it really means that", every year for his binary predictions, This gives him an intuitive feel for "I should adjust my odds up/down by x". I am trying to do the same for Normal Distribution predictions, so people can check their predictions.

3. I agree your methodology is superior :), All I propose that people sometimes make continuous predictions, and if they want to start doing that and track how much they suck, then I give them instructions to quickly getting a number for how well it is going.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T20:58:49.814Z · LW · GW

TLDR for our disagreement:

SimonM: Transforming to Uniform distribution works for any continuous variable and is what Metaculus uses for calibration
Me: the variance trick to calculate from this post is better if your variables are form a Normal distribution, or something close to a normal.
SimonM: Even for a Normal the Uniform is better.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T20:36:48.600Z · LW · GW

I don't know what s.f is, but the interval around 1.73 is obviously huge, with 5-1-0 data points it's quite narrow if your predictions are drawn from N(1, 1.73), that is what my next post will be about. There might also be a smart way to do this using the Uniform, but I would be surprised if it's dispersion is smaller than a chi^2 distribution :) (changing the mean is cheating, we are talking about calibration, so you can only change your dispersion)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T18:56:47.893Z · LW · GW

Hard disagree, From two data points I calculate that my future intervals should be 1.73 times wider, converting these two data points to U(0,1) I get

[0.99, 0.25]

How should I update my future predictions now?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T18:31:01.487Z · LW · GW

you are missing the step where I am transforming arbitrary distribution to U(0, 1)

medium confident in this explanation: Because the square of random variables from the same distributions follows a gamma distribution, and it's easier to see violations from a gamma than from a uniform, If the majority of your predictions are from a weird distributions then you are correct, but if they are mostly from normal or unimodal ones, then I am right. I agree that my solution is a hack that would make no statistician proud :)

Edit: Intuition pump, a T(0, 1, 100) obviously looks very normal, so transforming to U(0,1) and then to N(0, 1) will create basically the same distribution, the square of a bunch of normal is Chi^2, so the Chi^2 is the best distribution for detecting violations, obviously there is a point where this approximation sucks and U(0, 1) still works

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T18:11:00.362Z · LW · GW

changed to "Making predictions is a good practice, writing them down is even better."

does anyone have a better way of introducing this post?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T18:02:29.604Z · LW · GW

(Edit: the above post has 10 up votes, so many people feel like that, so I will change the intro)

You have two critiques:

1. Scott Alexander evokes tribalism

2. We predict more than people outside our group holding everything else constant

3. I was not aware of it, and I will change if more than 40% agree

Remove reference to Scott Alexander from the intro: [poll]{Agree}{Disagree}

1. I think this is true, but have no hard facts, more importantly you think I am wrong, or if this also evokes tribalism it should likewise be removed...

Also Remove "We rationalists are very good at making predictions" from the intro: [poll]{Agree}{Disagree}

If i remove both then I need a new intro :D

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T17:50:02.796Z · LW · GW

This is a good point, but you need less data to check whether your squared errors are close to 1 than whether your inverse CDF look uniform, so if the majority of predictions are normal I think my approach is better.

The main advantage of SimonM/Metaculus is that it works for any continuous distribution.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T17:17:56.352Z · LW · GW

Agreed 100% on 1) and with 2) I think my point is "start using the normal predictions as a gate way drug to over dispersed and model based predictions"

I stole the idea from Gelman and simplified it for the general community, I am mostly trying to raise the sanity waterline by spreading the gospel of predicting on the scale of the observed data. All your critiques of normal forecasts are spot on.

Ideally everybody would use mixtures of over-dispersed distributions or models when making predictions to capture all sources of uncertainty

It is my hope that by educating people in continuous prediction the Metaculus trade off you mention will slowly start to favor the continuous predictions because people find it as easy as binary prediction... but this is probably a pipe dream, so I take your point

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T16:46:09.239Z · LW · GW

You could make predictions from a t distribution to get fatter tails, but then the "easy math" for calibration becomes more scary... You can then take the "quartile" from the t distribution and ask what sigma in the normal that corresponds to. That is what I outlined/hinted at in the "Advanced Techniques 3"

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Use Normal Predictions · 2022-01-09T16:40:05.202Z · LW · GW

Good Points, Everything is a conditional probability, so you can simply make conditional normal predictions:

Let A = Biden alive

Let B = Biden vote share

Then the normal probability is conditional on him being alive and does not count otherwise :)

Another solution is to make predictions from a T-distribution to get fatter tails. and then use "Advanced trick 3" to transform it back to a normal when calculating your calibration.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Genetics: It sometimes skips a generation is a terrible explanation! · 2022-01-02T01:00:07.457Z · LW · GW

I think this was by parents, so they are forgiven :), your story is pretty crazy, but there is so much to know as a doctor that most becomes rules of thumbs (maps vs buttons) untill called out like you did

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Genetics: It sometimes skips a generation is a terrible explanation! · 2022-01-01T23:10:11.208Z · LW · GW

fair point. I think my target audience is people like me who heard this saying about colorblindness (or other classical Mendelian diseases that runs in families)

I have added a disclaimer towards the end :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Genetics: It sometimes skips a generation is a terrible explanation! · 2022-01-01T22:50:35.749Z · LW · GW

I am not sure I follow, I am confused about whether the 60/80 family refers to both parents, and what is meant by "off-beat" and "snap-back", I am also confused about what the numbers mean is it 60/80 of the genes or 60/80 of the coding region (so only 40 genes)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Genetics: It sometimes skips a generation is a terrible explanation! · 2022-01-01T22:40:49.670Z · LW · GW

I totally agree, technically it's a correct observation, but it's also what I was taught by adults when I asked as a kid, and therefore I wanted to correct it as the real explanation is very short and concise.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on This Year I Tried To Teach Myself Math. How Did It Go? · 2022-01-01T14:26:09.771Z · LW · GW

That is hard to believe, you seem so smart at the UoB discord and your podcast :), thanks for sharing

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on This Year I Tried To Teach Myself Math. How Did It Go? · 2022-01-01T14:22:54.897Z · LW · GW

The University of Bayes Discord (UoB) has study groups for Bayesian statistics which might be relevant to you. The newest study group is doing Statistical Rethinking 2022 as the lectures get posted to YouTube. It requires less math than you have demonstrated in your post.

If you want a slightly more rigors path to Bayesian statistics, then I would advice to read Lambert or Gelman See here for more info.

If you want to take the mathematician approach and lean probability theory first, then the book Probability 110 by Blitzstein is pretty good, the study group at UoB is as of writing half way trough that book.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Genetics of Space Amazons · 2022-01-01T01:36:35.558Z · LW · GW

Totally agree, it's also Christians critique of the idea :)... Maybe it could be relevant for aliens on a smaller planet as they could leave their planet more easily, and would thus be less advanced than us when we become space faring :)... Or a scifi where the different tech trees progress different, like stram punk

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Genetics of Space Amazons · 2021-12-31T19:51:40.157Z · LW · GW

Then maybe it only work for harem anime in space :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on From Considerations to Probabilities · 2021-12-31T17:45:05.394Z · LW · GW

A lot of your latex is not rendered correctly...

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Genetics of Space Amazons · 2021-12-31T16:26:57.860Z · LW · GW

Agreed, but then you don't get cool space amazons :). It could be an extra fail safe mechanism :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Genetics of Space Amazons · 2021-12-31T16:24:27.527Z · LW · GW

Good Point, In principle the X chromosome already has this issue when you get it from your farther, if the A chromosome is simply a normal X chromosome with an insertion of a set of proteins that blocks silencing, then you can still have recombination, if we assume the Amazon proteins are all located in the same LD region then mechanically everything is as in the post, but we do not have the Muller's ratchet problem

Also the A only recombines with X as AY is female and therefore never mates with an AX or AY

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Genetics of Space Amazons · 2021-12-31T09:27:04.774Z · LW · GW

When the space ship lands there is a 1% chance that no males are among the first 16 births ()

Luckily males are firtile for longer so if the second generation had no men the first generation still works

If the A had a mutation such that AX did not have 50% chance of passing on a A, then the gender ratio would be even more extreme, if the last man dies the a AY female could probably artificially incriminate a female.

You can update the matrix and do the for product to see how those different rules pan out, if you have a specific ratio you want to try then I can calculate it for you, calculating a target gender ratio will require a mathematician as this is a Markov process and their Transmission Matrix are hard to calculate fom a target steady state, if you are a for mortal like me

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on The Genetics of Space Amazons · 2021-12-30T23:04:07.495Z · LW · GW

1:10 was a good guess, but unfortunately the amazon gene only gets us to 1:3

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Question about Test-sets and Bayesian machine learning · 2021-08-09T20:02:00.164Z · LW · GW

Wild Speculation:

I am 70% confident that if we were smarter then we would not need it.

If you have some data that you (magically) know the likelihood and prior. Then you would have some uncertainty from the parameters in the model and some from the parameters, this would then change the form of the posterior for example from normal to a t-distribution to account for this extra uncertainty.

In the real world we assume a likelihood and guess a prior, and even with simple models such as y ~ ax + b we will usually model the residual errors as a normal distribution and thus thus loose some of the uncertainty, thus our residual errors are different in and out of sample.

Practical Reason

Also, a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better

Modern Bayesians have found two ways to solve this issue

1. WAIC: Which uses information theory see how the posterior predictive distribution captures the generative process and penalizes for the effective number of parameters.
2. PSIS-LOO: does a very fast version of LOO-CV where for each  you factor that  contribution to the posterior to get an out of sample posterior predictive estimate for .

Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors.

*Some models have parameters which constrains other parameters thus what I mean is "effective" parameters according to the WAIC  or PSIS-LOO estimation, parameters with strong priors are very constrained and count as much less than 1.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-05T20:50:10.457Z · LW · GW

Good points, but can't you still solve the discrete problem with a single model and a stick breaking prior on the number of mints, right?

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-04T04:17:27.149Z · LW · GW

If there are 3 competing models then Ideally you can make a larger model where each submodel is realized by specific parameter combinations.

If a M2 is simply M1 with an extra parameter b2, then you should have a stronger prior b2 being zero in M2, if M3 is M1 with one parameter transformed, then you should have a parameter interpolating between this transformation so you can learn that between 40-90% interpolating describe the data better.

If it's impossible to translate between models like this then you can do model averaging, but it's a sign of you not understanding your data.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-03T20:54:43.392Z · LW · GW

You are correct, we have to assume a model, just like we have to assume a prior. And strictly speaking the model is wrong and the prior is wrong :). But we can calculate how good the posterior predictive describe the data to get a feel for how bad our model is :)

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-03T04:35:36.098Z · LW · GW

I am a little confused by what x is on your statement, and by why you think we can't compute the likelihood or posterior predictive. In most real problems we can't compute the posterior but we can draw from it and thus approximate it via MCMC

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Do Bayesians like Bayesian model Averaging? · 2021-08-02T21:11:44.353Z · LW · GW

I agree with Radford Neal, model average and Bayes factors are very sensitive to the priors specification of the models, if you absolutely have to do model average methods such as PSIS-LOO or WAIC that focus on the predictive distribution are much better. If you had two identical models where one simply had a 10 times boarder uniform prior then their posterior predictive distributions would be identical but their Bayes factor would be 1/10, so a model average (assuming uniform prior on p(M_i)) would favor the narrow prior by a factor 10 where the predictive approach would correctly cobclude that they describe the data equal well and thus conclude that the models should be weighed equal.

Finally model average is usually conseptually wrong and can be solved by making a larger model that encompass all potential models, such as a hierarchical model to partial pool between the group and subject level models, gelmans 8 schools data is a good example: there are 8 schools and there are 2 simple models one with 1 parameter (all schools are the same) and one with 8 (every school is a special snow flake), and then the hierarchical model with 9 parameters, one for each school and one for how much to pool the estimates towards the group mean, gelmans radon dataset is also good for learning about hierarchical models

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-26T18:05:17.959Z · LW · GW

I am one of those people with an half baked epistemology and understanding of probability theory, and I am looking forward to reading Janes. And I agree there are a lot of ad hocisms in probability theory which means everything is wrong in the logic sense as some of the assumptions are broken, but a solid moden bayesian approach has much less adhocisms and also teaches you to build advanced models in less than 400 pages.

HMC is a sampling approach to solving the posterior which in practice is superior to analytical methods, because it actually accounts for correlations in predictors and other things which are usually assumed away.

WAIC is information theory on distributions which allows you to say that model A is better than model B because the extra parameters in B are fitting noice, basically minimum description length on steroids for out of sample uncertainty.

Also I studied biology which is the worst, I can perform experiments and thus do not have to think about causality and I do not expect my model to acout for half of the signal even if it's 'correct'

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-26T15:44:41.602Z · LW · GW

I think the above is accurate.

I disagree with the last part, but it has two sources of confusion

1. Frequentists vs Bayesian is in principle about priors but in practice about about point estimates vs distributions
1. Good Frequentists use distributions and bad Bayesian use point estimates such as Bayes Factors, a good review is this is https://link.springer.com/article/10.3758/s13423-016-1221-4
2. But the leap from theta to probability of heads I think is an intuitive leap that happens to be correct but unjustified.

Philosophically then the posterior predictive is actually frequents, allow me to explain:
Frequents are people who estimates a parameter and then draws fake samples from that point estimate and summarize it in confidence intervals, to justify this they imagine parallel worlds and what not.

Bayesian are people who assumes a prior distributions from which the parameter is drawn, they thus have both prior and likelihood uncertainty which gives posterior uncertainty, which is the uncertainty of the parameters in their model, when a Bayesian wants to use his model to make predictions then they integrate their model parameters out and thus have a predictive distribution of new data given data*. Because this is a distribution of the data like the Frequentists sampling function, then we can actually draw from it multiple times to compute summary statistics much like the frequents, and calculate things such as a "Bayesian P-value" which describes how likely the model is to have generated our data, here the goal is for the p-value to be high because that suggests that the model describes the data well.

*In the real world they do not integrate out theta, they draw it 10.000 times and use thous samples as a stand in distribution because the math is to hard for complex models

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-23T21:26:58.115Z · LW · GW

Regarding reading Jaynes, my understanding is its good for intuition but bad for applied statistics because it does not teach you modern bayesian stuff such as WAIC and HMC, so you should first do one of the applied books. I also think Janes has nothing about causality.

Comment by Jan Christian Refsgaard (jan-christian-refsgaard) on Jaynesian interpretation - How does “estimating probabilities” make sense? · 2021-07-23T21:21:50.041Z · LW · GW

Given 1. your model and 2 the magical no uncertainty in theta, then it's theta, the posterior predictive allows us to jump from infrence about parameters to infence about new data, it's a distribution of y (coin flip outcomes) not theta (which describes the frequency)