0 comments
Comments sorted by top scores.
comment by Steveot · 2021-05-21T11:29:28.993Z · LW(p) · GW(p)
Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how and depend on (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when . Minor comment: I think ?
Replies from: zachary-robertson↑ comment by Past Account (zachary-robertson) · 2021-05-21T14:23:42.924Z · LW(p) · GW(p)
The term is meant to be a posterior distribution after seeing data. If you have a good prior you could take . However, note could be high. You want trade-off between the cost of updating the prior and the loss reduction.
Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.
(Btw thanks for the correction)
Replies from: Steveotcomment by Charlie Steiner · 2021-05-22T22:02:50.517Z · LW(p) · GW(p)
I'm still confused about the part where you use the Hoeffding inequality - how is the lambda in that step and the lambda in the loss function "the same lambda"?
Replies from: zachary-robertson↑ comment by Past Account (zachary-robertson) · 2021-05-22T23:15:52.850Z · LW(p) · GW(p)
Because . They are the same. Does that help?