post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by Steveot · 2021-05-21T11:29:28.993Z · LW(p) · GW(p)

Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how  and  depend on   (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when . Minor comment: I think ?

Replies from: zachary-robertson
comment by Past Account (zachary-robertson) · 2021-05-21T14:23:42.924Z · LW(p) · GW(p)

The term is meant to be a posterior distribution after seeing data. If you have a good prior you could take . However, note could be high. You want trade-off between the cost of updating the prior and the loss reduction.

Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.

(Btw thanks for the correction)

Replies from: Steveot
comment by Steveot · 2021-05-24T13:05:13.694Z · LW(p) · GW(p)

Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high  probability (i.e., as you say,  is the data), while the learning bound or loss reduction is given for 

comment by Charlie Steiner · 2021-05-22T22:02:50.517Z · LW(p) · GW(p)

I'm still confused about the part where you use the Hoeffding inequality - how is the lambda in that step and the lambda in the loss function "the same lambda"?

Replies from: zachary-robertson
comment by Past Account (zachary-robertson) · 2021-05-22T23:15:52.850Z · LW(p) · GW(p)

Because . They are the same. Does that help?