0 comments

Comments sorted by top scores.

comment by Steveot · 2021-05-21T11:29:28.993Z · LW(p) · GW(p)

Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how and $^L (π)$ depend on $π_{0}$ (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when $π = π_{0}$ . Minor comment: I think $n = m$ ?

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2021-05-21T14:23:42.924Z · LW(p) · GW(p)

The term is meant to be a posterior distribution after seeing data. If you have a good prior you could take $π = π_{0}$ . However, note $L (π)$ could be high. You want trade-off between the cost of updating the prior and the loss reduction.

Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.

(Btw thanks for the correction)

Replies from: Steveot

↑ comment by Steveot · 2021-05-24T13:05:13.694Z · LW(p) · GW(p)

Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high probability (i.e., as you say, $π_{0}$ is the data), while the learning bound or loss reduction is given for $π$ .

comment by Charlie Steiner · 2021-05-22T22:02:50.517Z · LW(p) · GW(p)

I'm still confused about the part where you use the Hoeffding inequality - how is the lambda in that step and the lambda in the loss function "the same lambda"?

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2021-05-22T23:15:52.850Z · LW(p) · GW(p)

Because . They are the same. Does that help?