What happens to variance as neural network training is scaled? What does it imply about "lottery tickets"?

post by abramdemski · 2020-07-28T20:22:14.066Z · score: 23 (5 votes) · LW · GW · 1 comment

This is a question post.


    9 dsj
1 comment

Daniel Kokotajlo asks [LW · GW] whether the lottery ticket hypothesis implies the scaling hypothesis.

The way I see it, this depends on the distribution of "lottery tickets" being drawn from.

However, a long tail also suggests to me that variance in results would continue to be relatively high as a network is scaled: bigger networks are hitting bigger jackpots, but since even bigger jackpots are within reach, the payoff of scaling remains chaotic.

(This could all benefit from a more mathematical treatment.)

So: what do we know about NN training? Does it suggest we are living in extremistan or mediocristan?

Note: a major conceptual difficulty to answering this question is representing NN quality in the right units. For example, an accuracy metric -- which necessarily falls between 0% and 100% -- must yield "diminishing returns", and cannot be host to a "long-tailed distribution". Take that same metric and send it through an inverse sigmoid, and now you might not have diminishing returns, and could have a long-tail distribution. But we can transform data all day. The analysis shouldn't be too ad-hoc. So it's not immediately clear how to measure this.


answer by dsj · 2020-07-28T22:59:14.970Z · score: 9 (2 votes) · LW(p) · GW(p)

One assumption that I think might be implicit in your question is that the number of lottery tickets is linear with model size. But it seems plausible to me that it’s exponential in network depth.

1 comment

Comments sorted by top scores.

comment by romeostevensit · 2020-07-29T02:05:57.756Z · score: 2 (1 votes) · LW(p) · GW(p)

One related question is what sub-tasks of gpt-3 showed surprise jackpots vs gpt-2