Does the lottery ticket hypothesis suggest the scaling hypothesis?
post by Daniel Kokotajlo (daniel-kokotajlo)
This is a question post.
The lottery ticket hypothesis, as I (vaguely) understand it, is that artificial neural networks tend to work in the following way: When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.
By the scaling hypothesis I mean that in the next five years, many other architectures besides the transformer will also be shown to get substantially better as they get bigger. I'm also interested in defining it differently, as whatever Gwern is talking about [LW(p) · GW(p)].
answer by abramdemski
) · GW
The implication depends on the distribution of lottery tickets. If there is a short-tailed distribution, then the rewards of scaling will be relatively small; bigger would still get better, but very slowly. A long-tailed distribution, on the other hand, would suggest continued returns to getting more lottery tickets.
I ask a question here [LW · GW] about what's true in practice.
Comments sorted by top scores.
comment by gwern
) · GW
I wouldn't say the scaling hypothesis is purely about Transformers. Quite a few of my examples are RNNs, and it's unclear how much of a difference there is between RNNs and Transformers anyway. Transformers just appear to be a sweet spot in terms of power while being still efficiently optimizable on contemporary GPUs. CNNs for classification definitely get better with scale and do things like disentangle & transfer & become more robust as they get bigger (example from today), but whether they start exhibiting any meta-learning specifically I don't know.