How would the Scaling Hypothesis change things?

post by Aryeh Englander (alenglander) · 2021-08-13T15:42:03.449Z · LW · GW · 3 comments

This is a question post.

Contents

  Answers
    2 Lorenzo Rex
None
3 comments

The Scaling Hypothesis roughly says that current Deep Learning techniques, given ever more computing power, data, and perhaps some relatively minor improvements, will scale all the way to human-level AI and beyond. Let's suppose for the sake of argument that the Scaling Hypothesis is correct. How would that change your forecasts or perspectives on anything related to the future of AI?

Answers

answer by lorepieri (Lorenzo Rex) · 2021-08-14T22:42:46.819Z · LW(p) · GW(p)
  • Would your forecasts for AI timelines shorten significantly?

Yes, by 10-20 years, in particular for the first human level AGI, which I currently forecast between 2045-2060. 

  • Would your forecasts change for the probability of AI-caused global catastrophic / existential risks?

Not by much, I give a low estimate to an AI existential risk.

  • Would your focus of research or interests change at all?

Yes, in the same way that the classic computer vision field has been made pretty much obsolete by deep learning, apart for few pockets or for simple use cases.

  • Would it perhaps even change your perspective on life?

Yes, positively. We would get faster than expected to the commercialisation of AGI, shortening the gap to a post-scarcity society. 

 

That said, I don't believe to the scaling hypothesis. Even though NNs appear capable to simulate arbitrary complex behaviours, I think we will hit a wall of diminishing returns soon, making it impractical to proceed this way for the first AGI.  

3 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2021-08-13T22:25:51.851Z · LW(p) · GW(p)

I already believe in the scaling hypothesis, I just don't think we're in a world that's going to get to test it until after transformative AI is built by people who've continued to make progress on algorithms and architecture.

Perhaps there's an even stronger hypothesis that I'm more skeptical about, which is that you could actually get decent data-efficiency out of current architectures if they were just really really big? (I think that my standards for "decent" involve beating what is currently thought of as the scaling law for dataset size for transformers doing text prediction.) I think this would greatly increase the importance I'd place on politics / policy ASAP, because then we'd already be living in a world where a sufficiently large project would be transformative, I think.

Replies from: gwern
comment by gwern · 2021-08-13T22:51:11.694Z · LW(p) · GW(p)

which is that you could actually get decent data-efficiency out of current architectures if they were just really really big?

You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?

Replies from: Charlie Steiner
comment by Charlie Steiner · 2021-08-14T04:18:56.906Z · LW(p) · GW(p)

Here's maybe an example of what I'm thinking:

GPT-3 can zero-shot add numbers (to the extent that it can) because it's had to predict a lot of numbers getting added. And it's way better than GPT-2 which could only sometimes add 1 or 2 digits (citation just for clarity).

In a "weak scaling" view, this trend (such as it is) would continue - GPT-4 will be able to do more arithmetic, and will basically always carry the 1 when adding 3-digit numbers, and is starting to do notably well at adding 5-digit numbers, though it still often fails to carry the 1 across multiple places there. In this picture adding more data and compute is analogous to doing interpolation better and between rarer examples. After all, this is all that's necessary to make the loss go down.

In a "strong scaling" view, the prediction function that gets learned isn't just expected to interpolate, but to extrapolate, and extrapolate quite far with enough data and compute. And so maybe not GPT-4, but at least GPT-5 would be expected to "actually learn addition," in the sense that even if we scrubbed all 10+ digit addition from the training data, it would effortlessly (given an appropriate prompt) be able to add 15-digit numbers, because at some point the best hypothesis for predicting addition-like text involves a reliably-extrapolating algorithm for addition.

You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?

So in short, how much better is bigger? I think the first case is more likely for a lot of different sorts of tasks, and I think that this is still going to lead to super-impressive performance but is simultaneously really bad data efficiency. I'm also fairly convinced by Steve's arguments [AF · GW] for humans having architectural/algorithmic reasons for better data efficiency.