LW/ACX Saturday (8/5/23) The The scaling hypothesis - by Gwern

post by Michael Michalchik (michael-michalchik) · 2023-08-04T04:15:31.218Z · ? · GW · 0 comments

Contents

  https://gwern.net/scaling-hypothesis
None
No comments

LW/ACX Saturday (8/5/23) The The scaling hypothesis - by Gwern
 

https://docs.google.com/document/d/1Cqb2q5OVWkHcFyFTegG563JGnTrxU_Ahoh-iLF6J_KE/edit?usp=sharing
 

 

 

 

Hello Folks!

We are excited to announce the 37th Orange County ACX/LW meetup, happening this Saturday and most Saturdays thereafter.


 

Host: Michael Michalchik

Email: michaelmichalchik@gmail.com (For questions or requests)

Location: 1970 Port Laurent Place, Newport Beach, CA 92660

Date: Saturday, Aug 5, 2023

Time: 2 PM

Conversation Starters :

https://gwern.net/scaling-hypothesis



 

Walk & Talk: We usually have an hour-long walk and talk after the meeting starts. Two mini-malls with hot takeout food are easily accessible nearby. Search for Gelson's or Pavilions in the zip code 92660.

Share a Surprise: Tell the group about something unexpected that changed your perspective on the universe.

Future Direction Ideas: Contribute ideas for the group's future direction, including topics, meeting types, activities, etc.



 

There is only one reading this week because it is pretty long. Here is a summary according to chatGPT and Claude 2, respectively.

Claudes summary:



 

Why Does Pretraining Work? 


 

- The pretraining thesis argues language models can reach human-level language understanding through next-word/character prediction on large, diverse text corpora. This may seem counterintuitive - how can simply predicting the next token lead to intelligence? But predicting the next token is an extremely difficult task that implicitly requires learning all aspects of language.


 

- As the model trains, it progresses through stages. First, learning letter frequencies, then common words and phrases, then syntax and grammar, and eventually more complex phenomena like logic, causality, and reasoning. With enough data, the only way to minimize loss is true understanding.


 

- Pretraining leverages the wisdom of the crowd - implicit human knowledge encoded in textual data. Humans generate highly structured data reflecting capacities like reasoning. A model must learn these to best predict text.


 

- Each decimal place of loss reduction represents learning nuanced linguistic phenomena. The last fractions require modeling rare reasoning scenarios correctly every time. Tiny model improvements in rare cases translate to big real-world benefits.


 

- If a model can predict any arbitrary text as well as a human (achieving <0.7 bits loss per character), this essentially indicates human-level language intelligence. The model would have to be able to reason, use common sense, etc, to succeed.


 

- GPT-3's loss of ~1.73 bits per character shows progress but indicates current models still have a gap compared to human performance. A larger scale is still needed.


 

The Last Bits are Deepest


 

- As pretraining progresses, the model exhausts simple statistical patterns and must learn more sophisticated phenomena to reduce loss.


 

- The last fractions of a bit of loss reduction require correctly handling edge cases involving reasoning, logic, common sense, etc. Tiny model improvements on rare events can translate to big real-world gains.


 

- For example, a model may learn that objects don't change state randomly. Fixing the few cases where it incorrectly predicts a dead person becomes alive could have a disproportionate impact - reflecting an understanding of causality.


 

- The hardest bits force the model to go beyond surface patterns and capture something closer to true abstract understanding. They produce generalizable knowledge.


 

- So, while the last bits of loss are the hardest to reduce, they are also the most valuable. The model must squeeze out every bit of intelligence to successfully handle the full diversity of linguistic scenarios.


 

Reasons for Doubt


 

- The pretraining thesis seems almost too good to be true. Just scale up a prediction on text, and all capabilities emerge? This sounds like alchemy, not science.


 

- It's hard to prove pretraining would work in practice. The model might need infeasible amounts of computing or data. Key algorithms could be fundamentally limited. The model may get stuck in simple patterns.


 

- But GPT-3 provides strong evidence that pretraining and the scaling hypothesis are, in fact, valid. The surprising new capabilities demonstrate solving pretraining does lead to emergent reasoning abilities, despite no explicit supervision.


 

- Almost no one predicted the precise capabilities of GPT-3 in advance quantitatively. So even skeptics must update their views in light of these empirical results. The scaling hypothesis proponents have been proven right so far.

0 comments

Comments sorted by top scores.