Understanding the diffusion of large language models: summary

post by Ben Cottier (ben-cottier) · 2023-01-16T01:37:40.416Z · LW · GW · 1 comments

Contents

1 comment

1 comments

Comments sorted by top scores.

comment by cfoster0 · 2023-01-16T03:06:24.663Z · LW(p) · GW(p)

The only other explicit replication attempt I am aware of has not succeeded; this is the GPT-NeoX project by the independent research collective EleutherAI.

FWIW, if the desire was there, I think EleutherAI definitely could have replicated a full 175B parameter GPT-3 equivalent. They had/have sufficient compute, knowledge, etc. From what I recall being a part of that community, some reasons for stopping at 20B were:

  • Many folks wanted to avoid racing forward and avoid advancing timelines. Not training a big model "just because" was one way to do that.
  • 20B parameter models are stretching what you can fit on any single GPU setup, and 175B parameter models require a whole bunch of GPUs to inference, so they're ~useless for most researchers.
  • Nobody wanted to lead that project, when push came to shove. GPT-Neo, GPT-J, and GPT-NeoX-20B all took lots of individual initiative to get done, and most of the individual who led those projects had other priorities by the time the compute for a 175B parameter model training run became available.
  • Putting out the best possible 175B parameter model would've required gathering more data, given DeepMind's Chinchilla findings, which would have been a PITA.
  • Eventually, Meta put out a replication, which made working on another replication seem silly.