chinchilla's wild implications

post by nostalgebraist · 2022-07-31T01:18:28.254Z · LW · GW · 128 comments

Contents

  1. the scaling law
    plugging in real models
  2. are we running out of data?
    web scrapes
      MassiveWeb
      The GLaM/PaLM web corpus
    domain-specific corpora
      Code
      Arxiv
      Books
    "all the data we have"
  what is compute? (on a further barrier to data scaling)
  appendix: to infinity
None
128 comments

(Colab notebook here.)

This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla.[1]

The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion.  In particular:

Some things to note at the outset:

1. the scaling law

The paper fits a scaling law for LM loss , as a function of model size  and data size .

Its functional form is very simple, and easier to reason about than the  law from the earlier Kaplan et al papers.  It is a sum of three terms:

The first term only depends on the model size.  The second term only depends on the data size.  And the third term is a constant.

You can think about this as follows.

An "infinitely big" model, trained on "infinite data," would achieve loss .  To get the loss for a real model, you add on two "corrections":

  1. one for the fact that the model's only has  parameters, not infinitely many
  2. one for the fact that the model only sees  training examples, not infinitely many

Here's the same thing, with the constants fitted to DeepMind's experiments on the MassiveText dataset[3].

 

plugging in real models

Gopher is a model with 280B parameters, trained on 300B tokens of data.  What happens if we plug in those numbers?

What jumps out here is that the "finite model" term is tiny.

In terms of the impact on LM loss, Gopher's parameter count might as well be infinity.  There's a little more to gain on that front, but not much.

Scale the model up to 500B params, or 1T params, or 100T params, or  params . . . and the most this can ever do for you is an 0.052 reduction in loss[4].

Meanwhile, the "finite data" term is not tiny.  Gopher's training data size is very much not infinity, and we can go a long way by making it bigger.


Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.

It's 70B params, trained on 1.4T tokens of data.  Let's plug that in:

Much better![5]

Without using any more compute, we've improved the loss by 0.057.  That's bigger than Gopher's entire "finite model" term!

The paper demonstrates that Chinchilla roundly defeats Gopher on downstream tasks, as we'd expect.

Even that understates the accomplishment, though.  At least in terms of loss, Chinchilla doesn't just beat Gopher.  It beats any model trained on Gopher's data, no matter how big.

To put this in context: until this paper, it was conventional to train all large LMs on roughly 300B tokens of data.  (GPT-3 did it, and everyone else followed.)

Insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got[6].

People put immense effort into training models that big, and were working on even bigger ones, and yet none of this, in principle, could ever get as far Chinchilla did.


Here's where the various models lie on a contour plot of LM loss (per the equation), with  on the x-axis and  on the y-axis.

Only PaLM is remotely close to Chinchilla here.  (Indeed, PaLM does slightly better.)

PaLM is a huge model.  It's the largest one considered here, though MT-NLG is a close second.  Everyone writing about PaLM mentions that it has 540B parameters, and the PaLM paper does a lot of experiments on the differences between the 540B PaLM and smaller variants of it.

According to this scaling law, though, PaLM's parameter count is a mere footnote relative to PaLM's training data size

PaLM isn't competitive with Chinchilla because it's big.  MT-NLG is almost the same size, and yet it's trapped in the pinkish-purple zone on the bottom-left, with Gopher and the rest.

No, PaLM is competitive with Chinchilla only because it was trained on more tokens (780B) than the other non-Chinchilla models.  For example, this change in data size constitutes 85% of the loss improvement from Gopher to PaLM.

Here's the precise breakdown for PaLM:

PaLM's gains came with a great cost, though.  It used way more training compute than any previous model, and its size means it also takes a lot of inference compute to run.

Here's a visualization of loss vs. training compute (loss on the y-axis and in color as well):

Man, we spent all that compute on PaLM, and all we got was the slightest edge over Chinchilla!

Could we have done better?  In the equation just above, PaLM's terms look pretty unbalanced.  Given that compute, we probably should have used more data and trained a smaller model.

The paper tells us how to pick optimal values for params and data, given a compute budget.  Indeed, that's its main focus.

If we use its recommendations for PaLM's compute, we get the point "palm_opt" on this plot:

Ah, now we're talking!


"palm_opt" sure looks good.  But how would we train it, concretely?

Let's go back to the -vs.- contour plot world.

I've changed the axis limits here, to accommodate the massive data set you'd need to spent PaLM's compute optimally.

How much data would that require?  Around 6.7T tokens, or ~4.8 times as much as Chinchilla used.

Meanwhile, the resulting model would not be nearly as big as PaLM.  The optimal compute law actually puts it at 63B params[7].

Okay, so we just need to get 6.7T tokens and . . . wait, how exactly are we going to get 6.7T tokens?  How much text data is there, exactly?

2. are we running out of data?

It is frustratingly hard to find an answer to this question.

The main moral I want to get across in this post is that the large LM community has not taken data scaling seriously enough.

LM papers are meticulous about  -- doing all kinds of scaling analyses on models of various sizes, etc.  There has been tons of smart discussion about the hardware and software demands of training high- models.  The question "what would it take to get to 1T params? (or 10T?)" is on everyone's radar.

Yet, meanwhile:

As a particularly egregious example, here is what the LaMDA paper says about the composition of their training data:

The pre-training data, called Infiniset, is a combination of dialog data from public dialog data and other public web documents. It consists of 2.97B documents and 1.12B dialogs with 13.39B utterances. The composition of the data is as follows: 50% dialogs data from public forums; 12.5% C4 data [11]; 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc; 12.5% Wikipedia (English); 6.25% English web documents; and 6.25% Non-English web documents. The total number of words in the dataset is 1.56T.

"Dialogs data from public forums"?  Which forums?  Did you use all the forum data you could find, or only 0.01% of it, or something in between?  And why measure words instead of tokens -- unless they meant tokens?

If people were as casual about scaling  as this quotation is about scaling , the methods sections of large LM papers would all be a few sentences long.  Instead, they tend to look like this (excerpted from ~3 pages of similar material):

From the PaLM paper

...anyway.  How much more data could we get?

This question is complicated by the fact that not all data is equally good.

(This messy Google sheet contains the calculations behind some of what I say below.)

web scrapes

If you just want a lot of text, the easiest way to get it is from web scrapes like Common Crawl.

But these are infamously full of garbage, and if you want to train a good LM, you probably want to aggressively filter them for quality.  And the papers don't tell us how much total web data they have, only how much filtered data.

MassiveWeb

The training dataset used for Gopher and Chinchilla is called MassiveText, and the web scrape portion of it is called MassiveWeb.  This data originates in a mysterious, unspecified web scrape[8], which is funneled through a series of filters, including quality heuristics and an attempt to only keep English text.

MassiveWeb is 506B.  Could it be made bigger, by scaling up the original web scrape?  That depends on how complete the original web scrape was -- but we know nothing about it.

The GLaM/PaLM web corpus

PaLM used a different web scrape corpus.  It was first used in this paper about "GLaM," which again did not say anything about the original scraping process, only describing the quality filtering they did (and not in much detail).

The GLaM paper says its filtered web corpus is 143B tokens.  That's a lot smaller than MassiveWeb.  Is that because of the filtering?  Because the original scrape was smaller?  Dunno.

To further complicate matters, the PaLM authors used a variant of the GLaM dataset which made multilingual versions of (some of?) the English-only components.

How many tokens did this add?  They don't say[9].

We are told that 27% (211B) of PaLM's training tokens came from this web corpus, and we are separately told that they tried to avoid repeating data.  So the PaLM version of the GLaM web corpus is probably at least 211B, versus the original 143B.  (Though I am not very confident of that.)

Still, that's much smaller than MassiveWeb.  Is this because they had a higher quality bar (which would be bad news for further data scaling)?  They do attribute some of PaLM's success to quality filtering, citing the ablation on this in the GLaM paper[10].

It's hard to tell, but there is this ominous comment, in the section where they talk about PaLM vs. Chinchilla:

Although there is a large amount of very high-quality textual data available on the web, there is not an infinite amount. For the corpus mixing proportions chosen for PaLM, data begins to repeat in some of our subcorpora after 780B tokens, which is why we chose that as the endpoint of training. It is unclear how the “value” of repeated data compares to unseen data for large-scale language model training[11].

The subcorpora that start to repeat are probably the web and dialogue ones.

Read literally, this passage seems to suggest that even the vast web data resources available to Google Research (!) are starting to strain against the data demands of large LMs.  Is that plausible?  I don't know.

domain-specific corpora

We can speak with more confidence about text in specialized domains that's less common on the open web, since there's less of it out there, and people are more explicit about where they're getting it.

Code

If you want code, it's on Github.  There's some in other places too, but if you've exhausted Github, you probably aren't going to find orders of magnitude of additional code data.  (I think?)

We've more-or-less exhausted Github.  It's been scraped a few times with different kinds of filtering, which yielded broadly similar data sizes:

(The text to token ratios vary due to differences in how whitespace was tokenized.)

All of these scrapes contained a large fraction of the total code available on Github (in the Codex paper's case, just the python code).

Generously, there might be ~1T tokens of code out there, but not vastly more than that.

Arxiv

If you want to train a model on advanced academic research in physics or mathematics, you go to Arxiv.

For example, Arxiv was about half the training data for the math-problem-solving LM Minerva.

We've exhausted Arxiv.  Both the Minerva paper and the Pile use basically all of Arxiv, and it amounts to a measly 21B tokens.

Books

Books?  What exactly are "books"?

In the Pile, "books" means the Books3 corpus, which means "all of Bibliotik."  It contains 196,640 full-text books, amounting to only 27B tokens.

In MassiveText, a mysterious subset called "books" has 560B tokens.  That's a lot more than the Pile has!  Are these all the books?  In . . . the world?  In . . . Google books?  Who even knows?

In the GLaM/PaLM dataset, an equally mysterious subset called "books" has 390B tokens.

Why is the GLaM/PaLM number so much smaller than the MassiveText number?  Is it a tokenization thing?  Both of these datasets were made by Google, so it's not like the Gopher authors have special access to some secret trove of forbidden books (I assume??).

If we want LMs to learn the kind of stuff you learn from books, and not just from the internet, this is what we have.

As with the web, it's hard to know what to make of it, because we don't know whether this is "basically all the books in the world" or just some subset that an engineer pulled at one point in time[13].

"all the data we have"

In my spreadsheet, I tried to make a rough, erring-on-generous estimate of what you'd get if you pooled together all the sub-corpora mentioned in the papers I've discussed here.

I tried to make it an overestimate, and did some extreme things like adding up both MassiveWeb and the GLaM/PaLM web corpus as though they were disjoint.

The result was ~3.2T tokens, or 

Recall that this already contains "basically all" of the open-source code in the world, and  "basically all" of the theoretical physics papers written in the internet era -- within an order of magnitude, anyway.  In these domains, the "low-hanging fruit" of data scaling are not low-hanging at all.

what is compute? (on a further barrier to data scaling)

Here's another important comment from the PaLM paper's Chinchilla discussion.  This is about barriers to doing a head-to-head comparison experiment:

If the smaller model were trained using fewer TPU chips than the larger model, this would proportionally increase the wall-clock time of training, since the total training FLOP count is the same. If it were trained using the same number of TPU chips, it would be very difficult to maintain TPU compute efficiency without a drastic increase in batch size. The batch size of PaLM 540B is already 4M tokens, and it is unclear if even larger batch sizes would maintain sample efficiency.

In LM scaling research, all "compute" is treated as fungible.  There's one resource, and you spend it on params and steps, where compute = params * steps.

But params can be parallelized, while steps cannot.

You can take a big model and spread it (and its activations, gradients, Adam buffers, etc.) across a cluster of machines in various ways.  This is how people scale up  in practice.

But to scale up , you have to either:

Thus, it is unclear whether the "compute" you spend in high- models is as readily available (and as bound to grow over time) as we typically imagine "compute" to be.

If LM researchers start getting serious about scaling up data, no doubt people will think hard about this question, but that work has not yet been done.

appendix: to infinity

Earlier, I observation that Chinchilla beats any Gopher of arbitrary size.

The graph below expands on that observation, by including two variants of each model:

(There are two x-axes, one for data and one for params.  I included the latter so I have a place to put the infinite-data models without making an infinitely big plot.  

The dotted line is Chinchilla, to emphasize that it beats infinite-params Gopher.)

The main takeaway IMO is the size of the gap between ∞ data models and all the others.  Just another way of emphasizing how skewed these models are toward , and away from .

  1. ^
  2. ^

    See their footnote 2

  3. ^

    See their equation (10)

  4. ^

    Is 0.052 a "small" amount in some absolute sense?  Not exactly, but (A) it's small compared to the loss improvements we're used to seeing from new models, and (B) small compared to the improvements possible by scaling data.

    In other words, (A) we have spent a few years plucking low-hanging fruit much bigger than this, and (B) there are more such fruit available.

  5. ^

    The two terms are still a bit imbalanced, but that's largely due to the "Approach 3 vs 1/2" nuances mentioned above.

  6. ^

    Caveat: Gopher and Chinchilla were trained on the same data distribution, but these other models were not.  Plugging them into the equation won't give us accurate loss values for the datasets they used.  Still, the datasets are close enough that the broad trend ought to be accurate.

  7. ^

    Wait, isn't that smaller than Chinchilla?

    This is another Approach 3 vs. 1/2 difference.

    Chinchilla was designed with Approaches 1/2.  Using Approach 3, like we're doing here, give you a Chinchilla of only 33B params, which is lower than our palm_opt's 63B.

  8. ^

    Seriously, I can't find anything about it in the Gopher paper.  Except that it was "collected in November 2020."

  9. ^

    It is not even clear that this multilingual-ization affected the web corpus at all.

    Their datasheet says they "used multilingual versions of Wikipedia and conversations data."  Read literally, this would suggest they didn't change the web corpus, only those other two.

    I also can't tell if the original GLaM web corpus was English-only to begin with, since that paper doesn't say.

  10. ^

    This ablation only compared filtered web data to completely unfiltered web data, which is not a very fine-grained signal.  (If you're interested, EleutherAI has done more extensive experiments on the impact of filtering at smaller scales.)

  11. ^

    They are being a little coy here.  The current received wisdom by now is that repeating data is really bad for LMs and you should never do it.  See this paper and this one.
    EDIT 11/15/22: but see also the Galactica paper, which casts significant doubt on this claim.

  12. ^

    The Pile authors only included a subset of this in the Pile.

  13. ^

    The MassiveText datasheet says only that "the books dataset contains books from 1500 to 2008," which is not especially helpful.

128 comments

Comments sorted by top scores.

comment by Ivan Vendrov (ivan-vendrov) · 2022-07-31T04:31:48.058Z · LW(p) · GW(p)

Thought-provoking post, thanks.

One important implication is that pure AI companies such as OpenAI, Anthropic, Conjecture, Cohere are likely to fall behind companies with access to large amounts of non-public-internet text data like Facebook, Google, Apple, perhaps Slack. Email and messaging are especially massive sources of "dark" data, provided they can be used legally and safely (e.g. without exposing private user information). Taking just email, something like 500 billion emails are sent daily, which is more text than any LLM has ever been trained on (admittedly with a ton of duplication and low quality content).

Another implication is that federated learning, data democratization efforts, and privacy regulations like GDPR are much more likely to be critical levers on the future of AI than previously thought.

Replies from: Thirkle, yitz, hackpert, TrevorWiesinger
comment by Thirkle · 2022-07-31T22:44:42.078Z · LW(p) · GW(p)

Another implication is that centralised governments with the ability to aggressively collect and monitor citizen's data, such as China, could be major players.

A government such as China has no need to scrape data from the Internet, while being mindful of privacy regulations and copyright. Instead they can demand 1.4 billion people's data from all of their domestic tech companies. This includes everything such as emails, texts, WeChat, anything that the government desires.

comment by Yitz (yitz) · 2022-07-31T17:56:05.028Z · LW(p) · GW(p)

I suspect that litigation over copyright concerns with LLMs could significantly slow timelines, although it may come with the disadvantage of favoring researchers who don’t care about following regulations/data use best practices

comment by hackpert · 2022-08-06T17:23:41.007Z · LW(p) · GW(p)

I mean Microsoft for one seems fully invested in (married to) OpenAI and will continue to be for the foreseeable future, and Outlook/Exchange is probably the biggest source of "dark" data in the world, so I wouldn't necessarily put OpenAI on the same list as the others without strong traditional tech industry partnerships.

Replies from: ChristianKl
comment by ChristianKl · 2022-08-06T20:54:26.253Z · LW(p) · GW(p)

Allowing OpenAI to use Microsofts customer data to train the model essentially means releasing confidential customer information to the public. I doubt that's something that Microsoft is willing to do. 

comment by trevor (TrevorWiesinger) · 2022-07-31T18:04:17.163Z · LW(p) · GW(p)

Another implication is that federated learning, data democratization efforts, and privacy regulations like GDPR are much more likely to be critical levers on the future of AI than previously thought.

And presumably data poisoning as well? This sort of thing isn't easily influenced because it's deep in the turf of major militaries, but it would definitely be good news in the scenario that data becomes the bottleneck.

comment by Scott Alexander (Yvain) · 2022-08-01T23:19:32.367Z · LW(p) · GW(p)

Thanks for posting this, it was really interesting. Some very dumb questions from someone who doesn't understand ML at all:

1. All of the loss numbers in this post "feel" very close together, and close to the minimum loss of 1.69. Does loss only make sense on a very small scale (like from 1.69 to 2.2), or is this telling us that language models are very close to optimal and there are only minimal remaining possible gains? What was the loss of GPT-1?

2. Humans "feel" better than even SOTA language models, but need less training data than those models, even though right now the only way to improve the models is through more training data. What am I supposed to conclude from this? Are humans running on such a different paradigm that none of this matters? Or is it just that humans are better at common-sense language tasks, but worse at token-prediction language tasks, in some way where the tails come apart once language models get good enough?

3. Does this disprove claims that "scale is all you need" for AI, since we've already maxed out scale, or are those claims talking about something different?

Replies from: nostalgebraist, jack-armstrong
comment by nostalgebraist · 2022-08-02T01:28:35.711Z · LW(p) · GW(p)

(1)

Loss values are useful for comparing different models, but I don't recommend trying to interpret what they "mean" in an absolute sense.  There are various reasons for this.

One is that the "conversion rate" between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.

Early in training, when the model's progress looks like realizing "huh, the word 'the' is more common than some other words", these simple insights correspond to relatively large decreases in loss.  Once the model basically kinda knows English or whatever the language is, it's already made most of the loss progress it's going to make, and the further insights we really care about involve much smaller changes in loss.  See here for more on this by gwern.

(2)

No one really knows, but my money is on "humans are actually better at this through some currently-unknown mechanism," as opposed to "humans are actually bad at this exact thing."

Why do I think this?

Well, the reason we're here talking about this at all is that LMs do write text of spookily high quality, even if they aren't as good as humans at it.  That wasn't always true.  Before the transformer architecture was invented in 2017, LMs used to be nowhere near this good, and few people knew or talked about them except researchers.

What changed with the transformer?  To some extent, the transformer is really a "smarter" or "better" architecture than the older RNNs.  If you do a head-to-head comparison with the same training data, the RNNs do worse.

But also, it's feasible to scale transformers much bigger than we could scale the RNNs.  You don't see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.

So, even though all these models take tons of data to train, we could make the transformers really big and still train them on the tons-of-data they require.  And then, because scaling up  really does help, you get a model good enough that you and I are here talking about it.

That is, I don't think transformers are the best you can do at language acquisition.  I suspect humans are doing something better that we don't understand yet.  But transformers are easy to scale up really big, and in ML it's usually possible for sheer size to compensate for using a suboptimal architecture.

(P.S. Buck says in another thread that humans do poorly when directly asked to do language modeling -- which might mean "humans are actually bad at this exact thing," but I suspect this is due to the unfamiliarity of the task rather than a real limitation of humans.  That is, I suspect humans could be trained to perform very well, in the usual sense of "training" for humans where not too much data/time is necessary.)

(3)

This is sort of a semantic issue.

"Scaling" was always a broader concept that just scaling in model size.  In this post and the paper, we're talking about scaling with respect to model size and also with respect to data, and earlier scaling papers were like that too.  The two types of scaling look similar in equations.

So "data scale" is a kind of scale, and always has been.

On the other hand, the original OpenAI/Kaplan scaling paper found kinda the opposite result from this one -- model size was what mattered practically, and the data we currently have would be enough for a long time.

People started to conflate "scaling" and "scaling in model size," because we thought the OpenAI/Kaplan result meant these were the same thing in practice.  The way the "scale is all you need" meme gets used, it has this assumption kind of baked in.

There are some things that "scaling enthusiasts" were planning to do that might change in light of this result (if the result is really true) -- like specialized hardware or software that only helps for very large models.  But, if we can get much larger-scale data, we may be able to just switch over to a "data scaling world" that, in most respects, looks like the world the "parameter scaling world" that the scaling enthusiasts imagined.

Replies from: Buck, beth-barnes, iceman, Owain_Evans
comment by Buck · 2022-08-02T15:55:45.235Z · LW(p) · GW(p)

That is, I suspect humans could be trained to perform very well, in the usual sense of "training" for humans where not too much data/time is necessary.

 

I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).

EDIT: These results are now posted here [LW · GW].

Replies from: paulfchristiano, nostalgebraist, Owain_Evans
comment by paulfchristiano · 2022-08-02T16:24:15.274Z · LW(p) · GW(p)

I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there's definitely no way I was going to get close to GPT-2.

comment by nostalgebraist · 2022-08-02T18:05:24.857Z · LW(p) · GW(p)

I'm wary of the assumption that we can judge "human ability" on a novel task X by observing performance after an hour of practice.

There are some tasks where performance improves with practice but plateaus within one hour.  I'm thinking of relatively easy video games.  Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies.  But most interesting things that humans "can do" take much longer to learn than this.

Here are some things that humans "can do," but require >> 1 hour of practice to "do," while still requiring far less exposure to task-specific example data than we're used to in ML:

  • Superforecasting
    • Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour?  I would guess so, but I'm not sure)
  • Playing video/board/card games of nontrivial difficulty or depth
  • Speaking any given language, even when learned during the critical language acquisition period
  • Driving motor vehicles like cars (arguably) and planes (definitely)
  • Writing good prose, for any conventional sense of "good" in any genre/style
  • Juggling
  • Computer programming (with any proficiency, and certainly e.g. competitive programming)
  • Doing homework-style problems in math or physics
  • Acquiring and applying significant factual knowledge in academic subjects like law or history

The last 3 examples are the same ones Owain_Evans mentioned [LW(p) · GW(p)] in another thread, as examples of things LMs can do "pretty well on."

If we only let the humans practice for an hour, we'll conclude that humans "cannot do" these tasks at the level of current LMs either, which seems clearly wrong (that is, inconsistent with the common-sense reading of terms like "human performance").

Replies from: Buck
comment by Buck · 2022-08-02T18:17:14.121Z · LW(p) · GW(p)

Ok, sounds like you're using "not too much data/time" in a different sense than I was thinking of; I suspect we don't disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-08-22T07:49:20.519Z · LW(p) · GW(p)

The human brain internally is performing very similar computations to transformer LLMs - as expected from all the prior research indicating strong similarity between DL vision features and primate vision - but that doesn't mean we can immediately extract those outputs and apply them towards game performance.

comment by Owain_Evans · 2022-08-02T16:24:49.100Z · LW(p) · GW(p)

It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.

Replies from: mateusz-baginski
comment by Mateusz Bagiński (mateusz-baginski) · 2023-08-25T13:11:48.622Z · LW(p) · GW(p)

I think I remember seeing somewhere that LLMs learn more slowly on languages with "more complex" grammar (in the sense of their loss decreasing more slowly per the same number of tokens) but I can't find the source right now.

comment by Beth Barnes (beth-barnes) · 2022-08-02T02:37:16.495Z · LW(p) · GW(p)

Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)

comment by iceman · 2022-08-02T15:08:39.137Z · LW(p) · GW(p)

What changed with the transformer? To some extent, the transformer is really a "smarter" or "better" architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.

But also, it's feasible to scale transformers much bigger than we could scale the RNNs. You don't see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.

You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren't following it. It's an attempt to train an RNN like a transformer. Initial numbers look pretty good.

comment by Owain_Evans · 2022-08-02T16:16:46.580Z · LW(p) · GW(p)

A few points:

  1. Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
  2. In some ways, the models' ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.)
  3. RNNs are much worse than transformers at in-context learning. It's not just a difference in generative text quality. See this study by DeepMind: https://twitter.com/FelixHill84/status/1524352818261499911
Replies from: jose-miguel-cruz-y-celis
comment by Jose Miguel Cruz y Celis (jose-miguel-cruz-y-celis) · 2023-05-22T20:35:06.469Z · LW(p) · GW(p)

I'm curious about where you get that "models trained mostly on English text are still pretty good at Spanish" do you have a reference?

comment by wickemu (jack-armstrong) · 2022-08-03T16:04:32.676Z · LW(p) · GW(p)

2. Humans "feel" better than even SOTA language models, but need less training data than those models, even though right now the only way to improve the models is through more training data. What am I supposed to conclude from this? Are humans running on such a different paradigm that none of this matters? Or is it just that humans are better at common-sense language tasks, but worse at token-prediction language tasks, in some way where the tails come apart once language models get good enough?

Why do we say that we need less training data? Every minute instant of our existence is a multisensory point of data from before we've even exited the womb. We spend months, arguably years, hardly capable of anything at all yet still taking and retaining data. Unsupervised and mostly redundant, sure, but certainly not less than a curated collection of Internet text. By the time we're teaching a child to say "dog" for the first time they've probably experienced millions of fragments of data on creatures of various limb quantities, hair and fur types, sizes, sounds and smells, etc.; so they're already effectively pretrained on animals before we first provide a supervised connection between the sound "dog" and the sight of a four-limbed hairy creature with long ears on a leash.

I believe that Humans exceed the amount of data ML models have by multiple orders of magnitude by the time we're adults, even if it's extremely messy.

Replies from: Lanrian, jose-miguel-cruz-y-celis
comment by Lukas Finnveden (Lanrian) · 2022-08-14T18:18:52.764Z · LW(p) · GW(p)

There's a billion seconds in 30 years. Chinchilla was trained on 1.4 trillion tokens. So for a human adult to have as much data as chinchilla would require us to process the equivalent of ~1400 tokens per second. I think that's something like 2 kilobyte per second.

Inputs to the human brain are probably dominated by vision. I'm not sure how many bytes per second we see, but I don't think it's many orders of magnitudes higher than 2kb.

Replies from: ChristianKl
comment by ChristianKl · 2022-08-14T18:29:28.091Z · LW(p) · GW(p)

I'm not sure how many bytes per second we see, but I don't think it's many orders of magnitudes higher than 2kb.

That depends a lot on how you count. A quick Googling suggest that the optic nerve has 1.7 million nerve fibers. 

If you think about a neuron firing rate of 20 hz that gives you 34 MB per second. 

Replies from: Lanrian
comment by Lukas Finnveden (Lanrian) · 2022-08-14T22:13:11.741Z · LW(p) · GW(p)

(If 1 firing = 1 bit, that should be 34 megabit ~= 4 megabyte.)

This random article (which I haven't fact-checked in the least) claims a bandwidth of 8.75 megabit ~= 1 megabyte. So that's like 2.5 OOMs higher than the number I claimed for chinchilla. So yeah, it does seem like humans get more raw data.

(But I still suspect that chinchilla gets more data if you adjust for (un)interestingness. Where totally random data and easily predictable/compressible data are interesting, and data that is hard-but-possible to predict/compress is interesting.)

comment by Jose Miguel Cruz y Celis (jose-miguel-cruz-y-celis) · 2023-05-23T05:07:19.427Z · LW(p) · GW(p)

I did some calculations with a bunch of assumptions and simplifications but here's a high estimate, back of the envelope calculation for the data and "tokens" a 30 year old human would have "trained" on:

  •  Visual data: 130 million photoreceptor cells, firing at 10 Hz = 1.3Gbits/s = 162.5 MB/s over 30 years (aprox. 946,080,000 seconds) = 153 Petabytes
  • Auditory data:  Humans can hear frequencies up to 20,000 Hz, high quality audio is sampled at 44.1 kHz satisfying Nyquist-Shannon sampling theorem, if we assume a 16bit (cd quality)*2(channels for stereo) = 1.41 Mbits/s = .18 MB/s over 30 years = .167 Petabytes
  • Tactile data: 4 million touch receptors providing 8 bits/s (assuming they account for temperature, pressure, pain, hair movement, vibration) = 5 MB/s over 30 years = 4.73 Petabytes
  • Olfactory data:  We can detect up to 1 trillion smells , assuming we process 1 smell every second and each smell is represented a its own piece of data i.e. log2(1trillion) = 40 bits/s = 0.0000050 MB/s over 30 years = .000004 Petabytes
  • Taste data: 10,000 receptors, assuming a unique identifier for each basic taste (sweet, sour, salty, bitter and umami) log2(5) 2.3 bits rounded up to 3 = 30 kbits/s = 0.00375 MB/s over 30 years = .00035 Petabytes

    This amounts to 153 + .167 + 4.73 + .000004 +  .00035 = 158.64 Petabytes assuming 5 bytes per token (i.e. 5 characters) this amounts to 31,728 T tokens

    This is of course a high estimate and most of this data will clearly have huge compression capacity, but I wanted to get a rough estimate of a high upper bound.

    Here's the google sheet if anyone wants to copy it or contribute
comment by IL · 2022-07-31T18:26:15.708Z · LW(p) · GW(p)

When you exhaust all the language data from text, you can start extracting language from audio and video.

As far as I know the largest public repository of audio and video is YouTube. We can do a rough back-of-the-envelope computation for how much data is in there:

  • According to some 2019 article I found, in every minute 50 hours of video are uploaded to YouTube. If we assume this was the average for the last 15 years, that gets us 200 billion minutes of video.
  • An average conversation has 150 words per minute, according to a Google search. That gets us 30T words, or 30T tokens if we assume 1 token per word (is this right?)
  • Let's say 1% of that is actually useful, so that gets us 300B tokens, which is... a lot less than I expected.

So it seems like video doesn't save us, if we just use it for the language data. We could do self-supervised learning on the video data, but for that we need to know the scaling laws for video (has anyone done that?).

Replies from: nostalgebraist, sbowman
comment by nostalgebraist · 2022-08-02T00:17:01.621Z · LW(p) · GW(p)

Very interesting!

There are a few things in the calculation that seem wrong to me:

  • If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
  • I'd expect much less than 100% of Youtube video time to contain speech.  I don't know what a reasonable discount for this would be, though.
  • In the opposite direction, 1% useful seems too low.  IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.

In any case, yeah, this does not seem like a huge amount of data.  But there's enough order-of-magnitude fuzziness in the estimate that it does seem like it's worth someone's time to look into more seriously.

comment by Sam Bowman (sbowman) · 2022-08-02T00:04:31.800Z · LW(p) · GW(p)

I agree that this points in the direction of video becoming increasingly important.

But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don't have the scaling laws, but it seems pretty clear that there's a ton of information in the non-language parts of videos that'd be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it'll take more computation to extract the same amount of useful information from video than from text.) 

comment by MSRayne · 2022-07-31T13:32:27.370Z · LW(p) · GW(p)

Does this imply that AGI is not as likely to emerge from language models as might have been thought? To me it looks like it's saying that the only way to get enough data would be to have the AI actively interacting in the world - getting data itself.

Replies from: nostalgebraist, MathiasKirkBonde
comment by nostalgebraist · 2022-07-31T16:36:27.312Z · LW(p) · GW(p)

I definitely think it makes LM --> AGI less likely, although I didn't think it was very likely to begin with [LW · GW].

I'm not sure that the AI interacting with the world would help, at least with the narrow issue described here.

If we're talking about data produced by humans (perhaps solicited from them by an AI), then we're limited by the timescales of human behavior.   The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).

All that text was already there in the world when the current era of large LMs began, so large LMs got to benefit from it immediately, "for free."  But once it's exhausted, producing more is slow.


IMO, most people are currently overestimating the potential of large generative models -- including image models like DALLE2 -- because of this fact.

There was all this massive data already sitting around from human activity (the web, Github, "books," Instagram, Flickr, etc) long before ML compute/algorithms were anywhere near the point where they needed more data than that.

When our compute finally began to catch up with our data, we effectively spent all the "stored-up potential energy" in that data all at once, and then confused ourselves into thinking that compute was only necessary input for the reaction.

But now compute has finally caught up with data, and it wants more.  We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of "data startup capital."


I suppose the AI's interactions with the world could involve soliciting more data of the kind it needs to improve (ie active learning), which is much more valuable per unit than generic data.

I would still be surprised if this approach could get much of anywhere without requiring solicitation-from-humans on a massive scale, but it'd be nice to see a back-of-the-envelope calculation using existing estimates of the benefit of active learning.

Replies from: MSRayne, clone of saturn, Evan R. Murphy
comment by MSRayne · 2022-07-31T21:54:52.254Z · LW(p) · GW(p)

It seems to me that the key to human intelligence is nothing like what LMs do anyway; we don't just correlate vast quantities of text tokens. They have meanings. That is, words correlate to objects in our world model, learned through lived experience, and sentences correspond to claims about how those objects are related to one another or are changing. Without being rooted in sensory, and perhaps even motor, experience, I don't think general intelligence can be achieved. Language by itself can only go so far.

comment by clone of saturn · 2022-08-08T08:00:55.700Z · LW(p) · GW(p)

Language models seem to do a pretty good job at judging text "quality" in a way that agrees with humans. And of course, they're good at generating new text. Could it be useful for a model to generate a bunch of output, filter it for quality by its own judgment, and then continue training on its own output? If so, would it be possible to "bootstrap" arbitrary amounts of extra training data?

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2022-08-08T22:53:25.177Z · LW(p) · GW(p)

It might be even better to just augment the data with quality judgements instead of only keeping the high-quality samples. This way, quality can have the form of a natural language description instead of a one-dimensional in-built thing, and you can later prime the model for an appropriate sense/dimension/direction of quality, as a kind of objective, without retraining.

comment by Evan R. Murphy · 2022-08-01T06:24:49.034Z · LW(p) · GW(p)

But now compute has finally caught up with data, and it wants more. We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of "data startup capital."

We may be running up against text data limits on the public web. But the big data companies got that name for a reason. If they can tap into the data of a Gmail, Facebook Messenger or YouTube then they will find tons of more fuel for their generative models.

comment by MathiasKB (MathiasKirkBonde) · 2022-07-31T15:23:30.245Z · LW(p) · GW(p)

I don't think the real world is good enough either.

The fact that humans feel a strong sense of the tetris effect, suggest to me that the brain is constantly generating and training on synthetic data.

Replies from: yitz
comment by Yitz (yitz) · 2022-07-31T17:57:51.399Z · LW(p) · GW(p)

Aka dreams?

comment by Roman Leventov · 2022-08-01T10:14:54.802Z · LW(p) · GW(p)

My two cents contra updates towards longer or more uncertain AGI timelines given the information in this post:

  • The training of language models is many orders of magnitude less efficient than the training of the human brain, which acquires comparable language comprehension and generation ability on a tiny fraction of the text corpora discussed in this post. So we can expect more innovations that improve the training efficiency. Even one such innovation, improving the training efficiency (in terms of data) by a single order of magnitude, would probably ensure that the total size of publically available text data is not a roadblock on the path to AGI, even if it is, currently. I think the probability that we will see at least one such innovation in the next 5 years is quite high, more than 10%.
  • Perhaps DeepMind's Gato is already a response to the realisation that "there is not enough text", explained in this post. So they train Gato on game replays, themselves generated programmatically, using RL agents. They can generate practically unlimited amounts of training data in this way.  Then there is probably a speculation that at some scale, Gato will generalise the knowledge acquired in games to text, or will indeed enable much more efficient training on text, (a-la few-shot learning in current LMs) if the model is pre-trained on games and other tasks.
comment by Jay Bailey · 2022-07-31T15:19:32.983Z · LW(p) · GW(p)

I am curious about this "irreducible" term in the loss. Apologies if this is covered by the familiarity with LM scaling laws mentioned as a prerequisite for this article.

When you say "irreducible", does that mean "irreducible under current techniques" or "mathematically irreducible", or something else?

Do we have any idea what a model with, say, 1.7 loss (i.e, a model almost arbitrarily big in compute and data, but with the same 1.69 irreducible) would look like?

Replies from: nostalgebraist
comment by nostalgebraist · 2022-07-31T15:56:32.620Z · LW(p) · GW(p)

When you say "irreducible", does that mean "irreducible under current techniques" or "mathematically irreducible", or something else?

Closer to the former, and even more restrictive: "irreducible with this type of model, trained in this fashion on this data distribution."

Because language is a communication channel, there is presumably also some nonzero lower bound on the loss that any language model could ever achieve.  This is different from the "irreducible" term here, and presumably lower than it, although little is known about this issue.

Do we have any idea what a model with, say, 1.7 loss (i.e, a model almost arbitrarily big in compute and data, but with the same 1.69 irreducible) would look like?

Not really, although section 5 of this post [LW · GW] expresses some of my own intuitions about what this limit looks like.

Keep in mind, also, that we're talking about LMs trained on a specific data distribution, and only evaluating their loss on data sampled from that same distribution.

So if an LM achieved 1.69 loss on MassiveText (or a scaled-up corpus that looked like MassiveText in all respects but size), it would do very well at mimicking all the types of text present in MassiveText, but that does not mean it could mimic every existing kind of text (much less every conceivable kind of text).

Replies from: yitz
comment by Yitz (yitz) · 2022-07-31T18:06:25.729Z · LW(p) · GW(p)

Do we have a sense of what the level of loss is in the human brain? If I’m understanding correctly, if the amount of loss in a model is known to be finitely large, then will be incapable of perfectly modeling the world on principle (implying that to such a model physics is non-computable?)

Replies from: conor-sullivan
comment by Lone Pine (conor-sullivan) · 2022-07-31T22:24:49.338Z · LW(p) · GW(p)

Theroetically we could measure it by having humans play "the language model game" where you try to predict the next word in a text, repeatedly. How often you would get the next word wrong is a function of your natural loss. Of course, you'd get better at this game as you went along, just like LMs do, so what we'd want to measure is how well you'd do after playing for a few days.

There might have been a psychological study that resembles this. (I don't know.) We could probably also replicate it via citizen science: create a website where you play this game, and get people to play it. My prediction is that DL LMs are already far superior to even the best humans at this game. (Note that this doesn't mean I think DL is smarter than humans.)

Replies from: yitz
comment by Yitz (yitz) · 2022-08-01T03:50:38.186Z · LW(p) · GW(p)

Such a game already exists! See https://rr-lm-game.herokuapp.com/whichonescored2 and https://rr-lm-game.herokuapp.com/. I’ve been told humans tend to do pretty badly at the games (I didn’t do too well myself), so if you feel discouraged playing and want a similar style of game that’s perhaps a bit more fun (if slightly less relevant to the question at hand), I recommend https://www.redactle.com/. Regardless, I guess I’m thinking of loss (in humans) in the more abstract sense of “what’s the distance between the correct and human-given answer [to an arbitrary question about the real world]?” If there’s some mathematically necessary positive amount of loss humans must have at a minimum, that would seemingly imply that there are fundamental limits to the ability of human cognition to model reality.

Replies from: Buck, JBlack
comment by Buck · 2022-08-01T20:14:23.377Z · LW(p) · GW(p)

Yes, humans are way worse than even GPT-1 at next-token prediction, even after practicing for an hour.
EDIT: These results are now posted here [LW · GW]

Replies from: yitz
comment by Yitz (yitz) · 2022-08-02T04:41:00.764Z · LW(p) · GW(p)

Is there some reasonable-ish way to think about loss in the domain(s) that humans are (currently) superior at? (This might be equivalent to asking for a test of general intelligence, if one wants to be fully comprehensive)

comment by JBlack · 2022-08-01T08:58:44.474Z · LW(p) · GW(p)

The scoring for that first game is downright bizarre. The optimal strategy for picking probabilities does not reflect the actual relative likelihoods of the options, but says "don't overthink it". In order to do well, you must overthink it.

Replies from: Buck, yitz
comment by Buck · 2022-08-01T20:13:54.729Z · LW(p) · GW(p)

(I run the team that created that game. I made the guess-most-likely-next-token game and Fabien Roger made the other one.)

The optimal strategy for picking probabilities in that game is to say what your probability for those two next tokens would have been if you hadn't updated on being asked about them. What's your problem with this?

It's kind of sad that this scoring system is kind of complicated. But I don't know how to construct simpler games such that we can unbiasedly infer human perplexity from what the humans do.

comment by Yitz (yitz) · 2022-08-01T17:54:35.520Z · LW(p) · GW(p)

Yeah, if anyone builds a better version of this game, please let me know!

comment by Julian Schrittwieser (julian-schrittwieser) · 2022-07-31T12:29:15.615Z · LW(p) · GW(p)

An important distinction here is that the number of tokens a model was trained for should not be confused with the number of tokens in a dataset: if each token is seen exactly once during training then it has been trained for one "epoch".

In my experience scaling continues for quite a few epochs over the same datset, only if the model has more parameters than the datset tokens and training for >10 epochs does overfitting kick in and scaling break down.

Replies from: nostalgebraist, gwern
comment by nostalgebraist · 2022-07-31T17:08:44.346Z · LW(p) · GW(p)

This distinction exists in general, but it's irrelevant when training sufficiently large LMs.

It is well-established that repeating data during large LM training is not a good practice.  Depending on the model size and the amount of repeating, one finds that it is either

  1. a suboptimal use of compute (relative to training a bigger model for 1 epoch), or 
  2. actively harmful, as measured by test loss or loss on out-of-distribution data

with (2) kicking in earlier (in terms of the amount of repeating) for larger models, as shown in this paper (Figure 4 and surrounding discussion).

For more, see

  • references linked in footnote 11 [LW(p) · GW(p)] of this post, on how repeating data can be harmful
  • my earlier post here [LW · GW], on how repeating data can be compute-inefficient even when it's not harmful
  • this report on my own experience finetuning a 6.1B model, where >1 epoch was harmful
Replies from: p.b., ErickBall, tao-lin, dalasnoin
comment by p.b. · 2022-08-01T10:24:23.857Z · LW(p) · GW(p)

I think it would be a great follow-up post to explain why you think repeating data is not going to be the easy way out for the scaling enthusiasts at Deepmind and OpenAI.

I find the Figure 4 discussion at your first link quite confusing. They study repeated data i.e. disbalanced datasets to then draw conclusions about repeating data i.e. training for several epochs. The performance hit they observe seems to not be massive (when talking about scaling a couple of OOMs) and they keep the number of training tokens constant.

I really can't tell how this informs me about what would happen if somebody tried to scale compute 1000-fold and had to repeat data to do it compute-optimally, which seems to be the relevant question.

comment by ErickBall · 2022-08-04T16:58:33.143Z · LW(p) · GW(p)

So do you think, once we get to the point where essentially all new language models are trained on essentially all existing language data, it will always be more compute efficient to increase the size of the model rather than train for a second epoch?

This would seem very unintuitive and is not directly addressed by the papers you linked in footnote 11, which deal with small portions of the dataset betting repeated.

Replies from: nostalgebraist
comment by nostalgebraist · 2022-08-04T18:30:56.098Z · LW(p) · GW(p)

You're right, the idea that multiple epochs can't possibly help is one of the weakest links in the post.  Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first -- I'm kinda confused by this too.

After thinking more about it, I agree that the repeated-data papers don't provide much evidence that multiple epochs are harmful.

For example, although the Anthropic repeated-data paper does consider cases where a non-small fraction of total training tokens are repeated more than once.  In their most extreme case,

  • half of the training tokens are never repeated during training, and 
  • the other half of training tokens are some (smaller) portion of the original dataset, repeated 2 or more times

But this effectively lowers the total size of the model's training dataset -- the number of training tokens is held constant (100B), so the repeated copies are taking up space that would otherwise be used for fresh data.  For example, if the repeated tokens are repeated 2 times, then we are only using 3/4 of the data we could be (we select 1/2 for the unrepeated part, and then select 1/4 and repeat it twice for the other part).

We'd expect this to hurt the model, and to hurt larger models more, which explains some fraction of the observed effect.

I think there's a much stronger case that multiple epochs are surprisingly unhelpful for large models, even if they aren't harmful.  I went over that case in this post [LW · GW].  (Which was based on the earlier Kaplan et al papers, but I think the basic result still holds.)

However, multiple epochs do help, just less so as  grows... so even if they are negligibly helpful at GPT-3 size or above, they still might be relevantly helpful at Chinchilla size or below.  (And this would then push the compute optimal  even further down relative to Chinchilla, preferring smaller models + more steps.)

It would be really nice to see an extension of the Chinchilla experiment that tried multiple epochs, which would directly answer the question.

I'm not sure what I'd expect the result to be, even directionally.  Consider that if you are setting your learning rate schedule length to the full length of training (as in Chinchilla), then "doing a 2-epoch run" is not identical to "doing a 1-epoch run, then doing another epoch."  You'll have a higher LR during the first epoch than the 1-epoch run would have had, which would have been suboptimal if you had stopped at the first epoch.

Replies from: ErickBall
comment by ErickBall · 2022-08-04T19:45:53.366Z · LW(p) · GW(p)

Thanks, that's interesting... the odd thing about using a single epoch, or even two epochs, is that you're treating the data points differently. To extract as much knowledge as possible from each data point (to approach L(D)), there should be some optimal combination of pre-training and learning rate. The very first step, starting from random weights, presumably can't extract high level knowledge very well because the model is still trying to learn low level trends like word frequency. So if the first batch has valuable high level patterns and you never revisit it, it's effectively leaving data on the table. Maybe with a large enough model (or a large enough batch size?) this effect isn't too bad though.

comment by Tao Lin (tao-lin) · 2022-08-02T12:10:48.721Z · LW(p) · GW(p)

This paper is very unrepresentative - it seems to test 1 vs 64-1,000,000 repeats of data, not 1 vs 2-10 repeats as you would use in practice

comment by Simon Lermen (dalasnoin) · 2022-08-01T06:12:48.262Z · LW(p) · GW(p)

I can't access the wand link, maybe you have to change the access rules

I was interested in the report on fine-tuning a model for more than 1 epoch, even though finetuning is obviously not the same as training. 

Replies from: nostalgebraist
comment by nostalgebraist · 2022-08-01T13:23:57.818Z · LW(p) · GW(p)

It should work now, sorry about that.

comment by gwern · 2022-07-31T21:05:58.095Z · LW(p) · GW(p)

only if the model has more parameters than the dataset tokens and training for >10 epochs does overfitting kick in and scaling break down.

That sounds surprising. You are claiming that you observe the exact same loss, and downstream benchmarks, if you train a model on a dataset for 10 epochs as you do training on 10x more data for 1 epoch?

I would have expected some substantial degradation in efficiency such that the 10-epoch case was equivalent to training on 5x the data or something.

Replies from: gwern
comment by gwern · 2022-08-02T01:42:25.755Z · LW(p) · GW(p)

Twitter points me to an instance of this with T5, Figure 6/Table 9: at the lowest tested level of 64 repeats, there is slight downstream benchmark harm but still a lot less than I would've guessed.

Not sure how strongly to take this: those benchmarks are weak, not very comprehensive, and wouldn't turn up harm to interesting capabilities like few-shots or emergent ones like inner-monologues; but on the other hand, T5 is also a pretty strong model-family, was SOTA in several ways at the time & the family regularly used in cutting-edge work still, and so it's notable that it's harmed so little.

comment by harsimony · 2022-08-01T19:24:46.973Z · LW(p) · GW(p)

Some other order-of-magnitude estimates on available data, assuming words roughly equal tokens:

Wikipedia: 4B English words, according to this page.

Library of Congress: from this footnote a assume there are at most 100 million books worth of text in the LoC and from this page assume that books are 100k words, giving 10T words at most.

Constant writing: I estimate that a typical person writes at most 1000 words per day, with maybe 100 million people writing this amount of English on the internet. Over the last 10 years, these writers would have produced 370T words.

Research papers: this page estimates ~4m papers are published each year, at 10k words per paper with 100 years of research this amounts to 4T words total.

So it looks like 10T words is an optimistic order-of-magnitude estimate of the total amount of data available.

I assume the importance of a large quantity of clean text data will lead to the construction of a text database of ~1T tokens and that this database (or models trained on it) will eventually be open-sourced.

From there, it seems like really digging in to the sources of irreducible error will be necessary for further scaling. I would guess that a small part of this is "method error" (training details, context window, etc.) but that a significant fraction comes from intrinsic text entropy. Some entropy has to be present, or else text would have no information value.

I would guess that this irreducible error can probably be broken down into:

  1. Uncertainty about the specific type of text the model is trying to predict (e.g. it needs some data to figure out that it's supposed to write in modern English, then more data to learn that the writing is flamboyant/emotional, then more to learn that there is a narrative structure, then more to determine that it is a work of fiction etc.). The model will always need some data to specify which text-generating sub-model to use. This error can be reduced with better prompts (though not completely eliminated)

  2. Uncertainty about location within the text. For example, even if the model had memorized a specific play by Shakespeare, if you asked it to do next-word prediction on a random paragraph from the text, it would have trouble predicting the first few words simply because it hasn't determined which paragraph it has been given. This error should go away when looking at next-word prediction after the model has been fed enough data. Better prompts and a larger context window should help.

  3. Uncertainty inherent to the text. This related to the actual information content of the text, and should be irreducible. I'm not sure about the relative size of this uncertainty compared to the other ones, but this paper suggests an entropy of ~10 bits/word in English (which seems high?). I don't know how entropy translates into training loss for these models. Memorization of key facts (or database access) can reduce the average information content of a text.

EDIT: also note that going from 10T to 100T tokens would only reduce the loss by 0.045, so it may not be worthwhile to increase dataset size beyond the 10T order-of-magnitude.

Replies from: peter-hrosso
comment by Peter Hroššo (peter-hrosso) · 2022-08-21T05:17:03.147Z · LW(p) · GW(p)

Uncertainty about location within the text

I think the models are evaluated on inputs that fill their whole context window, ie. ~1024 tokens long. I doubt there is many parts in Shakespeare's plays with the same 1024 tokens repeated.

Replies from: harsimony
comment by harsimony · 2022-08-21T17:52:28.680Z · LW(p) · GW(p)

Oh I didn't realize! Thanks for clarifying. Uncertainty about location probably doesn't contribute much to the loss then.

comment by Alex_Altair · 2023-04-05T17:22:51.913Z · LW(p) · GW(p)

I have some thoughts that are either confusions, or suggestions for things that should be differently emphasized in this post (which is overall great!).

The first is that, as far as I can tell, these scaling laws are all determined empirically, as in, they literally trained a bunch of models with different parameters and then fit a curve to the points. This is totally fine, that's how a lot of things are discovered, and the fits look good to me, but a lot of this post reads as thought the law is a Law. For example;

At least in terms of loss, Chinchilla doesn't just beat Gopher.  It beats any model trained on Gopher's data, no matter how big.

This is not literally true, because saying "any model" could include totally different architectures that obey nothing like the empirical curves in this paper.

I'm generally unclear on what the scope of the empirical discovery is. (I'm also not particularly knowledgeable about machine learning.) Do we have reason to think that it applies in domains outside text completion? Does it apply to models that don't use transformers? (Is that even a thing now?) Does it apply across all the other bazillion parameters that go into a particular model, like, I dunno, the learning rate, or network width vs depth?

 

It also feels like the discussion over "have we used all the data" is skimming over what the purpose of a language model is, or what loss even means. To make an analogy for comparison, consider someone saying "the US census has gathered all possible data on the heights of US citizens. To get a more accurate model, we need to create more US citizens."

I would say that the point of a language model is to capture all statistical irregularities in language. If we've used all the data, then that's it, we did it. Creating more data will be changing the actual population that we are trying to run stats on, it will be adding more patterns that weren't there before.

I can imagine a counter argument to this that says, the text data that humanity has generated is being generated from some Platonic distribution that relates to what humans think and talk about, and we want to capture the regularities in that distribution. The existing corpus of text isn't the population, it is itself a sampling, and the LLMs are trying to evaluate the regularities from that sample.

Which, sure, that sounds fine, but I think the post sort of just makes it sound like we want to make number go down, and more data make number go down, without really talking about what it means.

Replies from: nostalgebraist
comment by nostalgebraist · 2023-04-06T17:50:36.546Z · LW(p) · GW(p)

I'm generally unclear on what the scope of the empirical discovery is. (I'm also not particularly knowledgeable about machine learning.) Do we have reason to think that it applies in domains outside text completion? Does it apply to models that don't use transformers? (Is that even a thing now?) Does it apply across all the other bazillion parameters that go into a particular model, like, I dunno, the learning rate, or network width vs depth?

The answer to each these questions is either "yes" or "tentatively, yes."

But the evidence doesn't come from the Chinchilla paper.  It comes from the earlier Kaplan et al papers, to which the Chinchilla paper is a response/extension/correction:

If you want to understand this post better, I'd recommend reading those papers, or a summary of them.

This post, and the Chinchilla paper itself, are part of the "conversation" started by the Kaplan papers.  They implicitly take some of the results from the Kaplan papers for granted, e.g.

  • "Scaling Laws for Neural Language Models" found that architectural "shape" differences, like width vs. depth, mattered very little compared to  and .  So, later work tends to ignore these differences.
  • Even if they got some of the details wrong, the Kaplan papers convinced people that LM loss scales in a very regular, predictable manner.  It's empirical work, but it's the kind of empirical work where your data really does look like it's closely following some simple curve -- not the kind where you fit a simple curve for the sake of interpretation, while understanding that there is a lot of variation it cannot capture.

    So, later work tends to be casual about the distinction between "the curve we fit to the data" and "the law governing the real phenomena."  (Theoretical work in this area generally tries to explain why LM loss might follow a simple power law -- under the assumption it really does follow such a law -- rather than trying to derive some more complicated, real-er functional form.)

I would say that the point of a language model is to capture all statistical irregularities in language. [...]

I can imagine a counter argument to this that says, the text data that humanity has generated is being generated from some Platonic distribution that relates to what humans think and talk about, and we want to capture the regularities in that distribution. The existing corpus of text isn't the population, it is itself a sampling, and the LLMs are trying to evaluate the regularities from that sample.

Which, sure, that sounds fine, but I think the post sort of just makes it sound like we want to make number go down, and more data make number go down, without really talking about what it means.

Hmm, I think these days the field views "language modeling" as a means to an end -- a way to make something useful, or something smart.

We're not trying to model language for its own sake.  It just so happens that, if you (say) want to make a machine that can do all the stuff ChatGPT can do, training a language model is the right first step.

You might find models like DALLE-2 and Stable Diffusion a helpful reference point.  These are generative models -- what do they for images is (handwaving some nuances) very close to what LMs do for text.  But the people creating and using these things aren't asking, "is this a good/better model of the natural distribution of text-image pairs?"  They care about creating pictures on demand, and about how good the pictures are.

Often, it turns out that if you want a model to do cool and impressive things, the best first step is to make a generative model, and make it as good as you can.  People want to "make number go down," not because we care about the number, but because we've seen time and time again that when it goes down, all the stuff we do care about gets better.

This doesn't fully address your question, because it's not clear that the observed regularity ("number goes down -- stuff gets better") will continue to hold if we change the distribution we use to train the generative model.  As an extreme example, if we added more LM training data that consisted of random numbers or letters, I don't think anyone would expect that to help.

However, if we add data that's different but still somehow interesting, it does tend to help -- on the new data, obviously, but also to some extent on the old data as well.  (There's another Kaplan scaling paper about that, for instance.)

And at this point, I'd feel wary betting against "more data is better (for doing cool and impressive things later)," as long as the data is interestingly structured and has some relationship to things we care about.  (See my exchange with gwern here [LW(p) · GW(p)] from a few years ago -- I think gwern's perspective more than mine has been borne out over time.)

Replies from: Alex_Altair
comment by Alex_Altair · 2023-04-06T19:16:19.503Z · LW(p) · GW(p)

Thanks! This whole answer was understandable and clarifying for me.

comment by ESRogs · 2022-08-01T23:42:44.845Z · LW(p) · GW(p)

Can you get anywhere with synthetic data? What happens if you train a model on its own output?

comment by Legionnaire · 2022-08-01T19:58:37.281Z · LW(p) · GW(p)

We're not running out of data to train on, just text.

Why did I not need 1 Trillion language examples to speak (debatable) intelligently? I'd suspect the reason is a combination of inherited training examples from my ancestors, but more importantly, language output is only the surface layer.

In order for language models to get much better, I suspect they need to be training on more than just language. It's difficult to talk intelligently about complex subjects if you've only ever read about them. Especially if you have no eyes, ears, or any other sense data. The best language models are still missing crucial context/info which could be gained through video, audio, and robotic IO.

Combined with this post, this would also suggest our hardware can already train more parameters than we need to in order to get much more intelligent models, if we can get that data from non text sources.

comment by Dirichlet-to-Neumann · 2022-07-31T06:43:50.070Z · LW(p) · GW(p)

Interesting and thought provoking.

"It's hard to tell, but there is this ominous comment, in the section where they talk about PaLM vs. Chinchilla:". In the context of fears about AI alignment, I would say "hopeful" rather than "ominous" !

comment by Raemon · 2022-08-14T07:43:46.813Z · LW(p) · GW(p)

Something I'm unsure about (commenting from my mod-perspective but not making a mod pronouncement) is how LW should relate to posts that lay out ideas that may advance AI capabilities. 

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

It seems important to be able to talk about that and model the world, but I'm wondering if posts like this should live behind a "need to log-in" filter, maybe with a slight karma-gate, so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

nostalgiabraist, I'm curious how you would feel about that.

Replies from: Kaj_Sotala, nostalgebraist, lechmazur, Chris_Leong
comment by Kaj_Sotala · 2022-08-14T09:37:36.128Z · LW(p) · GW(p)

so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

There's also the chance that if these posts are not gated, people who previously weren't plugged into the LW ecosystem but are interested in AI find LW through articles such as this one. And then eventually also start reading other articles here and become more interested in alignment concerns.

There's also a bit of a negative stereotype among some AI researchers as alignment people being theoretical philosophers doing their own thing and being entirely out of touch about what real AI is like. They might take alignment concerns a bit more seriously if they find it easy to actually find competent AI discussion on LW / Alignment Forum.

comment by nostalgebraist · 2022-08-14T18:44:27.863Z · LW(p) · GW(p)

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

What is the mechanism you're imagining for this speedup?  What happens that would not have happened without this post?

Consider that

  • The Chinchilla paper was released over four months ago, on 3/29/22.
  • It did not take long for the paper to get noticed among people interested in ML scaling, including here on LW. 

I'm struggling to imagine a situation where a relevant AI org is doing Chinchilla-like scaling experiments, yet somehow has managed to miss this paper (or to ignore/misunderstand it) for 4+ months.  The paper is not exactly a secret, and it's not even especially difficult to read as these things go.

More broadly, I doubt LW has significant leverage to decrease the overall supply of these kinds of conversations.  There are lots of venues for cutting-edge ML discussion, and the conversation is going to happen somewhere.  (See Connor's comments here [LW · GW].)

comment by Lech Mazur (lechmazur) · 2022-08-14T08:32:07.975Z · LW(p) · GW(p)

I don't have a strong opinion on hiding nostalgebraist's post behind a login gate. But as a data point, I'm not affiliated with any major labs and I don't currently work on LLMs but in other AI fields and I still read the Chinchilla paper before seeing this post (and I wasn't surprised by its results), so hiding the post wouldn't have made much of a difference for me.

However, I am very surprised that a report like ELK is publicly available for any web crawl to access. I think that if you query a future LLM that has this report in its training data and prompt it with a query related to hiding its intentions, you'll get much better results. Is this desirable? It seems to provide a shortcut to a lot of knowledge that an LLM would need to reason about by itself without access.

Replies from: Raemon
comment by Raemon · 2022-08-14T16:57:22.705Z · LW(p) · GW(p)

Yeah a few people have also brought up this concern recently. Will think about it.

comment by Chris_Leong · 2023-12-05T03:11:05.985Z · LW(p) · GW(p)

This is very tricky. On one hand, this may actually Streisand effect these results to greater prominence. On the other hand, at the point where people were specifically working around this to gain access to log-in gated LW resources, this would probably enhance our community status/prestige which might actually increase our influence.

Overall, I'd lean towards carefully experimenting with a log-in filter, with the notion of abandoning this idea if it doesn't seem to be achieving its goals.

comment by RyanCarey · 2022-08-04T07:49:44.933Z · LW(p) · GW(p)

It would be useful to have a more descriptive title, like "Chinchilla's implications for data bottlenecks" or something.

comment by Rodrigo Heck (rodrigo-heck-1) · 2022-07-31T06:14:35.165Z · LW(p) · GW(p)

A possible avenue to explore is to expand these models to multilingual data. There are perhaps a lot of high quality text uniquely available in other languages (news, blogs, etc.).  Anyways, IMO this effort should probably be directed less on acquiring the largest amount of data and more on acquiring high quality data. Chinchilla's scaling law doesn't include quality as a distinctive property, but we have reasons to believe that more challenging text are much more informative and can compensate low data environments.

Replies from: Frederik
comment by Tom Lieberum (Frederik) · 2022-07-31T08:01:18.588Z · LW(p) · GW(p)

I'd like to propose not talking publicly about ways to "fix" this issue. Insofar these results spell trouble for scaling up  LLMs, this is a good thing! 
Infohazard (meta-)discussions are thorny by their very nature and I don't want to discourage discussions around these results in general, e.g. how to interpret them or whether the analysis has merits. 

Replies from: nostalgebraist, leon-lang
comment by nostalgebraist · 2022-07-31T21:44:43.509Z · LW(p) · GW(p)

I disagree, but I'm not sure how relevant my opinion is, since I'm far less worried about "AGI ruin" to begin with than the median LWer.  That said, here's my thinking:

First, there's no universally agreed-upon line between "discussing whether the analysis has merits" and "giving the capabilities people free ideas."  Where a person draws this line depends on how obvious they think the ideas are, or how obvious they think they will be to the capabilities people.

Second, there are costs to not talking about things.  It's useful for alignment research to have a correct sense of where capabilities research is headed, and where it isn't headed.  If alignment researchers talk more to one another than to "capabilities people" (true IME), and they practice self-censorship like this, they'll end up with some importantly wrong beliefs.

Also, and perhaps worse -- if alignment researchers never voice their own secret capabilities ideas in fora where "capabilities people" can hear, then they'll never receive feedback about these ideas from the people who know what it would be like to apply them in the real world.  Alignment researchers may end up with private stockpiles of "secret tricks" in their heads which are actually either misguided or obvious, and this disconnect will be a further source of false beliefs.

So, to motivate your concern, we need to imagine a world where

  1. commenters on LW are proficient enough at capabilities research that they can make non-obvious advances in blog comments, in a way that "moves the needle" of capabilities research, and
  2. this is worth the false-belief downsides of self-censorship (say, because commenters on LW are sufficiently informed about capabilities research that they will not form false beliefs anyway)

This seems far from the real situation, IMO.  Based on what I see, "alignment researchers don't understand capabilities research well enough" seems like far more of a live threat to alignment than "alignment researchers are too good at capabilities research, and keep accidentally pushing the field forward in blog comments."  (At least using alignment-interested folks on LW as a proxy for "alignment researchers," and that's who we're deciding norms for anyway.)

Like, take this post as an example.  I was motivated to write this post because I felt like the Chinchilla paper wasn't understood well on LW.

It seems like people have heard of Chinchilla, but mentally categorized it as simple "sudden jump" in overall capabilities that otherwise left everything the same, rather than as a result that demands reconsideration of basic background assumptions.  I still saw people treating LM param counts like they were interchangeable with LM quality/scariness (and with LM training compute).  People would ask things like "what would it cost (in compute spending) to train a 10T parameter Chinchilla?", which is a bizarre way to frame things if you grok what Chinchilla is.

I don't think I'm presenting some novel insight in this post.  Mostly, I'm just reiterating what the papers say.  I expect any serious capabilities researcher in this area to have read these papers and internalized them at the same depth I have (or more).  But people on LW hadn't done that, and more generally people "interested in AI" who don't closely read all these papers hadn't done that.  So I wrote an explainer.

The LW reaction to new ML results typically looks this way to me.  Obviously "LW" is not a monolith and there are plenty of people here who do seriously internalize papers like this.  But the "general trend of the conversation," insofar as there is such a thing, repeatedly strikes me as over-focused on concrete impressive-sounding results (esp. those that sound impressive out of context), and under-focused on more theoretical advances that sound boring on paper but change the whole rules of the game.  The conversation "keeps up" with ML in the sense that it takes note of the decontextualized top-line results in new papers, but it often lacks a mechanistic appreciation of how it all fits together.

Anyway, this strikes me as a much bigger deal for alignment quantitatively at the current frontier than the risk of accidentally handing over free advances to the capabilities people.

Replies from: gwern, conor-sullivan
comment by gwern · 2022-07-31T23:23:50.363Z · LW(p) · GW(p)

People would ask things like "what would it cost (in compute spending) to train a 10T parameter Chinchilla?", which is a bizarre way to frame things if you grok what Chinchilla is.

That wasn't an alignment researcher, though (was it? I thought Tomás was just an interested commenter), and it's a reasonable question to ask when no one's run the numbers, and when you get an answer like 'well, it'd take something like >5000x more compute than PaLM', that's a lesson learned.

At least among the people I've talked to, it seems reasonably well understood that Chinchilla had major implications, meant an immediate capabilities jump and cheaper deployment, and even more importantly meant parameter scaling was dead, and data and then compute were the bottleneck (which is also what I've said bluntly in my earlier comments), and this was why Chinchilla was more important than more splashy stuff like PaLM*. (One capability researcher, incidentally, wasn't revising plans but that's because he wasn't convinced Chinchilla was right in the first place! AFAIK, there has been no dramatic followup to Chinchilla on part with GPT-3 following up Kaplan et al, and in fact, no one has replicated Chinchilla at all, much less run a full scaling law sweep and inferred similar scaling laws, so there is still some doubt there about how real Chinchilla is or how accurate or generalizable its scaling laws are - quite aside from the usual issues like hilariously vague descriptions of datasets.)

I also agree with Tom that if one had thoughts about Chinchilla and data sampling and brand new scaling dynamics catapulting immediately into arms races, it is increasingly approaching the point where a reasonable person might decide to move discussions to more private channels, and for that reason the public discussions of Chinchilla might be very basic and of the 'could we train a 10t parameter Chinchilla' sort.

* PaLM and DALL-E 2 etc helping down out Chinchilla is an example of what I've referred to about how the boom-bust clustering of DL research publications can be quite harmful to discussions.

Replies from: Bjartur Tómas
comment by Tomás B. (Bjartur Tómas) · 2022-08-01T00:20:53.494Z · LW(p) · GW(p)

That wasn't an alignment researcher, though (was it? I thought Tomás was just an interested commenter)

Yep. Just an interested layman.

comment by Lone Pine (conor-sullivan) · 2022-07-31T22:44:52.948Z · LW(p) · GW(p)

What are the public domain internet places where one can learn more about capabilities, or see discussions of capabilities frameworks? Here's what I'm aware of:

  • LessWrong
  • Twitter (but specifically who idk, I avoid Twitter for mental health reasons.)
  • ArXiv (comp-sci)
  • YouTube: MLST, Two Minute Papers, Yannic Kilcher, some conference talks
  • A little bit on reddit (r/mlscaling, u/gwern)

All-in-all, there's not that much heavy discussion online. I've been told that these discussions really happen in-person, in the Bay Area and in DeepMind London offices. LessWrong actually ends up having the best discussion (in the capabilities space.)

(Since someone is likely to complain about seeking out more capabilities information, well yes it's risky, but I'm more in agreement with nostalgebraist that this level of discussion is probably harmless, and that it's better we keep an accurate and up-to-date understanding of the situation and technology.)

Replies from: Hyperion
comment by Hyperion · 2022-08-01T19:28:49.578Z · LW(p) · GW(p)

Mostly Discord servers in my experience: EleutherAI is a big well known one but there are others with high concentrations of top ML researchers.

comment by Leon Lang (leon-lang) · 2022-07-31T11:42:06.468Z · LW(p) · GW(p)

I upvoted since I think discussing what should or should not be discussed is important, but I tentatively disagree:

  • It seems unlikely that comments on lesswrong speed up capabilities research since the thoughts are probably just a subset of what the scaling teams know, and lesswrong is likely not their highest signal information source anyway.
  • Even from a safety perspective, it seems important to know which problems in capabilities research can be alleviated, since this will give a clearer picture of timelines.
  • I think we should have strong reasons before discouraging topics of discussion since lesswrong is not only a place for instrumental rationality but also epistemic rationality -- maybe even more so.

That said, lesswrong is de facto one of the best places to discuss AI safety since the alignment forum is invite-only. thus, it seems that there should be some discussion around which tradeoffs to make on LW between "figuring out what's true" and "not spreading info hazards". 

Replies from: RobbBB, Frederik
comment by Rob Bensinger (RobbBB) · 2022-08-03T23:58:40.962Z · LW(p) · GW(p)

I disagree with the reasoning in this reply to Tom (and in nostalgebraist's reply). If stuff like this is net-positive to post about on LW, the chain of reasoning to arrive at that conclusion seems to me like it has to look different from the reasoning in these comments. E.g.:

  • "It seems unlikely that comments on lesswrong speed up capabilities research" - If "unlikely" here meant "only 40% likely", then it would obviously be a bad idea to post a capabilities insight. The degree of unlikeliness obviously matters, and it has to be weighed against the expected benefit of sharing the insight.
  • At the policy level, "How does this weigh against the expected benefits?" has to take into account that the quality and rarity of LWers' insights is likely to vary a lot by individual and across time; and it has to take into account that the risk level of LW posts is very correlated with the benefit level. In the worlds where there's not much future risk of a LWer someday posting a dangerous capabilities insight, there's also less future benefit to LW posts, since we're probably not generating many useful ideas in general (especially about AGI and AGI alignment).
  • "it seems important to know which problems in capabilities research can be alleviate" - What specific safety progress does this enable? (Maybe there's something, but 'it seems safety-relevant because it's a fact about ML' seems to prove too much. What's the actual implicit path by which humanity ends up safely navigating the AGI transition?)
  • 'We should require a high bar before we're willing to not-post potentially-world-destroying information to LW, because LW has a strong commitment to epistemic rationality' seems like an obviously terrible argument to me. People should not post stuff to the public Internet that destroys the world just because the place they're posting is a website that cares about Bayesianism and belief accuracy.

(Note that this is a high-level point I'm making about the kind of arguments being made here, not about the object-level question.)

Replies from: leon-lang, cata
comment by Leon Lang (leon-lang) · 2022-08-06T10:16:01.034Z · LW(p) · GW(p)

Thanks for your answer! 

In the worlds where there's not much future risk of a LWer someday posting a dangerous capabilities insight, there's also less future benefit to LW posts, since we're probably not generating many useful ideas in general (especially about AGI and AGI alignment).

This seems correct, though it's still valuable to flesh out that it seems possible to have LW posts that are helpful for alignment but not for capabilities: namely, such posts that summarize insights from capabilities research that are known to ~all capabilities people while known to few alignment people. 

The main reason I shifted more to your viewpoint now is that capabilities insights might influence people who do not yet know a lot about capabilities to work on that in the future, instead of working on alignment. Therefore, I'm also not sure if Marius' heuristic "Has company-X-who-cares-mostly-about-capabilities likely thought about this already?" for deciding whether something is infohazardy is safe.

'We should require a high bar before we're willing to not-post potentially-world-destroying information to LW, because LW has a strong commitment to epistemic rationality' seems like an obviously terrible argument to me. People should not post stuff to the public Internet that destroys the world just because the place they're posting is a website that cares about Bayesianism and belief accuracy.

Yes, that seems correct (though I'm a bit unhappy about you bluntly straw-manning my position). I think after reflection I would phrase my point as follows:
"There is a conflict between Lesswrongs commitment to epistemic rationality on the one hand, and the commitment to restrict info hazards on the other hand. Lesswrong's commitment to epistemic rationality exists for good reasons, and should not be given up lightly. Therefore, whenever we restrict discussion and information about certain topics, we should have thought about this with great care."

I don't yet have a fleshed-out view on this, but I did move a bit in Tom's direction.

comment by cata · 2022-08-04T00:14:30.133Z · LW(p) · GW(p)

We should require a high bar before we're willing to not-post potentially-world-destroying information to LW, because LW has a strong commitment to epistemic rationality' seems like an obviously terrible argument to me.

I think that argument is good if you expand out its reasoning. The reason we have a strong commitment to epistemic rationality is because learning and teaching true things is almost always very good. You need to establish a fair chunk of probable bad to outweigh it.

comment by Tom Lieberum (Frederik) · 2022-07-31T12:39:09.558Z · LW(p) · GW(p)

Thanks for your reply! I think I basically agree with all of your points. I feel a lot of frustration around the fact that we don't seem to have adequate infohazard policies to address this. It seems like a fundamental trade-off between security and openness/earnestness of discussion does exist though. 

It could be the case that this community is not the correct place to enforce this rules, as there does still exist a substantial gap between "this thing could work" and "we have a working system". This is doubly true in DL where implementation details matter a great deal.

Replies from: marius-hobbhahn
comment by Marius Hobbhahn (marius-hobbhahn) · 2022-07-31T15:27:56.836Z · LW(p) · GW(p)

My tentative heuristic for whether you should publish a post that is potentially infohazardy is "Has company-X-who-cares-mostly-about-capabilities likely thought about this already?". It's obviously non-trivial to answer that question but I'm pretty sure most companies who build LLMs have looked at Chinchilla and come to similar conclusions as this post. In case you're unsure, write up the post in a google doc and ask someone who has thought more about infohazards whether they would publish it or not. 

Also, I think Leon underestimates how fast a post can spread even if it is just intended for an alignment audience on LW. 

comment by deepthoughtlife · 2022-08-02T19:00:45.067Z · LW(p) · GW(p)

It would be quite easy to automatically generate all of the math and logic you could ever want for these models. Far more than you could possibly ever want train it on (wouldn't want to make it a math only bot, probably.). I could easily program a computer to come up with effectively infinite correct math problems. There are quintillions of 64bit addition problems alone... (actually an immense underestimate. there are 18.4 quintillion 64bit numbers alone). Subtraction, multiplication, division, algebra, trig, calculus, statistics, etc; AND, OR, NOT, XOR, NAND, NOR, syllogisms, first order logic, second order, etc.

Multi-modality stuff would be helpful too such as music videos. Movements, sounds, sung words, subtitles, etc, and then also include other performances of the same song by the same and different people, marked in some way as being the same song. This is the sort of thing that would make it start to understand how ideas work.

Want it to understand what the word 'falling' really means? Mark a bunch of videos of things falling with 'falling', expound on 'falling' in math, show its relations in logic, and use it in many sentences. Even a small number of items will allow it to start bootstrapping meaning (humans only need less than 10,000 words for things relating to the world, and can largely start the process in a new language with 1,000.). What colors look like, what,  vowels, consonants, etc sound like, what objects people use in everyday life, basic physics, etc, all easy and with a ton of data available.

This might be a very large endeavor...but if it isn't already, will soon be much cheaper than the training cost. (And by these training laws, make the overall cost much less for much higher performance.) And yes, I think this will help it even if all it ever does is generate text completions. I am a firm believer that grounding is at least one of the absolutely necessary things for AI to become 'Intelligence'.

Side note: I don't really believe in the AI hype machine. Definitely not for the near future, at least. We haven't even reached what people were claiming about GPT-3 yet.

Replies from: Houshalter
comment by Houshalter · 2022-08-07T08:19:05.650Z · LW(p) · GW(p)

The Pile includes 7GB of math problems generated by deepmind basically as you describe. I don't believe the models trained on it can do any of them, but my testing wasn't properly done.

Replies from: deepthoughtlife
comment by deepthoughtlife · 2022-08-07T13:07:35.862Z · LW(p) · GW(p)

I am unsurprised it includes them, since it is an obvious thing. 7GB sounds like a crazy amount of math problems...but is only a tiny amount compared to what could be generated. Chinchilla was all about how they need more data, and it would be an easy way to increase that (correctly).

That don't understand math on 7GB amount of examples is obviously related to the current extremely primitive state of logic in all such models. The big question, would it still not understand math and logic at 100x the amount of it. If it could learn basic abstract reasoning, that would massively improve its performance at all tasks. Since math and logic are literally just languages that express an understanding of the easiest (context-independent) of relations between things, that would prove modern techniques wholly unsuited to real AI. I suspect if it was 700GB of math, it wouldn't fail so hard at math, but who knows? 

(GPT-J even fails at things like '2 + 2 =' on half the prompts I try, often giving strange results like '0' or '5' even with a temperature of 0, though often that is because it doesn't even realize it is math, assuming that '2 + 2 =' is somehow a programming thing even though the similarity is entirely superficial. Even when it knows it is doing math, it will often get the answer right at first, and then switch to '2 + 2 = 0'.).

Replies from: Houshalter
comment by Houshalter · 2022-08-07T22:57:31.262Z · LW(p) · GW(p)

Human beings can not do most math without pencil and paper and a lot of pondering. Whereas there are a number of papers showing specialized transformers can do math and code at a more sophisticated level than I would have expected before seeing the results.

Replies from: deepthoughtlife, sharmake-farah
comment by deepthoughtlife · 2022-08-08T16:19:28.753Z · LW(p) · GW(p)

I literally noted that GPT-J, which uses said 7GB of math (assuming that number is right), usually fails at '2 + 2 ='. People can do several digit addition without pencil and paper. '763 + 119 =' probably doesn't require pencil and paper to get '882'. We do require it for many step algorithms, but this is not that.  'Dumb' computers do 64-bit addition trivially  (along with algebra, calculus, etc.). I haven't seen specialized math models, but I'm dumbfounded that general models don't do math way better.

I haven't tried coding using 'AI' tools, so have no real opinion on how well it compares to basic autocomplete.

comment by Noosphere89 (sharmake-farah) · 2022-08-07T23:09:55.131Z · LW(p) · GW(p)

The basic problem of arithmetic is this: You can't be informal in math, and every single step needs to be checked. Language, while complicated can allow a degree of informality, as long as you can communicate well. Math does not allow this.

Replies from: deepthoughtlife
comment by deepthoughtlife · 2022-08-08T16:36:28.630Z · LW(p) · GW(p)

You kind of can be informal though?

Suppose, 5x  - 2 = 3b +9, thus

x = (3b + 11)/5 or b = (5x -11)/3 

If b = 2, then

x = 17/5

If x = 2, then 

b = -1/3

This is obviously correct math, but formally you would do each step separately. The steps don't necessarily need to be checked either, because it is an easy enough one that you can just check the result.

Math is a language, just a rigorous one, where it is simple to be right or wrong. It is a simple way to abstract away things that don't matter, and talk about the underlying relations. Math is a subset of language with easier relations. For something with a pure general intelligence, math is probably much easier than a normal language.

I hold that we are story telling intelligences [and consciousness is us telling ourselves our own story as we compose it] that have been generalized through a deep understanding of the patterns in stories, which is why normal languages are easier for us -they were made to tell stories. (I also hold that you story of math is technically incorrect.)

comment by mgalle · 2022-08-02T12:05:50.130Z · LW(p) · GW(p)

I know of two independently developed LLM in two languages where the conclusions of the developers is that "we run out of data in our language".  One of them is trying to scale by going multilingual.

Where to look next? There is lots of untapped data in speech (radio shows, youtube, etc): that amount could make a difference in my opinion.

comment by Lech Mazur (lechmazur) · 2022-08-14T08:58:01.767Z · LW(p) · GW(p)

This paper came out recently: https://arxiv.org/abs/2207.14502 . It shows a way to work around the lack of sufficient training data for generating computer programs by "generating synthetic programming puzzles and solutions, verified for correctness by a Python interpreter." We can think of analogous generation for data-limited general LLMs and there are some possibilities.

comment by LGS · 2022-08-06T23:34:33.564Z · LW(p) · GW(p)

Great post.

I have a question. Suppose we want to create a decent language model which is a small as possible -- small enough to run on a cell phone, say. We could try to compensate for this by scaling data to infinity. Now, we may run out of data, but if we do, we can generate more data artificially using a much larger LM. For example, consider training something BERT-sized using artificial data generated by PaLM (assume we have a very high compute budget in the training phase).

How well should we expect this to perform? If we plug into the above, it seems like 100M parameters (the size of BERT base, I think?) is hopelessly small and will never get anywhere, whereas at 1B we might approach "almost GPT3" given infinite data, and with 10B we have a realistic shot -- did I do this right? What's the right loss to put in from the limited data, given the data is not actually limited (it's generated by PaLM) but it's low quality (it's generated by PaLM instead of being "real")?

Also, is 1B parameters equal to around 4GB of storage? What's the conversion? Could we imagine a 1B model to be implementable on high-end cell phones in a few years from now? Or would it be too slow to do a forward pass without fancy TPUs?

Replies from: lechmazur
comment by Lech Mazur (lechmazur) · 2022-08-07T01:58:12.681Z · LW(p) · GW(p)

You're describing a data augmentation variant of the teacher-student knowledge distillation. It can work well.

16 bits/parameter is most commonly supported but 8-bit quantization can also be used.

Performance does not depend only on the number of parameters but also on the architecture. 

High-end smartphones commonly have special-purpose processors for neural networks, so their performance is not bad.

 

comment by Zvi · 2022-08-04T17:20:43.535Z · LW(p) · GW(p)

Scott Alexander asked things related to this, but still seems worth being more explicit about what this perfect 1.69 loss model would be like in practice if we got there?

Replies from: nostalgebraist
comment by nostalgebraist · 2022-08-04T18:44:54.944Z · LW(p) · GW(p)

The correct answer is the annoyingly trivial one: "it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText."

How good is that, though?  Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.

If you're Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways.  Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on MassiveText-like data), then you think the 1.69 model would be an AGI.  Both of these people are right, conditional on their other beliefs.

The key distinction here is that "1.69 loss" may not the best achievable loss on this dataset.  It's just an estimate of the best loss achievable by this kind of model.

The question "what would a model be like, if it got the best achievable loss, period?" is more interesting, but nothing in this post or these papers really touches on it.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2022-08-06T17:59:05.443Z · LW(p) · GW(p)

it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText

Transformers a Turing complete, so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.

comment by aogara (Aidan O'Gara) · 2022-08-03T03:31:50.325Z · LW(p) · GW(p)

It's worth noting that Ajeya's BioAnchors report estimates that TAI will require a median of 22T data points, nearly an order of magnitude more than the available text tokens as estimated here. See here [LW · GW] for more. 

comment by traviswfisher@gmail.com · 2022-08-03T01:48:25.076Z · LW(p) · GW(p)

Another interesting corpus (though problematic for legal reasons) would be sci-hub.  Quick googling gives estimates of around 50 million research articles; the average research article runs around 4000 words, and sci-hub is estimated to contain about 69% of all research articles published in peer-reviewed journals.  That would put sci-hub at about 50 million * 4000 = 200B tokens and the whole scientific journal literature at an estimated 290B tokens.

comment by maxnadeau · 2022-08-01T14:53:34.152Z · LW(p) · GW(p)

Confusion:

You write "Only PaLM looks better than Chinchilla here, mostly because it trained on 780B tokens instead of 300B or fewer, plus a small (!) boost from its larger size."

But earlier you write:

"Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.

It's 70B params, trained on 1.4T tokens of data"

300B vs. 1.4T. Is this an error?

Replies from: nostalgebraist, Buck
comment by nostalgebraist · 2022-08-01T15:45:33.884Z · LW(p) · GW(p)

Hmm, yeah, I phrased that point really badly.  I'll go back and rewrite it.

A clearer version of the sentence might read:

"Only PaLM is remotely close to Chinchilla here, mostly because it trained on a larger number of tokens than the other non-Chinchilla models, plus a small (!) boost from its larger size."

For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size.  This is what I meant when I said that PaLM only got a "small" boost from its larger size.

EDIT: rewrote and expanded this part of the post.

comment by Buck · 2022-08-01T15:45:50.285Z · LW(p) · GW(p)

I think that in that first sentence, OP is comparing PaLM to other large LMs rather than to Chinchilla.

comment by __nmca__ (lesswrong@nmca.dev) · 2022-07-31T20:36:37.869Z · LW(p) · GW(p)

Great post. The reason "no one was focused on data" was an error in the original OAI scaling laws paper, which was discovered by Hoffman et Al.

comment by anon135711 · 2022-12-27T18:59:16.439Z · LW(p) · GW(p)

Super interesting post.  I'm a bit late to the party, but I work in the space and the obvious reason nobody will say specifically what data they used is that their lawyers won't let them.  I've literally had conversations with product counsel about how specific we can be in a paper / blog post about our data sources.

Copyright and privacy law is approximately 3 centuries behind the state of technology, and there are not clear rules about what data you're allowed to use for what.  Someone like Google or Microsoft won't just say "we scraped Quora for a Q&A dataset" because scraping Quora is technically against their terms of service, and it's all user generated data so probably there's some PII buried in there somewhere.  By admitting they used Quora data, they might get sued by Quora, or Quora users, or EU regulators might decide that since their are EU users on Quora that means the dataset is subject to GDPR wipeout rules and no model trained on data that's been deleted by a user can be stored for more than 7 days... etc. etc.  I don't think this is a scientific community failing nearly so much as it's a regulatory failing.

comment by cubefox · 2022-08-12T01:09:55.890Z · LW(p) · GW(p)

A comment from hacker news on this piece:

The reason that language models require large amounts of data is because they lack grounding. When humans write a sentence about.. let's say "fire", we can relate that word to visual, auditory and kinesthetic experiences built from a coherent world model. Without this world model the LM needs a lot of examples, essentially it has to remember all the different contexts in which the word "fire" appears and figure out when it's appropriate to use this word in a sentence [...]

In other words, language models need so much more language data than humans because they have no symbol grounding, and they have no symbol grounding because they lack a world model. This hypothesis would predict that required text data shrinks when multi modal models form world models and associate words with sensory data (e.g. from being trained on video data).

comment by metachirality · 2023-12-24T02:30:35.311Z · LW(p) · GW(p)

Probably not the most important thing ever, but this is really pleasing to look at, from the layout to the helpful pictures, which makes it an absolute joy to read.

Also pretty good at explaining Chinchilla scaling too I guess.

comment by mbazzani · 2023-03-28T18:51:32.622Z · LW(p) · GW(p)

Judging from this, might privacy regulations be one of the best ways to slow down AI development? Privacy is a widely accepted mainstream issue, so it should be a lot easier to advocate for. I think it would be a lot easier for regular people to understand and get behind privacy regulation than DL regulation. On the other hand, it's not neglected and therefore less important on the margin.

Replies from: The Hype Doesn't Help
comment by The Hype Doesn't Help · 2023-03-28T18:55:19.263Z · LW(p) · GW(p)

Why do you want regular people who aren't qualified to get involved? I can't think of any instance where unqualified people brought something productive to the table regarding any issue. Once you become qualified, sure, but before then, why? Qualified people will end up having to sift through garbage generated by the unqualified, thus making them less likely to continue to be engaged because it feels like a waste of time to them. You don't need me to point out the obvious example of this do you now?

Replies from: mbazzani
comment by mbazzani · 2023-03-28T19:17:31.463Z · LW(p) · GW(p)

To give a short, very bad, but sort-of meaninfgful summary of my ideas: Even idiots have resources. It might help to give a concrete example of a plausible-ish archetype of something that might happen. I don't necessarily think this exact thing will happen, but it may help to clarify what I'm thinking.

  1. Suppose 5% of Americans would be willing to vote for political candidates based purely on their privacy regulation promises, if they were properly persuaded (or donate to privacy nonprofits, or contribute in some other way).

  2. Privacy regulations could meaningfully restrict data access and therefore slow down the progress of deep learning capabilities.

  3. Suppose a significant portion of those would never be persuaded of AI X-Risk arguments and would never contribute meaningfully to alignment work otherwise.

  4. If those thee facts are true, I think it would be net positive to advocate for privacy regulation directly, rather than telling people about x-risks, since there are more peoole who are receptive to privacy arguments than x-risk arguments.Obviously this would have to require careful consideration of your audience. If you think you're talking to thoughtful people who could recognize the importance of alignment and contribute to it, then it is clearly better to actually tell them about alignment directly.

Does this chain of thought seem reasonable to you? If not, what do you think is missing or wrong?

comment by tickybob · 2022-09-20T17:26:26.318Z · LW(p) · GW(p)

There is an old (2013) paper from Google here that mentions training an ngram model on 1.3T tokens: ("Our second-level distributed language model uses word 4-grams. The English model is trained on a 1.3 × 10^12 token training set"). An even earlier 2006 blog post here also references a 1T word corpus.

This number is 2x as big as MassiveWeb, more than a decade old, and not necessarily the whole web even back then. So I would be quite surprised if the MassiveWeb 506B token number represents a limit of what's available on the web. My guess would be that there's at least an order of magnitude more tokens available in a full web scrape. Though a lot does depends on how much the "quality filter" throws out.

And if this does represent a limit of what's on the web, then as other posters have said, email is much larger than the web. Though I question whether anyone would be reckless enough to train an LLM on everyone's private emails without consent, it seems like a potential privacy disaster.

comment by p.b. · 2022-08-02T13:38:01.725Z · LW(p) · GW(p)

Some more questions: 

Meanwhile, the resulting model would not be nearly as big as PaLM.  The optimal compute law actually puts it at 63B params.

How come PaLM_opt is smaller than Chinchilla? Isn't Chinchilla supposed to be Gopher_opt?

Insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got[6].

These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scaling laws tell us where the old scaling would have broken down? 

Replies from: nostalgebraist
comment by nostalgebraist · 2022-08-02T17:43:22.883Z · LW(p) · GW(p)

How come PaLM_opt is smaller than Chinchilla? Isn't Chinchilla supposed to be Gopher_opt?

See the footnote attached to that sentence.

These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scaling laws tell us where the old scaling would have broken down? 

Great question, with a complicated answer.

First, one of the assumptions you're making is not quite right.  By "trained differently" I imagine you're referring to a difference in learning rate schedules, since that was the fundamental difference between the earlier scaling papers (Kaplan et al) and the Chinchilla paper (Hoffmann et al).

Then, it sounds like you're imagining:

  1. Kaplan et al chose learning rate schedules in a particular way
  2. Models like GPT-3 and Gopher did learning rate schedules in the same way, so they got the same scaling law
  3. Hoffmann et al chose their learning rate schedules in a different way from previous authors, so they got a different scaling law

But (2) here is not true.  Kaplan et al chose their schedules in an unusual way that doesn't adapt to the number of training steps, while in practice (and in GPT-3, etc.) people always adapt their schedules to the number of steps like Hoffmann et al do.

"Wait," you say -- "if that's true, then shouldn't GPT-3 and Gopher agree with the Hoffmann et al law, not the Kaplan et al law?  Why didn't those papers observe a breakdown in the Kaplan et al law?"

Well, one of the implications of the Kaplan et al law is that for compute-optimal training, you should spent basically all your marginal compute on larger models, while increasing the number of training tokens (batch size * steps) more slowly.

Following this rule, people kept training on ~300B tokens or so, while raising  with compute.  So when they plotted loss-vs.-compute, they were effectively just plotting loss-vs.-.

But if you're just looking at loss-vs.- for a constant number of training tokens, and that number is reasonably close to the one Kaplan et al used to set their LR schedule (so that yours is close to theirs) -- then Kaplan et al law is a lot, uh, less wrong.

The problem with the Kaplan law was an incorrect estimate of how loss varied with steps/data.  And as a result, picking param/step/data combinations that were suboptimal given a compute budget.

But if you follow its suboptimal recommendations, they tell you not to vary steps/data much.  The law is wrong about what happens if you vary steps/data, but it also tells you not to do that, so you won't notice it being wrong.

Replies from: p.b.
comment by p.b. · 2022-08-03T09:45:00.354Z · LW(p) · GW(p)

Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that's not very plausible and for different learning rate schedules it is maybe even less plausible.

Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws? 

If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.  

Replies from: gwern
comment by gwern · 2022-08-03T14:37:46.634Z · LW(p) · GW(p)

(That is, the 'W' in AdamW stands for 'weight decay', that is, a lasso-like regularization trying to shrink the size of weights and reducing 'how wiggly' and complex the curves a given set of weights can compute, biasing towards smoother simpler curves requiring less data to estimate well. Per the famous variance/bias tradeoff, regularization can help with small data and hurt with large data, so with large data approaches, often a key fix is removing regularization - and these models are the largest of data. In principle, an informative prior like regularization ought to 'wash out' in the limit and not be a problem even if they are wrong, but in practice this doesn't seem to quite work out, perhaps because these approaches aren't Bayesian enough for that to happen or because you have other bad hyperparameters or something is not quite implemented right or it is all correct in the limit but training dynamics go wrong... "Neural nets want to work" so you can have pretty serious bugs and still have a NN which seems to be training as well as it should yet fall far short of its true potential.)

Replies from: carlos-ramon-guevara
comment by CRG (carlos-ramon-guevara) · 2022-08-03T21:20:56.098Z · LW(p) · GW(p)

WD is not really about regularisation nowadays, so it's not surprising that it helps at all model sizes. Layernorm in transformers makes WD affect mostly the effective LR of the weights. (Except the final linear, the absolute scale of the weights doesn't matter, since you have a final LN), and so the actual effect of wd is keeping the update/weight ratio biger over training. (In fact, you can substitute WD in normed nets for an exponentially increasing LR schedule).

Replies from: gwern, not-relevant
comment by gwern · 2022-08-04T00:32:48.870Z · LW(p) · GW(p)

Yes, that's part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you'd expect it to be a change in the constant, not exponent or parameterization. (At least, it's certainly not obvious to me that that's what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn't have so many papers about fixing it.)

comment by Not Relevant (not-relevant) · 2022-08-03T21:54:06.721Z · LW(p) · GW(p)

Has this WD unimportance as regularization been written about somewhere? As a possible counterpoint, in a recent paper on the grokking phenomenon, the authors found that grokking only occurs when training with WD. Otherwise, once the model reached zero training loss, it would barely have a gradient to follow, and thus stop building better representations that improve prediction OOD.

comment by Aiyen · 2022-08-05T00:22:27.668Z · LW(p) · GW(p)

On the MMLU benchmark, Chinchilla five-shot reported 67.6% accuracy; how does one convert this to loss or vice versa?  More to the point, what loss would the human expert 89.8% correspond to?  It would be very interesting to see how much compute that scaling law predicts would be necessary to produce human expert level losses with optimal data availability, or with as much data as is likely available to such a project. 

comment by Sushrut Karnik (sushrut-karnik) · 2023-01-05T04:42:19.191Z · LW(p) · GW(p)

How many tokens would we have if we transcribed the audio of as many youtube videos as possible? (After a lot of filters I imagine)

comment by JosephY · 2022-08-29T22:33:53.130Z · LW(p) · GW(p)

Sorry if this is obvious, but where does the “irreducible” loss come from? Wouldn’t that also be a function of the data, or I guess the data’s predictability?

Replies from: nostalgebraist
comment by nostalgebraist · 2022-08-30T15:55:05.137Z · LW(p) · GW(p)

Yes, it's a function of the data, as well as the model architecture / training routine.  See my reply in this thread [LW(p) · GW(p)].

Also, the value of the irreducible loss isn't important for the conclusions discussed in this post.  What we care about is how loss varies with data and parameter count.

Those, too, are functions of the data, but different groups training large LMs use qualitatively similar datasets, so I would expect the conclusions here to apply across the board.

comment by awlego · 2022-08-02T09:17:32.662Z · LW(p) · GW(p)

I would expect the outcome of this to drive capabilities research more towards "learning to learn". Goal being to improve the amount of knowledge that is extracted from each observed piece of data.

comment by Houshalter · 2022-08-07T07:36:25.420Z · LW(p) · GW(p)

They fit a simplistic model where the two variables were independent and the contribution of each decays exponentially. This leads to the shocking conclusion that the two inputs are independent and decay exponentially...

I mean the model is probably fine for it's intended purpose; finding the rough optimal ratio of parameters and data for a given budget. It might mean that current models have suboptimal compute budgets. But it doesn't imply anything beyond that, like some hard limit to scaling given our data supply.

If the big tech companies really want to train a giant model, but run out of data (unlikely)... well it may not be compute optimal, but there is nothing stopping them from doing multiple passes over the same data. If they even get to the point that it starts to overfit (unlikely), there's a plethora of regularization methods to try.

Replies from: nostalgebraist
comment by nostalgebraist · 2022-08-07T19:02:36.557Z · LW(p) · GW(p)

What specific claims in the post do you disagree with?

See this post [LW · GW] for why multiple epochs will probably not work nearly as well as training on additional data.

Replies from: Houshalter
comment by Houshalter · 2023-11-13T06:54:14.211Z · LW(p) · GW(p)

I'm not sure what my exact thoughts were back then. I was/am at least skeptical of the specific formula used as it seems arbitrary. It is designed intentionally to have certain properties like exponentially diminishing returns. So it's not exactly a "wild implication" that it has these properties.

I recently fit the Chinchilla formula to the data from the first LLaMA paper: https://i.imgur.com/u1Tm5EU.png

This was over an unrelated disagreement elsewhere about whether Chinchilla's predictions still held or made sense. As well as the plausibility of training tiny models to far greater performance.

First, the new parameters are wildly different than the old ones. Take that for what you will, but they are hardly set in stone. Second even with the best fit, the formula still doesn't really match the shape of the observed curves. I think it's just not the right curve.

As for reusing data I've seen sources claim reusing data up in language models to four times had no negative effect. And up to like 40 times was possible before it really stopped helping. I think LLMs currently do not use much regularization and other tricks that were done in other fields when data was limited. Those might push it further.
If data became truly scarce, there may be other tricks to extend the data we have further. You also have all of the data from the people that talk to these things all day and upvote and downvote their responses. (I don't think anyone has even tried making an AI that intentionally asks users questions about things it wants to learn more about, like a human would do.)