## Posts

larger language models may disappoint you [or, an eternally unfinished draft] 2021-11-26T23:08:56.221Z
the scaling “inconsistency”: openAI’s new insight 2020-11-07T07:40:06.548Z
on “learning to summarize” 2020-09-12T03:20:08.333Z
interpreting GPT: the logit lens 2020-08-31T02:47:08.426Z
is gpt-3 few-shot ready for real applications? 2020-08-03T19:50:09.740Z
[updated] how does gpt2′s training corpus capture internet discussion?  not well 2020-07-27T22:30:07.909Z
Why is pseudo-alignment "worse" than other ways ML can fail to generalize? 2020-07-18T22:54:50.957Z
GPT-3: a disappointing paper 2020-05-29T19:06:27.589Z
covid-19 notes, 4/19/20 2020-04-20T05:30:01.873Z
mind viruses about body viruses 2020-03-28T04:20:02.674Z
human psycholinguists: a critical appraisal 2019-12-31T00:20:01.330Z
“embedded self-justification,” or something like that 2019-11-03T03:20:01.848Z
When does rationality-as-search have nontrivial implications? 2018-11-04T22:42:01.452Z

Comment by nostalgebraist on larger language models may disappoint you [or, an eternally unfinished draft] · 2022-01-03T17:37:18.845Z · LW · GW

There is some point at which it’s gaining a given capability for the first time though, right? [...]

So my read of the de-noising argument is that at current scaling margins we shouldn’t expect new capabilities—is that correct?

Not quite.

If you define some capability in a binary yes-no way, where it either "has it" or "doesn't have it" -- then yes, there are models that "have it" and those that "don't," and there is some scale where models start "having it."

But this apparent "switch flip" is almost always an artifact of the map, not a part of the territory.

Suppose we operationalize "having the capability" as "scoring better than chance on some test of the capability."  What we'll find is that models smoothly move from doing no better than chance, to doing 1% better than chance, to doing 2% better . . . (numbers are meant qualitatively).

If we want, we can point at the model that does 1% better than chance and say "it got the capability right here," but (A) this model doesn't have the capability in any colloquial sense of the term, and (B) if we looked harder, we could probably find an intermediate model that does 0.5% better, or 0.25%...

(By the time the model does better than chance at something in a way that is noticeable to a human, it will typically have been undergoing a smooth, continuous increase in performance for a while already.)

Comment by nostalgebraist on GPT-3: a disappointing paper · 2021-12-24T19:45:45.190Z · LW · GW

The later post still reiterates the main claims from this post, though.

• This post: "Few-shot learning results are philosophically confusing and numerically unimpressive; the GPT-3 paper was largely a collection of few-shot learning results, therefore the paper was disappointing"
• The later post: "Few-shot learning results are philosophically confusing and numerically unimpressive; therefore we don't understand GPT-3's capabilities well and should use more 'ecological' methods instead"

Many commenters on this post disagreed with the part that both posts share ("Few-shot learning results are philosophically confusing and numerically unimpressive").

Comment by nostalgebraist on GPT-3: a disappointing paper · 2021-12-17T22:14:12.285Z · LW · GW

This post holds up well in hindsight.  I still endorse most of the critiques here, and the ones I don't endorse are relatively unimportant.  Insofar as we have new evidence, I think it tends to support the claims here.

In particular:

• Framing few-shot learning as "meta-learning" has caused a lot of confusion.  This framing made little sense to begin with, for the reasons I note in this post, and there is now some additional evidence against it.
• The paper does very little to push the envelope of what is possible in NLP, even though GPT-3 is probable capable of pushing the envelope.  The paper spends very little time getting GPT-3 to do new things that were not previously possible.  Instead it spends most of its time reproducing BERT results.
"Our 175B model can do as well as BERT" is an underwhelming punchline, and if anything an inadvertent argument against prompting as a technique.

Both these points are still not appreciated as broadly as I think they ought to be.

I'm not sure how much lasting value this post has.  My recent post here covers the same ground more carefully.

I'm not sure if this is relevant, but this post received some very critical comments, leading me to seriously question the value of continuing to write posts like this on LW.  See here for a discussion about this with a reader of my blog.  I did continue to write posts like this, and they have been well received, even when they reiterated my arguments here.  I am curious what explains this difference, and have no good hypotheses.

Some points here I no longer endorse:

• I no longer care whether the new model "deserves" the name GPT-3, and I shouldn't have mixed this inane gripe with serious critiques.  (I had forgotten at the time, but when GPT-2 was first announced, I made a similar objection to its name.)
• "this is about the least interesting transformer paper one can imagine in 2020" is just wrong, even as hyperbole.
• The paper crosses a larger scale gulf than earlier ones, and it's valuable to know what happens as you scale that much, even if what happens is "nothing especially novel."
• Related: I had a vague impression from other work that "scaled-up transformers are fundamentally like smaller ones."  Here, I acted more confident of that claim than I had reason to be, and also assumed it was an established consensus, which it (clearly!) wasn't.  I still think this claim is true, but it's a point of contention even today.
• I didn't understand that OpenAI "really meant it" about few-shot results.
• That is, I assumed that "obviously" no one would use few-shot as a practical technique, and thought OpenAI was exhibiting these results to illuminate the model's properties, where in fact OpenAI really believes in a future where we interact with LMs entirely through natural language prompting.
• The release of the API (some time after this post) blindsided me.
• I had the same "obviously this isn't practical" response to GPT-2 zero-shot results, though when I go back and read the GPT-2 paper, it's clear the authors "really mean it" there too.
Comment by nostalgebraist on EfficientZero: How It Works · 2021-12-17T20:32:21.101Z · LW · GW

I don't think humans would put much probability on such sequences, even conditionally: we'd think that at some point the sequence would stop, because why would there be such gibberish?

I think the intuition behind your remark "why would there be such gibberish?" actually goes most of the way to explaining the repetition trap.

The key thing about pathologically repetitive sequences is that they are . . . pathologically repetitive, i.e. out-of-distribution for natural text.

Once you're already in one, I don't think it's really so obvious that the repetition should eventually stop.  Yes, that's what a human writer would do -- but a human writer wouldn't have produced the conditioning sequence to begin with.

We start out with a prior that puts high weight on "this belongs to some natural genre of text," and low weight on "this belongs to a weird hyper-repetitive 'genre' of text."  But eventually, after enough bad predictions from the former and enough accurate predictions from the latter, we really ought to yield to the evidence and update. Eventually it should become clear that the question "why would there be such gibberish?" has some answer, since we keep observing "such gibberish" and not anything else.

But why does LM sampling enter the trap to begin with?  I think there needs to be some "initial misstep," where a sampled token makes the text just a bit too repetitive.  This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.

In other words, repetition is special because it's a way of going off-distribution where there is, nonetheless, a single "obvious" way to continue the text, and continuing it thus will keep you in the same off-distribution region.  Whereas most ways of going off-distribution are just confusing, and don't have a legible structure the LM would have learned from in-distribution training.

I would expect scale to lower the probability of the "initial mistake," and thus reduce the fraction of samples that are repetitive (is this borne out in practice?).  I don't expect scale to make LMs stop assigning high likelihood to repetitive continuations of unnaturally repetitive prefixes, since I think that's a conditionally correct judgment.

For practical purposes, I've found my custom sampler to be pretty effective solution, though sometimes the LM still "wins the fight," as in this amusing example.

Do you have a source for iGPTs not exhibiting the repetition trap?  Not that I don't believe you, I just would have expected otherwise, so I'm curious.

Comment by nostalgebraist on Hard-Coding Neural Computation · 2021-12-13T22:47:59.338Z · LW · GW

I'm confused by your notation for feed-forward layers.

What justifies re-using the same labels ("apple" etc.) for

1. the coordinates of
2. the coordinates of , i.e. the basis in which the nonlinearity operates

?

If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by , or which vectors/semes they get mapped to by .

But your labels don't correspond to either of these interpretations.  Instead, it looks like you are following rules of the form "the 4th component of every basis is called 'yum'," which leads you to label a coordinate "yum" even though it's neither mapped from "yum" by , nor mapped to "yum" by .

This notation also seems to require the basis (2) to have the same number of elements as (1), which generally will not be the case.  In transformers, (2) is typically larger by a factor of 4.   The logic of your example, meanwhile, can be expressed using a smaller nonlinearity basis of 3 elements:

with some arbitrary choices about which multiplicative constants to absorb into  and  vs. which to absorb into .

Comment by nostalgebraist on More Christiano, Cotra, and Yudkowsky on AI progress · 2021-12-09T20:00:56.625Z · LW · GW

I agree with Eliezer's recommendation to double-check results in papers that one finds surprising.

So, I looked into the claim of a 10x - 100x gain for transformers, using Table 2 from the paper.  Detailed results are in this Colab.

Briefly, I don't think the claim of 10x - 100x is well supported.  Depending on what exactly you compute, you get anywhere from "no speedup" to "over 300x speedup."  All the estimates you can make have obvious problems, and all show a massive gap between French and German.

In detail:

• The appearance of a large speedup is heavily affected by the fact that previous SOTAs were ensembles, and ensembling is a very inefficient way to spend compute.
• In terms of simple BLEU / compute, the efficiency gain from transformers looks about 10x smaller if we compare to non-ensembled older models.
• Simple BLEU / compute is not a great metric because of diminishing marginal returns.
• By this metric, the small transformer is ~6x "better" than the big one!
• By this metric, small transformer has a speedup of ~6x to ~40x, while big transformer has a speedup of ~1x to ~6x.
• We can try to estimate marginal returns by comparing sizes for transformers, and ensembled vs. not for older methods.
• This gives a speedup of ~5x for German and ~100x to ~300x for French
• But this is not an apples-to-apples comparison, as the transformer is scaled while the others are ensembled.

I imagine this question has been investigated much more rigorously outside the original paper.  The first Kaplan scaling paper does this for LMs; I dunno who has done it for MT, but I'd be surprised if no one has.

EDIT: something I want to know is why ensembling was popular before transformers, but not after them.  If ensembling older models was actually better than scaling them, that would weaken my conclusion a lot.

I don't know if ensembling vs. scaling has been rigorously tested, either for transformers or older models.

Comment by nostalgebraist on larger language models may disappoint you [or, an eternally unfinished draft] · 2021-12-09T01:24:42.144Z · LW · GW

It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence.

All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like "giving someone a free-response math test."

• They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event)
• They don't know the material, yet they ace the test: requires an astronomically unlikely coincidence

The distinction I'm meaning to draw is not that there is no directional error, but that the RL/SL tasks have the right structure: there is an optimization procedure which is "leaving money on the table" if the capability is present yet ends up unused.

Comment by nostalgebraist on Is GPT-3 already sample-efficient? · 2021-12-05T16:27:03.219Z · LW · GW

Whoops, fixed.

Comment by nostalgebraist on Is GPT-3 already sample-efficient? · 2021-12-05T01:35:38.745Z · LW · GW

I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 "Curie."

In my experience, doing more than a single epoch is always harmful when finetuning GPT-J.

I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule.  I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect.  Training beyond the first epoch only helped on text that had been accidentally duplicated between train and val, and was harmful elsewhere.  In other words, it was "helpful" for exact memorization, but harmful for generalization.

I have a wandb report here with some plots of this phenomenon.  I'm still not sure whether it's an indication of the sample efficiency associated with the ~6B scale, a quirk of GPT-J specifically, or (less plausibly) a quirk or bug in the codebase used to tune it.

I did this work before OpenAI released their finetuning feature, and was surprised to find them defaulting to 4 epochs.  Especially given that their feature has a relatively tiny maximum dataset size.  My gut feeling is that 4 epochs is way too many, given a large model and only 2.5M tokens.

Comment by nostalgebraist on larger language models may disappoint you [or, an eternally unfinished draft] · 2021-12-03T20:19:06.841Z · LW · GW

I'm glad you liked the post!  And, given that you are an avowed "enthusiast," I'm pleasantly surprised that we agree about as many things as we do.

The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.

Thanks for pointing out this argument -- I hadn't thought about it before.  A few thoughts:

Ordinary text generation is also a multi-step process.  (The token length generally isn't fixed in advance, but could be, i.e. we could define a task "write convincingly for N tokens.")  So, why does generation quality scale so smoothly?

Part of the answer is that single-token success is not fully binary: there are choices that are suboptimal / "weird" without constituting instant failure.  Due to the "delusion" phenomenon, weird choices can pile on themselves and lead to failure, but "weirdness" is a continuous variable so this effect can scale more gradually.

But also, part of the answer must be that generation is relatively easy, with single-token success probabilities very close to 1 even for small models.

(Why is generation easy, when it potentially includes every other task as a subtask?  Well, it samples other tasks in proportion to their frequency in natural text, which≈ their relative volume in pre-training data, which≈ how easy they are for the model.)

This shows how the relevance of the argument depends on the success probabilities living in the right "transitional regime," like your 90% vs 99% vs 99.9%.  More precisely, the argument is relevant at the point where, for a given task and set of model scales, the scaling moves us across this range.  I suppose by continuity this has to happen somewhere for any multi-step task, which makes me wonder whether we could "induce" discontinuous scaling for any task by forcing it to be done in a multi-step way.

Last thought: this might explain why one-step arithmetic scales discontinuously.  Suppose it can only be done by some sequential multi-step algorithm (and that this is not true of most tasks).  Presumably the model implements the steps along the "time axis" of successive layers.  The model has some failure probability at each step, and the argument goes through.

I wonder if your thoughts [on abstract reasoning] have changed since Codex was released after you originally drafted this post.

I didn't update much on Codex.  Part of that was because I'd already seen this paper, which strikes me as a comparably impressive feat of abstraction in the code generation domain.

Also, the Codex model available in the API feels very much like GPT in the way it "reasons," and is roughly what I'd expect from a GPT extended to code.  It has that same quality where it frequently but not predictably does the right thing, where I often see it doing many separate things right but I can't rely on it doing any one of them stably across all contexts.  As with GPT, I get the best results when I stop asking "does it know X or not?" and instead ask "can I express X in a form likely to be common in the training data?"

I'm interested in knowing more about your reasons for thinking that little will come of scaled LLMs' abstract reasoning capabilities.

[...] While I agree that language models are very prone to spit out text that looks superficially more like legitimate abstract reasoning than it is [...], why does this imply that they cannot also learn the “real” patterns? What exactly are the "real" patterns?

This is going to get speculative and hand-wavey.  I don't know what abstract reasoning really is, any more than anyone does.  But I have some ideas :)

First, something I have noticed since I started working with these models is that my own mind contains a module much like GPT, and this module plays a role in my reasoning process.

When I reflect on my own thought processes, they often look like a game played between a GPT-like "babbler" and an evaluating "critic."

The babbler produces an interior monologue that sounds like my own voice, but (unlike when I'm speaking out loud) is only lightly conditioned at best on things like "concepts I want to express."  Instead, it just . . . says words that sound like me, making some argument with the confidence I'd have if I actually believed it, but it's not trying to express an idea I already have -- it's just generating text that sounds like me.

I let the babbler run for a while, and then I step back and assess the monologue, asking "does this make sense? is this really a new idea? does this prove too much? can I think of counterexamples?"  Like generating code and then checking if it compiles.  Most babbler-monologues are rejected by the critic, at which point the babbler tries again, conditioned (in some way I don't understand) on the critic's rejection.

Most of my actually-believed-ideas originated in this game, I think.  Also, I often do a short-range, purely linguistic variant of this when I'm writing: I ask the babbler for the next word or phrase, and there are several rounds of "no that doesn't work" before I pick one.  Even my mathematical reasoning is often like this, though it also involves other babbler-like modules that eg generate mental imagery which can be interpreted (by the critic) as expressing a mathematical argument.

Now, I highly doubt this is the only way that one can do abstract reasoning.  (I don't even think that all humans do it like this.)  However, this is the source of my intuitions about the components involved in "true abstract reasoning" and how it differs from what LMs tend to do.

When I do "true abstract reasoning" as described above, there is a distinction between timesteps of candidate generation (inner loop), timesteps of candidate evaluation (outer loop), and timesteps of actually selecting the next idea (increments on some passes of the outer loop but not others).  This seems important for avoiding "delusive" effects.

I have to run the babbler for a while to even get a coherent idea that's possible to assess.  By that point, the babbler is already conditioning on its earlier output in a self-deluding way.  Unlike in GPT, though, these earlier outputs are not irrevocably written in stone at the moment we receive the later outputs; the critic is free to reject the entire sequence.  With GPT, by the time it would be possible to notice "hey, I'm making a bad argument," it's already ... making a bad argument, and there's no going back.

(I think there's an analogy here to AlphaZero/MuZero's value head vs. its MCTS rollouts, where GPT is like the value head / "intuitive hunches," lacking the slower search wrapper.)

Of course, in principle, you could imagine bundling this entire procedure inside an LM.  Indeed, any sufficiently good LM would eventually have to solve the problems this procedure is designed to solve.  Why don't I expect transformer LMs to develop this structure internally?

One reason: the existence of my babbler seems like (weak) evidence that it's better to use an LM inside a bigger non-LM algorithm.

My babbler itself feels very much like a likelihood-trained causal generative model, with the same virtuosity at surface mimicry, and the same lack of conditioning latents besides its own output.  I suspect that making these kinds of models comes naturally to the cerebral cortex, and that if the brain could just implement reasoning end-to-end with such a model, it would have done it that way.

A second reason is ... okay, this is a whole separate point and the comment's already long.  I'll try to make this brief.

I think transformer LMs do a lot of what they do through a kind of "compressed memorization" of very large amounts of data.  Early on, they learn many different ways that text is regular; some of this may look like "truly learning (eg syntactic) rules."  This low-level knowledge allows them to store training sequences in a vastly compressed form.  Then, a lot of what they do in training is actual memorization of the data, in a compressed and noisy/interleaved form.  Inference looks like mapping the input to the compressed space, and then doing a shallow-ish ensemble in that space over a massive number of texts the input is "reminiscent of" along various dimensions.  The huge model capacity allows for a huge ensemble, so many superficial patterns cancel out in the ensemble, while deeper patterns stack.

This perspective is inspired by the way logit lens looks in later layers, by this paper which is similar to logit lens, and also by work like this showing you can extract exact strings from trained models that were only seen a few times in training.

The key point here is that you can compress things you can't yet abstractively understand, using easier things you do understand.  I can't use abstractive summarization to compress (say) Grothendieck's EGA, since I don't understand it . . . but I can still run gzip on it, and that goes a long way!  Hence, the frontier of the model's apparent abstractive capability will outrun its actual abstractive capability: this frontier consists of texts the model can't compress via facility with their content, but can simply memorize in bulk using easier compression.

In something like your list sorting example, I suspect the model doesn't "have" an internal list sorter that looks anything like an algorithm.  Instead, it has heavily compressed memories of many actual programming tutorials that included short example lists in unsorted and sorted form, and taking an average over these will usually "sort" a short list of small numbers -- with help from low-level abstract operations like "greater than over small numbers," but without any idea that a list can be arbitrary length / can contain any ordered type.

(EDIT to clarify: the context-dependence and flakiness of the capability is how we can tell it's coming from the compressed ensemble.  Contrast with the reliability of something like English syntax, which I believe is part of the compressor itself.  This is my distinction between abstraction that's "real" and "fake")

Anyway, I think transformers are very good at this kind of compressive memorization -- but not nearly as good at doing other kinds of computational work, like search or (obviously?) recursion.  Like, whenever I think about how to "program" some routine using attn+FFs, I tend to despair.  Even simple things often to be spread across >1 layer/"step" or >1 head, and the number of heads/layers in huge models feels tiny relative to the diversity of abstraction we expect out of them.  (See this paper for some actual transformer "programs.")

This is hand-wavey, but my intuition is that the "right abstractions for abstraction" are hard to fit in a transformer or similar modern NN, while memorizing instances of abstraction-use is far cheaper.  And yes, eventually, at large enough scale, the models will have to do the right thing or they'll stop progressing.  But there is so much more left to smoothly learn with memorization that I think this architectural deficit will be invisible for a long time, over which LMs will continue to (unreliably) imitate abstraction better and better.

Comment by nostalgebraist on Visible Thoughts Project and Bounty Announcement · 2021-12-01T17:33:23.279Z · LW · GW

Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.

I've fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn't exist yet).

Below I list some key things to note.  Also see here for related discussion.  These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently.

• GPT performance tends to scale smoothly and gradually with data/model size, over multiple orders of magnitude.
• In terms of subjective response, you don't need much data to get GPTs to the level of "hey, it kinda gets it!".
• You may need several orders of magnitude more data to reach the point of saturation where the model can't improve with additional data.
• Incomplete mastery usually looks more like "randomly failing X% of the time" than "understanding X% of the content of the task," which can make it difficult to assess quality (or quality differences) at a glance.

For a concrete example, here is a data scaling experiment I did with GPT-J (6.1B params) on the tumblr post dataset I use for my tumblr bot.  My full dataset is roughly 4 times as large as the 30M word dataset proposed here, i.e. the 30M word dataset would be roughly as big as the 25% subsample shown in the report.

The linked report only shows val loss, which is not very interpretable, but at least conveys that I haven't reached diminishing returns yet.  This seems plausible from subjective evidence, as the model still sometimes misunderstands tumblr lingo / the conversational structure of the data / etc.

Comment by nostalgebraist on larger language models may disappoint you [or, an eternally unfinished draft] · 2021-11-30T21:01:57.767Z · LW · GW

I'd predict that trying to infer the necessary prompt with the reversing trick wouldn't work for small models anyhow, and would be a waste of time compared to directly editing/controlling the model.

Also, even if one had a reversed model available, it would not be trivial to generate useful prompts with it.

The goal is (roughly) to find a prompt that maximizes .  But the reversed model gives you

The answer is a constant so we can ignore it, but  is problematic: we don't care about the likelihood of the prompt, since we plan to condition on it.

Moreover, we need a way to communicate what the prompt is supposed to mean, and a single answer isn't a sufficient latent for that.  (Consider "...2002? Barack Obama"  "who was the Illinois State senator for the 13th district in the year...")

Prompt-finetuning resolves the ambiguity by averaging over multiple answers, which could work here, but would require an unusual sampling technique (average likelihood over multiple prompts?).

Comment by nostalgebraist on larger language models may disappoint you [or, an eternally unfinished draft] · 2021-11-30T18:28:33.452Z · LW · GW

I don't know.

I poked around on Google Scholar for a bit, trying to answer these questions, and managed to learn the following:

• The term "few-shot learning" seems to have become widespread sometime around 2017.
• Before 2017, it's hard to find usage of "few-shot" but easy to find usage of "one-shot."  (Example, example)
• The "one-shot" term dates back at least as far as 2003.  Work before 2010 tends to lump what we would call "one-shot" and "few-shot" into a single category, as in this paper (2006) and this one (2003).
Comment by nostalgebraist on larger language models may disappoint you [or, an eternally unfinished draft] · 2021-11-30T17:33:48.080Z · LW · GW

There is no known way to "reverse" an LM like that.

(Well, besides the brute force method, where you generate a preceding token by looping over all possible values for that token.  GPT's vocab has ~50k tokens, so this is ~50k slower than forwared sampling.)

There are some LMs that naturally work in both directions.  Namely, masked language models (eg BERT), as opposed to causal language models (eg GPT).  Rather than taking a substring as input, masked language models take a complete string, but with some positions randomly masked or corrupted, and it's trained to undo these changes.

However, these models are mostly used for things other than text generation; it's possible to make them write text, but the resulting text tends to be lower-quality than what you can get from a comparably sized GPT.

Comment by nostalgebraist on larger language models may disappoint you [or, an eternally unfinished draft] · 2021-11-28T18:15:10.682Z · LW · GW

When I talk about self-contradiction in this post, I'm talking about the model contradicting itself in the span of a single context window.  In other words, when the contradicting fact is "within the last n tokens."

Comment by nostalgebraist on My experience at and around MIRI and CFAR (inspired by Zoe Curzi's writeup of experiences at Leverage) · 2021-10-21T04:13:40.760Z · LW · GW

But on a set of very reasonable priors, we would expect your most meaningful and spiritually significant head-moment to be correlated with and causally linked to some kind of unusual thing happening outside your head.  An activity, an interaction with other people, a novel observation.

This doesn't feel plausible at all to me.  (This is one of two key places where I disagree with your framing)

Like, this is a huge category: "experiences that don't involve anything unusual happening around you."  It includes virtually all of the thinking we do -- especially the kind of thinking that demands concentration.  For most (all?) of us, it includes moments of immense terror and immense joy.  Fiction writers commonly spend many hours in this state, "just sitting there" and having ideas and figuring out how they fit together, before they ever commit a single word of those ideas to (digital) paper.  The same goes for artists of many other kinds.  This is where theorems are proven, where we confront our hidden shames and overcome them, (often) where we first realize that we love someone, or that we don't love someone, or . . .

The other place where I disagree with your framing: it seems like you are modeling human minds at a kind of coarse resolution, where people have mostly-coherent beliefs, with a single global "map" or world model that all the beliefs refer to,  and the beliefs have already been (at least approximately) "updated" to reflect all the person's actual experiences, etc.

That coarse-grained model is often helpful, but in this case, I think things make more sense if you "zoom in" and model human minds as very complicated bundles of heuristics, trying to solve a computationally expensive problem in real time, with lots of different topic-specific maps that sometimes conflict, and a lot of reliance on simplifying assumptions that we don't always realize we're making.

And indeed, this is much of why (just) thinking can be so interesting and meaningful: it gives us the ability to process information slower than realtime, digesting it with less aggressive reliance on cheap heuristics.  We "turn things over in our heads," disabling/re-enabling different heuristics, flipping through our different maps, etc.

I think a part of what psychedelics do is to produce a more intense version of "turning things over in one's head," disabling some of the more-ingrained heuristics that you usually forget about, getting you to apply a style of thinking X to a topic Y when you'd always normally think of Y in style Z, changing which things you mentally bin together vs. split apart.  This can yield real insights that are outside of your normal "search space," but even if not, it exposes you to a lot of potential ways of looking at things that you can use later if you deem them valuable.

(I have used psychedelics a number of times, and I have the impression that some of this use led to personal growth, although it might have been growth that would have occurred soon anyway.  I did find these experiences "meaningful," mostly in a way unrelated to "having breakthroughs" or "learning/realizing things" during the experience -- more to do with the cognitive/emotional presentation-of-new-possibilities I described in the previous paragraph.  And for the "art-like" aspect of the experience, the way I'd call a moving work of fiction or music "meaningful to me.")

Comment by nostalgebraist on My experience at and around MIRI and CFAR (inspired by Zoe Curzi's writeup of experiences at Leverage) · 2021-10-18T00:11:33.554Z · LW · GW

As far as I can tell, normal corporate management is much worse than Leverage

Your original post drew a comparison between MIRI and Leverage, the latter of which has just been singled out for intense criticism.

If I take the quoted sentence literally, you're saying that "MIRI was like Leverage" is a gentler critique than "MIRI is like your regular job"?

If the intended message was "my job was bad, although less bad than the jobs of many people reading this, and instead only about as bad as Leverage Research," why release this criticism on the heels of a post condemning Leverage as an abusive cult?  If you believe the normally-employed among LessWrong readers are being abused by sub-Leverage hellcults, all the time, that seems like quite the buried lede!

Sorry for the intense tone, it's just ... this sentence, if taken seriously, reframes the entire post for me in a big, weird, bad way.

Comment by nostalgebraist on My experience at and around MIRI and CFAR (inspired by Zoe Curzi's writeup of experiences at Leverage) · 2021-10-17T22:43:46.198Z · LW · GW

I didn't mention it in the comment, but having a larger pool of researchers is not only useful for doing "ordinary" work in parallel -- it also increases the rate at which your research community discovers and accumulates outlier-level, irreplaceable genius figures of the Euler/Gauss kind.

If there are some such figures already in the community, great, but there are presumably others yet to be discovered.  That their impact is currently potential, not actual, does not make its sacrifice any less damaging.

Comment by nostalgebraist on My experience at and around MIRI and CFAR (inspired by Zoe Curzi's writeup of experiences at Leverage) · 2021-10-17T21:01:48.911Z · LW · GW

First, thank you for writing this.

Second, I want to jot down a thought I've had for a while now, and which came to mind when I read both this and Zoe's Leverage post.

To me, it looks like there is a recurring phenomenon in the rationalist/EA world where people...

• ...become convinced that the future is in their hands: that the fate of the entire long-term future ("the future light-cone") depends on the success of their work, and the work of a small circle of like-minded collaborators
• ...become convinced that (for some reason) only they, and their small circle, can do this work (or can do it correctly, or morally, etc.) -- that in spite of the work's vast importance, in spite of the existence of billions of humans and surely at least thousands with comparable or superior talent for this type of work, it is correct/necessary for the work to be done by this tiny group
• ...become less concerned with the epistemic side of rationality -- "how do I know I'm right? how do I become more right than I already am?" -- and more concerned with gaining control and influence, so that the long-term future may be shaped by their own (already-obviously-correct) views
• ...spend more effort on self-experimentation and on self-improvement techniques, with the aim of turning themselves into a person capable of making world-historic breakthroughs -- if they do not feel like such a person yet, they must become one, since the breakthroughs must be made within their small group
• ...become increasingly concerned with a sort of "monastic" notion of purity or virtue: some set of traits which few-to-no people possess naturally, which are necessary for the great work, and which can only be attained through an inward-looking process of self-cultivation that removes inner obstacles, impurities, or aversive reflexes ("debugging," making oneself "actually try")
• ...suffer increasingly from (understandable!) scrupulosity and paranoia, which compete with the object-level work for mental space and emotional energy
• ...involve themselves in extreme secrecy, factional splits with closely related thinkers, analyses of how others fail to achieve monastic virtue, and other forms of zero- or negative-sum conflict which do not seem typical of healthy research communities
• ...become probably less productive at the object-level work, and at least not obviously more productive, and certainly not productive in the clearly unique way that would be necessary to (begin to) justify the emphasis on secrecy, purity, and specialness

I see all of the above in Ziz's blog, for example, which is probably the clearest and most concentrated example I know of the phenomenon.  (This is not to say that Ziz is wrong about everything, or even to say Ziz is right or wrong about anything -- only to observe that her writing is full of factionalism, full of concern for "monastic virtue," much less prone to raise the question "how do I know I'm right?" than typical rationalist blogging, etc.)  I got the same feeling reading about Zoe's experience inside Leverage.  And I see many of the same things reported in this post.

I write from a great remove, as someone who's socially involved with parts of the rationalist community, but who has never even been to the Bay Area -- indeed, as someone skeptical that AI safety research is even very important!  This distance has the obvious advantages and disadvantages.

One of the advantages, I think, is that I don't even have inklings of fear or scrupulosity about AI safety.  I just see it as a technical/philosophical research problem.  An extremely difficult one, yes, but one that is not clearly special or unique, except possibly in its sheer level of difficulty.

So, I expect it is similar to other problems of that type.  Like most such problems, it would probably benefit from a much larger pool of researchers: a lot of research is just perfectly-parallelizable brute-force search, trying many different things most of which will not work.

It would be both surprising news, and immensely bad news, to learn that only a tiny group of people could (or should) work on such a problem -- that would mean applying vastly less parallel "compute" to the problem, relative to what is theoretically available, and that when the problem is forbiddingly difficult to begin with.

Of course, if this were really true, then one ought to believe that it is true.  But it surprises me how quick many rationalists are to accept this type of claim, on what looks from the outside like very little evidence.  And it also surprises me how quickly the same people accept unproven self-improvement techniques, even ideas that look like wishful thinking ("I can achieve uniquely great things if I just actually try, something no one else is doing..."), as substitutes for what they lose by accepting insularity.  Ways to make up for the loss in parallel compute by trying to "overclock" the few processors left available.

From where I stand, this just looks like a hole people go into, which harms them while -- sadly, ironically -- not even yielding the gains in object-level productivity it purports to provide.  The challenge is primarily technical, not personal or psychological, and it is unmoved by anything but direct attacks on its steep slopes.

(Relevant: in grad school, I remember feeling envious of some of my colleagues, who seemed able to do research easily, casually, without any of my own inner turmoil.  I put far more effort into self-cultivation, but they were far more productive.  I was, perhaps, "trying hard to actually try"; they were probably not even trying, just doing.  I was, perhaps, "working to overcome my akrasia"; they simply did not have my akrasia to begin with.

I believe that a vast amount of good technical research is done by such people, perhaps even the vast majority of good technical research.  Some AI safety researchers are like this, and many people like this could do great AI safety research, I think; but they are utterly lacking in "monastic virtue" and they are the last people you will find attached to one of these secretive, cultivation-focused monastic groups.)

Comment by nostalgebraist on NLP Position Paper: When Combatting Hype, Proceed with Caution · 2021-10-17T03:05:02.260Z · LW · GW

I agree with the critiques you make of specific papers (in section 2), but I'm less convinced by your diagnosis that these papers are attempting to manage/combat hype in a misguided way.

IMO, "underclaiming" is ubiquitous in academic papers across many fields -- including fields unrelated to NLP or ML, and fields where there's little to no hype to manage.  Why do academics underclaim?  Common reasons include:

1. An incentive to make the existing SOTA seem as bad as possible, to maximize the gap between it and your own new, sparkly, putatively superior method.
Anyone who's read papers in ML, numerical analysis, statistical inference, computer graphics, etc. is familiar with this phenomenon; there's a reason this tweet is funny.
2. An incentive to frame one's own work as solving a real, practically relevant problem which is not adequately addressed by existing approaches.  This is related to #1, but tends to affect motivating discussion, whereas #1 affects the presentation of results.
3. General sloppiness about citations.  Academics rarely do careful background work on the papers they cite, especially once it becomes "conventional" to cite a particular paper in a particular context.  Even retracted papers often go on being cited year after year, with no mention made of the retraction.

I suspect 1+2+3 above, rather than hype management, explains the specific mistakes you discuss.

For example, Zhang et al 2020 seems like a case of #2.  They cite Jia and Liang as evidence about a problem with earlier models, a problem they are trying to solve with their new method.  It would be strange to "manage hype" by saying NLP systems can't do X, and then in the same breath present a new system which you claim does X!

Jang and Lukasiewicz (2021) is also a case of #2, describing a flaw primarily in order to motivate their own proposed fix.

Meanwhile, Xu et al 2020 seems like #3: it's a broad review paper on "adversarial attacks" which gives a brief description of Jia and Liang 2017 alongside brief descriptions of many other results, many of them outside NLP.  It's true that the authors should not have used the word "SOTA" here, but it seems more plausible that this is mere sloppiness (they copied other, years-old descriptions of the Jia and Liang result) rather than an attempt to push a specific perspective about NLP.

I think a more useful framing might go something like:

• We know very little about the real capabilities/limits of existing NLP systems.  The literature does not discuss this topic with much care or seriousness; people often cite outdated results, or attach undue significance to correct-but-narrow philosophical points about limitations.
• This leads to some waste of effort, as people work on solving problems that have already been solved (like trying to "fix" Jia and Liang issues as if it were still 2017).  Note that this is a point NLP researchers ought to care about, whether they are interested in AI safety or not.
• This is also bad from an AI safety perspective.
• We should study the capabilities of existing systems, and the likely future trajectory of those capabilities, with more care and precision.
Comment by nostalgebraist on interpreting GPT: the logit lens · 2021-05-01T01:27:16.410Z · LW · GW

Ah, I think we miscommunicated.

I meant "gelu(x) achieves its maximum curvature somewhere near x=0."

People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|.  In this sense gelu is like relu.

It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.

This is indeed unique to relu -- I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops.  Obviously you can't simulate a differentiable function with that trick.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2021-04-26T21:37:51.589Z · LW · GW

I'm confused -- the paper you link is not about better prompts for GPT-3.  It's about a novel fine-tuning methodology for T5.  GPT-3 only appears in the paper as a reference/baseline to which the new method is compared.

The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings.

Because of this, I sometimes refer to GPT-3 as "quantifying the cost (in additional scale) imposed by choosing a GPT-style model."  That is, the following should be roughly competitive w/ each other:

• BERT/T5 at param count N
• GPT at param count ~100 * N

See my comments near the bottom here.

Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design.

Your comment claims that the "SOTA" within that line of work is close to the overall SOTA on SuperGLUE -- which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks.  However, I'd need to see a reference that actually establishes this.

Comment by nostalgebraist on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-03-04T06:27:39.237Z · LW · GW

Most complexity measures give roughly similar values for the (relative) complexity of most objects

I'll write mostly about this statement, as I think it's the crux of our disagreement.

The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure.

However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects.  Whether this is true depends on our choice of translation.

(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object.  Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects.  But the example shows that the claim about complexity measures will only hold if our translation is "good enough" in some sense.  If we don't have any idea what "good enough" means, something is missing from the story.)

In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure.  (More precisely, the hand-waviness about what metric structure survives the translation, if any.)  If there's no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object "close to" the specified one.

Basic information theory doesn't require a metric, only a measure.  There's no sense of "getting an output approximately right," only of "getting the exactly right output with high probability."  If you care about being approximately right according to some metric, this leads you to rate-distortion theory.

Both of these domains -- information theory without a metric, and with one -- define notions of incompressibility/complexity, but they're different.  Consider two distributions on R:

1. The standard normal,
2. The standard normal, but you chop it into a trillion pieces on the x axis, and translate the pieces to arbitrary locations in R

According to basic information theory, these are equally simple/compressible.  (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)

But in rate-distortion theory, (1) is way more simple/compressible than (2).  If you're coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1].  Whereas if you're coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.

If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities.  When you remove the metric, they're identical.  So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.

Comment by nostalgebraist on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-02-22T05:22:36.922Z · LW · GW

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.

Consider:

• If  are finite sets, then there are a finite number of functions .   Let's write  for the finite set of such functions.
• The authors view the counting measure on  -- where every function is equally likely -- as "unbiased."
• This choice makes sense if  are truly unstructured collections of objects with no intrinsic meaning.
• However, if there is some extra structure on them like a metric, it's no longer clear that "all functions are equally likely" is the right reference point.
• Imposing a constraint that functions should use/respect the extra structure, even in some mild way like continuity, may pick out a tiny subset of  relative to the counting measure.
• Finally, if we pick a measure of simplicity that happens to judge this subset to be unusually simple, then any prior that prefers mildly reasonable functions (eg continuous ones) will look like a simplicity prior.

This is much too coarse a lens for distinguishing NNs from other statistical learning techniques, since all of them are generally going to involve putting a metric on the input space.

Let's see how this goes wrong in the Shannon entropy argument from this paper.

• The authors consider (a quantity equivalent to) the fraction of inputs in  for which a given function outputs .
• They consider a function simpler if this fraction is close to 1 or 0, because then it's easier to compress.
• With the counting measure, "most" functions output  about half of the time.  (Like the binomial distribution -- there are lots of different ways you can flip 5 tails and 5 heads, but only one way to flip 10 heads.)
• To learn binary functions with an NN, they encode the inputs as binary vectors like .  They study what happens when you feed these to either (A) linear model, or (B) a ReLu stack, with random weights.
• It turns out that the functions expressed by these models are much more likely than the counting measure to assign a single label ( or ) to most outputs.
• Why?
• For an random function on an input space of size , you need to roll  independent random variables.  Each roll affects only one input element.
• But when you encode the inputs as vectors of length  and feed them into a model, the layers of the model have weights that are also -vectors.  Each of their components affects many input elements at once, in the same direction.  This makes it likely for the judgments to clump towards  or .
• For example, with the linear model with no threshold, if we roll a weight vector whose elements are all positive, then every input maps to .  This happens a fraction  of the time.  But only one boolean function maps every input to , so the counting measure would give this probability .
• This doesn't seem like a special property of neural nets.  It just seems like a result of assigning a normed vector space structure to the inputs, and preferring functions that "use" the structure in their labeling rule.  "Using" the structure means any decision you make about how to treat one input element has implications for others (because they're close to it, or point in the same direction, or something).  Thus you have fewer independent decisions to make, and there's a higher probability they all push in the same direction.

Sort of similar remarks apply to the other complexity measure used by authors, LZ complexity.  Unlike the complexity measure discussed above, this one does implicitly put a structure on the input space (by fixing an enumeration of it, where the inputs are taken to be bit vectors, and the enumeration reads them off in binary).

"Simple" functions in the LZ sense are thus ones that respond to binary vectors in (roughly) a predictable way,.  What does it mean for a function to respond to binary vectors in a predictable way?  It means that knowing the values of some of the bits provides information about the output, even if you don't know all of them.  But since our models are encoding the inputs as binary vectors, we are already setting them up to have properties like this.

Comment by nostalgebraist on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-26T17:31:30.885Z · LW · GW

I'm don't think this step makes sense:

Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can't be done for less than 10e15 params is a task which requires 10e15 data points also.

In the picture, it looks like there's something special about having a 1:1 ratio of data to params.  But this is a coincidence due to the authors' choice of units.

They define "one data point" as "one token," which is fine.  But it seems equally defensible to define "one data point" as "what the model can process in one forward pass," which is ~1e3 tokens.  If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!

To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps.  This depends on your choice of units.  And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems "have the same scaling law."  Scaling is about relationships between differences, not relationships between absolute magnitudes.

On the larger topic, I'm pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for "a data point" is.  This is mostly for "Could a Neuroscientist Understand a Microprocessor?"-type reasons.  I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.

Comment by nostalgebraist on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-25T23:28:51.776Z · LW · GW

Actually, I think I spoke too soon about the visualization... I don't think your image of L(D) and L(N) is quite right.

Here is what the actual visualization looks like.  More blue = lower loss, and I made it a contour plot so it's easy to see indifference curves of the loss.

https://64.media.tumblr.com/8b1897853a66bccafa72043b2717a198/de8ee87db2e582fd-63/s540x810/8b960b152359e9379916ff878c80f130034d1cbb.png

In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:

• If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis.  That is, in this regime, N doesn't matter and loss is effectively a function of D alone.
• This is L(D).
• It looks like the color changes you see if you move horizontally through the upper left region.
• Likewise, in the lower right region, D doesn't matter and loss depends on N alone.
• This is L(N).
• It looks like the color changes you see if you move vertically through the lower right region.

To restate my earlier claims...

If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower).  So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).

This is what motives the heuristic that you scale D with N, to stay on the diagonal line.

On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive.  For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.

When I said that it's intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach.  And that's going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.

Asking "what could we do with a N=1e15 model?" (or any other number) is kind of a weird question from the perspective of this plot.  It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region ... or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.

In Ajeya's work, this question means "let's assume we're using an N=1e15 model, and then let's assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let's figure out how big D has to be to get there."

So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as "the performance which you could only reach with N=1e15 params".

What feels weird to me -- which you touched on above -- is the way this lets the scaling relations "backset drive" the definition of sufficient quality for AGI.  Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it... we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.

Comment by nostalgebraist on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-25T16:51:41.514Z · LW · GW

You can't have more D than you have compute, in some sense, because D isn't the amount of training examples you've collected, it's the amount you actually use to train... right? So... isn't this a heuristic for managing compute? It sure seemed like it was presented that way.

This is a subtle and confusing thing about the Kaplan et al papers.  (It's also the subject of my post that I linked earlier, so I recommend you check that out.)

There are two things in the papers that could be called "optimal compute budgeting" laws:

• A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps  and params .
• The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size  vs params .

I said the  vs  law was "not a heuristic for managing compute" because the  vs  law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.

However, the  vs  law does tell you about how to spend compute in an indirect way, for the exact reason you say, that  is related to how long you train.  Comparing the two laws yields the "breakdown" or "kink point."

Do you agree or disagree? ... I take [you] to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D?

Sorry, why do you expect I disagree?  I think I agree.  But also, I'm not really claiming the scaling laws say or don't say anything about the brain, I'm just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems).  We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.

Perhaps it would help me if I could visualize it in two dimensions

This part is 100% qualitatively accurate, I think.  The one exception is that there are two "optimal compute" lines on the plot with different slopes, for the two laws referred to above.  But yeah, I'm saying we won't be on either of those lines, but on the L(N) or the L(D) line.

Comment by nostalgebraist on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-25T02:35:48.985Z · LW · GW

The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance.

The scaling laws from the Kaplan et al papers do tell you this.

The relevant law is , for the early-stopped test loss given parameter count  and data size .  It has the functional form

with .

The result that you should scale  comes from trying to keep the two terms in this formula about the same size.

This is not exactly a heuristic for managing compute (since  is not dependent on compute, it's dependent on how much data you can source).  It's more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.

You always can train models that are "too large" on datasets that are "too small" according to the heuristic, and they won't diverge or do poorly or anything.  They just won't improve much upon the results of smaller models.

In terms of the above, you are setting  and then asking what  ought to be.  If the heuristic gives you an answer that seems very high, that doesn't mean the model is "not as data efficient as you expected."  Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to  rather than using a smaller model to get almost identical performance.

I find it more intuitive to think about the following, both discussed in the papers:

• , the  limit of
• meaning: the peak data efficiency possible with this model class
• , the  limit of
• meaning: the scaling of loss with parameters when not data-constrained but still using early stopping

If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between  and  to ensure we are not in either limit.

Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold ).  Ajeya's approach essentially assumes that we'll cross this threshold at a particular value of , and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.

I'm not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the  or the  curve until it hits .

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2021-01-16T01:54:02.450Z · LW · GW

I wrote this post about a year ago.  It now strikes me as an interesting mixture of

1. Ideas I still believe are true and important, and which are (still) not talked about enough
2. Ideas that were plausible at the time, but are much less so now
3. Claims I made for their aesthetic/emotional appeal, even though I did not fully believe them at the time

In category 1 (true, important, not talked about enough):

• GPT-2 is a source of valuable evidence about linguistics, because it demonstrates various forms of linguistic competence that previously were only demonstrated by humans.
• Much scholarly ink has been spilled over questions of the form "what would it take, computationally, to do X?" -- where X is something GPT-2 can actually do.  Since we now have a positive example, we should revisit these debates and determine which claims GPT-2 disproves, and which it supports.
• Some of the key participants in those debates are not revisiting them in this way, and appear to think GPT-2 is entirely irrelevant to their work.

In category 2 (plausible then but not now):

• "The structure of the transformer is somehow specially apt for language, relative to other architectures that were tried."
• I now think this is much less likely thanks to the 2 OpenAI scaling papers in 2020.
• The first paper made it seem more plausible that LSTMs would behave like GPT-2 if given a much larger quantity of compute/data
• The second paper showed that the things we know about transformers from the text domain generalize very well to image/video/math
• I now think transformers are just a "good default architecture" for our current compute regime and may not have special linguistic properties
• I'm finding this difficult to phrase, but in 2019 I think I believed Gary Marcus had similar preconceptions to me but was misreading the current evidence.
• I now think he's more committed to the idea that GPT-2-like approaches are fundamentally barking up the wrong tree, and will maintain this idea in the face of arbitrary feats of competence.

In category 3 (misleading):

• I overstated the similarity between what Marcus wanted in 2001, and what has actually occurred.
• I think Marcus wanted neural nets to be designed in a very top-down, constrained way, baking in lots of human prior knowledge.
• ConvNets do bake in (a very simple kind of) prior knowledge.
• But, though LSTMs and transformers are more "structured" than fully connected nets, the structure is not intended to encode prior knowledge.
• Nothing in the recently successful architectures looks like the deliberate design, aimed at enforcing known linguistic regularities, that Marcus argued for.
• I was aware of the vast gap between "more structure than the literal minimum possible" and "the kind of structure Marcus wanted," but conflated the two.  Possibly because I thought the resulting irony was appealing, and/or because it was suggested the disagreement was illusory and was thus emotionally appealing.

In sum, I still like the writing and humor in this post, and I think it makes some important observations, but I also think it leaves the reader with some importantly wrong impressions.

Comment by nostalgebraist on Fourth Wave Covid Toy Modeling · 2021-01-10T22:02:50.680Z · LW · GW

Rt can go below one in Zvi's model.  It just takes an even higher rate of new infections.

Here's the same picture, with the horizontal axis extended so this is visible: https://64.media.tumblr.com/008005269202c21313ef5d5db6a8a4c6/83a097f275903c4c-81/s2048x3072/7b2e6e27f1fb7ad57ac0dcc6bd61fce77a18a2c1.png

Of course, in the real world, Rt dips below one all the time, as you can see in the colored points.

As a dramatic example, Zvi's model is predicting the future forward from 12/23/20.  But a mere week before that date, Rt was below one!

Comment by nostalgebraist on Fourth Wave Covid Toy Modeling · 2021-01-07T23:31:13.423Z · LW · GW

Thanks!  This is exactly the kind of toy model I thought would help move these discussions forward.

The part I'm most suspicious of is the model of the control system.  I have written a Colab notebook exploring the issue in some detail, but briefly:

• If you run the control system model on the past (2020), it vastly over-predicts R.
• This is true even in the very recent past, when pandemic fatigue should have "set in."
• Of course, by your assumptions, it should over-predict past R to some extent.  Because we now have pandemic fatigue, and didn't then.
• However:
• It seems better to first propose a model we know can match past data, and then add a tuning term/effect for "pandemic fatigue" for future prediction.
• Because this model can't predict even the very recent past, it's not clear it models anything we have observed about pandemic fatigue (ie the observations leading us to think pandemic fatigue is happening).
• Instead, it effectively assumes a discontinuity at 12/23/20, where a huge new pandemic fatigue effect turns on.  This effect only exists in the future; if it were turned on in the past, it would have swamped all other factors.

To get a sense of scale, here is one of the plots from my notebook:

https://64.media.tumblr.com/823e3a2f55bd8d1edb385be17cd546c7/673bfeb02b591235-2b/s640x960/64515d7016eeb578e6d9c45020ce1722cbb6af59.png

The colored points show historical data on R vs. the 6-period average, with color indicating the date.

• The first thing that stands out is that these two variables are not even approximately in a one-to-one relationship.
• The second thing that stands out is that, if you were to fit some one-to-one relationship anyway, it would be very different from the toy model here.
• Third thing: the toy model's baseline R is anchored to the "top of a hill" on a curve that has been oscillating quickly.  With an exponent of zero, it would stay stuck at the top of the recent hills, i.e. it would still over-predict the recent past.  (With a positive exponent, it shoots above those hills.)

More general commentary on the issue:

• It seems like you are
1. ... first, assuming that the control system sets R to infections
2. ... then, observing that we still have R~1 (as always), despite a vast uptick in infections
3. ... then, concluding that the control system has drastically changed all of a sudden, because that's the only way to preserve the assumption (1)
• Whereas, it seems more natural to take (3) as evidence that (1) was wrong.

In other words, you are looking at a mostly constant R (with a slight sustained recent upswing), and concluding that this lack of a change is actually the result of two large changes that cancel out:

1. Control dynamics that should make R go down
2. A new discontinuity in control dynamics that conspires to exactly cancel #1, preserving a ~constant R

When R has been remarkably constant the whole time, I'm suspicious of introducing a sudden "blast" of large changes in opposing directions that net out to R still staying constant.  What evidence is there for this "blast"?

(The recent trajectory of R is not evidence for it, as discussed above: it's impossible to explain recent R with these forces in play.  They have to have have suddenly appeared, like a mean Christmas present.)

My model of the R/cases trends is something like:

• "R is always ~1 with noise/oscillations"
• "cases are exponential in R, so when the noise/oscillations conspire upwards for a while, cases blow up"

The missing piece is what sets the noise/oscillations, because if we can control that we can help.  However, any model of the noise/oscillations must calibrate them so it reproduces 2020's tight control around R~1.

This tight control was a surprise and is hard to reproduce in a model, but if our model doesn't reproduce it, we will go on being surprised by the same thing that surprised us before.

Comment by nostalgebraist on DALL-E by OpenAI · 2021-01-05T22:51:07.839Z · LW · GW

Very interesting!

The approach to images here is very different from Image GPT.  (Though this is not the first time OpenAI has written about this approach -- see the "Image VQ" results from the multi-modal scaling paper.)

In Image GPT, an image is represented as a 1D sequence of pixel colors.  The pixel colors are quantized to a palette of size 512, but still represent "raw colors" as opposed to anything more abstract.  Each token in the sequence represents 1 pixel.

In DALL-E, an image is represented as a 2D array of tokens from a latent code.  There are 8192 possible tokens.  Each token in the sequence represents "what's going on" in a roughly 8x8 pixel region (because they use 32x32 codes for 256x256 images).

(Caveat: The mappings from pixels-->tokens and tokens-->pixels are contextual, so a token can influence pixels outside "its" 8x8 region.)

This latent code is analogous to the BPE code used to represent tokens (generally words) for text GPT.  Like BPE, the code is defined before doing generative training, and is presumably fixed during generative training.  Like BPE, it chunks the "raw" signal (pixels here, characters in BPE) into larger, more meaningful units.

This is like a vocabulary of 8192 "image words."  DALL-E "writes" an 32x32 array of these image words, and then a separate network "decodes" this discrete array to a 256x256 array of pixel colors.

Intuitively, this feels closer than Image GPT to mimicking what text GPT does with text.  Pixels are way lower-level than words; 8x8 regions with contextual information feel closer to the level of words.

As with BPE, you get a head start over modeling the raw signal.  As with BPE, the chunking may ultimately be a limiting factor.  Although the chunking process here is differentiable (a neural auto-encoder), so it ought to be adaptable in a way BPE is not.

(Trivia: I'm amused that one of their visuals allows you to ask for images of triangular light bulbs -- the example Yudkowsky used in LOGI to illustrate the internal complexity of superficially atomic concepts.)

Comment by nostalgebraist on Covid 12/31: Meet the New Year · 2021-01-05T03:48:30.517Z · LW · GW

Many of the same thoughts were in my mind when I linked when I linked that study on the previous post.

----

IMO, it would help clarify arguments about the "control system" a lot to write down the ideas in some quantitative form.

As I wrote here:

I always see [rates of compliance, lockdown fatigue, which kinds of restrictions are actually followed, etc.] discussed in very qualitative, intuitive terms.  We talk of cases, tests, fatality rates, and reproduction numbers quantitatively.  We look at tables and charts of these numbers, we compare projections of them.

But when the conversation turns to lockdown compliance, the numbers vanish, the claims range over broad and poorly specified groups (instead of percentages and confidence intervals we get phrases like “most people,” or merely “people”), and everything is (as far as I can tell) based on gut feeling.

Even a simple toy model could help, by separating intuitions about the mechanism from those about outcomes.  If someone argues that a number will be 1000x or 0.001x the value the toy model would predict, that suggests either

• (a) the number is wrong or
• (b) the toy model missed some important factor with a huge influence over the conclusions one draws

Either (a) or (b) would be interesting to learn.

----

One basic question I don't feel I have the answer to: do we know anything about how powerful the control system is?

Roughly, "the control system" is an explanation for the fact that R stays very close to 1 in many areas. It oscillates up and down, but it never gets anywhere near as low as 0, or anywhere near as high as the uncontrolled value of ~4.5.

As long as this trend holds, it's like we're watching the temperature of my room when I've got the thermostat set to 70F.  Sure enough, the temperature stays close to 70F.

This tells you nothing about the maximum power of my heating system.  In colder temperatures, it'd need to work harder, and at some low enough temperature T, it wouldn't be able to sustain 70F inside.  But we can't tell what that cutoff T is until we reach it.  "The indoor temperature right now oscillates around 70F" doesn't tell you anything about T.

Doesn't this argument work just as well for the "control system"?  A toy model could answer that question.

Comment by nostalgebraist on Covid 12/24: We’re F***ed, It’s Over · 2020-12-25T04:01:26.889Z · LW · GW

I'm confused by your pessimism about England's Tier 4 restrictions:

So basically, if you’re outside where it’s safe, they’ll harass you and maybe worse. Whereas if you stay inside, technically it’s not allowed but in practice it’s a lot less likely anything happens to you, unless the anything in question is ‘you catch Covid-19.’ The rules are porous enough that they aren’t enforceable against the things that are risky but enforceable enough to shut down the relatively safe actions that keep people sane. And with weird exceptions for remarkably large indoor gatherings for certain events that are textbook superspreaders.

All of which is what our model expects to see, and none of which seems likely to be remotely sufficient if the new strain is as infectious as they estimate.

Tier 4's bundle of restrictions is almost identical to those from England's "second lockdown" in November.  (See e.g. here.)  But you write as though you believe the "second lockdown" was impactful:

[...] the context of England being under lockdown conditions that had previously turned the tide [...]

How effective are these kind of measures at controlling things (a) before the new strain and (b) with the new strain?

This heavily discussed paper from Dec 23 addresses question (b), using the same model the authors previously applied to question (a) in this paper.  These papers are worth reading and I won't attempt to summarize them, but some relevant points:

• The authors argued for the "second lockdown" in the 2nd linked paper on the basis of its projected impacts on mobility, thus R, thus etc.
• The 2nd linked paper was later updated with data from November, showing that their model did quite well at predicting the effect on mobility, R, etc.
• The 1st linked paper (on new strain) approximates Tier 4 as being equivalent to "second lockdown" in its effects
• The 1st linked paper (on new strain) is worth reading in its entirety as it provides some (provisional) quantitative backing to intuitions about the impact of various measures (Tier 4 / Tier 4 + school closures / Tier 4 + school closures + XYZ amount of vaccination)
Comment by nostalgebraist on the scaling “inconsistency”: openAI’s new insight · 2020-11-09T00:39:49.199Z · LW · GW

I don't think you're completely missing something.  This is the active learning approach, which gwern also suggested -- see that thread for more.

Comment by nostalgebraist on the scaling “inconsistency”: openAI’s new insight · 2020-11-09T00:36:45.626Z · LW · GW

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

Sure -- my point to contrast two cases

1. a counterfactual world with a much larger "regular" web, so WebText and Common Crawl are 1000x their real size
2. the real world, where we have to go beyond "regular" web scrapes to add orders of magnitude

Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free.  This includes domains the research would never have come up with themselves.

If we switch to manually hunting down large specialized datasets, this will definitely help, but we're no longer getting broad domain coverage for free.  At best we get broad domain coverage through manual researcher effort and luck, at worst we don't get it at all.

I see your point about active learning "telling us" when we need more data -- that's especially appealing if it can point us to specific domains where more coverage would help.

Comment by nostalgebraist on the scaling “inconsistency”: openAI’s new insight · 2020-11-08T23:19:23.114Z · LW · GW

What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?

IIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss).  This will help if

1. you get most of the way to intrinsic entropy L(D) on your first pass over D points
2. you can downsample your full dataset without lowering the total number of examples seen in training, i.e. you have too many points to do one full epoch over them

I can imagine this regime becoming the typical one for non-text modalities like video that have huge data with lots of complex redundancy (which the model will learn to compress).

With text data, though, I'm concerned that (2) will fail soon.

The number of train steps taken by GPT-3 was the same order of magnitude as the size of Common Crawl. I haven't seen convincing evidence that comparably good/diverse text datasets can be constructed which are 10x this size, 100x, etc.  The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

Comment by nostalgebraist on Why GPT wants to mesa-optimize & how we might change this · 2020-09-26T05:21:20.726Z · LW · GW

Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?

No, it's a more philosophical point.  Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text."

For example, the mere appearance of something like

Title: Why GPT wants to mesa-optimize & how we might change this

Author: John_Maxwell

does not guarantee that the text following it bears that title, or was written by that author.  (As I am illustrating right now.)

Of course, one can design datasets where information like this is provided more authoritatively -- say, always at the start of each text, curated for quality, etc.  (GPT isn't like that, but Grover and CTRL kind of are, in different ways.)

But even that can only go so far.  If the author is "Julius Caesar," does that mean the historical figure, some internet poster with that handle, or any number of other possibilities?  A passage of fiction written in a character's voice -- is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character?  (Note that the character is a much better answer to the question "who does this sound like?")  And doesn't the date matter too, so we know whether this post in the venue "Less Wrong" was on 2010's LW or 2020's?

Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words.  You can try to hack in some sidechannels to provide context, but there's no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world.  And just as a definitional manner, these sidechannels are modifications to "language modeling," which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.

Yeah, not for transformers I think.

Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.

capybaralet's point about conservation of expected evidence applies here -- GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

If we then say "the mechanism for pricing them in is doing internal lookahead," then we are imagining that lookahead operating over some predictor that is otherwise good but hasn't priced in lookahead yet.  But I don't know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.

Comment by nostalgebraist on Why GPT wants to mesa-optimize & how we might change this · 2020-09-26T02:47:29.410Z · LW · GW

I'm skeptical that internal beam search would help in language modeling.

Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you're looking.  So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

Weather is like this because of chaotic dynamics.  Language modeling is like this because

(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn't extrapolate from reading the first (100-X)%, or else they'd just stop and not write the remaining X%.

(b) By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom.  So even if you were smart enough to guess what any individual human would say next (!), you don't know which human produced the text you're looking at.  (Or even whether it was a human at all.)

Thus (IMO), language modeling is not really about thinking ahead to find some "objectively correct" next move as in Chess/Go.  It's more about trying to guess what the author of this text will do in the very next step.  The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn't find it very useful.

To make the point concrete, I don't think "orange" is necessarily a bad guess here -- among other things, it would be the correct guess if the author were trying to illustrate the point of your example!

And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis "...", which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in.  (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )

Comment by nostalgebraist on on “learning to summarize” · 2020-09-13T17:58:13.740Z · LW · GW

To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there's only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.

With only final rewards, you can still include it as a variable formally. but there's no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)

I guess I was using "there isn't a horizon per se" to mean "the time structure of the rewards determines the horizon for you, it wouldn't make sense to vary it," but I can see how that would be confusing.

If you only set the horizon to 1 but changed nothing else in their work, you'd get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.

Comment by nostalgebraist on on “learning to summarize” · 2020-09-12T21:06:35.673Z · LW · GW
I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this.

Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post -- I was reading over the funny samples I quoted at the end and thought "huh, that would qualify as 'bizarre behavior,' wouldn't it?"

Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

If I understand you, yes, this is what I want. My intuition here is based on:

• at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
• when OpenAI (and I) think about what "better probabilities" we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific "made-up" facts, or the kind of synthetic errors they introduce in Table 18

So, it feels like "we" want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.

Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I'm annotating to improve an LM for nonfiction writing, and I see "Paris, the capital of Canada," what I really want is to make the token " Canada" less probable in this context.

This is a preference over next-token probabilities, not sequences -- if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they're naturally in LM terms to begin with.

This doesn't get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked "good" and descend for tokens marked "bad," but there may be conceptually similar losses that are better-behaved in training.

Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

Comment by nostalgebraist on "Learning to Summarize with Human Feedback" - OpenAI · 2020-09-09T08:20:42.136Z · LW · GW

Various thoughts -- focused on critique because I find that most interesting to write down. (I didn't have a strong negative or positive reaction to the paper.)

## ----

IMO there are two almost unrelated ideas going on in OpenAI's preference learning work (this paper and the original one).

• First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
• Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.

As their first step, they do supervised learning on the data from the first idea to produce a "reward model." (In this paper, this happens roughly once, with little active learning of the reward model over successive batches of annotation.)

This model assigns a score to an entire sample of N tokens, but for LM finetuning, you want something that tells you how good each token is individually. The second idea is the way they choose to bridge the gap, with a specific RL technique.

The overall results look good, but it's not clear how to attribute that across the two ideas, and OpenAI's discussion tends to blur the two together. They can perhaps learn high-quality reward models from preference data (first idea), but it's less clear they are using these models to tune sampling in a good way (gwern said the same thing after trying it).

On the flipside, their RL approach to sampling treats the reward as a black box, so it has nothing to do with preference data per se; you could apply it with any score function.

----

As far as I can tell, their final "human evaluation" was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of "evaluating on training data." It's not surprising that a model tuned on someone's annotations agrees with that person more than a model which wasn't.

For example, in Fig. 3, it looks like the "supervised" baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model. I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.

This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations but otherwise doing supervised learning, vs. using their RL approach.

Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.

----

The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It's good that OpenAI is doing the right things here, but this is not a new result -- rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do. (That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-02T16:22:58.045Z · LW · GW

Interesting topic! I'm not confident this lens would reveal much about it (vs. attention maps or something), but it's worth a try.

I'd encourage you to try this yourself with the Colab notebook, since you presumably have more experience writing this kind of prompt than I do.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-02T00:02:08.712Z · LW · GW

I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.

What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they're continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.

I expect GPT and many other neural models are effectively working in such space of nearly orthogonal vectors, and picking/combining elements of it. A decomposition into orthogonal vectors won't really illuminate this. I wish I knew more about this topic -- are there standard techniques?

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T22:11:25.371Z · LW · GW
One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model.

That's a great idea!

One possible hypothesis that this might let you test is whether the information about the input is being stored indirectly via what the model's guess is given that input or whether it's just being stored in parts of the embedding space that aren't very relevant to the output (if it's the latter, the linear model should put a lot of weight on basis elements that have very little weight in the embedding matrix).

Hmm... I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.

But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.

Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they're near-impossible so you don't need to track them closely), and then smoothly subtracted out at the end. There's probably a way to check if that's happening.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T21:52:47.645Z · LW · GW

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T21:01:55.087Z · LW · GW

Good idea, I'll do that.

I know I'd run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it's smaller.

Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like "plasma" are "kept around" for later use.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T18:52:54.949Z · LW · GW
Maybe lm_head was set to be equal to wte transpose?

Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.

• In OpenAI's tensorflow code, see lines 154 and 171 of src/model.py. The variable "wte" is defined on 151, then re-used on 171.
• In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the place to look.)

Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the "head" used to convert the final layer outputs into predictions. The intent is easy swapping between different "heads" with the same "body" beneath.

This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.

Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn't a critique of the latter; it's trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T15:19:43.991Z · LW · GW
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?

One can do this in the Colab notebook by calling show_token_progress with comparisons_vs="first" rather than the default "final". IIRC, this also shows a discontinuous flip at the bottom followed by slower change.

(This is similar to asking the question "do the activations assign high or low probability the input token?" One can answer the same question by plotting logits or ranks with the input layer included.)

GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values.

It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T04:23:51.125Z · LW · GW

Interesting, but not (I think?) the direction I was headed in.

I was thinking more about the way the model seems to be managing a tradeoff between preserving the representation of token i and producing the representation of token i+1.

The depth-wise continuity imposed by weight decay means late layers are representing something close to the final output -- in late layers the model is roughly looking at its own guesses, even if they were wrong, which seems suboptimal.

Consider this scenario:

• The model does poorly at position i, assigning very low probability to the true token residing at i+1.
• To retain a clear view of the input sequence, the model now needs to "keep around" the true token at i+1, since its own guess is a poor proxy.
• But early layers don't know that: they can't "look up" and notice the poor prediction. So they just treat i+1 like any other position. (I.e. there's no way to implement a selective "copy when we got it wrong" mechanism)
• In late layers, position i+1 has been converted into a guess about i+2 by the earlier layers, so we can't rely on it to tell us what really occupied i+1.
• And position i has been converted to a bad guess about position i+1, so if we use it as a proxy for i+1 we'll do poorly.

My sampling idea was something like "let's replace (or interpolate) late activations with embeddings of the actual next token, so the model can see what really happened, even when its probability was low." (This is for sampling specifically because it'd be too slow in training, where you want to process a whole window at once with matrix operations; sampling has to be a loop anyway, so there's no cost to adding stuff that only works as a loop.)

But, thinking about it more, the model clearly can perform well in scenarios like the above, e.g. my plasma example and also many other cases naturally arising in language which GPT handles well.

I have no idea how it does it -- indeed the connection structure feels weirdly adverse to such operations -- but apparently it does. So it's probably premature to assume it can't do this well, and attempt to "help it out" with extra tricks.