the scaling “inconsistency”: openAI’s new insight 2020-11-07T07:40:06.548Z
on “learning to summarize” 2020-09-12T03:20:08.333Z
interpreting GPT: the logit lens 2020-08-31T02:47:08.426Z
is gpt-3 few-shot ready for real applications? 2020-08-03T19:50:09.740Z
[updated] how does gpt2′s training corpus capture internet discussion?  not well 2020-07-27T22:30:07.909Z
Why is pseudo-alignment "worse" than other ways ML can fail to generalize? 2020-07-18T22:54:50.957Z
GPT-3: a disappointing paper 2020-05-29T19:06:27.589Z
covid-19 notes, 4/19/20 2020-04-20T05:30:01.873Z
mind viruses about body viruses 2020-03-28T04:20:02.674Z
human psycholinguists: a critical appraisal 2019-12-31T00:20:01.330Z
“embedded self-justification,” or something like that 2019-11-03T03:20:01.848Z
When does rationality-as-search have nontrivial implications? 2018-11-04T22:42:01.452Z


Comment by nostalgebraist on interpreting GPT: the logit lens · 2021-05-01T01:27:16.410Z · LW · GW

Ah, I think we miscommunicated.

I meant "gelu(x) achieves its maximum curvature somewhere near x=0."

People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|.  In this sense gelu is like relu.

It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.

This is indeed unique to relu -- I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops.  Obviously you can't simulate a differentiable function with that trick.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2021-04-26T21:37:51.589Z · LW · GW

I'm confused -- the paper you link is not about better prompts for GPT-3.  It's about a novel fine-tuning methodology for T5.  GPT-3 only appears in the paper as a reference/baseline to which the new method is compared.

The use of a BERT / T5-style model (denoising loss + unmasked attn) is noteworthy because these models reliably outperform GPT-style models (LM loss + causally masked attn) in supervised settings.

Because of this, I sometimes refer to GPT-3 as "quantifying the cost (in additional scale) imposed by choosing a GPT-style model."  That is, the following should be roughly competitive w/ each other:

  • BERT/T5 at param count N
  • GPT at param count ~100 * N

See my comments near the bottom here.

Separately, I am aware that people have gotten much better performance out of GPT-3 by putting some effort into prompt design, vs. the original paper which put basically no effort into prompt design.

Your comment claims that the "SOTA" within that line of work is close to the overall SOTA on SuperGLUE -- which I would readily believe, since GPT-3 was already pretty competitive in the paper and dramatic effects have been reported for prompt design on specific tasks.  However, I'd need to see a reference that actually establishes this.

Comment by nostalgebraist on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-03-04T06:27:39.237Z · LW · GW

Most complexity measures give roughly similar values for the (relative) complexity of most objects


I'll write mostly about this statement, as I think it's the crux of our disagreement.

The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure.

However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects.  Whether this is true depends on our choice of translation.

(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object.  Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects.  But the example shows that the claim about complexity measures will only hold if our translation is "good enough" in some sense.  If we don't have any idea what "good enough" means, something is missing from the story.)

In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure.  (More precisely, the hand-waviness about what metric structure survives the translation, if any.)  If there's no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object "close to" the specified one.

Basic information theory doesn't require a metric, only a measure.  There's no sense of "getting an output approximately right," only of "getting the exactly right output with high probability."  If you care about being approximately right according to some metric, this leads you to rate-distortion theory.

Both of these domains -- information theory without a metric, and with one -- define notions of incompressibility/complexity, but they're different.  Consider two distributions on R:

  1. The standard normal,
  2. The standard normal, but you chop it into a trillion pieces on the x axis, and translate the pieces to arbitrary locations in R

According to basic information theory, these are equally simple/compressible.  (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)

But in rate-distortion theory, (1) is way more simple/compressible than (2).  If you're coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1].  Whereas if you're coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.

If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities.  When you remove the metric, they're identical.  So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.

Comment by nostalgebraist on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian · 2021-02-22T05:22:36.922Z · LW · GW

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.


  • If  are finite sets, then there are a finite number of functions .   Let's write  for the finite set of such functions.
  • The authors view the counting measure on  -- where every function is equally likely -- as "unbiased."
  • This choice makes sense if  are truly unstructured collections of objects with no intrinsic meaning.
  • However, if there is some extra structure on them like a metric, it's no longer clear that "all functions are equally likely" is the right reference point.
  • Imposing a constraint that functions should use/respect the extra structure, even in some mild way like continuity, may pick out a tiny subset of  relative to the counting measure.
  • Finally, if we pick a measure of simplicity that happens to judge this subset to be unusually simple, then any prior that prefers mildly reasonable functions (eg continuous ones) will look like a simplicity prior.

This is much too coarse a lens for distinguishing NNs from other statistical learning techniques, since all of them are generally going to involve putting a metric on the input space.

Let's see how this goes wrong in the Shannon entropy argument from this paper.

  • The authors consider (a quantity equivalent to) the fraction of inputs in  for which a given function outputs .
  • They consider a function simpler if this fraction is close to 1 or 0, because then it's easier to compress.
  • With the counting measure, "most" functions output  about half of the time.  (Like the binomial distribution -- there are lots of different ways you can flip 5 tails and 5 heads, but only one way to flip 10 heads.)
  • To learn binary functions with an NN, they encode the inputs as binary vectors like .  They study what happens when you feed these to either (A) linear model, or (B) a ReLu stack, with random weights.
  • It turns out that the functions expressed by these models are much more likely than the counting measure to assign a single label ( or ) to most outputs.
  • Why?
    • For an random function on an input space of size , you need to roll  independent random variables.  Each roll affects only one input element.
    • But when you encode the inputs as vectors of length  and feed them into a model, the layers of the model have weights that are also -vectors.  Each of their components affects many input elements at once, in the same direction.  This makes it likely for the judgments to clump towards  or .
    • For example, with the linear model with no threshold, if we roll a weight vector whose elements are all positive, then every input maps to .  This happens a fraction  of the time.  But only one boolean function maps every input to , so the counting measure would give this probability .
    • This doesn't seem like a special property of neural nets.  It just seems like a result of assigning a normed vector space structure to the inputs, and preferring functions that "use" the structure in their labeling rule.  "Using" the structure means any decision you make about how to treat one input element has implications for others (because they're close to it, or point in the same direction, or something).  Thus you have fewer independent decisions to make, and there's a higher probability they all push in the same direction.

Sort of similar remarks apply to the other complexity measure used by authors, LZ complexity.  Unlike the complexity measure discussed above, this one does implicitly put a structure on the input space (by fixing an enumeration of it, where the inputs are taken to be bit vectors, and the enumeration reads them off in binary).

"Simple" functions in the LZ sense are thus ones that respond to binary vectors in (roughly) a predictable way,.  What does it mean for a function to respond to binary vectors in a predictable way?  It means that knowing the values of some of the bits provides information about the output, even if you don't know all of them.  But since our models are encoding the inputs as binary vectors, we are already setting them up to have properties like this.

Comment by nostalgebraist on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-26T17:31:30.885Z · LW · GW

I'm don't think this step makes sense:

Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can't be done for less than 10e15 params is a task which requires 10e15 data points also.

In the picture, it looks like there's something special about having a 1:1 ratio of data to params.  But this is a coincidence due to the authors' choice of units.

They define "one data point" as "one token," which is fine.  But it seems equally defensible to define "one data point" as "what the model can process in one forward pass," which is ~1e3 tokens.  If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!

To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps.  This depends on your choice of units.  And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems "have the same scaling law."  Scaling is about relationships between differences, not relationships between absolute magnitudes.

On the larger topic, I'm pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for "a data point" is.  This is mostly for "Could a Neuroscientist Understand a Microprocessor?"-type reasons.  I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.

Comment by nostalgebraist on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-25T23:28:51.776Z · LW · GW

Actually, I think I spoke too soon about the visualization... I don't think your image of L(D) and L(N) is quite right.

Here is what the actual visualization looks like.  More blue = lower loss, and I made it a contour plot so it's easy to see indifference curves of the loss.

In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:

  • If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis.  That is, in this regime, N doesn't matter and loss is effectively a function of D alone.
    • This is L(D).
    • It looks like the color changes you see if you move horizontally through the upper left region.
  • Likewise, in the lower right region, D doesn't matter and loss depends on N alone.
    • This is L(N).
    • It looks like the color changes you see if you move vertically through the lower right region.

To restate my earlier claims... 

If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower).  So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).

This is what motives the heuristic that you scale D with N, to stay on the diagonal line.

On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive.  For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.

When I said that it's intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach.  And that's going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.

Asking "what could we do with a N=1e15 model?" (or any other number) is kind of a weird question from the perspective of this plot.  It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region ... or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.

In Ajeya's work, this question means "let's assume we're using an N=1e15 model, and then let's assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let's figure out how big D has to be to get there."

So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as "the performance which you could only reach with N=1e15 params".

What feels weird to me -- which you touched on above -- is the way this lets the scaling relations "backset drive" the definition of sufficient quality for AGI.  Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it... we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.

Comment by nostalgebraist on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-25T16:51:41.514Z · LW · GW

You can't have more D than you have compute, in some sense, because D isn't the amount of training examples you've collected, it's the amount you actually use to train... right? So... isn't this a heuristic for managing compute? It sure seemed like it was presented that way.

This is a subtle and confusing thing about the Kaplan et al papers.  (It's also the subject of my post that I linked earlier, so I recommend you check that out.)

There are two things in the papers that could be called "optimal compute budgeting" laws:

  • A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps  and params .
  • The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size  vs params .

I said the  vs  law was "not a heuristic for managing compute" because the  vs  law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.

However, the  vs  law does tell you about how to spend compute in an indirect way, for the exact reason you say, that  is related to how long you train.  Comparing the two laws yields the "breakdown" or "kink point."

Do you agree or disagree? ... I take [you] to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D?

Sorry, why do you expect I disagree?  I think I agree.  But also, I'm not really claiming the scaling laws say or don't say anything about the brain, I'm just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems).  We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.

Perhaps it would help me if I could visualize it in two dimensions

This part is 100% qualitatively accurate, I think.  The one exception is that there are two "optimal compute" lines on the plot with different slopes, for the two laws referred to above.  But yeah, I'm saying we won't be on either of those lines, but on the L(N) or the L(D) line.

Comment by nostalgebraist on Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain · 2021-01-25T02:35:48.985Z · LW · GW

The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance.


The scaling laws from the Kaplan et al papers do tell you this.

The relevant law is , for the early-stopped test loss given parameter count  and data size .  It has the functional form

with .

The result that you should scale  comes from trying to keep the two terms in this formula about the same size.

This is not exactly a heuristic for managing compute (since  is not dependent on compute, it's dependent on how much data you can source).  It's more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.

You always can train models that are "too large" on datasets that are "too small" according to the heuristic, and they won't diverge or do poorly or anything.  They just won't improve much upon the results of smaller models.

In terms of the above, you are setting  and then asking what  ought to be.  If the heuristic gives you an answer that seems very high, that doesn't mean the model is "not as data efficient as you expected."  Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to  rather than using a smaller model to get almost identical performance.

I find it more intuitive to think about the following, both discussed in the papers:

  • , the  limit of 
    • meaning: the peak data efficiency possible with this model class
  • , the  limit of 
    • meaning: the scaling of loss with parameters when not data-constrained but still using early stopping

If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between  and  to ensure we are not in either limit.

Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold ).  Ajeya's approach essentially assumes that we'll cross this threshold at a particular value of , and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.

I'm not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the  or the  curve until it hits .

See also my post here.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2021-01-16T01:54:02.450Z · LW · GW

I wrote this post about a year ago.  It now strikes me as an interesting mixture of

  1. Ideas I still believe are true and important, and which are (still) not talked about enough
  2. Ideas that were plausible at the time, but are much less so now
  3. Claims I made for their aesthetic/emotional appeal, even though I did not fully believe them at the time

In category 1 (true, important, not talked about enough):

  • GPT-2 is a source of valuable evidence about linguistics, because it demonstrates various forms of linguistic competence that previously were only demonstrated by humans.
  • Much scholarly ink has been spilled over questions of the form "what would it take, computationally, to do X?" -- where X is something GPT-2 can actually do.  Since we now have a positive example, we should revisit these debates and determine which claims GPT-2 disproves, and which it supports.
  • Some of the key participants in those debates are not revisiting them in this way, and appear to think GPT-2 is entirely irrelevant to their work.

In category 2 (plausible then but not now):

  • "The structure of the transformer is somehow specially apt for language, relative to other architectures that were tried."
    • I now think this is much less likely thanks to the 2 OpenAI scaling papers in 2020.
    • The first paper made it seem more plausible that LSTMs would behave like GPT-2 if given a much larger quantity of compute/data
    • The second paper showed that the things we know about transformers from the text domain generalize very well to image/video/math
    • I now think transformers are just a "good default architecture" for our current compute regime and may not have special linguistic properties
  • I'm finding this difficult to phrase, but in 2019 I think I believed Gary Marcus had similar preconceptions to me but was misreading the current evidence.
    • I now think he's more committed to the idea that GPT-2-like approaches are fundamentally barking up the wrong tree, and will maintain this idea in the face of arbitrary feats of competence.

In category 3 (misleading):

  • I overstated the similarity between what Marcus wanted in 2001, and what has actually occurred.
    • I think Marcus wanted neural nets to be designed in a very top-down, constrained way, baking in lots of human prior knowledge.
    • ConvNets do bake in (a very simple kind of) prior knowledge.
    • But, though LSTMs and transformers are more "structured" than fully connected nets, the structure is not intended to encode prior knowledge.
    • Nothing in the recently successful architectures looks like the deliberate design, aimed at enforcing known linguistic regularities, that Marcus argued for.
    • I was aware of the vast gap between "more structure than the literal minimum possible" and "the kind of structure Marcus wanted," but conflated the two.  Possibly because I thought the resulting irony was appealing, and/or because it was suggested the disagreement was illusory and was thus emotionally appealing.

In sum, I still like the writing and humor in this post, and I think it makes some important observations, but I also think it leaves the reader with some importantly wrong impressions.

Comment by nostalgebraist on Fourth Wave Covid Toy Modeling · 2021-01-10T22:02:50.680Z · LW · GW

Rt can go below one in Zvi's model.  It just takes an even higher rate of new infections.

Here's the same picture, with the horizontal axis extended so this is visible:

Of course, in the real world, Rt dips below one all the time, as you can see in the colored points.

As a dramatic example, Zvi's model is predicting the future forward from 12/23/20.  But a mere week before that date, Rt was below one!

Comment by nostalgebraist on Fourth Wave Covid Toy Modeling · 2021-01-07T23:31:13.423Z · LW · GW

Thanks!  This is exactly the kind of toy model I thought would help move these discussions forward.

The part I'm most suspicious of is the model of the control system.  I have written a Colab notebook exploring the issue in some detail, but briefly:

  • If you run the control system model on the past (2020), it vastly over-predicts R.
    • This is true even in the very recent past, when pandemic fatigue should have "set in."
  • Of course, by your assumptions, it should over-predict past R to some extent.  Because we now have pandemic fatigue, and didn't then.
  • However:
    • It seems better to first propose a model we know can match past data, and then add a tuning term/effect for "pandemic fatigue" for future prediction.
    • Because this model can't predict even the very recent past, it's not clear it models anything we have observed about pandemic fatigue (ie the observations leading us to think pandemic fatigue is happening).
    • Instead, it effectively assumes a discontinuity at 12/23/20, where a huge new pandemic fatigue effect turns on.  This effect only exists in the future; if it were turned on in the past, it would have swamped all other factors.

To get a sense of scale, here is one of the plots from my notebook:

The colored points show historical data on R vs. the 6-period average, with color indicating the date.

  • The first thing that stands out is that these two variables are not even approximately in a one-to-one relationship.
  • The second thing that stands out is that, if you were to fit some one-to-one relationship anyway, it would be very different from the toy model here.
  • Third thing: the toy model's baseline R is anchored to the "top of a hill" on a curve that has been oscillating quickly.  With an exponent of zero, it would stay stuck at the top of the recent hills, i.e. it would still over-predict the recent past.  (With a positive exponent, it shoots above those hills.)

More general commentary on the issue:

  • It seems like you are
    1. ... first, assuming that the control system sets R to infections
    2. ... then, observing that we still have R~1 (as always), despite a vast uptick in infections
    3. ... then, concluding that the control system has drastically changed all of a sudden, because that's the only way to preserve the assumption (1)
  • Whereas, it seems more natural to take (3) as evidence that (1) was wrong.

In other words, you are looking at a mostly constant R (with a slight sustained recent upswing), and concluding that this lack of a change is actually the result of two large changes that cancel out:

  1. Control dynamics that should make R go down
  2. A new discontinuity in control dynamics that conspires to exactly cancel #1, preserving a ~constant R

When R has been remarkably constant the whole time, I'm suspicious of introducing a sudden "blast" of large changes in opposing directions that net out to R still staying constant.  What evidence is there for this "blast"?

(The recent trajectory of R is not evidence for it, as discussed above: it's impossible to explain recent R with these forces in play.  They have to have have suddenly appeared, like a mean Christmas present.)

My model of the R/cases trends is something like:

  • "R is always ~1 with noise/oscillations"
  • "cases are exponential in R, so when the noise/oscillations conspire upwards for a while, cases blow up"

The missing piece is what sets the noise/oscillations, because if we can control that we can help.  However, any model of the noise/oscillations must calibrate them so it reproduces 2020's tight control around R~1.

This tight control was a surprise and is hard to reproduce in a model, but if our model doesn't reproduce it, we will go on being surprised by the same thing that surprised us before.

Comment by nostalgebraist on DALL-E by OpenAI · 2021-01-05T22:51:07.839Z · LW · GW

Very interesting!

The approach to images here is very different from Image GPT.  (Though this is not the first time OpenAI has written about this approach -- see the "Image VQ" results from the multi-modal scaling paper.)

In Image GPT, an image is represented as a 1D sequence of pixel colors.  The pixel colors are quantized to a palette of size 512, but still represent "raw colors" as opposed to anything more abstract.  Each token in the sequence represents 1 pixel.

In DALL-E, an image is represented as a 2D array of tokens from a latent code.  There are 8192 possible tokens.  Each token in the sequence represents "what's going on" in a roughly 8x8 pixel region (because they use 32x32 codes for 256x256 images).

(Caveat: The mappings from pixels-->tokens and tokens-->pixels are contextual, so a token can influence pixels outside "its" 8x8 region.)

This latent code is analogous to the BPE code used to represent tokens (generally words) for text GPT.  Like BPE, the code is defined before doing generative training, and is presumably fixed during generative training.  Like BPE, it chunks the "raw" signal (pixels here, characters in BPE) into larger, more meaningful units.

This is like a vocabulary of 8192 "image words."  DALL-E "writes" an 32x32 array of these image words, and then a separate network "decodes" this discrete array to a 256x256 array of pixel colors.

Intuitively, this feels closer than Image GPT to mimicking what text GPT does with text.  Pixels are way lower-level than words; 8x8 regions with contextual information feel closer to the level of words.

As with BPE, you get a head start over modeling the raw signal.  As with BPE, the chunking may ultimately be a limiting factor.  Although the chunking process here is differentiable (a neural auto-encoder), so it ought to be adaptable in a way BPE is not.

(Trivia: I'm amused that one of their visuals allows you to ask for images of triangular light bulbs -- the example Yudkowsky used in LOGI to illustrate the internal complexity of superficially atomic concepts.)

Comment by nostalgebraist on Covid 12/31: Meet the New Year · 2021-01-05T03:48:30.517Z · LW · GW

Many of the same thoughts were in my mind when I linked when I linked that study on the previous post.


IMO, it would help clarify arguments about the "control system" a lot to write down the ideas in some quantitative form.

As I wrote here:

I always see [rates of compliance, lockdown fatigue, which kinds of restrictions are actually followed, etc.] discussed in very qualitative, intuitive terms.  We talk of cases, tests, fatality rates, and reproduction numbers quantitatively.  We look at tables and charts of these numbers, we compare projections of them.

But when the conversation turns to lockdown compliance, the numbers vanish, the claims range over broad and poorly specified groups (instead of percentages and confidence intervals we get phrases like “most people,” or merely “people”), and everything is (as far as I can tell) based on gut feeling.

Even a simple toy model could help, by separating intuitions about the mechanism from those about outcomes.  If someone argues that a number will be 1000x or 0.001x the value the toy model would predict, that suggests either 

  • (a) the number is wrong or 
  • (b) the toy model missed some important factor with a huge influence over the conclusions one draws

Either (a) or (b) would be interesting to learn.


One basic question I don't feel I have the answer to: do we know anything about how powerful the control system is?

Roughly, "the control system" is an explanation for the fact that R stays very close to 1 in many areas. It oscillates up and down, but it never gets anywhere near as low as 0, or anywhere near as high as the uncontrolled value of ~4.5.

As long as this trend holds, it's like we're watching the temperature of my room when I've got the thermostat set to 70F.  Sure enough, the temperature stays close to 70F.

This tells you nothing about the maximum power of my heating system.  In colder temperatures, it'd need to work harder, and at some low enough temperature T, it wouldn't be able to sustain 70F inside.  But we can't tell what that cutoff T is until we reach it.  "The indoor temperature right now oscillates around 70F" doesn't tell you anything about T.

Doesn't this argument work just as well for the "control system"?  A toy model could answer that question.

Comment by nostalgebraist on Covid 12/24: We’re F***ed, It’s Over · 2020-12-25T04:01:26.889Z · LW · GW

I'm confused by your pessimism about England's Tier 4 restrictions:

So basically, if you’re outside where it’s safe, they’ll harass you and maybe worse. Whereas if you stay inside, technically it’s not allowed but in practice it’s a lot less likely anything happens to you, unless the anything in question is ‘you catch Covid-19.’ The rules are porous enough that they aren’t enforceable against the things that are risky but enforceable enough to shut down the relatively safe actions that keep people sane. And with weird exceptions for remarkably large indoor gatherings for certain events that are textbook superspreaders.

All of which is what our model expects to see, and none of which seems likely to be remotely sufficient if the new strain is as infectious as they estimate.

Tier 4's bundle of restrictions is almost identical to those from England's "second lockdown" in November.  (See e.g. here.)  But you write as though you believe the "second lockdown" was impactful:

[...] the context of England being under lockdown conditions that had previously turned the tide [...] 

How effective are these kind of measures at controlling things (a) before the new strain and (b) with the new strain?

This heavily discussed paper from Dec 23 addresses question (b), using the same model the authors previously applied to question (a) in this paper.  These papers are worth reading and I won't attempt to summarize them, but some relevant points:

  • The authors argued for the "second lockdown" in the 2nd linked paper on the basis of its projected impacts on mobility, thus R, thus etc.
  • The 2nd linked paper was later updated with data from November, showing that their model did quite well at predicting the effect on mobility, R, etc.
  • The 1st linked paper (on new strain) approximates Tier 4 as being equivalent to "second lockdown" in its effects
  • The 1st linked paper (on new strain) is worth reading in its entirety as it provides some (provisional) quantitative backing to intuitions about the impact of various measures (Tier 4 / Tier 4 + school closures / Tier 4 + school closures + XYZ amount of vaccination)
Comment by nostalgebraist on the scaling “inconsistency”: openAI’s new insight · 2020-11-09T00:39:49.199Z · LW · GW

I don't think you're completely missing something.  This is the active learning approach, which gwern also suggested -- see that thread for more.

Comment by nostalgebraist on the scaling “inconsistency”: openAI’s new insight · 2020-11-09T00:36:45.626Z · LW · GW

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

Sure -- my point to contrast two cases

  1. a counterfactual world with a much larger "regular" web, so WebText and Common Crawl are 1000x their real size
  2. the real world, where we have to go beyond "regular" web scrapes to add orders of magnitude

Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free.  This includes domains the research would never have come up with themselves.

If we switch to manually hunting down large specialized datasets, this will definitely help, but we're no longer getting broad domain coverage for free.  At best we get broad domain coverage through manual researcher effort and luck, at worst we don't get it at all.

I see your point about active learning "telling us" when we need more data -- that's especially appealing if it can point us to specific domains where more coverage would help.

Comment by nostalgebraist on the scaling “inconsistency”: openAI’s new insight · 2020-11-08T23:19:23.114Z · LW · GW

What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?


IIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss).  This will help if

  1. you get most of the way to intrinsic entropy L(D) on your first pass over D points
  2. you can downsample your full dataset without lowering the total number of examples seen in training, i.e. you have too many points to do one full epoch over them

I can imagine this regime becoming the typical one for non-text modalities like video that have huge data with lots of complex redundancy (which the model will learn to compress).

With text data, though, I'm concerned that (2) will fail soon.

The number of train steps taken by GPT-3 was the same order of magnitude as the size of Common Crawl. I haven't seen convincing evidence that comparably good/diverse text datasets can be constructed which are 10x this size, 100x, etc.  The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

Comment by nostalgebraist on Why GPT wants to mesa-optimize & how we might change this · 2020-09-26T05:21:20.726Z · LW · GW

Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?


No, it's a more philosophical point.  Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text."

For example, the mere appearance of something like

Title: Why GPT wants to mesa-optimize & how we might change this 

Author: John_Maxwell

does not guarantee that the text following it bears that title, or was written by that author.  (As I am illustrating right now.)

Of course, one can design datasets where information like this is provided more authoritatively -- say, always at the start of each text, curated for quality, etc.  (GPT isn't like that, but Grover and CTRL kind of are, in different ways.)

But even that can only go so far.  If the author is "Julius Caesar," does that mean the historical figure, some internet poster with that handle, or any number of other possibilities?  A passage of fiction written in a character's voice -- is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character?  (Note that the character is a much better answer to the question "who does this sound like?")  And doesn't the date matter too, so we know whether this post in the venue "Less Wrong" was on 2010's LW or 2020's?

Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words.  You can try to hack in some sidechannels to provide context, but there's no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world.  And just as a definitional manner, these sidechannels are modifications to "language modeling," which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.

Yeah, not for transformers I think.

Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.

capybaralet's point about conservation of expected evidence applies here -- GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

If we then say "the mechanism for pricing them in is doing internal lookahead," then we are imagining that lookahead operating over some predictor that is otherwise good but hasn't priced in lookahead yet.  But I don't know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.

Comment by nostalgebraist on Why GPT wants to mesa-optimize & how we might change this · 2020-09-26T02:47:29.410Z · LW · GW

I'm skeptical that internal beam search would help in language modeling.

Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you're looking.  So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

Weather is like this because of chaotic dynamics.  Language modeling is like this because

(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn't extrapolate from reading the first (100-X)%, or else they'd just stop and not write the remaining X%.

(b) By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom.  So even if you were smart enough to guess what any individual human would say next (!), you don't know which human produced the text you're looking at.  (Or even whether it was a human at all.)

Thus (IMO), language modeling is not really about thinking ahead to find some "objectively correct" next move as in Chess/Go.  It's more about trying to guess what the author of this text will do in the very next step.  The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn't find it very useful.

To make the point concrete, I don't think "orange" is necessarily a bad guess here -- among other things, it would be the correct guess if the author were trying to illustrate the point of your example!

And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis "...", which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in.  (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )

Comment by nostalgebraist on on “learning to summarize” · 2020-09-13T17:58:13.740Z · LW · GW

To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there's only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.

With only final rewards, you can still include it as a variable formally. but there's no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)

I guess I was using "there isn't a horizon per se" to mean "the time structure of the rewards determines the horizon for you, it wouldn't make sense to vary it," but I can see how that would be confusing.

If you only set the horizon to 1 but changed nothing else in their work, you'd get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.

Comment by nostalgebraist on on “learning to summarize” · 2020-09-12T21:06:35.673Z · LW · GW
I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this.

Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post -- I was reading over the funny samples I quoted at the end and thought "huh, that would qualify as 'bizarre behavior,' wouldn't it?"

Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

If I understand you, yes, this is what I want. My intuition here is based on:

  • at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
  • when OpenAI (and I) think about what "better probabilities" we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific "made-up" facts, or the kind of synthetic errors they introduce in Table 18

So, it feels like "we" want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.

Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I'm annotating to improve an LM for nonfiction writing, and I see "Paris, the capital of Canada," what I really want is to make the token " Canada" less probable in this context.

This is a preference over next-token probabilities, not sequences -- if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they're naturally in LM terms to begin with.

This doesn't get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked "good" and descend for tokens marked "bad," but there may be conceptually similar losses that are better-behaved in training.

Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

Comment by nostalgebraist on "Learning to Summarize with Human Feedback" - OpenAI · 2020-09-09T08:20:42.136Z · LW · GW

Various thoughts -- focused on critique because I find that most interesting to write down. (I didn't have a strong negative or positive reaction to the paper.)


IMO there are two almost unrelated ideas going on in OpenAI's preference learning work (this paper and the original one).

  • First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
  • Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.

As their first step, they do supervised learning on the data from the first idea to produce a "reward model." (In this paper, this happens roughly once, with little active learning of the reward model over successive batches of annotation.)

This model assigns a score to an entire sample of N tokens, but for LM finetuning, you want something that tells you how good each token is individually. The second idea is the way they choose to bridge the gap, with a specific RL technique.

The overall results look good, but it's not clear how to attribute that across the two ideas, and OpenAI's discussion tends to blur the two together. They can perhaps learn high-quality reward models from preference data (first idea), but it's less clear they are using these models to tune sampling in a good way (gwern said the same thing after trying it).

On the flipside, their RL approach to sampling treats the reward as a black box, so it has nothing to do with preference data per se; you could apply it with any score function.


As far as I can tell, their final "human evaluation" was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of "evaluating on training data." It's not surprising that a model tuned on someone's annotations agrees with that person more than a model which wasn't.

For example, in Fig. 3, it looks like the "supervised" baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model. I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.

This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations but otherwise doing supervised learning, vs. using their RL approach.

Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.


The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It's good that OpenAI is doing the right things here, but this is not a new result -- rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do. (That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-02T16:22:58.045Z · LW · GW

Interesting topic! I'm not confident this lens would reveal much about it (vs. attention maps or something), but it's worth a try.

I'd encourage you to try this yourself with the Colab notebook, since you presumably have more experience writing this kind of prompt than I do.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-02T00:02:08.712Z · LW · GW

I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.

What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they're continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.

I expect GPT and many other neural models are effectively working in such space of nearly orthogonal vectors, and picking/combining elements of it. A decomposition into orthogonal vectors won't really illuminate this. I wish I knew more about this topic -- are there standard techniques?

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T22:11:25.371Z · LW · GW
One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model.

That's a great idea!

One possible hypothesis that this might let you test is whether the information about the input is being stored indirectly via what the model's guess is given that input or whether it's just being stored in parts of the embedding space that aren't very relevant to the output (if it's the latter, the linear model should put a lot of weight on basis elements that have very little weight in the embedding matrix).

Hmm... I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.

But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.

Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they're near-impossible so you don't need to track them closely), and then smoothly subtracted out at the end. There's probably a way to check if that's happening.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T21:52:47.645Z · LW · GW

Post has been now updated with a long-ish addendum about this topic.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T21:01:55.087Z · LW · GW

Good idea, I'll do that.

I know I'd run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it's smaller.

Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like "plasma" are "kept around" for later use.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T18:52:54.949Z · LW · GW
Maybe lm_head was set to be equal to wte transpose?

Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.

  • In OpenAI's tensorflow code, see lines 154 and 171 of src/ The variable "wte" is defined on 151, then re-used on 171.
  • In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the place to look.)

Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the "head" used to convert the final layer outputs into predictions. The intent is easy swapping between different "heads" with the same "body" beneath.

This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.

Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn't a critique of the latter; it's trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T15:19:43.991Z · LW · GW
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?

One can do this in the Colab notebook by calling show_token_progress with comparisons_vs="first" rather than the default "final". IIRC, this also shows a discontinuous flip at the bottom followed by slower change.

(This is similar to asking the question "do the activations assign high or low probability the input token?" One can answer the same question by plotting logits or ranks with the input layer included.)

GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values.

It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T04:23:51.125Z · LW · GW

Interesting, but not (I think?) the direction I was headed in.

I was thinking more about the way the model seems to be managing a tradeoff between preserving the representation of token i and producing the representation of token i+1.

The depth-wise continuity imposed by weight decay means late layers are representing something close to the final output -- in late layers the model is roughly looking at its own guesses, even if they were wrong, which seems suboptimal.

Consider this scenario:

  • The model does poorly at position i, assigning very low probability to the true token residing at i+1.
  • To retain a clear view of the input sequence, the model now needs to "keep around" the true token at i+1, since its own guess is a poor proxy.
  • But early layers don't know that: they can't "look up" and notice the poor prediction. So they just treat i+1 like any other position. (I.e. there's no way to implement a selective "copy when we got it wrong" mechanism)
  • In late layers, position i+1 has been converted into a guess about i+2 by the earlier layers, so we can't rely on it to tell us what really occupied i+1.
  • And position i has been converted to a bad guess about position i+1, so if we use it as a proxy for i+1 we'll do poorly.

My sampling idea was something like "let's replace (or interpolate) late activations with embeddings of the actual next token, so the model can see what really happened, even when its probability was low." (This is for sampling specifically because it'd be too slow in training, where you want to process a whole window at once with matrix operations; sampling has to be a loop anyway, so there's no cost to adding stuff that only works as a loop.)

But, thinking about it more, the model clearly can perform well in scenarios like the above, e.g. my plasma example and also many other cases naturally arising in language which GPT handles well.

I have no idea how it does it -- indeed the connection structure feels weirdly adverse to such operations -- but apparently it does. So it's probably premature to assume it can't do this well, and attempt to "help it out" with extra tricks.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-08-31T23:57:17.016Z · LW · GW
Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...

Not sure I understand the distinction, could you rephrase?

If by "last slot" you mean last layer (as opposed to earlier layers), that seems like the same thing as outputting the input offset by one.

If by "last slot" you mean the token N+1 given tokens (1, 2, ... N), then no, that's not how GPT works. If you put in tokens (1, 2, ... N), you always get guesses for tokens (2, 3, ..., N+1) in response. This is true even if all you care about is the guess for N+1.

Comment by nostalgebraist on is gpt-3 few-shot ready for real applications? · 2020-08-08T02:06:12.657Z · LW · GW

People do this a lot with BERT, and it has its own problems -- the first section of this recent paper gives a good overview.

Then of course there is plenty of work trying to mitigate those problems, like that paper . . . but there are still various ways of doing so, with no clear consensus. So a more general statement of few-shot's promise might be "you don't have to worry about which fine-tuning setup you're going to use, out of the many available alternatives, all of which have pitfalls."

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-08-01T22:27:38.348Z · LW · GW

To be fair, it's not an apples-to-apples comparison.

GPT-3 few-shot learning gets to use less data. (Although much of superGLUE has tiny train sets, so this gap isn't as big as it sounds.) And with GPT-3 you don't have the storage overhead of a separate trained model for every task.

Back when I wrote this post, I really did not realize that OpenAI was serious about few-shot learning as a practical, competitive approach. I had assumed it was meant as a conceptual demonstration of meta-learning, or a new way to probe what LMs "know."

In other words, I implicitly assumed "oh, of course they aren't planning [something like the OpenAI API], it'd be uncharitable to assume they actually think this is a practical approach." Now it's clear that they do think that, which makes for a different conversation than the one I had expected here. (I'm still bearish on the approach, though.)

Comment by nostalgebraist on Are we in an AI overhang? · 2020-07-29T06:28:57.475Z · LW · GW

They do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5)

At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold.

I'll have to think more about this and what it might mean for their other scaling laws... at the very least, it's an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-03T03:24:24.245Z · LW · GW

On the reading of the graphs:

All I can say is "I read them differently and I don't think further discussion of the 'right' way to read them would be productive."

Something that might make my perspective clear:

  • when I first read this comment, I thought "whoa, that 'phase change' point seems fair and important, maybe I just wasn't looking for that in the graphs"
  • and then I went back and looked at the graphs and thought "oh, no, that's obviously not distinguishable from noise; that's the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise," etc. etc.

I don't expect this will convince you I'm right, but the distance here seems more about generic "how to interpret plots in papers" stuff than anything interesting about GPT-3.

On this:

I can't think of a coherent model where both of these claims are simultaneously true; if you have one, I'd certainly be interested in hearing what it is.

Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them "less noisily" as the scale grows.

The intended connotation of my stance that "fine-tuning will outperform few-shot" is not "haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!" If anything, it's the opposite:

  • I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
  • I think fine-tuning has shown itself to be a remarkably effective way to "get at" this knowledge for downstream tasks -- even with small data sets, not far in scale from the "data sets" used in few-shot.
  • So, I don't understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).

Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.

So, this all feels a bit like being a dog owner who reads a new paper "demonstrating dogs' capacity for empathy with humans," is unimpressed w/ it's methodology, and finds themselves arguing over what concrete model of "dog empathy" they hold and what it predicts for the currently popular "dog empathy" proxy metrics, with a background assumption that they're some sort of dog-empathy-skeptic.

When in fact -- they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.

I've already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I've seen them do that kind of thing in real life . . . should I be impressed?

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-01T19:53:18.684Z · LW · GW

I agree with you about hype management in general, I think. The following does seem like a point of concrete disagreement:

It sounds like you expected "GPT" to mean something more like "paradigm-breaker" and so you were disappointed, but this feels like a ding on your expectations more than a ding on the paper.

If the paper had not done few-shot learning, and had just reviewed LM task performance / generation quality / zero-shot (note that zero-shot scales up well too!), I would agree with you.

However, as I read the paper, it touts few-shot as this new, exciting capability that only properly emerges at the new scale. I expected that, if any given person found the paper impressive, it would be for this purported newness and not only "LM scaling continues," and this does seem to be the case (e.g. gwern, dxu). So there is a real, object-level dispute over the extent to which this is a qualitative jump.

I'm not sure I have concrete social epistemology goals except "fewer false beliefs" -- that is, I am concerned with group beliefs, but only because they point to which truths will be most impactful to voice. I predicted people would be overly impressed with few-shot, and I wanted to counter that. Arguably I should have concentrated less on "does this deserve the title GPT-3?" and more heavily on few-shot, as I've done more recently.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-01T15:48:59.151Z · LW · GW
Are there bits of evidence against general reasoning ability in GPT-3? Any answers it gives that it would obviously not give if it had a shred of general reasoning ability?

In the post I gestured towards the first test I would want to do here -- compare its performance on arithmetic to its performance on various "fake arithmetics." If #2 is the mechanism for its arithmetic performance, then I'd expect fake arithmetic performance which

  • is roughy comparable to real arithmetic performance (perhaps a bit worse but not vastly so)
  • is at least far above random guessing
  • more closely correlates with the compressibility / complexity of the formal system than with its closeness to real arithmetic

BTW, I want to reiterate that #2 is about non-linguistic general reasoning, the ability to few-shot learn generic formal systems with no relation to English. So the analogies and novel words results seem irrelevant here, although word scramble results may be relevant, as dmtea says.


There's something else I keep wanting to say, because it's had a large influence on my beliefs, but is hard to phrase in an objective-sounding way . . . I've had a lot of experience with GPT-2:

  • I was playing around with fine-tuning soon after 117M was released, and jumped to each of the three larger versions shortly after its release. I have done fine-tuning with at least 11 different text corpora I prepared myself.
  • All this energy for GPT-2 hobby work eventually convergent into my tumblr bot, which uses a fine-tuned 1.5B with domain-specific encoding choices and a custom sampling strategy ("middle-p"), and generates 10-20 candidate samples per post which are then scored by a separate BERT model optimizing for user engagement and a sentiment model to constrain tone. It's made over 5000 posts so far and continues to make 15+ / day.

So, I think have a certain intimate familiarity with GPT-2 -- what it "feels like" across the 4 released sizes and across numerous fine-tuning / sampling / etc strategies on many corpora -- that can't be acquired just by reading papers. And I think this makes me less impressed with arithmetic and other synthetic results than some people.

I regularly see my own GPT-2s do all sorts of cool tricks somewhat similar to these (in fact the biggest surprise here is how far you have to scale to get few-shot arithmetic!), and yet there are also difficult-to-summarize patterns of failure and ignorance which are remarkably resistant to scaling across the 117M-to-1.5B range. (Indeed, the qualitative difference across that range is far smaller than I had expected when only 117M was out.) GPT-2 feels like a very familiar "character" to me by now, and I saw that "character" persist across the staged release without qualitative jumps. I still wait for evidence that convinces me 175B is a new "character" and not my old, goofy friend with another lovely makeover.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-31T04:25:36.775Z · LW · GW
what, in my view, are the primary implications of the GPT-3 paper--namely, what it says about the viability of few-shot prediction as model capacity continues to increase

This seems like one crux of our disagreement. If I thought the paper shows a clear trend, with room to grow, toward much greater performance few-shot learning with even bigger models, I would be more impressed with "few-shot + large LM" as an approach.

I don't think it shows that. The clearest evidence on this subject, IMO, is the many plots in their Appendix H. On a large fraction of the individual downstream tasks, few-shot learning has either

  • a scaling trend with a clearly defined shape that is mostly flat by the 175B point, with a remaining gap vs. fine-tuning that seems unlike to be closed (examples: WiC, MultiRC, ReCoRD, PhysicaQA, OpenBookQA, at least 5 of the 6 reading comprehension tasks, ANLI)
  • a very noisy trend where, due to noise, returns to scale might be large but might just as well be near zero (examples: BoolQ, CB, WSC)

The scaling trend is more encouraging on certain downstream tasks (COPA, ARC, Winogrande, many the MT tasks), on "less downstream" tasks that essentially probe language modeling skill in a different way (cloze/completion), and on synthetic tasks.

On average, there is a trend toward slow but steady growth with scale (Fig 1.3), but this masks the great across-task variance catalogued above. The scaling picture for few-shot is very different from the scaling picture for LM loss itself, which as catalogued in another OpenAI paper is remarkably smooth and predictable, and which (as GPT-3 shows) continues smoothly to 175B.

I find it difficult to express just what I find unimpressive here without further knowledge of your position. (There is an asymmetry: "there is value in this paper" is a there-exists-an-x claim, while "there is no value in this paper" is a for-all-x claim. I'm not arguing for-all-x, only that I have not seen any x yet.)

All I can do is enumerate and strike out all the "x"s I can think of. Does few-shot learning look promising in the scaling limit?

  • As a tool for humans: no, I expect fine-tuning will always be preferred.
  • As a demonstration that transformers are very generic reasoners: no, we still see a wide spread of task performance despite smooth gains in LM loss, with some of the most distinctive deficits persisting at all scales (common sense physics, cf section 5), and some very basic capabilities only emerging at very large scale and noisily even there (arithmetic).
  • As an AGI component: no. Because few-shot learning on most tasks shows no clear scaling trend toward human level, any role of transformers in AGI will require more effective ways of querying them (such as fine-tuning controlled by another module), or non-transformer models.
Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-31T01:57:20.820Z · LW · GW

Since I'm not feigning ignorance -- I was genuinely curious to hear your view of the paper -- there's little I can do to productively continue this conversation.

Responding mainly to register (in case there's any doubt) that I don't agree with your account of my beliefs and motivations, and also to register my surprise at the confidence with which you assert things I know to be false.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-30T20:50:31.785Z · LW · GW

Perhaps I wasn't clear -- when I cited my experience as an ML practitioner, I did so in support of a claim about whether the stated capabilities of GPT-3 sound useful, not as a point about what those capabilities are.

I don't think the practical value of very new techniques is impossible to estimate. For example, the value of BERT was very clear in the paper that introduced it: it was obvious that this was a strictly better way to do supervised NLP, and it was quickly and widely adopted.

(I suppose it's conceivable that few-shot learning with a large model is "secretly useful" in some way not conveyed in the paper, but that's true of any paper, so if this proves anything then it proves too much.)

A smell test: what do you think your past experience would have predicted about the performance of a 175B-parameter model in advance?

Above I argued this question was orthogonal to my point, but to answer it anyway: I'd certainly predict better performance on LM tasks, as a simple extrapolation of the existing "biggening" research (GPT-2 at 1.5B parameters, Megatron-LM at 8.3B, T5 at 11B, T-NLG at 17B).

For downstream tasks, I'd expect similar scaling: certainly with fine-tuning (given T5's success on SuperGLUE) though GPT-3 was not fine-tuned, and also with unsupervised approaches (zero-shot, few-shot) given the reported scaling of GPT-2 zero-shot with model size (GPT-2 Fig 1).

I also would have predicted that fine-tuning still out-performs unsupervised approaches by a large margin on most tasks, a gap we observe with unsupervised GPT-3 vs. fine-tuned smaller models (presumably comparing to fine-tuned 175B models would yield an even larger gap).

I alluded to all this in the post, as did the GPT-3 authors in their paper: the results demonstrate that existing trends continue up to 175B. As Daniel Kokotajlo says, the new observation confirms an already familiar, though previously untested, prediction.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-29T21:01:55.886Z · LW · GW

It sounds like you think I'm nitpicking relatively minor points while ignoring the main significance of the paper. What do you think that main significance is?

I can see an argument that the value of few-shot LM prediction is its potential flexibility as a generic tool -- it can presumably do many tasks that are not standard benchmarks, weren't in the paper, etc.

Given my past ML experience, this just doesn't sound that promising to me, which may be our disconnect. In practical work I tend to find that a few days' work preparing a supervised dataset on my exact problem domain beats anything I can produce without that dataset. Few-shot learning apparently trades that few days of work for another non-zero time investment (finding the right prompt and output-reading methodology), generally worse performance, and (pending distillation successes) vastly larger compute requirements.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-29T19:57:18.240Z · LW · GW

If one ignores the "GPT-3" terminology, then yeah, it's a perfectly decent scaling-up-transformers paper similar to the others that have come out in the last few years. (A paper with some flaws, but that's not surprising.)

But, I would be very surprised if there isn't a lot of hype about this paper -- hype largely due to the "GPT-3" term, and the inappropriate expectations it sets. People are naturally going to think "GPT-3" is as much of a step forward as "GPT-2" was, and it isn't. I take a critical tone here in an effort to cut that hype off at the pass.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-13T17:36:34.443Z · LW · GW

I should say first that I completely agree with you about the extreme data inefficiency of many systems that get enthusiastically labeled "AI" these days -- it is a big problem which calls into question many claims about these systems and their displays of "intelligence."

Especially a few years ago (the field has been getting better about this over time), there was a tendency to define performance with reference to some set collection of tasks similar to the training task without acknowledging that broader generalization capacity, and generalization speed in terms of "number of data points needed to learn the general rule," are key components of any intuitive/familiar notion of intelligence. I've written about this in a few places, like the last few sections of this post, where I talk about the "strange simpletons."

However, it's not clear to me that this limitation is inherent to neural nets or to "AI" in the way you seem to be saying. You write:

Comparing AI to human neurology is off the mark in my estimation, because AIs don't really learn rules. They can predict outcomes (within a narrow context), but the AI has no awareness of the actual "rules" that are leading to that outcome - all it knows is weights and likelihoods.

If I understand you correctly, you're taking a position that Marcus argued against in The Algebraic Mind. I'm taking Marcus' arguments there largely as a given in this post, because I agree with them and because I was interested specifically in the way Marcus' Algebraic Mind arguments cut against Marcus' own views about deep learning today.

If you want to question the Algebraic Mind stuff itself, that's fine, but if so you're disagreeing with both me and Marcus more fundamentally than (I think) Marcus and I disagree with one another, and you'll need a more fleshed-out argument if you want to bridge a gulf of this size.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T20:07:54.391Z · LW · GW

The appearance of a disagreement in this thread seems to hinge on an ambiguity in the phrase "word choice."

If "word choice" just means something narrow like "selecting which noun you want to use, given that you are picking the inhabitant of a 'slot' in a noun phrase within a structured sentence and have a rough idea of what concept you want to convey," then perhaps priming and other results about perceptions of "word similarity" might tell us something about how it is done. But no one ever thought that kind of word choice could scale up to full linguistic fluency, since you need some other process to provide the syntactic context. The idea that syntax can be eliminatively reduced to similarity-based choices on the word level is a radical rejection of linguistic orthodoxy. Nor does anyone (as far as I'm aware) believe GPT-2 works like this.

If "word choice" means something bigger that encompasses syntax, then priming experiments about single words don't tell us much about it.

I do take the point that style as such might be a matter of the first, narrow kind of word choice, in which case GPT-2's stylistic fluency is less surprising than its syntactic fluency. In fact, I think that's true -- intellectually, I am more impressed by the syntax than the style.

But the conjunction of the two impresses me to an extent greater than the sum of its parts. Occam's Razor would have us prefer one mechanism to two when we can get away with it, so if we used to think two phenomena required very different mechanisms, a model that gets both using one mechanism should make us sit up and pay attention.

It's more a priori plausible that all the distinctive things about language are products of a small number of facts about brain architecture, perhaps adapted to do only some of them with the rest arising as spandrels/epiphenomena -- as opposed to needing N architectural facts to explain N distinctive things, with none of them yielding predictive fruit beyond the one thing it was proposed to explain. So, even if we already had a (sketch of a) model of style that felt conceptually akin to a neural net, the fact that we can get good style "for free" out of a model that also does good syntax (or, if you prefer, good syntax "for free" out of a model that also does good style) suggests we might be scientifically on the right track.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T19:32:47.925Z · LW · GW

For the full argument from Marcus, read the parts about "training independence" in The Algebraic Mind ch. 2, or in the paper it draws from, "Rethinking Eliminative Connectionism."

The gist is really simple, though. First, note that if some input node is always zero during training, that's equivalent to it not being there at all: their contribution to the input of any node in the first hidden layer is the relevant weight times zero, which is zero. Likewise, the gradient of anything w/r/t these weights is zero (because you'll always multiply by zero when doing the chain rule), so they'll never get updated from their initial values.

Then observe that, if the nodes are any nonzero constant value during training, the connections add a constant to the first hidden layer inputs instead of zero. But we already have a parameter for an additive constant in a hidden layer input: the "bias." So if the input node is supposed to carry some information, the network still can't learn what it is; it just thinks it's updating the bias. (Indeed, you can go the other way and rewrite the bias as an extra input node that's always constant, or as N such nodes.)

The argument for constant outputs is even simpler: the network will just set the weights and bias to something that always yields the right constant. For example, it'd work to set the weights to zero and the bias to where is the activation function and is the constant. If the output has any relationship to the input then this is wrong, but the training data plus the update rule give you no reason to reject it.

None of this is controversial and it does indeed become obvious once you think about it enough; this kind of idea is much of the rationale for weight sharing, which sets the weights for constant input nodes using patterns learned from non-constant ones rather than randomly/arbitrarily.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T18:56:36.124Z · LW · GW

Hmm... I think you are technically right, since "compositionality" is typically defined as a property of the way phrases/sentences/etc. in a language relate to their meanings. Since language modeling is a task defined in terms of words, without involving their referents at all, GPT-2 indeed does not model/exhibit this property of the way languages mean things.

But the same applies identically to every property of the way languages mean things! So if this is really the argument, there's no reason to focus specifically on "compositionality." On the one hand, we would never expect to get compositionality out of any language model, whether a "deep learning" model or some other kind. On the other hand, the argument would fail for any deep learning model that has to connect words with their referents, like one of those models that writes captions for images.

If we read the passage I quoted from 2019!Marcus in this way, it's a trivially true point about GPT-2 that he immediately generalizes to a trivially false point about deep learning. I think when I originally read the passage, I just assumed he couldn't possibly mean this, and jumped to another interpretation: he's saying that deep learning lacks the capacity for structured representations, which would imply an inability to model compositionality even when needed (e.g. when doing image captioning as opposed to language modeling).

Fittingly, when he goes on to describe the problem, it doesn't sound like he's talking about meaning but about having flat rather than hierarchical representations:

Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure.

In The Algebraic Mind, Marcus critiqued some connectionist models on the grounds that they cannot support "structured representations." Chapter 4 of the book is called "Structured Representations" and is all about this, mostly focused on meaning (he talks a lot about "structured knowledge") but not at all tied to meaning specifically. Syntax and semantics are treated as equally in need of hierarchical representations, equally impossible without them, and equally possible with them.

Unlike the point about meaning and language models, this is a good and nontrivial argument that actually works against some neural nets once proposed as models of syntax or knowledge. So when 2019!Marcus wrote about "compositionality," I assumed that he was making this argument, again, about GPT-2. In that case, GPT-2's proficiency with syntax alone is a relevant datum, because Marcus and conventional linguists believe that syntax alone requires structured representations (as against some of the connectionists, who didn't).

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-02T18:53:34.981Z · LW · GW
In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don't think of removing inductive biases as representational advances - or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we're doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).

I think it's misleading to view "amount of inductive bias" as a one-dimensional scale, with the transformer somewhere "between" CNNs and MLPs. As I said in that post, the move from vanilla MLPs to CNNs involves the introduction of two kinds of constraints/biases at once -- weight sharing between positions, and locality -- and these are two very different things, not just two (perhaps differently sized) injections of "more bias" on our hypothetical 1D bias scale.

For example, locality without weight sharing is certainly conceivable (I can't remember if I've seen it before), but I'd imagine it would do very poorly on text data, because it relaxes the CNN constraint that's appropriate for text while keeping the one that's inappropriate. If you compare that to the transformer, you've got two different ways of relaxing the CNN biases, but one works better and one (I would imagine) works worse. This shows that a given architecture's representational aptness for a given domain isn't just a function of some 1D "amount of inductive bias" in conjunction with data/compute volume; the specific nature of the biases and the domain matter too.

As as sidenote, most pre-transformer SOTA architectures for text were RNNs, not CNNs. So, having argued above that "moving to a superset" shouldn't be simplified to "reducing some 1D 'bias' variable," I'd also say that "moving to a superset" isn't what happened anyway.

Concretely, I'd predict with ~80% confidence that within 3 years, we'll be able to achieve comparable performance to our current best language models without using transformers - say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?

Disagree. Not that this seems deeply impossible or anything, but it's exactly what people were trying to do for many years before the introduction of the transformer; a lot of work has already gone into this, and now there's less incentive to do it.

On the general topic of transformer vs. CNN/LSTM, as well as the specific topic of my OP, I found the paper linked by steve2152 very interesting.

Comment by nostalgebraist on “embedded self-justification,” or something like that · 2019-11-03T08:35:43.544Z · LW · GW

Thanks, the floor/ceiling distinction is helpful.

I think "ceilings as they exist in reality" is my main interest in this post. Specifically, I'm interested in the following:

  • any resource-bound agent will have ceilings, so an account of embedded rationality needs a "theory of having good ceilings"
  • a "theory of having good ceilings" would be different from the sorts of "theories" we're used to thinking about, involving practical concerns at the fundamental desiderata level rather than as a matter of implementing an ideal after it's been specified

In more detail: it's one thing to be able to assess quick heuristics, and it's another (and better) one to be able to assess quick heuristics quickly. It's possible (maybe) to imagine a convenient situation where the theory of each "speed class" among fast decisions is compressible enough to distill down to something which can be run in that speed class and still provide useful guidance. In this case there's a possibility for the theory to tell us why our behavior as a whole is justified, by explaining how our choices are "about as good as can be hoped for" during necessarily fast/simple activity that can't possibly meet our more powerful and familiar notions of decision rationality.

However, if we can't do this, it seems like we face an exploding backlog of justification needs: every application of a fast heuristic now requires a slow justification pass, but we're constantly applying fast heuristics and there's no room for the slow pass to catch up. So maybe a stronger agent could justify what we do, but we couldn't.

I expect helpful theories here to involve distilling-into-fast-enough-rules on a fundamental level, so that "an impractically slow but working version of the theory" is actually a contradiction in terms.

Comment by nostalgebraist on “embedded self-justification,” or something like that · 2019-11-03T07:14:45.998Z · LW · GW

I don't understand Thing #1. Perhaps, in the passage you quote from my post, the phrase "decision procedure" sounds misleadingly generic, as if I have some single function I use to make all my decisions (big and small) and we are talking about modifications to that function.

(I don't think that is really possible: if the function is sophisticated enough to actually work in general, it must have a lot of internal sub-structure, and the smaller things it does inside itself could be treated as "decisions" that aren't being made using the whole function, which contradicts the original premise.)

Instead, I'm just talking about the ordinary sort of case where you shift some resources away from doing X to thinking about better ways to do X, where X isn't the whole of everything you do.

Re: Q/A/A1, I guess I agree that these things are (as best I can tell) inevitably pragmatic. And that, as EY says in the post you link, "I'm managing the recursion to the best of my ability" can mean something better than just "I work on exactly N levels and then my decisions at level N+1 are utterly arbitrary." But then this seems to threaten the Embedded Agency programme, because it would mean we can't make theoretically grounded assessments or comparisons involving agents as strong as ourselves or stronger.

(The discussion of self-justification in this post was originally motivated by the topic of external assessment, on the premise that if we are powerful enough to assess a proposed AGI in a given way, it must also be powerful enough to assess itself in that way. And contrapositively, if the AGI can't assess itself in a given way then we can't assess it in that way either.)

Comment by nostalgebraist on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T06:00:20.806Z · LW · GW

I don't see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. "cheap win" strategies that are easily defeated by serious players and hence never come up in serious play.