on “learning to summarize” 2020-09-12T03:20:08.333Z · score: 22 (7 votes)
interpreting GPT: the logit lens 2020-08-31T02:47:08.426Z · score: 104 (35 votes)
is gpt-3 few-shot ready for real applications? 2020-08-03T19:50:09.740Z · score: 31 (12 votes)
[updated] how does gpt2′s training corpus capture internet discussion?  not well 2020-07-27T22:30:07.909Z · score: 24 (12 votes)
Why is pseudo-alignment "worse" than other ways ML can fail to generalize? 2020-07-18T22:54:50.957Z · score: 43 (11 votes)
GPT-3: a disappointing paper 2020-05-29T19:06:27.589Z · score: 54 (45 votes)
covid-19 notes, 4/19/20 2020-04-20T05:30:01.873Z · score: 29 (11 votes)
mind viruses about body viruses 2020-03-28T04:20:02.674Z · score: 56 (21 votes)
human psycholinguists: a critical appraisal 2019-12-31T00:20:01.330Z · score: 170 (64 votes)
“embedded self-justification,” or something like that 2019-11-03T03:20:01.848Z · score: 41 (14 votes)
When does rationality-as-search have nontrivial implications? 2018-11-04T22:42:01.452Z · score: 67 (24 votes)


Comment by nostalgebraist on Why GPT wants to mesa-optimize & how we might change this · 2020-09-26T05:21:20.726Z · score: 1 (1 votes) · LW · GW

Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?


No, it's a more philosophical point.  Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text."

For example, the mere appearance of something like

Title: Why GPT wants to mesa-optimize & how we might change this 

Author: John_Maxwell

does not guarantee that the text following it bears that title, or was written by that author.  (As I am illustrating right now.)

Of course, one can design datasets where information like this is provided more authoritatively -- say, always at the start of each text, curated for quality, etc.  (GPT isn't like that, but Grover and CTRL kind of are, in different ways.)

But even that can only go so far.  If the author is "Julius Caesar," does that mean the historical figure, some internet poster with that handle, or any number of other possibilities?  A passage of fiction written in a character's voice -- is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character?  (Note that the character is a much better answer to the question "who does this sound like?")  And doesn't the date matter too, so we know whether this post in the venue "Less Wrong" was on 2010's LW or 2020's?

Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words.  You can try to hack in some sidechannels to provide context, but there's no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world.  And just as a definitional manner, these sidechannels are modifications to "language modeling," which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.

Yeah, not for transformers I think.

Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.

capybaralet's point about conservation of expected evidence applies here -- GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

If we then say "the mechanism for pricing them in is doing internal lookahead," then we are imagining that lookahead operating over some predictor that is otherwise good but hasn't priced in lookahead yet.  But I don't know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.

Comment by nostalgebraist on Why GPT wants to mesa-optimize & how we might change this · 2020-09-26T02:47:29.410Z · score: 13 (4 votes) · LW · GW

I'm skeptical that internal beam search would help in language modeling.

Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you're looking.  So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

Weather is like this because of chaotic dynamics.  Language modeling is like this because

(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn't extrapolate from reading the first (100-X)%, or else they'd just stop and not write the remaining X%.

(b) By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom.  So even if you were smart enough to guess what any individual human would say next (!), you don't know which human produced the text you're looking at.  (Or even whether it was a human at all.)

Thus (IMO), language modeling is not really about thinking ahead to find some "objectively correct" next move as in Chess/Go.  It's more about trying to guess what the author of this text will do in the very next step.  The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn't find it very useful.

To make the point concrete, I don't think "orange" is necessarily a bad guess here -- among other things, it would be the correct guess if the author were trying to illustrate the point of your example!

And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis "...", which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in.  (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )

Comment by nostalgebraist on on “learning to summarize” · 2020-09-13T17:58:13.740Z · score: 1 (1 votes) · LW · GW

To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there's only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.

With only final rewards, you can still include it as a variable formally. but there's no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)

I guess I was using "there isn't a horizon per se" to mean "the time structure of the rewards determines the horizon for you, it wouldn't make sense to vary it," but I can see how that would be confusing.

If you only set the horizon to 1 but changed nothing else in their work, you'd get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.

Comment by nostalgebraist on on “learning to summarize” · 2020-09-12T21:06:35.673Z · score: 3 (2 votes) · LW · GW
I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this.

Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post -- I was reading over the funny samples I quoted at the end and thought "huh, that would qualify as 'bizarre behavior,' wouldn't it?"

Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

If I understand you, yes, this is what I want. My intuition here is based on:

  • at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
  • when OpenAI (and I) think about what "better probabilities" we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific "made-up" facts, or the kind of synthetic errors they introduce in Table 18

So, it feels like "we" want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.

Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I'm annotating to improve an LM for nonfiction writing, and I see "Paris, the capital of Canada," what I really want is to make the token " Canada" less probable in this context.

This is a preference over next-token probabilities, not sequences -- if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they're naturally in LM terms to begin with.

This doesn't get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked "good" and descend for tokens marked "bad," but there may be conceptually similar losses that are better-behaved in training.

Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

Comment by nostalgebraist on "Learning to Summarize with Human Feedback" - OpenAI · 2020-09-09T08:20:42.136Z · score: 10 (3 votes) · LW · GW

Various thoughts -- focused on critique because I find that most interesting to write down. (I didn't have a strong negative or positive reaction to the paper.)


IMO there are two almost unrelated ideas going on in OpenAI's preference learning work (this paper and the original one).

  • First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
  • Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.

As their first step, they do supervised learning on the data from the first idea to produce a "reward model." (In this paper, this happens roughly once, with little active learning of the reward model over successive batches of annotation.)

This model assigns a score to an entire sample of N tokens, but for LM finetuning, you want something that tells you how good each token is individually. The second idea is the way they choose to bridge the gap, with a specific RL technique.

The overall results look good, but it's not clear how to attribute that across the two ideas, and OpenAI's discussion tends to blur the two together. They can perhaps learn high-quality reward models from preference data (first idea), but it's less clear they are using these models to tune sampling in a good way (gwern said the same thing after trying it).

On the flipside, their RL approach to sampling treats the reward as a black box, so it has nothing to do with preference data per se; you could apply it with any score function.


As far as I can tell, their final "human evaluation" was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of "evaluating on training data." It's not surprising that a model tuned on someone's annotations agrees with that person more than a model which wasn't.

For example, in Fig. 3, it looks like the "supervised" baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model. I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.

This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations but otherwise doing supervised learning, vs. using their RL approach.

Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.


The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It's good that OpenAI is doing the right things here, but this is not a new result -- rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do. (That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-02T16:22:58.045Z · score: 1 (1 votes) · LW · GW

Interesting topic! I'm not confident this lens would reveal much about it (vs. attention maps or something), but it's worth a try.

I'd encourage you to try this yourself with the Colab notebook, since you presumably have more experience writing this kind of prompt than I do.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-02T00:02:08.712Z · score: 3 (2 votes) · LW · GW

I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.

What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they're continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.

I expect GPT and many other neural models are effectively working in such space of nearly orthogonal vectors, and picking/combining elements of it. A decomposition into orthogonal vectors won't really illuminate this. I wish I knew more about this topic -- are there standard techniques?

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T22:11:25.371Z · score: 6 (4 votes) · LW · GW
One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model.

That's a great idea!

One possible hypothesis that this might let you test is whether the information about the input is being stored indirectly via what the model's guess is given that input or whether it's just being stored in parts of the embedding space that aren't very relevant to the output (if it's the latter, the linear model should put a lot of weight on basis elements that have very little weight in the embedding matrix).

Hmm... I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.

But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.

Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they're near-impossible so you don't need to track them closely), and then smoothly subtracted out at the end. There's probably a way to check if that's happening.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T21:52:47.645Z · score: 1 (1 votes) · LW · GW

Post has been now updated with a long-ish addendum about this topic.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T21:01:55.087Z · score: 1 (1 votes) · LW · GW

Good idea, I'll do that.

I know I'd run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it's smaller.

Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like "plasma" are "kept around" for later use.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T18:52:54.949Z · score: 8 (6 votes) · LW · GW
Maybe lm_head was set to be equal to wte transpose?

Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.

  • In OpenAI's tensorflow code, see lines 154 and 171 of src/ The variable "wte" is defined on 151, then re-used on 171.
  • In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the place to look.)

Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the "head" used to convert the final layer outputs into predictions. The intent is easy swapping between different "heads" with the same "body" beneath.

This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.

Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn't a critique of the latter; it's trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T15:19:43.991Z · score: 4 (3 votes) · LW · GW
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?

One can do this in the Colab notebook by calling show_token_progress with comparisons_vs="first" rather than the default "final". IIRC, this also shows a discontinuous flip at the bottom followed by slower change.

(This is similar to asking the question "do the activations assign high or low probability the input token?" One can answer the same question by plotting logits or ranks with the input layer included.)

GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values.

It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-09-01T04:23:51.125Z · score: 4 (3 votes) · LW · GW

Interesting, but not (I think?) the direction I was headed in.

I was thinking more about the way the model seems to be managing a tradeoff between preserving the representation of token i and producing the representation of token i+1.

The depth-wise continuity imposed by weight decay means late layers are representing something close to the final output -- in late layers the model is roughly looking at its own guesses, even if they were wrong, which seems suboptimal.

Consider this scenario:

  • The model does poorly at position i, assigning very low probability to the true token residing at i+1.
  • To retain a clear view of the input sequence, the model now needs to "keep around" the true token at i+1, since its own guess is a poor proxy.
  • But early layers don't know that: they can't "look up" and notice the poor prediction. So they just treat i+1 like any other position. (I.e. there's no way to implement a selective "copy when we got it wrong" mechanism)
  • In late layers, position i+1 has been converted into a guess about i+2 by the earlier layers, so we can't rely on it to tell us what really occupied i+1.
  • And position i has been converted to a bad guess about position i+1, so if we use it as a proxy for i+1 we'll do poorly.

My sampling idea was something like "let's replace (or interpolate) late activations with embeddings of the actual next token, so the model can see what really happened, even when its probability was low." (This is for sampling specifically because it'd be too slow in training, where you want to process a whole window at once with matrix operations; sampling has to be a loop anyway, so there's no cost to adding stuff that only works as a loop.)

But, thinking about it more, the model clearly can perform well in scenarios like the above, e.g. my plasma example and also many other cases naturally arising in language which GPT handles well.

I have no idea how it does it -- indeed the connection structure feels weirdly adverse to such operations -- but apparently it does. So it's probably premature to assume it can't do this well, and attempt to "help it out" with extra tricks.

Comment by nostalgebraist on interpreting GPT: the logit lens · 2020-08-31T23:57:17.016Z · score: 1 (1 votes) · LW · GW
Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...

Not sure I understand the distinction, could you rephrase?

If by "last slot" you mean last layer (as opposed to earlier layers), that seems like the same thing as outputting the input offset by one.

If by "last slot" you mean the token N+1 given tokens (1, 2, ... N), then no, that's not how GPT works. If you put in tokens (1, 2, ... N), you always get guesses for tokens (2, 3, ..., N+1) in response. This is true even if all you care about is the guess for N+1.

Comment by nostalgebraist on is gpt-3 few-shot ready for real applications? · 2020-08-08T02:06:12.657Z · score: 3 (2 votes) · LW · GW

People do this a lot with BERT, and it has its own problems -- the first section of this recent paper gives a good overview.

Then of course there is plenty of work trying to mitigate those problems, like that paper . . . but there are still various ways of doing so, with no clear consensus. So a more general statement of few-shot's promise might be "you don't have to worry about which fine-tuning setup you're going to use, out of the many available alternatives, all of which have pitfalls."

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-08-01T22:27:38.348Z · score: 1 (1 votes) · LW · GW

To be fair, it's not an apples-to-apples comparison.

GPT-3 few-shot learning gets to use less data. (Although much of superGLUE has tiny train sets, so this gap isn't as big as it sounds.) And with GPT-3 you don't have the storage overhead of a separate trained model for every task.

Back when I wrote this post, I really did not realize that OpenAI was serious about few-shot learning as a practical, competitive approach. I had assumed it was meant as a conceptual demonstration of meta-learning, or a new way to probe what LMs "know."

In other words, I implicitly assumed "oh, of course they aren't planning [something like the OpenAI API], it'd be uncharitable to assume they actually think this is a practical approach." Now it's clear that they do think that, which makes for a different conversation than the one I had expected here. (I'm still bearish on the approach, though.)

Comment by nostalgebraist on Are we in an AI overhang? · 2020-07-29T06:28:57.475Z · score: 9 (6 votes) · LW · GW

They do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5)

At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold.

I'll have to think more about this and what it might mean for their other scaling laws... at the very least, it's an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-03T03:24:24.245Z · score: 6 (4 votes) · LW · GW

On the reading of the graphs:

All I can say is "I read them differently and I don't think further discussion of the 'right' way to read them would be productive."

Something that might make my perspective clear:

  • when I first read this comment, I thought "whoa, that 'phase change' point seems fair and important, maybe I just wasn't looking for that in the graphs"
  • and then I went back and looked at the graphs and thought "oh, no, that's obviously not distinguishable from noise; that's the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise," etc. etc.

I don't expect this will convince you I'm right, but the distance here seems more about generic "how to interpret plots in papers" stuff than anything interesting about GPT-3.

On this:

I can't think of a coherent model where both of these claims are simultaneously true; if you have one, I'd certainly be interested in hearing what it is.

Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them "less noisily" as the scale grows.

The intended connotation of my stance that "fine-tuning will outperform few-shot" is not "haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!" If anything, it's the opposite:

  • I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
  • I think fine-tuning has shown itself to be a remarkably effective way to "get at" this knowledge for downstream tasks -- even with small data sets, not far in scale from the "data sets" used in few-shot.
  • So, I don't understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).

Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.

So, this all feels a bit like being a dog owner who reads a new paper "demonstrating dogs' capacity for empathy with humans," is unimpressed w/ it's methodology, and finds themselves arguing over what concrete model of "dog empathy" they hold and what it predicts for the currently popular "dog empathy" proxy metrics, with a background assumption that they're some sort of dog-empathy-skeptic.

When in fact -- they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.

I've already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I've seen them do that kind of thing in real life . . . should I be impressed?

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-01T19:53:18.684Z · score: 9 (6 votes) · LW · GW

I agree with you about hype management in general, I think. The following does seem like a point of concrete disagreement:

It sounds like you expected "GPT" to mean something more like "paradigm-breaker" and so you were disappointed, but this feels like a ding on your expectations more than a ding on the paper.

If the paper had not done few-shot learning, and had just reviewed LM task performance / generation quality / zero-shot (note that zero-shot scales up well too!), I would agree with you.

However, as I read the paper, it touts few-shot as this new, exciting capability that only properly emerges at the new scale. I expected that, if any given person found the paper impressive, it would be for this purported newness and not only "LM scaling continues," and this does seem to be the case (e.g. gwern, dxu). So there is a real, object-level dispute over the extent to which this is a qualitative jump.

I'm not sure I have concrete social epistemology goals except "fewer false beliefs" -- that is, I am concerned with group beliefs, but only because they point to which truths will be most impactful to voice. I predicted people would be overly impressed with few-shot, and I wanted to counter that. Arguably I should have concentrated less on "does this deserve the title GPT-3?" and more heavily on few-shot, as I've done more recently.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-01T15:48:59.151Z · score: 10 (6 votes) · LW · GW
Are there bits of evidence against general reasoning ability in GPT-3? Any answers it gives that it would obviously not give if it had a shred of general reasoning ability?

In the post I gestured towards the first test I would want to do here -- compare its performance on arithmetic to its performance on various "fake arithmetics." If #2 is the mechanism for its arithmetic performance, then I'd expect fake arithmetic performance which

  • is roughy comparable to real arithmetic performance (perhaps a bit worse but not vastly so)
  • is at least far above random guessing
  • more closely correlates with the compressibility / complexity of the formal system than with its closeness to real arithmetic

BTW, I want to reiterate that #2 is about non-linguistic general reasoning, the ability to few-shot learn generic formal systems with no relation to English. So the analogies and novel words results seem irrelevant here, although word scramble results may be relevant, as dmtea says.


There's something else I keep wanting to say, because it's had a large influence on my beliefs, but is hard to phrase in an objective-sounding way . . . I've had a lot of experience with GPT-2:

  • I was playing around with fine-tuning soon after 117M was released, and jumped to each of the three larger versions shortly after its release. I have done fine-tuning with at least 11 different text corpora I prepared myself.
  • All this energy for GPT-2 hobby work eventually convergent into my tumblr bot, which uses a fine-tuned 1.5B with domain-specific encoding choices and a custom sampling strategy ("middle-p"), and generates 10-20 candidate samples per post which are then scored by a separate BERT model optimizing for user engagement and a sentiment model to constrain tone. It's made over 5000 posts so far and continues to make 15+ / day.

So, I think have a certain intimate familiarity with GPT-2 -- what it "feels like" across the 4 released sizes and across numerous fine-tuning / sampling / etc strategies on many corpora -- that can't be acquired just by reading papers. And I think this makes me less impressed with arithmetic and other synthetic results than some people.

I regularly see my own GPT-2s do all sorts of cool tricks somewhat similar to these (in fact the biggest surprise here is how far you have to scale to get few-shot arithmetic!), and yet there are also difficult-to-summarize patterns of failure and ignorance which are remarkably resistant to scaling across the 117M-to-1.5B range. (Indeed, the qualitative difference across that range is far smaller than I had expected when only 117M was out.) GPT-2 feels like a very familiar "character" to me by now, and I saw that "character" persist across the staged release without qualitative jumps. I still wait for evidence that convinces me 175B is a new "character" and not my old, goofy friend with another lovely makeover.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-31T04:25:36.775Z · score: 13 (10 votes) · LW · GW
what, in my view, are the primary implications of the GPT-3 paper--namely, what it says about the viability of few-shot prediction as model capacity continues to increase

This seems like one crux of our disagreement. If I thought the paper shows a clear trend, with room to grow, toward much greater performance few-shot learning with even bigger models, I would be more impressed with "few-shot + large LM" as an approach.

I don't think it shows that. The clearest evidence on this subject, IMO, is the many plots in their Appendix H. On a large fraction of the individual downstream tasks, few-shot learning has either

  • a scaling trend with a clearly defined shape that is mostly flat by the 175B point, with a remaining gap vs. fine-tuning that seems unlike to be closed (examples: WiC, MultiRC, ReCoRD, PhysicaQA, OpenBookQA, at least 5 of the 6 reading comprehension tasks, ANLI)
  • a very noisy trend where, due to noise, returns to scale might be large but might just as well be near zero (examples: BoolQ, CB, WSC)

The scaling trend is more encouraging on certain downstream tasks (COPA, ARC, Winogrande, many the MT tasks), on "less downstream" tasks that essentially probe language modeling skill in a different way (cloze/completion), and on synthetic tasks.

On average, there is a trend toward slow but steady growth with scale (Fig 1.3), but this masks the great across-task variance catalogued above. The scaling picture for few-shot is very different from the scaling picture for LM loss itself, which as catalogued in another OpenAI paper is remarkably smooth and predictable, and which (as GPT-3 shows) continues smoothly to 175B.

I find it difficult to express just what I find unimpressive here without further knowledge of your position. (There is an asymmetry: "there is value in this paper" is a there-exists-an-x claim, while "there is no value in this paper" is a for-all-x claim. I'm not arguing for-all-x, only that I have not seen any x yet.)

All I can do is enumerate and strike out all the "x"s I can think of. Does few-shot learning look promising in the scaling limit?

  • As a tool for humans: no, I expect fine-tuning will always be preferred.
  • As a demonstration that transformers are very generic reasoners: no, we still see a wide spread of task performance despite smooth gains in LM loss, with some of the most distinctive deficits persisting at all scales (common sense physics, cf section 5), and some very basic capabilities only emerging at very large scale and noisily even there (arithmetic).
  • As an AGI component: no. Because few-shot learning on most tasks shows no clear scaling trend toward human level, any role of transformers in AGI will require more effective ways of querying them (such as fine-tuning controlled by another module), or non-transformer models.
Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-31T01:57:20.820Z · score: 11 (10 votes) · LW · GW

Since I'm not feigning ignorance -- I was genuinely curious to hear your view of the paper -- there's little I can do to productively continue this conversation.

Responding mainly to register (in case there's any doubt) that I don't agree with your account of my beliefs and motivations, and also to register my surprise at the confidence with which you assert things I know to be false.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-30T20:50:31.785Z · score: 4 (5 votes) · LW · GW

Perhaps I wasn't clear -- when I cited my experience as an ML practitioner, I did so in support of a claim about whether the stated capabilities of GPT-3 sound useful, not as a point about what those capabilities are.

I don't think the practical value of very new techniques is impossible to estimate. For example, the value of BERT was very clear in the paper that introduced it: it was obvious that this was a strictly better way to do supervised NLP, and it was quickly and widely adopted.

(I suppose it's conceivable that few-shot learning with a large model is "secretly useful" in some way not conveyed in the paper, but that's true of any paper, so if this proves anything then it proves too much.)

A smell test: what do you think your past experience would have predicted about the performance of a 175B-parameter model in advance?

Above I argued this question was orthogonal to my point, but to answer it anyway: I'd certainly predict better performance on LM tasks, as a simple extrapolation of the existing "biggening" research (GPT-2 at 1.5B parameters, Megatron-LM at 8.3B, T5 at 11B, T-NLG at 17B).

For downstream tasks, I'd expect similar scaling: certainly with fine-tuning (given T5's success on SuperGLUE) though GPT-3 was not fine-tuned, and also with unsupervised approaches (zero-shot, few-shot) given the reported scaling of GPT-2 zero-shot with model size (GPT-2 Fig 1).

I also would have predicted that fine-tuning still out-performs unsupervised approaches by a large margin on most tasks, a gap we observe with unsupervised GPT-3 vs. fine-tuned smaller models (presumably comparing to fine-tuned 175B models would yield an even larger gap).

I alluded to all this in the post, as did the GPT-3 authors in their paper: the results demonstrate that existing trends continue up to 175B. As Daniel Kokotajlo says, the new observation confirms an already familiar, though previously untested, prediction.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-29T21:01:55.886Z · score: 14 (13 votes) · LW · GW

It sounds like you think I'm nitpicking relatively minor points while ignoring the main significance of the paper. What do you think that main significance is?

I can see an argument that the value of few-shot LM prediction is its potential flexibility as a generic tool -- it can presumably do many tasks that are not standard benchmarks, weren't in the paper, etc.

Given my past ML experience, this just doesn't sound that promising to me, which may be our disconnect. In practical work I tend to find that a few days' work preparing a supervised dataset on my exact problem domain beats anything I can produce without that dataset. Few-shot learning apparently trades that few days of work for another non-zero time investment (finding the right prompt and output-reading methodology), generally worse performance, and (pending distillation successes) vastly larger compute requirements.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-29T19:57:18.240Z · score: 14 (9 votes) · LW · GW

If one ignores the "GPT-3" terminology, then yeah, it's a perfectly decent scaling-up-transformers paper similar to the others that have come out in the last few years. (A paper with some flaws, but that's not surprising.)

But, I would be very surprised if there isn't a lot of hype about this paper -- hype largely due to the "GPT-3" term, and the inappropriate expectations it sets. People are naturally going to think "GPT-3" is as much of a step forward as "GPT-2" was, and it isn't. I take a critical tone here in an effort to cut that hype off at the pass.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-13T17:36:34.443Z · score: 3 (2 votes) · LW · GW

I should say first that I completely agree with you about the extreme data inefficiency of many systems that get enthusiastically labeled "AI" these days -- it is a big problem which calls into question many claims about these systems and their displays of "intelligence."

Especially a few years ago (the field has been getting better about this over time), there was a tendency to define performance with reference to some set collection of tasks similar to the training task without acknowledging that broader generalization capacity, and generalization speed in terms of "number of data points needed to learn the general rule," are key components of any intuitive/familiar notion of intelligence. I've written about this in a few places, like the last few sections of this post, where I talk about the "strange simpletons."

However, it's not clear to me that this limitation is inherent to neural nets or to "AI" in the way you seem to be saying. You write:

Comparing AI to human neurology is off the mark in my estimation, because AIs don't really learn rules. They can predict outcomes (within a narrow context), but the AI has no awareness of the actual "rules" that are leading to that outcome - all it knows is weights and likelihoods.

If I understand you correctly, you're taking a position that Marcus argued against in The Algebraic Mind. I'm taking Marcus' arguments there largely as a given in this post, because I agree with them and because I was interested specifically in the way Marcus' Algebraic Mind arguments cut against Marcus' own views about deep learning today.

If you want to question the Algebraic Mind stuff itself, that's fine, but if so you're disagreeing with both me and Marcus more fundamentally than (I think) Marcus and I disagree with one another, and you'll need a more fleshed-out argument if you want to bridge a gulf of this size.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T20:07:54.391Z · score: 12 (3 votes) · LW · GW

The appearance of a disagreement in this thread seems to hinge on an ambiguity in the phrase "word choice."

If "word choice" just means something narrow like "selecting which noun you want to use, given that you are picking the inhabitant of a 'slot' in a noun phrase within a structured sentence and have a rough idea of what concept you want to convey," then perhaps priming and other results about perceptions of "word similarity" might tell us something about how it is done. But no one ever thought that kind of word choice could scale up to full linguistic fluency, since you need some other process to provide the syntactic context. The idea that syntax can be eliminatively reduced to similarity-based choices on the word level is a radical rejection of linguistic orthodoxy. Nor does anyone (as far as I'm aware) believe GPT-2 works like this.

If "word choice" means something bigger that encompasses syntax, then priming experiments about single words don't tell us much about it.

I do take the point that style as such might be a matter of the first, narrow kind of word choice, in which case GPT-2's stylistic fluency is less surprising than its syntactic fluency. In fact, I think that's true -- intellectually, I am more impressed by the syntax than the style.

But the conjunction of the two impresses me to an extent greater than the sum of its parts. Occam's Razor would have us prefer one mechanism to two when we can get away with it, so if we used to think two phenomena required very different mechanisms, a model that gets both using one mechanism should make us sit up and pay attention.

It's more a priori plausible that all the distinctive things about language are products of a small number of facts about brain architecture, perhaps adapted to do only some of them with the rest arising as spandrels/epiphenomena -- as opposed to needing N architectural facts to explain N distinctive things, with none of them yielding predictive fruit beyond the one thing it was proposed to explain. So, even if we already had a (sketch of a) model of style that felt conceptually akin to a neural net, the fact that we can get good style "for free" out of a model that also does good syntax (or, if you prefer, good syntax "for free" out of a model that also does good style) suggests we might be scientifically on the right track.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T19:32:47.925Z · score: 4 (2 votes) · LW · GW

For the full argument from Marcus, read the parts about "training independence" in The Algebraic Mind ch. 2, or in the paper it draws from, "Rethinking Eliminative Connectionism."

The gist is really simple, though. First, note that if some input node is always zero during training, that's equivalent to it not being there at all: their contribution to the input of any node in the first hidden layer is the relevant weight times zero, which is zero. Likewise, the gradient of anything w/r/t these weights is zero (because you'll always multiply by zero when doing the chain rule), so they'll never get updated from their initial values.

Then observe that, if the nodes are any nonzero constant value during training, the connections add a constant to the first hidden layer inputs instead of zero. But we already have a parameter for an additive constant in a hidden layer input: the "bias." So if the input node is supposed to carry some information, the network still can't learn what it is; it just thinks it's updating the bias. (Indeed, you can go the other way and rewrite the bias as an extra input node that's always constant, or as N such nodes.)

The argument for constant outputs is even simpler: the network will just set the weights and bias to something that always yields the right constant. For example, it'd work to set the weights to zero and the bias to where is the activation function and is the constant. If the output has any relationship to the input then this is wrong, but the training data plus the update rule give you no reason to reject it.

None of this is controversial and it does indeed become obvious once you think about it enough; this kind of idea is much of the rationale for weight sharing, which sets the weights for constant input nodes using patterns learned from non-constant ones rather than randomly/arbitrarily.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T18:56:36.124Z · score: 2 (2 votes) · LW · GW

Hmm... I think you are technically right, since "compositionality" is typically defined as a property of the way phrases/sentences/etc. in a language relate to their meanings. Since language modeling is a task defined in terms of words, without involving their referents at all, GPT-2 indeed does not model/exhibit this property of the way languages mean things.

But the same applies identically to every property of the way languages mean things! So if this is really the argument, there's no reason to focus specifically on "compositionality." On the one hand, we would never expect to get compositionality out of any language model, whether a "deep learning" model or some other kind. On the other hand, the argument would fail for any deep learning model that has to connect words with their referents, like one of those models that writes captions for images.

If we read the passage I quoted from 2019!Marcus in this way, it's a trivially true point about GPT-2 that he immediately generalizes to a trivially false point about deep learning. I think when I originally read the passage, I just assumed he couldn't possibly mean this, and jumped to another interpretation: he's saying that deep learning lacks the capacity for structured representations, which would imply an inability to model compositionality even when needed (e.g. when doing image captioning as opposed to language modeling).

Fittingly, when he goes on to describe the problem, it doesn't sound like he's talking about meaning but about having flat rather than hierarchical representations:

Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure.

In The Algebraic Mind, Marcus critiqued some connectionist models on the grounds that they cannot support "structured representations." Chapter 4 of the book is called "Structured Representations" and is all about this, mostly focused on meaning (he talks a lot about "structured knowledge") but not at all tied to meaning specifically. Syntax and semantics are treated as equally in need of hierarchical representations, equally impossible without them, and equally possible with them.

Unlike the point about meaning and language models, this is a good and nontrivial argument that actually works against some neural nets once proposed as models of syntax or knowledge. So when 2019!Marcus wrote about "compositionality," I assumed that he was making this argument, again, about GPT-2. In that case, GPT-2's proficiency with syntax alone is a relevant datum, because Marcus and conventional linguists believe that syntax alone requires structured representations (as against some of the connectionists, who didn't).

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-02T18:53:34.981Z · score: 5 (4 votes) · LW · GW
In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don't think of removing inductive biases as representational advances - or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we're doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).

I think it's misleading to view "amount of inductive bias" as a one-dimensional scale, with the transformer somewhere "between" CNNs and MLPs. As I said in that post, the move from vanilla MLPs to CNNs involves the introduction of two kinds of constraints/biases at once -- weight sharing between positions, and locality -- and these are two very different things, not just two (perhaps differently sized) injections of "more bias" on our hypothetical 1D bias scale.

For example, locality without weight sharing is certainly conceivable (I can't remember if I've seen it before), but I'd imagine it would do very poorly on text data, because it relaxes the CNN constraint that's appropriate for text while keeping the one that's inappropriate. If you compare that to the transformer, you've got two different ways of relaxing the CNN biases, but one works better and one (I would imagine) works worse. This shows that a given architecture's representational aptness for a given domain isn't just a function of some 1D "amount of inductive bias" in conjunction with data/compute volume; the specific nature of the biases and the domain matter too.

As as sidenote, most pre-transformer SOTA architectures for text were RNNs, not CNNs. So, having argued above that "moving to a superset" shouldn't be simplified to "reducing some 1D 'bias' variable," I'd also say that "moving to a superset" isn't what happened anyway.

Concretely, I'd predict with ~80% confidence that within 3 years, we'll be able to achieve comparable performance to our current best language models without using transformers - say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?

Disagree. Not that this seems deeply impossible or anything, but it's exactly what people were trying to do for many years before the introduction of the transformer; a lot of work has already gone into this, and now there's less incentive to do it.

On the general topic of transformer vs. CNN/LSTM, as well as the specific topic of my OP, I found the paper linked by steve2152 very interesting.

Comment by nostalgebraist on “embedded self-justification,” or something like that · 2019-11-03T08:35:43.544Z · score: 7 (5 votes) · LW · GW

Thanks, the floor/ceiling distinction is helpful.

I think "ceilings as they exist in reality" is my main interest in this post. Specifically, I'm interested in the following:

  • any resource-bound agent will have ceilings, so an account of embedded rationality needs a "theory of having good ceilings"
  • a "theory of having good ceilings" would be different from the sorts of "theories" we're used to thinking about, involving practical concerns at the fundamental desiderata level rather than as a matter of implementing an ideal after it's been specified

In more detail: it's one thing to be able to assess quick heuristics, and it's another (and better) one to be able to assess quick heuristics quickly. It's possible (maybe) to imagine a convenient situation where the theory of each "speed class" among fast decisions is compressible enough to distill down to something which can be run in that speed class and still provide useful guidance. In this case there's a possibility for the theory to tell us why our behavior as a whole is justified, by explaining how our choices are "about as good as can be hoped for" during necessarily fast/simple activity that can't possibly meet our more powerful and familiar notions of decision rationality.

However, if we can't do this, it seems like we face an exploding backlog of justification needs: every application of a fast heuristic now requires a slow justification pass, but we're constantly applying fast heuristics and there's no room for the slow pass to catch up. So maybe a stronger agent could justify what we do, but we couldn't.

I expect helpful theories here to involve distilling-into-fast-enough-rules on a fundamental level, so that "an impractically slow but working version of the theory" is actually a contradiction in terms.

Comment by nostalgebraist on “embedded self-justification,” or something like that · 2019-11-03T07:14:45.998Z · score: 1 (1 votes) · LW · GW

I don't understand Thing #1. Perhaps, in the passage you quote from my post, the phrase "decision procedure" sounds misleadingly generic, as if I have some single function I use to make all my decisions (big and small) and we are talking about modifications to that function.

(I don't think that is really possible: if the function is sophisticated enough to actually work in general, it must have a lot of internal sub-structure, and the smaller things it does inside itself could be treated as "decisions" that aren't being made using the whole function, which contradicts the original premise.)

Instead, I'm just talking about the ordinary sort of case where you shift some resources away from doing X to thinking about better ways to do X, where X isn't the whole of everything you do.

Re: Q/A/A1, I guess I agree that these things are (as best I can tell) inevitably pragmatic. And that, as EY says in the post you link, "I'm managing the recursion to the best of my ability" can mean something better than just "I work on exactly N levels and then my decisions at level N+1 are utterly arbitrary." But then this seems to threaten the Embedded Agency programme, because it would mean we can't make theoretically grounded assessments or comparisons involving agents as strong as ourselves or stronger.

(The discussion of self-justification in this post was originally motivated by the topic of external assessment, on the premise that if we are powerful enough to assess a proposed AGI in a given way, it must also be powerful enough to assess itself in that way. And contrapositively, if the AGI can't assess itself in a given way then we can't assess it in that way either.)

Comment by nostalgebraist on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T06:00:20.806Z · score: 10 (6 votes) · LW · GW

I don't see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. "cheap win" strategies that are easily defeated by serious players and hence never come up in serious play.

Comment by nostalgebraist on “embedded self-justification,” or something like that · 2019-11-03T05:29:41.568Z · score: 3 (2 votes) · LW · GW

It's not really about doing well/better in all domains, it's about being able to explain how you can do well at all of the things you do, even if that isn't nearly everything. And making that explanation complete enough to be convincing, as an argument about the real world assessed using your usual standards, while still keeping it limited enough to avoid self-reference problems.

Comment by nostalgebraist on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T04:24:28.122Z · score: 7 (5 votes) · LW · GW

IIUC the distinction being made is about the training data, granted the assumption that you may be able to distill tree-search-like abilities into a standard NN with supervised learning if you have samples from tree search available as supervision targets in the first place.

AGZ was hooked up to a tree search in its training procedure, so its training signal allowed it to learn not just from the game trees it "really experienced" during self-play episodes but also (in a less direct way) from the much larger pool of game trees it "imagined" while searching for its next move during those same episodes. The former is always (definitionally) available in self-play, but the latter is only available if tree search is feasible.

Comment by nostalgebraist on Embedded World-Models · 2019-07-06T23:19:55.694Z · score: 10 (3 votes) · LW · GW
I think you're confusing behavior with implementation.

I'm definitely not treating these as interchangeable -- my argument is about how, in a certain set of cases, they are importantly not interchangeable.

Specifically, I'm arguing that certain characterizations of ideal behavior cannot help us explain why any given implementation approximates that behavior well or poorly.

I don't understand how the rest of your points engage with my argument. Yes, there is a good reason Solomonoff does a weighted average and not an argmax; I don't see how this affects my argument one way or the other. Yes, fully general theories can be valuable even when they're not practical to apply directly to real problems; I was arguing that a specific type of fully general theory lacks a specific type of practical value, one which people sometimes expect that type of theory to have.

Comment by nostalgebraist on When does rationality-as-search have nontrivial implications? · 2018-11-09T05:33:35.367Z · score: 7 (4 votes) · LW · GW
But it seems like the core strategy--be both doing object-level cognition and meta-level cognition about how you're doing object-level cognitive--is basically the same.
It remains unclear to me whether the right way to find these meta-strategies is something like "start at the impractical ideal and rescue what you can" or "start with something that works and build new features"; it seems like modern computational Bayesian methods look more like the former than the latter.

I'd argue that there's usually a causal arrow from practical lore to impractical ideals first, even if the ideals also influence practice at a later stage. Occam's Razor came before Solomonoff; "change your mind when you see surprising new evidence" came before formal Bayes. The "core strategy" you refer to sounds like "do both exploration and exploitation," which is the sort of idea I'd imagine goes back millennia (albeit not in those exact terms).

One of my goals in writing this post was to formalize the feeling I get, when I think about an idealized theory of this kind, that it's a "redundant step" added on top of something that already does all the work by itself -- like taking a decision theory and appending the rule "take the actions this theory says to take." But rather than being transparently vacuous, like that example, they are vacuous in a more hidden way, and the redundant steps they add tend to resemble legitimately good ideas familiar from practical experience.

Consider the following (ridiculous) theory of rationality: "do the most rational thing, and also, remember to stay hydrated :)". In a certain inane sense, most rational behavior "conforms to" this theory, since the theory parasitizes on whatever existing notion of rationality you had, and staying hydrated is generally a good idea and thus does not tend to conflict with rationality. And whenever staying hydrated is a good idea, one could imagine pointing to this theory and saying "see, there's the hydration theory of rationality at work again." But, of course, none of this should actually count in the "hydration theory's" favor: all the real work is hidden in the first step ("do the most rational thing"), and insofar as hydration is rational, there's no need to specify it explicitly. This doesn't quite map onto the schema, but captures the way in which I think these theories tend to confuse people.

If the more serious ideals we're talking about are like the "hydration theory," we'd expect them to have the appearance of explaining existing practical methods, and of retrospectively explaining the success of new methods, while not being very useful for generating any new methods. And this seems generally true to me: there's a lot of ensemble-like or regularization-like stuff in ML that can be interpreted as Bayesian averaging/updating over some base space of models, but most of the excitement in ML is in these base spaces. We didn't get neural networks from Bayesian first principles.

Comment by nostalgebraist on Subsystem Alignment · 2018-11-07T17:02:06.988Z · score: 4 (5 votes) · LW · GW

Does "subsystem alignment" cover every instance of a Goodhart problem in agent design, or just a special class of problems that arises when the sub-systems are sufficiently intelligent?

As stated, that's a purely semantic question, but I'm concerned with a more-than-semantic issue here. When we're talking about all Goodhart problems in agent design, we're talking about a class of problems that already comes up in all sorts of practical engineering, and which can be satisfactorily handled in many real cases without needing any philosophical advances. When I make ML models at work, I worry about overfitting and about misalignments between the loss function and my true goals, but it's usually easy to place bounds on how much trouble these things can cause. Unlike humans interacting with "evolution," my models don't live in a messy physical world with porous boundaries; they can only control their output channel, and it's easy to place safety restrictions on the output of that channel, outside the model. This is like "boxing the AI," but my "AI" is so dumb that this is clearly safe. (We could get even clearer examples by looking at non-ML engineers building components that no one would call AI.)

Now, once the subsystem is "intelligent enough," maybe we have something like a boxed AGI, with the usual boxed AGI worries. But it doesn't seem obvious to me that "the usual boxed AGI worries" have to carry over to this case. Making a subsystem strikes me as a more favorable case for "tool AI" arguments than making something with a direct interface to physical reality, since you have more control over what the output channel does and does not influence, and the task may be achievable even with a very limited input channel. (As an example, one of the ML models I work on has an output channel that just looks like "show a subset of these things to the user"; if you replaced it with a literal superhuman AGI, but kept the output channel the same, not much could go wrong. This isn't the kind of output channel we'd expect to hook up to a real AGI, but that's my point: sometimes what you want out of your subsystem just isn't rich enough to make boxing fail, and maybe that's enough.)

Comment by nostalgebraist on When does rationality-as-search have nontrivial implications? · 2018-11-06T02:02:20.647Z · score: 9 (3 votes) · LW · GW

I was not aware of these results -- thanks. I'd glanced at the papers on reflective oracles but mentally filed them as just about game theory, when of course they are really very relevant to the sort of thing I am concerned with here.

We have a remaining semantic disagreement. I think you're using "embeddedness" quite differently than it's used in the "Embedded World-Models" post. For example, in that post (text version):

In a traditional Bayesian framework, “learning” means Bayesian updating. But as we noted, Bayesian updating requires that the agent start out large enough to consider a bunch of ways the world can be, and learn by ruling some of these out.

Embedded agents need resource-limited, logically uncertain updates, which don’t work like this.

Unfortunately, Bayesian updating is the main way we know how to think about an agent progressing through time as one unified agent. The Dutch book justification for Bayesian reasoning is basically saying this kind of updating is the only way to not have the agent’s actions on Monday work at cross purposes, at least a little, to the agent’s actions on Tuesday.

Embedded agents are non-Bayesian. And non-Bayesian agents tend to get into wars with their future selves.

The 2nd and 4th paragraphs here are clearly false for reflective AIXI. And the 2nd paragraph implies that embedded agents are definitionally resource-limited. There is a true and important sense in which reflective AIXI can be "embedded" -- that was the point of coming up with it! -- but the Embedded Agency sequence seems to be excluding this kind of case when it talks about embedded agents. This strikes me as something I'd like to see clarified by the authors of the sequence, actually.

I think the difference may be that we talk about "a theory of rationality for embedded agents," we could mean "a theory that has consequences for agents equally powerful to it," or we could mean something more like "a theory that has consequences for agents of arbitrarily low power." Reflective AIXI (as a theory of rationality) explains why reflective AIXI (as an agent) is optimally designed, but it can't explain why a real-world robot might or might not be optimally designed.

Comment by nostalgebraist on When does rationality-as-search have nontrivial implications? · 2018-11-05T20:44:08.796Z · score: 5 (3 votes) · LW · GW

My argument isn’t specialized to AIXI — note that I also used LIA as an example, which has a weaker R along with a weaker S.

Likewise, if you put AIXI in a world whose parts can do uncomputable things (like AIXI), you have the same pattern one level up. Your S is stronger, with uncomptable strategies, but by the same token, you lose AIXI’s optimality. It’s only searching over computable strategies, and you have to look at all strategies (including the uncomputable ones) to make sure you’re optimal. This leads to a rule R distinct from AIXI, just as AIXI is distinct from a Turing machine.

I guess it’s conceivable that this hits a fixed point at this level or some higher level? That would be abstractly interesting but not very relevant to embeddedness in the kind of world I think I inhabit.

Comment by nostalgebraist on Embedded World-Models · 2018-11-05T16:55:12.672Z · score: 13 (5 votes) · LW · GW
OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple board-evaluation heuristic at the leaf nodes, is a pretty decent algorithm in practice.

I've written previously about this kind of argument -- see here (scroll down to the non-blockquoted text). tl;dr we can often describe the same optimum in multiple ways, with each way giving us a different series that approximates the optimum in the limit. Whether any one series does well or poorly when truncated to N terms can't be explained by saying "it's a truncation of the optimum," since they all are; these truncations properties are facts about the different series, not about the optimum. I illustrate with different series expansions for .

Furthermore, it seems like there's a pattern where, the more general the algorithmic problem you want to solve is, the more your solution is compelled to resemble some sort of brute-force search.

You may be right, and there are interesting conversations to be had about when solutions will tend to look like search and when they won't. But this doesn't feel like it really addresses my argument, which is not about "what kind of algorithm should you use" but about the weirdness of the injunction to optimize over a space containing every procedure you could ever do, including all of the optimization procedures you could ever do. There is a logical / definitional weirdness here that can't be resolved by arguments about what sorts of (logically / definitionally unproblematic) algorithms are good or bad in what domains.

Comment by nostalgebraist on Embedded World-Models · 2018-11-04T21:31:22.880Z · score: 18 (9 votes) · LW · GW

This post feels quite similar to things I have written in the past to justify my lack of enthusiasm about idealizations like AIXI and logically-omniscient Bayes. But I would go further: I think that grappling with embeddedness properly will inevitably make theories of this general type irrelevant or useless, so that "a theory like this, except for embedded agents" is not a thing that we can reasonably want. To specify what I mean, I'll use this paragraph as a jumping-off point:

Embedded agents don’t have the luxury of stepping outside of the universe to think about how to think. What we would like would be a theory of rational belief for situated agents which provides foundations that are similarly as strong as the foundations Bayesianism provides for dualistic agents.

Most "theories of rational belief" I have encountered -- including Bayesianism in the sense I think is meant here -- are framed at the level of an evaluator outside the universe, and have essentially no content when we try to transfer them to individual embedded agents. This is because these theories tend to be derived in the following way:

  • We want a theory of the best possible behavior for agents.
  • We have some class of "practically achievable" strategies , which can actually be implemented by agents. We note that an agent's observations provide some information about the quality of different strategies . So if it were possible to follow a rule like "find the best given your observations, and then follow that ," this rule would spit out very good agent behavior.
  • Usually we soften this to a performance-weighted average rather than a hard argmax, but the principle is the same: if we could search over all of , the rule that says "do the search and then follow what it says" can be competitive with the very best . (Trivially so, since it has access to the best strategies, along with all the others.)
  • But usually . That is, the strategy "search over all practical strategies and follow the best ones" is not a practical strategy. But we argue that this is fine, since we are constructing a theory of ideal behavior. It doesn't have to be practically implementable.

For example, in Solomonoff, is defined by computability while is allowed to be uncomputable. In the LIA construction, is defined by polytime complexity while is allowed to run slower than polytime. In logically-omniscient Bayes, finite sets of hypotheses can be manipulated in a finite universe but the full Boolean algebra over hypotheses generally cannot.

I hope the framework I've just introduced helps clarify what I find unpromising about these theories. By construction, any agent you can actually design and run is a single element of (a "practical strategy"), so every fact about rationality that can be incorporated into agent design gets "hidden inside" the individual , and the only things you can learn from the "ideal theory" are things which can't fit into a practical strategy.

For example, suppose (reasonably) that model averaging and complexity penalties are broadly good ideas that lead to good results. But all of the model averaging and complexity penalization that can be done computably happens inside some Turing machine or other, at the level "below" Solomonoff. Thus Solomonoff only tells you about the extra advantage you can get by doing these things uncomputably. Any kind of nice Bayesian average over Turing machines that can happen computably is (of course) just another Turing machine.

This also explains why I find it misleading to say that good practical strategies constitute "approximations to" an ideal theory of this type. Of course, since just says to follow the best strategies in , if you are following a very good strategy in your behavior will tend to be close to that of . But this cannot be attributed to any of the searching over that does, since you are not doing a search over ; you are executing a single member of and ignoring the others. Any searching that can be done practically collapses down to a single practical strategy, and any that doesn't is not practical. Concretely, this talk of approximations is like saying that a very successful chess player "approximates" the rule "consult all possible chess players, then weight their moves by past performance." Yes, the skilled player will play similarly to this rule, but they are not following it, not even approximately! They are only themselves, not any other player.

Any theory of ideal rationality that wants to be a guide for embedded agents will have to be constrained in the same ways the agents are. But theories of ideal rationality usually get all of their content by going to a level above the agents they judge. So this new theory would have to be a very different sort of thing.

Comment by nostalgebraist on An Untrollable Mathematician Illustrated · 2018-11-04T18:06:45.975Z · score: 15 (6 votes) · LW · GW

This prior isn’t trollable in the original sense, but it is trollable in a weaker sense that still strikes me as important. Since must sum to 1, only finitely many sentences can have for a given . So we can choose some finite set of “important sentences” and control their oscillations in a practical sense, but if there’s any such that we think oscillations across the range are a bad thing, all but finitely many sentences can exhibit this bad behavior.

It seems especially bad that we can only prevent “up-to- trolling” for finite sets of sentences, since in PA (or whatever) there are plenty of countable sets of sentences that seem “essentially the same” (like the ones you get from an induction argument), and it feels very unnatural to choose finite subsets of these and distinguish them from the others, even (or especially?) if we pretend we have no prior knowledge beyond the axioms.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-11-04T00:18:41.837Z · score: 3 (2 votes) · LW · GW

To quote Abram Demski in “All Mathematicians are Trollable”:

The main concern is not so much whether GLS-coherent mathematicians are trollable as whether they are trolling themselves. Vulnerability to an external agent is somewhat concerning, but the existence of misleading proof-orderings brings up the question: are there principles we need to follow when deciding what proofs to look at next, to avoid misleading ourselves?

My concern is not with the dangers of an actual adversary, it’s with the wild oscillations and extreme confidences that can arise even when logical facts arrive in a “fair” way, so long as it is still possible to get unlucky and experience a “clump” of successive observations that push P(A) way up or down.

We should expect such clumps sometimes unless the observation order is somehow specially chosen to discourage them, say via the kind of “principles” Demski wonders about.

One can also prevent observation order from mattering by doing what the Eisenstat prior does: adopt an observation model that does not treat logical observations as coming from some fixed underlying reality (so that learning “B or ~A” rules out some ways A could have been true), but as consistency-constrained samples from a fixed distribution. This works as far as it goes, but is hard to reconcile with common intuitions about how e.g. P=NP is unlikely because so many “ways it could have been true” have failed (Scott Aaronson has a post about this somewhere, arguing against Lubos Motl who seems to think like the Eisenstat prior), and more generally with any kind of mathematical intuition — or with the simple fact that the implications of axioms are fixed in advance and not determined dynamically as we observe them. Moreover, I don’t know of any way to (approximately) apply this model in real-world decisions, although maybe someone will come up with one.

This is all to say that I don’t think there is (yet) any standard Bayesian answer to the problem of self-trollability. It’s a serious problem and one at the very edge of current understanding, with only some partial stabs at solutions available.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-08T17:03:05.132Z · score: 8 (5 votes) · LW · GW

Ah, yeah, you're right that it's possible to do this. I'm used to thinking in the Kolmogorov picture, and keep forgetting that in the Jaynesian propositional logic picture you can treat material conditionals as contingent facts. In fact, I went through the process of realizing this in a similar argument about the same post a while ago, and then forgot about it in the meantime!

That said, I am not sure what this procedure has to recommend it, besides that it is possible and (technically) Bayesian. The starting prior, with independence, does not really reflect our state of knowledge at any time, even at the time before we have "noticed" the implication(s). For, if we actually write down that prior, we have an entry in every cell of the truth table, and if we inspect each of those cells and think "do I really believe this?", we cannot answer the question without asking whether we know facts such as A => B -- at which point we notice the implication!

It seems more accurate to say that, before we consider the connection of A to B, those cells are "not even filled in." The independence prior is not somehow logically agnostic; it assigns a specific probability to the conditional, just as our posterior does, except that in the prior that probability is, wrongly, not one.

Okay, one might say, but can't this still be a good enough place to start, allowing us to converge eventually? I'm actually unsure about this, because (see below) the logical updates tend to push the probabilities of the "ends" of a logical chain further towards 0 and 1; at any finite time the distribution obeys Cromwell's Rule, but whether it converges to the truth might depend on the way in which we take the limit over logical and empirical updates (supposing we do arbitrarily many of each type as time goes on).

I got curious about this and wrote some code to do these updates with arbitrary numbers of variables and arbitrary conditionals. What I found is that as we consider longer chains A => B => C => ..., the propositions at one end get pushed to 1 or 0, and we don't need very long chains for this to get extreme. With all starting probabilities set to 0.7 and three variables 0 => 1 => 2, the probability of variable 2 is 0.95; with five variables the probability of the last one is 0.99 (see the plot below). With ten variables, the last one reaches 0.99988. We can easily come up with long chains in the California example or similar, and following this procedure would lead us to absurdly extreme confidence in such examples.

I've also given a second plot below, where all the starting probabilities are 0.5. This shows that the growing confidence does not rely on an initial hunch one way or the other; simply updating on the logical relationships from initial neutrality (plus independences) pushes us to high confidence about the ends of the chain.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-07T16:53:28.635Z · score: 7 (5 votes) · LW · GW

Two comments:

1. You seem to be suggesting that the standard Bayesian framework handles logical uncertainty as a special case. (Here we are not exactly "uncertain" about sentences, but we have to update on their truth from some prior that did not account for it, which amounts to the same thing.) If this were true, the research on handling logical uncertainty through new criteria and constructions would be superfluous. I haven't actually seen a proposal like this laid out in detail, but I think they've been proposed and found wanting, so I'll be skeptical at least until I'm shown the details of such a proposal.

(In particular, this would need to involve some notion of conditional probabilities like P(A | A => B), and perhaps priors like P(A => B), which are not a part of any treatment of Bayes I've seen.)

2. Even if this sort of thing does work in principle, it doesn't seem to help in the practical case at hand. We're now told to update on "noticing" A => B by using objects like P(A | A => B), but these too have to be guessed using heuristics (we don't have a map of them either), so it inherits the same problem it was introduced to solve.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-07T02:29:30.400Z · score: 5 (3 votes) · LW · GW
You assume a creature that can't see all logical consequences of hypotheses [...] Then you make it realize new facts about logical consequences of hypotheses

This is not quite what is going on in section 7b. The agent isn't learning any new logical information. For instance, in jadagul's "US in 2100" example, all of the logical facts involved are things the agent already knows. " 'California is a US state in 2100' implies 'The US exists in 2100' " is not a new fact, it's something we already knew before running through the exercise.

My argument in 7b is not really about updating -- it's about whether probabilities can adequately capture the agent's knowledge, even at a single time.

This is in a context (typical of real decisions) where:

  • the agent knows a huge number of logical facts, because it can correctly interpret hypotheses written in a logically transparent way, like "A and B," and because it knows lots of things about subsets in the world (like US / California)
  • but, the agent doesn't have the time/memory to write down a "map" of every hypothesis connected by these facts (like a sigma-algebra). For example, you can read an arbitrary string of hypotheses "A and B and C and ..." and know that this implies "A", "A and C", etc., but you don't have in your mind a giant table containing every such construction.

So the agent can't assign credences/probabilities simultaneously to every hypothesis on that map. Instead, they have some sort of "credence generator" that can take in a hypothesis and output how plausible it seems, using heuristics. In their raw form, these outputs may not be real numbers (they will have an order, but may not have e.g. a metric).

If we want to use Bayes here, we need to turn these raw credences into probabilities. But remember, the agent knows a lot of logical facts, and via the probability axioms, these all translate to facts relating probabilities to one another. There may not be any mapping from raw credence-generator-output to probabilities that preserves all of these facts, and so the agent's probabilities will not be consistent.

To be more concrete about the "credence generator": I find that when I am asked to produce subjective probabilities, I am translating them from internal representations like

  • Event A feels "very likely"
  • Event B, which is not logically entailed by A or vice versa, feels "pretty likely"
  • Event (A and B) feels "pretty likely"

If we demand that these map one-to-one to probabilities in any natural way, this is inconsistent. But I don't think it's inconsistent in itself; it just reflects that my heuristics have limited resolution. There isn't a conjunction fallacy here because I'm not treating these representations as probabilities -- but if I decide to do so, then I will have a conjunction fallacy! If I notice this happening, I can "plug the leak" by changing the probabilities, but I will expect to keep seeing new leaks, since I know so many logical facts, and thus there are so many consequences of the probability axioms that can fail to hold. And because I expect this to happen going forward, I am skeptical now that my reported probabilities reflect my actual beliefs -- not even approximately, since I expect to keep deriving very wrong things like an event being impossible instead of likely.

None of this is meant disapprove of using probability estimates to, say, make more grounded estimates of cost/benefit in real-world decisions. I do find that useful, but I think it is useful for a non-Bayesian reason: even if you don't demand a universal mapping from raw credences, you can get a lot of value out of saying things like "this decision isn't worth it unless you think P(A) > 97%", and then doing a one-time mapping of that back onto a raw credence, and this has a lot of pragmatic value even if you know the mappings will break down if you push them too hard.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-06T01:38:28.877Z · score: 9 (5 votes) · LW · GW

If I understand your objection correctly, it's one I tried to answer already in my post.

In short: Bayesianism is normative for problems to you can actually state in its formalism. This can be used as an argument for at least trying to state problems in its formalism, and I do think this is often a good idea; many of the examples in Jaynes' book show the value of doing this. But when the information you have actually does not fit the requirements of the formalism, you can only use it if you get more information (costly, sometimes impossible) or forget some of what you know to make the rest fit. I don't think Bayes normatively tells you to do those kinds of things, or at least that would require a type of argument different from the usual Dutch Books etc.

Using the word "brain" there was probably a mistake. This is only about brains insofar as it's about the knowledge actually available to you in some situation, and the same idea applies to the knowledge available to some robot you are building, or some agent in a hypothetical decision problem (so long as it is a problem with the same property, of not fitting well into the formalism without extra work or forgetting).

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-06T01:25:16.679Z · score: 8 (5 votes) · LW · GW

I don't disagree with any of this. But if I understand correctly, you're only arguing against a very strong claim -- something like "Bayes-related results cannot possibly have general relevance for real decisions, even via 'indirect' paths that don't rely on viewing the real decisions in a Bayesian way."

I don't endorse that claim, and would find it very hard to argue for. I can imagine virtually any mathematical result playing some useful role in some hypothetical framework for real decisions (although I would be more surprised in some cases than others), and I can't see why Bayesian stuff should be less promising in that regard than any arbitrarily chosen piece of math. But "Bayes might be relevant, just like p-adic analysis might be relevant!" seems like damning with faint praise, given the more "direct" ambitions of Bayes as advocated by Jaynes and others.

Is there a specific "indirect" path for the relevance of Bayes that you have in mind here?

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-05T20:22:03.257Z · score: 19 (7 votes) · LW · GW

I disagree that this answers my criticisms. In particular, my section 7 argues that it's practically unfeasible to even write down most practical belief / decision problems in the form that the Bayesian laws require, so "were the laws followed?" is generally not even a well-defined question.

To be a bit more precise, the framework with a complete hypothesis space is a bad model for the problems of interest. As I detailed in section 7, that framework assumes that our knowledge of hypotheses and the logical relations between hypotheses are specified "at the same time," i.e. when we know about a hypothesis we also know all its logical relations to all other hypotheses, and when we know (implicitly) about a logical relation we also have access (explicitly) to the hypotheses it relates. Not only is this false in many practical cases, I don't even know of any formalism that would allow us to call it "approximately true," or "true enough for the optimality theorems to carry over."

(N.B. as it happens, I don't think logical inductors fix this problem. But the very existence of logical induction as a research area shows that this is a problem. Either we care about the consequences of lacking logical omniscience, or we don't -- and apparently we do.)

It's sort of like quoting an optimality result given access to some oracle, when talking about a problem without access to that oracle. If the preconditions of a theorem are not met by the definition of a given decision problem, "meet those preconditions" cannot be part of a strategy for that problem. "Solve a different problem so you can use my theorem" is not a solution to the problem as stated.

Importantly, this is not just an issue of "we can't do perfect Bayes in practice, but if we were able, it'd be better." Obtaining the kind of knowledge representation assumed by the Bayesian laws has computational / resource costs, and in any real decision problem, we want to minimize these. If we're handed the "right" knowledge representation by a genie, fine, but if we are talking about choosing to generate it, that in itself is a decision with costs.

As a side point, I am also skeptical of some of the optimality results.