Posts
Comments
This is really remarkable! Thanks for posting, I don't know when I would have seen it otherwise.
Partway through, I Googled a couple passages to quickly confirm that it wasn't some "modern blog post in an old-timey style" -- it feels more specific and prescient to me than I remember Erewhon being (though Gwern's comment makes me think I should go back and look again, or check Darwin Among the Machines).
Evans' predictions on Trost's arguments and overall vibe are also impressive, IMO. (E.g. "if it were not for your incurable dilettanteism in science as in all other things--if you had once understood the action of any delicate machine..." ~= "actually training models gives you the insight that AI risk isn't a thing.") I wonder how much Evans-vs.-Trost-style conversation was already going on.
Interesting how this part turned out:
how do I know that they may not be ultimately made to carry, or may not in themselves evolve, conditions of self-supply, self-repair, and reproduction
It doesn't seem like Evans (or others at the time?) anticipated that instead of individual machines reproducing like humans, they could just run the factories / mines / etc. that produce them. Which is an odd oversight, since by that time machines were certainly used in factories to make more machines -- it seems like the "reproduction via external construction" prediction should have been a lot easier to make than the "reasoning machines" prediction.
"Heaven forbid! They seem to be flying about in the air with other germs, and have found a sort of nidus among my melancholy fancies. Nobody really holds them. They bear the same relation to real belief as walking on the head for a show does to running away from an explosion or walking fast to catch the train."
😬
The sidebar that shows all comments by author is incredibly useful (to me)!
I don't know how long ago it was put in, but when I noticed it, it made it waaaaay easier for me to parse through big conversation trees, get a sense for what people are thinking, and zero in on threads I want to read in detail.
Thanks to whoever had that idea and implemented it!
Ajeya "Cotra", not "Corta" :)
Thanks, for the info I'm reading through your posts now! I'm sorry your experience was / still is so terrible. Knock on wood, I'm not having as bad a time so far -- I wonder if the most recent booster helped me, or if it's just luck (different strain, different immune system, etc.)
Especially good to know how easy it was to pass to your spouse -- I'll do my best to take that into account.
(I strongly agree w/ your post on Paxlovid, by the way -- it was a game changer for how bad my symptoms were, I'm very glad I could get it.)
In case it's useful, I have a threat model writeup here: https://www.danieldewey.net/risk/case.html. (I haven't linked it from many places, and I'm trying to spread it around when I see the chance.)
If I wanted to explain these results, I think I would say something like:
GPT-3 has been trained to predict what the next token would be if the prompt appeared in its dataset (text from the internet). So, if GPT-3 has learned well, it will "talk as if symbols are grounded" when it predicts that the internet-text would "talk as if symbols are grounded" following the given prompt, and not if not.
It's hard to use this explanation to predict what GPT-3 will do on edge cases, but this would lead me to expect that GPT-3 will more often "talk as if symbols are grounded" when the prompt is a common prose format (e.g. stories, articles, forum posts), and less often when the prompt is most similar to non-symbol-groundy things in the dataset (e.g. poetry) or not that similar to anything in the dataset.
I think your examples here broadly fit that explanation, though it feels like a shaky just-so story:
- If I saw the first "undead worker" prompt on a webpage, I would think "hm, normal stories don't have this kind of weird repetition -- is this a poem or a metaphor or something? A joke?" I wouldn't think it was 97% to continue going to work, but I wouldn't be surprised if she did -- maybe 30%-50%?
- The second undead worker prompt looks a lot more like a normal kind of story, so I'm not that surprised that GPT was more likely to think it continued like a story looked more symbol-groundy -- if I saw that text on the internet, I would still think there was a reasonable chance that it's some kind of joke, but not as high as the first prompt.
- IDK about the map thing -- this looks like a case where GPT just hadn't seen enough training text in the general vicinity of the prompt to do very well? It's definitely interesting that it figured out the command format, but didn't seem to figure out the layout of the situation.
I don't see how to test this theory, but it seems like it has to be kind of tautologically correct -- predicting next token is what GPT-3 was trained to do, right?
Maybe to find out how adept GPT-3 is at continuing prompts that depend on common knowledge about common objects, or object permanence, or logical reasoning, you could create prompts that are as close as possible to what appears in the dataset, then see if it fails those prompts more than average? I don't think there's a lot we can conclude from unusual-looking prompts.
I'm curious what you think of this -- maybe it misses the point of your post?
*(I'm not sure exactly what you mean when you say "symbol grounding", but I'm taking it to mean something like "the words describe objects that have common-sense properties, and future words will continue this pattern".)
Nice work everyone! Very excited to see the results.
Thank you for writing this! I usually have to find a few different angles to look at a paper from before I feel like I understand it, and this kind of thing is super helpful.
Thanks, Richard!
I do think both of those cases fit into the framework fine (unless I'm misunderstanding what you have in mind):
- In the first case, we're training a model in an environment. As it gets more capable, it reaches a point where it can find new, harmful behaviors in some set of situations. Our worries are now that (1) we can't recognize that behavior as harmful, or (2) we don't visit those situations during training, but they do in fact come up in practice (distribution shift). If we say "but the version of the model we had yesterday, before all this additional training, didn't behave badly in this situation!", that just seems like sloppy training work -- it's not clear why we should expect the behavior of an earlier version of a model to bind a later version.
- In the second case, it sounds like you're imagining us watching evolution and thinking "let's evolve humans that are reproductively fit, but aren't dangerous to other species." We train the humans a lot in the ancestral environment, and see that they don't hurt other species much. But then, the humans change the environment a lot, and in the new situations they create, they hurt other species a lot. In this case, I think it's pretty clear that the distribution has shifted. We might wish we'd done something earlier to certify that humans wouldn't hurt animals a lot under any circumstance, or we'd deployed humans in some sandbox so we could keep the high-level distribution of situations the same, or dealt with high-level distribution shift some other way.
In other words, if we imagine a model misbehaving in the wild, I think it'll usually either be the case that (1) it behaved that way during training but we didn't notice the badness (evaluation breakdown), or (2) we didn't train it on a similar enough situation (high-level distribution shift).
As we move further away from standard DL training practices, we could see failure modes that don't fit into these two categories -- e.g. there could be some bad fixed-point behaviors in amplification that aren't productively thought of as "evaluation breakdown" or "high-level distribution shift." But these two categories do seem like the most obvious ways that current DL practice could produce systematically harmful behavior, and I think they take up a pretty large part of the space of possible failures.
(ETA: I want to reiterate that these two problems are restatements of earlier thinking, esp. by Paul and Evan, and not ideas I'm claiming are new at all; I'm using my own terms for them because "inner" and "outer" alignments have different meanings for different people.)
I'm really enjoying Project Hail Mary, the new book from The Martian author Andy Weir, and I think other LW readers might as well.
Avoid spoilers harder than you normally would -- there are a lot of spoilers online that are easy to hit by accident.
Why you might like it:
- Lots of figuring things out on the fly; math, science, and general hypothesizing / problem-solving exercises. Fun to stop and try to figure out on your own, or just relax and watch the character solve them.
- Requires a lot less physics knowledge than some similar books to "play along at home" (e.g. I could do most of what I've seen so far, vs. something like Egan's Orthogonal, which needs more physics than I know)
- Nice escapism about how Earth responds to global threats :)