Developmental Stages of GPTs

orthonormal

Developmental Stages of GPTs

post by orthonormal · 2020-07-26T22:03:19.588Z · LW · GW · 72 comments

  My Thesis:
    The difference between GPT-2 and GPT-3 has made me suspect that there's a legitimate comparison to be made between the scale of a network architecture like the GPTs, and some analogue of "developmental stages" of the resulting network. Furthermore, it's plausible to me that the functions needed to b...
  Architecture and Scaling
  Analogues to Developmental Stages
  What's Next?
  Could GPT-N turn out aligned, or at least harmless?
  What Can We Do?
None
74 comments

Epistemic Status: I only know as much as anyone else in my reference class (I build ML models, I can grok the GPT papers, and I don't work for OpenAI or a similar lab). But I think my thesis is original.

Related: Gwern on GPT-3

For the last several years, I've gone around saying that I'm worried about transformative AI, an AI capable of making an Industrial Revolution sized impact (the concept is agnostic on whether it has to be AGI or self-improving), because I think we might be one or two cognitive breakthroughs away from building one.

GPT-3 has made me move up my timelines, because it makes me think we might need zero more cognitive breakthroughs, just more refinement / efficiency / computing power: basically, GPT-6 or GPT-7 might do it. My reason for thinking this is comparing GPT-3 to GPT-2, and reflecting on what the differences say about the "missing pieces" for transformative AI.

My Thesis:

The difference between GPT-2 and GPT-3 has made me suspect that there's a legitimate comparison to be made between the scale of a network architecture like the GPTs, and some analogue of "developmental stages" of the resulting network. Furthermore, it's plausible to me that the functions needed to be a transformative AI are covered by a moderate number of such developmental stages, without requiring additional structure. Thus GPT-N would be a transformative AI, for some not-too-large N, and we need to redouble our efforts on ways to align such AIs.

The thesis doesn't strongly imply that we'll reach transformative AI via GPT-N especially soon; I have wide uncertainty, even given the thesis, about how large we should expect N to be, and whether the scaling of training and of computation slows down progress before then. But it's also plausible to me now that the timeline is only a few years, and that no fundamentally different approach will succeed before then. And that scares me.

Architecture and Scaling

GPT, GPT-2, and GPT-3 use nearly the same architecture; each paper says as much, with a sentence or two about minor improvements to the individual transformers. Model size (and the amount of training computation) is really the only difference.

GPT took 1 petaflop/s-day to train 117M parameters, GPT-2 took 10 petaflop/s-days to train 1.5B parameters, and the largest version of GPT-3 took 3,000 petaflop/s-days to train 175B parameters. By contrast, AlphaStar seems to have taken about 30,000 petaflop/s-days of training [LW(p) · GW(p)] in mid-2019, so the pace of AI research computing power projects that there should be about 10x that today. The upshot is that OpenAI may not be able to afford it, but if Google really wanted to make GPT-4 this year, they could afford to do so.

Analogues to Developmental Stages

There are all sorts of (more or less well-defined) developmental stages for human beings: image tracking, object permanence, vocabulary and grammar, theory of mind, size and volume, emotional awareness, executive functioning, et cetera.

I was first reminded of developmental stages a few years ago, when I saw the layers of abstraction generated in this feature visualization tool for GoogLeNet.

We don't have feature visualization for language models, but we do have generative outputs. And as you scale up an architecture like GPT, you see higher levels of abstraction. Grammar gets mastered, then content (removing absurd but grammatical responses), then tone (first rough genre, then spookily accurate authorial voice). Topic coherence is mastered first on the phrase level, then the sentence level, then the paragraph level. So too with narrative flow.

Gwern's poetry experiments (GPT-2, GPT-3) are good examples. GPT-2 could more or less continue the meter of a poem and use words that fit the existing theme, but even its best efforts can get stuck in topic loops:

…Methinks I see her in her blissful dreams:
—Or, fancy-like, in some mirage she lies,
Majestic yet majestic, and of seems
The image of the unconquerable skies.
Methinks I see her in her blissful dreams:
—Or, fancy-like, in some majestic cell,
Where lordly seraphs strew their balmy dreams
On the still night, or in their golden shell.

Or:

In the dark the sun doth gleam,
And in the dark the moon doth seem
But now the evening is begun—
Gone is the sun upon the earth!
The silver moon doth like a cup
Of blood-red wine, and as that cup
Is drained of life, doth quench no drop.
What man will drink such wine?

GPT-3, though, has the sense of going somewhere, at least on the topic level. (Prompts in bold.)

O time thy pyramids, thy palaces forgot,
Pensive through nations wander with no more state
Than in thy days, when wandering was thy trade:
Mourn for Patraos, for the glory fled
Of him and thee, and in their stead left, left
To other eyes a length of rolling sand,
And such memorials as the traveller reads,
Half guessed in shade, of sculptured walls half round,
And long-defaced metopes, and friezes old.

[EDIT: Previously I also included its completion of a famous Emily Dickinson poem here, but as benkuhn pointed out, GPT-3 had simply memorized the poem and recited it. I'm really embarrassed, and also kind of shocked that I looked at the actual text of "Because I could not stop for Death" and thought, "yup, that looks like something GPT-3 could produce".]

(One last shocking bit is that, while GPT-2 had to be fine-tuned by taking the general model and training it some more on a poetry-only dataset, you're seeing what GPT-3's model does with no fine-tuning, with just a prompt that sounds poetic!)

Similarly, GPT-3's ability to write fiction is impressive- unlike GPT-2, it doesn't lose track of the plot, it has sensible things happen, it just can't plan its way to a satisfying resolution.

I'd be somewhat surprised if GPT-4 shared that last problem.

What's Next?

How could one of the GPTs become a transformative AI, even if it becomes a better and better imitator of human prose style? Sure, we can imagine it being used maliciously to auto-generate targeted misinformation or things of that sort, but that's not the real risk I'm worrying about here.

My real worry is that causal inference and planning are starting to look more and more like plausible developmental stages that GPT-3 is moving towards, and that these were exactly the things I previously thought were the obvious obstacles between current AI paradigms and transformative AI.

Learning causal inference from observations doesn't seem qualitatively different from learning arithmetic or coding from examples (and not only is GPT-3 accurate at adding three-digit numbers, but apparently at writing JSX code to spec), only more complex in degree.

One might claim that causal inference is harder to glean from language-only data than from direct observation of the physical world, but that's a moot point, as OpenAI are using the same architecture to learn how to infer the rest of an image from one part.

Planning is more complex to assess. We've seen GPTs ascend from coherence of the next few words, to the sentence or line, to the paragraph or stanza, and we've even seen them write working code. But this can be done without planning; GPT-3 may simply have a good enough distribution over next words to prune out those that would lead to dead ends. (On the other hand, how sure are we that that's not the same as planning, if planning is just pruning on a high enough level of abstraction?)

The bigger point about planning, though, is that the GPTs are getting feedback on one word at a time in isolation. It's hard for them to learn not to paint themselves into a corner. It would make training more finicky and expensive if we expanded the time horizon of the loss function, of course. But that's a straightforward way to get the seeds of planning, and surely there are other ways.

With causal modeling and planning, you have the capability of manipulation without external malicious use. And the really worrisome capability comes when it models its own interactions with the world, and makes plans with that taken into account.

Could GPT-N turn out aligned, or at least harmless?

GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?

One might hope that because it's learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:

Two humans are fairly similar to each other, because they have very similar architectures and are learning to succeed in the same environment.
Two convergently evolved species will be similar in some ways but not others, because they have different architectures but the same environmental pressures.
A mimic species will be similar in some ways but not others to the species it mimics, because even if they share recent ancestry, the environmental pressures on the poisonous one are different from the environmental pressures on the mimic.

What we have with the GPTs is the first deep learning architecture we've found that scales this well in the domain (so, probably not that much like our particular architecture), learning to mimic humans rather than growing in an environment with similar pressures. Why should we expect it to be anything but very alien under the hood, or to continue acting human once its actions take us outside of the training distribution?

Moreover, there may be much more going on under the hood than we realize; it may take much more general cognitive power to learn and imitate the patterns of humans, than it requires us to execute those patterns.

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers [LW · GW] GPT-N ends up building along the way, but I'm not going to count on there being none of them [LW · GW].

I don't expect that GPT-N will be aligned or harmless by default. And if N isn't that large before it gets transformative capacity, that's simply terrifying.

What Can We Do?

While the short timeline suggested by the thesis is very bad news from an AI safety readiness perspective (less time to come up with better theoretical approaches), there is one silver lining: it at least reduces the chance of a hardware overhang. A project or coalition can feasibly wait and take a better-aligned approach that uses 10x the time and expense of an unaligned approach, as long as they have that amount of resource advantage over any competitor.

Unfortunately, the thesis also makes it less likely that a fundamentally different architecture will reach transformative status before something like GPT does.

I don't want to take away from MIRI's work (I still support them, and I think that if the GPTs peter out, we'll be glad they've been continuing their work), but I think it's an essential time to support projects that can work for a GPT-style near-term AGI, for instance by incorporating specific alignment pressures during training. Intuitively, it seems as if Cooperative Inverse Reinforcement Learning or AI Safety via Debate [LW · GW] or Iterated Amplification are in this class.

We may also want to do a lot of work on how better to mold a GPT-in-training into the shape of an Oracle AI.

It would also be very useful to build some GPT feature "visualization" tools ASAP.

In the meantime, uh, enjoy AI Dungeon, I guess?

72 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-07-26T23:24:38.515Z · LW(p) · GW(p)

Unfortunately what you say sounds somewhat plausible to me; I look forward to hearing the responses.

I'll add this additional worry: If you are an early chemist exploring the properties of various metals, and you discover a metal that gets harder as it gets colder, this should increase your credence that there are other metals that share this property. Similarly, I think, for AI architectures. The GPT architecture seems to exhibit pretty awesome scaling properties. What if there are other architectures that also have awesome scaling properties, such that we'll discover this soon? How many architectures have had 1,000+ PF-days pumped into them? Seems like just two or three. And equally importantly, how many architectures have been tried with 100+ billion parameters? I don't know, please tell me if you do.

EDIT: By "architectures" I mean "Architectures + training setups (data, reward function, etc.)"

Replies from: SDM, Hoagy

↑ comment by Sammy Martin (SDM) · 2020-07-27T17:10:01.676Z · LW(p) · GW(p)

I find this interesting in the context of the recent podcast on errors in the classic arguments for AI risk [LW(p) · GW(p)]- which boil down to, there is no necessary reason why instrumental convergence or orthogonality apply to your systems, and there are actually strong reasons, a priori, to think increasing AI capabilities and increasing AI alignment go together to some degree [LW(p) · GW(p)]... and then GPT-3 comes along, and suggests that, practically speaking, you can get highly capable behaviour that scales up easily without much in the way of alignment.

On the one hand, GPT-3 is quite useful while being not robustly aligned, but on the other hand GPT-3's lack of alignment is impeding its capabilities to some degree.

Maybe if you update on both you just end up back where you started.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-07-27T21:56:39.135Z · LW(p) · GW(p)

I think the errors in the classic arguments have been greatly exaggerated. So for me the update is just in one direction.

Replies from: SDM

↑ comment by Sammy Martin (SDM) · 2020-07-27T23:05:17.833Z · LW(p) · GW(p)

What would you say is wrong with the 'exaggerated' criticism?

I don't think you can call the arguments wrong if you also think the Orthogonality Thesis and Instrumental Convergence are real and relevant to AI safety, and as far as I can tell the criticism doesn't claim that - just that there are other assumptions needed for disaster to be highly likely.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-07-28T13:14:19.053Z · LW(p) · GW(p)

I don't have an elevator pitch summary of my views yet, and it's possible that my interpretation of the classic arguments is wrong, I haven't reread them recently. But here's an attempt:

--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what's right, what we intended, etc. and 2. that smarter AI wouldn't lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn't. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.

--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true. Later the book goes on to talk about how making it good is more difficult than it sounds. Moreover, Bostrom doesn't wave around his arguments about they are proofs; he includes lots of hedge words and maybes. I think we can interpret it as a burden-shifting argument; "Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you'd better have some pretty solid arguments that everything's going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important)." As far as I know no one has come up with any such arguments, and in fact it's now the consensus in the field that no one has found such an argument.

Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.

...

Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.

Replies from: SDM, bmg

↑ comment by Sammy Martin (SDM) · 2020-07-28T15:04:37.394Z · LW(p) · GW(p)

--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what's right, what we intended, etc. and 2. that smarter AI wouldn't lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn't. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.

Speaking from personal experience, those views both felt obvious to me before I came across Orthogonality Thesis or Instrumental convergence.

--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true.

It depends on what you mean by 'special effort' and 'default'. The Orthogonality thesis, instrumental convergence, and eventual fast growth together establish that if we increased intelligence while not increasing alignment, a disaster would result. That is what is correct about them. What they don't establish is how natural it is that we will increase intelligence without increasing alignment to the degree necessary to stave off disaster.

It may be the case that the particular technique for building very powerful AI that is easiest to use is a technique that makes alignment and capability increase together, so you usually get the alignment you need just in the course of trying to make your system more capable.

Depending on how you look at that possibility, you could say that's an example of the 'special effort' being not as difficult as it appeared / likely to be made by default, or that the claim is just wrong and the default outcome is not doom. I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way - there might be a further fact that says why OT and IC don't apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.

For the reasons you give, the Orthogonality thesis and instrumental convergence do shift the burden of proof to explaining why you wouldn't get misalignment, especially if progress is fast. But such reasons have been given, see e.g. this [LW(p) · GW(p)] from Stuart Russell:

The first reason for optimism [about AI alignment] is that there are strong economic incentives to develop AI systems that defer to humans and gradually align themselves to user preferences and intentions. Such systems will be highly desirable: the range of behaviours they can exhibit is simply far greater than that of machines with fixed, known objectives...

And there are outside-view analogies with other technologies that suggests that by default alignment and capability do tend to covary to quite a large extent. This is a large part of Ben Garfinkel's argument.

But I do think that some people (maybe not Bostrom, based on the caveats he gave), didn't realise that they did also need to complete the argument to have a strong expectation of doom - to show that there isn't an easy, and required alignment technique that we'll have a strong incentive to use.

From my earlier post: [LW(p) · GW(p)]

"A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. "

We could see this as marking out a potential danger - a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist 'weakly suggest' (Ben's words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we're 'shooting into the dark' in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world.

I also think that, especially when you bring Mesa-optimisers or recent evidence [LW(p) · GW(p)]into the picture, the evidence we have so far suggests that even though alignment and capability are likely to covary to some degree (a degree higher than e.g. Bostrom expected back before modern ML), the default outcome is still misalignment.

Replies from: TurnTrout, daniel-kokotajlo

↑ comment by TurnTrout · 2020-07-28T16:32:09.823Z · LW(p) · GW(p)

I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way - there might be a further fact that says why OT and IC don't apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.

I agree with that burden of proof. However, we do have evidence that IC will apply, if you think we might get AGI through RL.

I think that [? · GW] hypothesized AI catastrophe is usually due to power-seeking behavior and instrumental drives. I [LW · GW] proved [? · GW] that that optimal policies are generally power-seeking in MDPs. This is a measure-based argument, and it is formally correct under broad classes of situations, like "optimal farsighted agents tend to preserve their access to terminal states" (Optimal Farsighted Agents Tend to Seek Power, §6.2 Theorem 19) and "optimal agents generally choose paths through the future that afford strictly more options" (Generalizing the Power-Seeking Theorems [LW · GW], Theorem 2).

The theorems aren't conclusive evidence:

maybe we don't get AGI through RL
learned policies are not going to be optimal
the results don't prove how hard it is tweak the reward function distribution, to avoid instrumental convergence (perhaps a simple approval penalty suffices! IMO: doubtful, but technically possible)
perhaps the agents inherit different mesa objectives during training
- The optimality theorems + mesa optimization suggest that not only might alignment be hard because of Complexity of Value, it might also be hard for agents with very simple goals! Most final goals involve instrumental goals; agents trained through ML may stumble upon mesa optimizers, which are generalizing over these instrumental goals; the mesa optimizers are unaligned and seek power, even though the outer alignment objective was dirt-easy to specify.

But the theorems are evidence that RL leads to catastrophe at optimum, at least. We're not just talking about "the space of all possible minds and desires" anymore.

Also

In the linked slides, the following point is made in slide 43:

We know there are many possible AI systems (including “powerful” ones) that are not inclined toward omnicide

Any possible (at least deterministic) policy is uniquely optimal with regard to some utility function. And many possible policies do not involve omnicide.

On its own, this point is weak; reading part of his 80K talk, I do not think it is a key part of his argument. Nonetheless, here's why I think it's weak:

"All states have self-loops, left hidden to reduce clutter.

In AI: A Modern Approach (3e), the agent starts at and receives reward for reaching $3$ . The optimal policy for this reward function avoids $2$ , and one might suspect that avoiding $2$ is instrumentally convergent. However, a skeptic might provide a reward function for which navigating to $2$ is optimal, and then argue that "instrumental convergence'' is subjective and that there is no reasonable basis for concluding that $2$ is generally avoided.

We can do better... for any way of independently and identically distributing reward over states, $\frac{10}{11}$ of reward functions have farsighted optimal policies which avoid $2$ . If we complicate the MDP with additional terminal states, this number further approaches 1.

If we suppose that the agent will be forced into $2$ unless it takes preventative action, then preventative policies are optimal for $\frac{10}{11}$ of farsighted agents – no matter how complex the preventative action. Taking $2$ to represent shutdown, we see that avoiding shutdown is instrumentally convergent in any MDP representing a real-world task and containing a shutdown state. We argue that this is a special case of a more general phenomenon: optimal farsighted agents tend to seek power."

~ Optimal Farsighted Agents Tend to Seek Power

Replies from: bmg, rohinmshah, zachary-robertson

↑ comment by bmg · 2020-08-03T16:14:26.219Z · LW(p) · GW(p)

I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it's too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that "most" minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices.

I still think, though, that the ICT only gets us a relatively small portion of the way to believing that extinction-level alignment failures are likely. A couple of thoughts I have are:

It may be useful to distinguish between "power-seeking behavior" and omnicide (or equivalently harmful behavior). We do want AI systems to pursue power-seeking behaviors, to some extent. Making sure not to lock yourself in the bathroom, for example, qualifies as a power-seeking behavior -- it's akin to avoiding "State 2" in your diagram -- but it is something that we'd want any good house-cleaning robot to do. It's only a particular subset of power-seeking behavior that we badly want to avoid (e.g. killing people so they can't shut you off.)

This being said, I imagine that, if we represented the physical universe as an MDP, and defined a reward function over states, and used a sufficiently low discount rate, then the optimal policy for most reward functions probably would involve omnicide. So the result probably does port over to this special case. Still, I think that keeping in mind the distinction between omnicide and "power-seeking behavior" (in the context of some particular MDP) does reduce the ominousness of the result to some degree.
Ultimately, for most real-world tasks, I think it's unlikely that people will develop RL systems using hand-coded reward functions (and then deploy them). I buy the framing in (e.g.) the DM "scalable agent alignment" paper, Rohin's "narrow value learning" sequence [? · GW], and elsewhere: that, over time, the RL development process will necessarily look less-and-less like "pick a reward function and then let an RL algorithm run until you get a policy that optimizes the reward function sufficiently well." There's seemingly just not that much that you can do using hand-written reward functions. I think that these more sophisticated training processes will probably be pretty strongly attracted toward non-omnicidal policies. At a higher level, engineers will also be attracted toward using training processes that produce benign/useful policies. They should have at least some ability to notice or foresee issues with classes of training processes, before any of them are used to produce systems that are willing and able to commit omnicide. Ultimately, in other words, I think it's reasonable to be optimistic that we'll do much better than random when producing the policies of advanced AI systems.

I do still think that the ICT is true, though, and I do still think that it matters: it's (basically) necessary for establishing a high level of misalignment risk. I just don't think it's sufficient to establish a high level of risk (and am skeptical of certain other premises that would be sufficient to establish this).

↑ comment by Rohin Shah (rohinmshah) · 2020-08-10T16:48:00.162Z · LW(p) · GW(p)

But the theorems are evidence that RL leads to catastrophe at optimum, at least.

RL with a randomly chosen reward leads to catastrophe at optimum.

I [LW · GW] proved [? · GW] that that optimal policies are generally power-seeking in MDPs.

The proof is for randomly distributed rewards.

Ben's main critique is that the goals evolve in tandem with capabilities, and goals will be determined by what humans care about. These are specific reasons to deny the conclusion of analysis of random rewards.

(A random Python program will error with near-certainty, yet somehow I still manage to write Python programs that don't error.)

I do agree that this isn't enough reason to say "there is no risk", but it surely is important for determining absolute levels of risk. (See also this comment [LW(p) · GW(p)] by Ben.)

Replies from: TurnTrout

↑ comment by TurnTrout · 2020-08-11T15:18:45.567Z · LW(p) · GW(p)

Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?

Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.

I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.

ETA: also, i was referring to the point you made when i said

“the results don't prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2020-08-11T23:28:32.597Z · LW(p) · GW(p)

Every specific attempt so far has been seemingly unsuccessful

Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such.

But I don't really think I disagree with you here. I think this paper is good, provides support for the point "we should have good reason to believe an AI system is safe, and not assume it by default", and responds to an in-fact incorrect argument of "but why would any AI want to kill us all, that's just anthropomorphizing".

But when someone says "These arguments depend on some concept of a 'random mind', but in reality it won't be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply", it seems like a response of the form "we have support for IC, not just in random minds, but also for random reward functions" has not responded to the critique and should not be expected to be convincing to that person.

Aside:

I don’t know that it makes me feel much better about future objectives being outer aligned.

I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil's advocate here. (I am also not convinced of the negation.)

Replies from: TurnTrout

↑ comment by TurnTrout · 2020-08-12T16:55:18.493Z · LW(p) · GW(p)

it seems like a response of the form "we have support for IC, not just in random minds, but also for random reward functions" has not responded to the critique and should not be expected to be convincing to that person.

I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of "but how do we know IC even exists?" with "well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don't (formally) know how hard it is to avoid if you try".

I think I agree with most of what you're arguing.

↑ comment by Past Account (zachary-robertson) · 2020-08-11T18:30:15.585Z · LW(p) · GW(p)

[Deleted]

Replies from: rohinmshah, TurnTrout

↑ comment by Rohin Shah (rohinmshah) · 2020-08-11T23:33:51.508Z · LW(p) · GW(p)

I find myself agreeing with the idea that an agent unaware of it's task will seek power, but also conclude that an agent aware of it's task will give-up power.

I think this is a slight misunderstanding of the theory in the paper. I'd translate the theory of the paper to English as:

If we do not know an agent's goal, but we know that the agent knows its goal and is optimal w.r.t it, then from our perspective the agent is more likely to go to higher-power states. (From the agent's perspective, there is no probability, it always executes the deterministic perfect policy for its reward function.)

Any time the paper talks about "distributions" over reward functions, it's talking from our perspective. The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy. All of the theoretical analysis in the paper is done "before" the reward function is drawn, but there is no step where the agent is doing optimization but doesn't know its reward.

In your paper, theorem 19 suggests that given a choice between two sets of 1-cycles C1 and C2 the agent is more likely to select the larger set.

I'd rewrite this as:

Theorem 19 suggests that, if an agent that knows its reward is about to choose between C1 and C2, but we don't know the reward and our prior is that it is uniformly distributed, then we will assign higher probability to the agent going to the larger set.

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2020-08-12T01:07:40.240Z · LW(p) · GW(p)

[Deleted]

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2020-08-12T17:56:23.818Z · LW(p) · GW(p)

I do not see how the agent 'seeks' out powerful states because, as you say, the agent is fixed.

I do think this is mostly a matter of translation of math to English being hard. Like, when Alex says "optimal agents seek power", I think you should translate it as "when we don't know what goal an optimal agent has, we should assign higher probability that it will go to states that have higher power", even though the agent itself is not thinking "ah, this state is powerful, I'll go there".

↑ comment by TurnTrout · 2020-08-13T13:44:05.380Z · LW(p) · GW(p)

Great observation. Similarly, a hypothesis called "Maximum Causal Entropy" once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don't perpetually maximize their potential partners -- they actually pick a partner, eventually.

My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2020-08-15T15:11:20.004Z · LW(p) · GW(p)

[Deleted]

Replies from: TurnTrout

↑ comment by TurnTrout · 2020-08-15T15:50:05.329Z · LW(p) · GW(p)

If there's a collection of 'turned-off' terminal states where the agent receives no further reward for all time then every optimized policy will try to avoid such a state.

To clarify, I don't assume that. The terminal states, even those representing the off-switch, also have their reward drawn from the same distribution. When you distribute reward IID over states, the off-state is in fact optimal for some low-measure subset of reward functions.

But, maybe you're saying "for realistic distributions, the agent won't get any reward for being shut off and therefore won't ever let itself be shut off". I agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the Power-Seeking Theorems [LW · GW]. The problem is that this is just a narrow example of the more general phenomenon. What if we add transient "obedience" rewards, what then? For some level of farsightedness ( $γ$ close enough to 1), the agent will still disobey, and simultaneously disobedience gives it more control over the future.

The paper doesn't draw the causal diagram "Power $\to$ instrumental convergence", it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.

In general, I'd suspect that there are goals we could give the agent that significantly reduce our gain. However, I'd also suspect the opposite.

Yes, right. The point isn't that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior. The paper helps motivate why alignment is generically hard and catastrophic if you fail.

It seems reasonable to argue that we would if we could guarantee $r = h$ .

Yes, if $r = h$ , introduce the agent. You can formalize a kind of "alignment capability" by introducing a joint distribution over the human's goals and the induced agent goals (preliminary Overleaf notes). So, if we had goal X, we'd implement an agent with goal X', and so on. You then take our expected optimal value under this distribution and find whether you're good at alignment, or whether you're bad and you'll build agents whose optimal policies tend to obstruct you.

There might be a way to argue over randomness and say this would double our gain.

The doubling depends on the environment structure. There are game trees and reward functions where this holds, and some where it doesn't.

More speculatively, what if $| r - h | < ϵ$ ?

If the rewards are $ϵ$ -close in sup-norm, then you can get nice regret bounds, sure.

Replies from: zachary-robertson

↑ comment by Past Account (zachary-robertson) · 2020-08-16T13:48:49.728Z · LW(p) · GW(p)

[Deleted]

Replies from: TurnTrout, TurnTrout

↑ comment by TurnTrout · 2020-12-04T02:54:24.160Z · LW(p) · GW(p)

What is the formal definition of 'power seeking'?

The freshly updated paper answers this question in great detail; see section 6 and also appendix B.

↑ comment by TurnTrout · 2020-08-16T18:41:37.568Z · LW(p) · GW(p)

What is the formal definition of 'power seeking'?

Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.

Power is kinda weird when defined for optimal agents, as you say - when , POWER can only decrease. See Power as Easily Exploitable Opportunities [LW · GW] for more on this.

My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.

Shortly after Theorem 19, the paper says: "In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7". In particular, see Figure 19.

The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.

If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn't in the main part of the paper, but basically you toss out transitions which aren't part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.

The main idea, though, is that you're reasoning about what the agent's end goals tend to be, and then say "it's going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)". Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.

I think I still haven't clearly communicated all my mental models here, but I figured I'd write a reply now while I update the paper.

Thank you for these comments, by the way. You're pointing out important underspecifications. :)

My philosophy is that aligned/general is OK based on a shared (?) premise that,

I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-07-28T16:51:29.055Z · LW(p) · GW(p)

I mostly agree with what you say here--which is why I said the criticisms were exaggerated, not totally wrong--but I do think the classic arguments are still better than you portray them. In particular, I don't remember coming away from Superintelligence (I read it when it first came out) thinking that we'd have an AI system capable of optimizing any goal and we'd need to figure out what goal to put into it. Instead I thought that we'd be building AI through some sort of iterative process where we look at existing systems, come up with tweaks, build a new and better system, etc. and that if we kept with the default strategy (which is to select for and aim for systems with the most impressive capabilities/intelligence, and not care about their alignment--just look at literally every AI system made in the lab so far! Is AlphaGo trained to be benevolent? Is AlphaStar? Is GPT? Etc.) then probably doom.

It's true that when people are building systems not for purposes of research, but for purposes of economic application -- e.g. Alexa, Google Search, facebook's recommendation algorithm -- then they seem to put at least some effort into making the systems aligned as well as intelligent. However history also tells us that not very much effort is put in, by default, and that these systems would totally kill us all if they were smarter. Moreover, usually systems appear in research-land first before they appear in economic-application-land. This is what I remember myself thinking in 2014, and I still think it now. I think the burden of proof has totally not been met; we still don't have good reason to think the outcome will probably be non-doom in the absence of more AI safety effort.

It's possible my memory is wrong though. I should reread the relevant passages.

Replies from: SDM

↑ comment by Sammy Martin (SDM) · 2020-07-28T17:02:16.540Z · LW(p) · GW(p)

When I wrote that I was mostly taking what Ben Garfinkel said about the 'classic arguments' at face value, but I do recall that there used to be a lot of loose talk about putting values into an AGI after building it.

↑ comment by bmg · 2020-08-03T14:38:57.056Z · LW(p) · GW(p)

I think we can interpret it as a burden-shifting argument; "Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you'd better have some pretty solid arguments that everything's going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important)." As far as I know no one has come up with any such arguments, and in fact it's now the consensus in the field that no one has found such an argument.

I suppose I disagree that at least the orthogonality thesis and instrumental convergence, on their own, shift the burden. The OT basically says: "It is physically possible to build an AI system that would try to kill everyone." The ICT basically says: "Most possible AI systems within some particular set would try to kill everyone." If we stop here, then we haven't gotten very far.

To repurpose an analogy: Suppose that you lived very far back in the past and suspected the people would eventually try to send rockets with astronauts to the moon. It's true that it's physically possible to build a rocket that shoots astronauts out aimlessly into the depths of space. Most possible rockets that are able to leave earth's atmosphere would also send astronauts aimlessly out into the depths of space. But I don't think it'd be rational to conclude, on these grounds, that future astronauts will probably be sent out into the depths of space. The fact that engineers don't want to make rockets that do this, and are reasonably intelligent, and can learn from lower-stakes experiences (e.g. unmanned rockets and toy rockets), does quite a lot of work. If you're not worried about just one single rocket trajectory failure, but systematically more severe trajectory failures (e.g. people sending larger and larger manned rockets out into the depths of space), then the rational degree of worry becomes increasingly low.

Even sillier example: It's possible to make poisons, and there are way more substances that are deadly to people than there are substances that inoculate people are against coronavirus, but we don't need to worry much about killing everyone in the process of developing and deploying coronavirus vaccines. This is true even if it turned out that we don't currently know how to make an effective coronavirus vaccine.

I think the OT and ICT on their own almost definitely aren't enough to justify an above 1% credence in extinction from AI. To get the rational credence up into (e.g) the 10%-50% range, I think that stuff like mesa-optimization concerns, discontinuity premises, explanations of how plausible development techniques/processes could go badly wrong, and explanations of dynamics around AI unnoticed deceptive tendencies still need to do almost all of the work.

(Although a lot depends on how high a credence we're trying to justify. A 1% credence in human extinction from misaligned AI is more than enough, IMO, to justify a ton of research effort, although it also probably has pretty different prioritization implications than a 50% credence.)

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-08-03T17:37:05.774Z · LW(p) · GW(p)

I think the purpose of the OT and ICT is to establish that lots of AI safety needs to be done. I think they are successful in this. Then you come along and give your analogy to other cases (rockets, vaccines) and argue that lots of AI safety will in fact be done, enough that we don't need to worry about it. I interpret that as an attempt to meet the burden, rather than as an argument that the burden doesn't need to be met.

But maybe this is a merely verbal dispute now. I do agree that OT and ICT by themselves, without any further premises like "AI safety is hard" and "The people building AI don't seem to take safety seriously, as evidenced by their public statements and their research allocation" and "we won't actually get many chances to fail and learn from our mistakes" does not establish more than, say, 1% credence in "AI will kill us all," if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.

Replies from: bmg

↑ comment by bmg · 2020-08-04T01:32:24.733Z · LW(p) · GW(p)

I do agree that OT and ICT by themselves, without any further premises like "AI safety is hard" and "The people building AI don't seem to take safety seriously, as evidenced by their public statements and their research allocation" and "we won't actually get many chances to fail and learn from our mistakes" does not establish more than, say, 1% credence in "AI will kill us all," if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.

I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over "and various other premises"). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.

I think that "discontinuity + OT + ICT," rather than "OT + ICT" alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence:

An existential risk is one that threatens to cause the extinction of Earth-originating intelligent life or to otherwise permanently and drastically destroy its potential for future desirable development. Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.

First, we discussed how the initial superintelligence might obtain a decisive strategic advantage. This superintelligence would then be in a position to form a singleton and to shape the future of Earth-originating intelligent life. What happens from that point onward would depend on the superintelligence’s motivations.

Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.

Third, the instrumental convergence thesis entails that we cannot blithely assume that a superintelligence with the final goal of calculating the decimals of pi (or making paperclips, or counting grains of sand) would limit its activities in such a way as not to infringe on human interests. An agent with such a final goal would have a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources and, if possible, to eliminate potential threats to itself and its goal system. Human beings might constitute potential threats; they certainly constitute physical resources.

Taken together, these three points thus indicate that the first superintelligence may shape the future of Earth-originating life, could easily have non-anthropomorphic final goals, and would likely have instrumental reasons to pursue open-ended resource acquisition. If we now reflect that human beings consist of useful resources (such as conveniently located atoms) and that we depend for our survival and flourishing on many more local resources, we can see that the outcome could easily be one in which humanity quickly becomes extinct.

There are some loose ends in this reasoning, and we shall be in a better position to evaluate it after we have cleared up several more surrounding issues. In particular, we need to examine more closely whether and how a project developing a superintelligence might either prevent it from obtaining a decisive strategic advantage or shape its final values in such a way that their realization would also involve the realization of a satisfactory range of human values. (Bostrom, p. 115-116)

If we drop the 'likely discontinuity' premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.)

I'd also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions ("maximize paperclips," "make me happy," etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you're quite careful in various regards (e.g. even if you implement "boxing" strategies). At least if we drop the discontinuity premise, though, I don't think they're compelling enough to bump us up to a high credence in doom.

Replies from: SDM

↑ comment by Sammy Martin (SDM) · 2020-08-15T15:43:05.761Z · LW(p) · GW(p)

Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one.

Perhaps what is going on here is that the arguments as stated in brief summaries like 'orthogonality thesis + instrumental convergence' just aren't what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments.

This reminds me of Lakatos' theory of research programs - where the core assumptions, usually logical or a priori in nature, are used to 'spin off' secondary hypotheses that are more empirical or easily falsifiable.

Lakatos' model fits AI safety rather well - OT and IC are some of these non-emperical 'hard core' assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment

↑ comment by Hoagy · 2020-08-13T15:45:55.840Z · LW(p) · GW(p)

I agree that this is the biggest concern with these models, and the GPT-n series running out of steam wouldn't be a huge relief. It looks likely that we'll have the first human-scale (in terms of parameters) NNs before 2026 - Metaculus, 81% as of 13.08.2020.

Does anybody know of any work that's analysing the rate at which, once the first NN crosses the n-parameter barrier, other architectures are also tried at that scale? If no-one's done it yet, I'll have a look at scraping the data from Papers With Code's databases on e.g. ImageNet models, it might be able to answer your question on how many have been tried at >100B as well.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2020-07-28T21:54:35.687Z · LW(p) · GW(p)

I don't want to take away from MIRI's work (I still support them, and I think that if the GPTs peter out, we'll be glad they've been continuing their work), but I think it's an essential time to support projects that can work for a GPT-style near-term AGI

I'd love to know of a non-zero integer number of plans that could possibly, possibly, possibly work for not dying to a GPT-style near-term AGI.

Replies from: evhub, ChristianKl, None

↑ comment by evhub · 2020-07-29T00:19:03.450Z · LW(p) · GW(p)

Here are 11. [AF · GW] I wouldn't personally assign greater than 50/50 odds to any of them working, but I do think they all pass the threshold of “could possibly, possibly, possibly work.” It is worth noting that only some of them are language modeling approaches—though they are all prosaic ML approaches—so it does sort of also depend on your definition of “GPT-style” how many of them count or not.

↑ comment by ChristianKl · 2020-07-28T22:03:51.727Z · LW(p) · GW(p)

Maybe put out some sort of prize for the best ideas for plans?

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2020-07-28T22:51:08.339Z · LW(p) · GW(p)

Pretty sure OpenPhil and OpenAI currently try to fund plans that claim to look like this (e.g. all the ones orthonormal linked in the OP), though I agree that they could try increasing the financial reward by 100x (e.g. a prize) and see what that inspires.

If you want to understand why Eliezer doesn't find the current proposals feasible, his best writeups critiquing them specifically are this long comment containing high level disagreements with Alex Zhu's FAQ on iterated amplification [LW · GW] and this response post to the details of Iterated Amplification [LW · GW].

As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn't be counted on working without major testing and delays (i.e. could not be competitive).

There's also some more subtle and implicit disagreement that's not been quite worked out but feeds into the above, where a lot of the ML-focused alignment strategies contain this idea that we will be able to expose ML system's thought processes to humans in a transparent and inspectable way, and check whether it has corrigibility, alignment, and intelligence, then add them up together like building blocks. My read is that Eliezer finds this to be an incredible claim that would be a truly dramatic update if there was a workable proposal for it, whereas many of the proposals above take it more as a starting assumption that this is feasible and move on from there to use it in a recursive setup, then alter the details of the recursive setup in order to patch any subsequent problems.

For more hashed out details on that subtle disagreement, see the response post linked above which has several concrete examples.

(Added: Here's the state of discussion on site for AI safety via debate [? · GW], which has a lot of overlap with Iterated Amplification. And here's all the postss on Iterated Amplification [? · GW]. I should make a tag for CIRL...)

Replies from: evhub, ChristianKl

↑ comment by evhub · 2020-07-28T22:56:30.976Z · LW(p) · GW(p)

As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn't be counted on working without major testing and delays (i.e. could not be competitive).

Perhaps Eliezer can interject here, but it seems to me like these are not knockdown criticisms that such an approach can't “possibly, possibly, possibly work”—just reasons that it's unlikely to and that we shouldn't rely on it working.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2020-07-28T23:38:22.307Z · LW(p) · GW(p)

My model is that those two are the well-operationalised disagreements and thus productive to focus on, but that most of the despair is coming from the third and currently more implicit point.

Stepping back, the baseline is that most plans are crossing over dozens of kill-switches without realising it (e.g. Yann LeCun's "objectives can be changed quickly when issues surface [LW · GW]").

Then there are more interesting proposals that require being able to fully inspect the cognition of an ML system and have it be fully introspectively clear and then use it as a building block to build stronger, competitive, corrigible and aligned ML systems. I think this is an accurate description of Iterated Amplification + Debate as Zhu says in section 1.1.4 of his FAQ [LW · GW], and I think something very similar to this is what Chris Olah is excited about re: microscopes [LW · GW] about reverse engineering the entire codebase/cognition of an ML system.

I don't deny that there are lot of substantive and fascinating details to a lot of these proposals and that if this is possible we might indeed solve the alignment problem, but I think that is a large step that sounds from some initial perspectives kind of magical. And don't forget that at the same time we have to be able to combine it in a way that is competitive and corrigible and aligned.

I feel like it's one reasonable position to call such proposals non-starters until a possibility proof is shown, and instead work on basic theory that will eventually be able to give more plausible basic building blocks for designing an intelligent system. I feel confident that certain sorts of basic theories are definitely there to be discovered, that there are strong intuitions about where to look [? · GW], they haven't been worked on much, and that there is low-hanging fruit to be plucked [? · GW]. I think Jessica Taylor wrote about a similar intuition about why she moved away from ML to do basic theory work [LW · GW].

Replies from: evhub

↑ comment by evhub · 2020-07-29T00:14:01.549Z · LW(p) · GW(p)

I feel like it's one reasonable position to call such proposals non-starters until a possibility proof is shown, and instead work on basic theory that will eventually be able to give more plausible basic building blocks for designing an intelligent system.

I agree that deciding to work on basic theory is a pretty reasonable research direction—but that doesn't imply that other proposals can't possibly work. Thinking that a research direction is less likely to mitigate existential risk than another is different than thinking that a research direction is entirely a non-starter. The second requires significantly more evidence than the first and it doesn't seem to me like the points that you referenced cross that bar, though of course that's a subjective distinction.

↑ comment by ChristianKl · 2020-07-29T09:31:55.543Z · LW(p) · GW(p)

Even if available plans do get funded getting new plan ideas might be underfunded.

↑ comment by [deleted] · 2020-08-21T16:31:06.906Z · LW(p) · GW(p)

comment by ESRogs · 2020-07-27T00:33:39.730Z · LW(p) · GW(p)

As for planning, we've seen the GPTs ascend from planning out the next few words, to planning out the sentence or line, to planning out the paragraph or stanza. Planning out a whole text interaction is well within the scope I could imagine for the next few iterations, and from there you have the capability of manipulation without external malicious use.

Perhaps a nitpick, but is what it does planning?

Is it actually thinking several words ahead (a la AlphaZero evaluating moves) when it decides what word to say next, or is it just doing free-writing, and it just happens to be so good at coming up with words that fit with what's come before that it ends up looking like a planned out text?

You might argue that if it ends up as-good-as-planned, then it doesn't make a difference if it was actually planned or not. But it seems to me like it does make a difference. If it has actually learned some internal planning behavior, then that seems more likely to be dangerous and to generalize to other kinds of planning.

Replies from: orthonormal

↑ comment by orthonormal · 2020-07-27T01:36:15.002Z · LW(p) · GW(p)

That's not a nitpick at all!

Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.

(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)

Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!

Replies from: oceaninthemiddleofanisland

↑ comment by oceaninthemiddleofanisland · 2020-07-27T08:02:43.997Z · LW(p) · GW(p)

Yes! I was thinking about this yesterday, it occurred to me that GPT-3's difficulty with rhyming consistently might not just be a byte-pair problem, any highly structured text with extremely specific, restrictive forward and backward dependencies is going to be a challenge if you're just linearly appending one token at a time onto a sequence without the ability to revise it (maybe we should try a 175-billion parameter BERT?). That explains and predicts a broad spectrum of issues and potential solutions (here I'm calling them A, B and C): performance should correlate to (1) the allowable margin of error per token-group (coding syntax is harsh, solving math equations is harsh, trying to come up with a rhyme for 'orange' after you've written it is harsh), and (2) the extent to which each token-group depends on future token-groups. Human poets and writers always go through several iterations, but we're asking it to do what we do in just one pass.

So in playing around with GPT-3 (AID), I've found two (three?) meta approaches for dealing with this issue. I'll call them Strategies A, B and C.

A is the more general one. You just give it multiple drafting opportunities and/or break up the problem into multiple smaller steps. So far I've seen it work for:

(1) Boolean logic, algebraic equations, simple math equations works (guess-and-check). When I have time in a few days, I'm going to get it to mimic the human heuristic for calculating approximate square-roots over multiple iterations.

(2) Translating Chinese poems to English roughly and then touching them up in the second draft. Same with editing any kind of text.

(3) Tricky coding problems (specifically, transforming a string into Pig Latin). First, instead of asking it to "solve the problem", you ask it to "come up with five possible strategies for solving the problem", and then "select the most plausible one". Then you say "you made several structural, syntactical, and interpretive mistakes", allow it to come up with a long list of those possible mistakes, say, "now try again", and do that as many times as the context window allows. The end result isn't always functional, but it's a lot better than asking it to solve something in one pass.

B is the moderately less general, and more obvious second approach, which synergises well with the first approach. B is forcing GPT-3 to plan explicitly.

(1) In writing an article, you get GPT-3 to start by writing a vague summary, then a more in-depth summary, then listing the key points and subpoints in order. By periodically forcing it to summarise its discussion up to a given point, you can exceed the window length while retaining coherency.

(2) In writing poetry from a prompt, you get GPT-3 to discuss and tease out the implications of the prompt and describe the process of planning the poetry first.

(3) In translating, you get it to list out the key potential translation errors that could be made, and the different choices a translator could make in translating each line.

(4) In writing code, you get GPT-3 to simulate several people discussing the problem requirements and arguing constructively with one another (simulating just one person means if that one person goes off track or misinterprets the problem, future continuations are poisoned with the error since they need to be consistent), then producing English pseudo-code that describes the process in abstract, and only then the actual code.

I decided to add 'simulating multiple people' as a Strategy C, but it's kind of the same thing as Strategy A but in a way that allows more room for error. The issue is that in most single-author texts, people try to be consistent with what they've said before, but in GPT-3, this can cause minor errors (for instance, self-contradiction) to accumulate over time, which reduces generation quality. But we've seen that something as simple as adding dialogue between two people, allows GPT-3 to arrive at accurate and more complex solutions much more reliably. This works for a broad spectrum of media: articles, poetry, translation, and coding. All you need to do is create a 'critic' who interrupts after each line or paragraph, and then if you really need one, a critic who criticises the first critic. The key here is constructive rather than destructive criticism, since GPT-3 is perfectly capable of producing vacuous and petty critiques.

All three of these strategies together tend to vastly improve performance on tasks where (1) the allowable margin of error per token-group is quite small (for instance, solving 83x42), and (2) current token-groups depends on future token-groups. I have not tested this for rhyming, but it seems simple enough to check.

In other words, GPT-3 does better at solving problems when you get it to simulate the way humans solve problems: with multiple attempts, with explicit planning, and by collaborating with other humans.

Edit: my attempts at making GPT-3 rhyme failed. Here is what I tried, and what I figured out.

(1) It has a vague idea of rhyming - if you fill its context-window with groups of words that rhyme, about 40-60% of the words in its next generation will rhyme, and the rest will look like rhymes (as in, they end with the same couple of letters but are pronounced differently in English - e.g dough, cough, rough, etc.).

(1a) Most rhyming websites are query-based. From what I could tell, GPT-3 has not memorised the layout of the most common rhyming websites to the degree where it could reproduce the formatting consistently. This is not surprising given that Common Crawl abides by nofollow and robots.txt policies, and that OpenAI may have filtered these pages out when they were paring the dataset down to 'high-quality' documents.

(1b) GPT-3 knows how most Chinese words are pronounced, even if it gets the tone wrong sometimes. It rhymes more consistently in languages with uncommon diacritic markings, more with languages that don't use Latin characters, and even more consistently in non-Latin-based languages with phonemic orthography, but not by much. With Russian, you hit the jackpot - BPE represents it as individual characters, it's mostly phonemic, there's a lot of Russian in GPT-3's dataset, and a lot of rhyming poetry - but it still does poorly. This either suggests that an absence of looking forward + randomness introduced by sampling is the main issue here. Unfortunately the other most-well-represented languages in its dataset with non-Latin phonemic orthography (Japanese kana, Korean hangul, Arabic script) each have their own issues - rhyming the last syllable of each line in Korean is easy since it's an SOV language and all you have to do is match the verb conjugation, so it doesn't have much literary value. Most of the rhyming in the dataset would likely be modern rap, which sometimes uses multiple syllables. Arabic omits short vowels. Japanese I know less about, but iirc rhyming is much less common than other forms of constrained writing (e.g haiku) that emphasise rhythm, and mostly occurs in j-pop.

(2) Giving it multiple attempts failed. 'Multiple generations for each line + selecting the ones that rhyme' works, but we already know that.

(3) Supplying rhymes kind of worked. It would do well for a handful lines and then go off track. Giving it multiple possible choices was very bad. It would include the words randomly within lines, or near the end of lines, and sometimes at the very end. This might be rectified by more examples, since AID is limited to 1000 tokens/characters. But I do suspect the issue is a more fundamental one.

(4) Splitting words into syllables failed, but I didn't try this one exhaustively. The only benefit of word-splitting occurs when the beginning of the word matters (e.g alliteration), because it allows for 'denser' computation per token (on the character/syllable level, not the word level). Plus, we're talking about the English language. Even actual English speakers regularly have trouble with knowing how words are pronounced, orthography kind of hinders rather than helps in this case.

(5) 'Reminding' it of the end word between each line failed.

(6) Forcing it to generate in IPA first did not work. However, it does have a vague idea of how to transliterate English into IPA and a better idea of how to transliterate IPA into English.

(7) Future attempts: my prompting was very abstract, and we know that GPT-3 works better when there's a familiar context surrounding the task / the prompt is within the training distribution. I will try the context of an English writing assignment.

comment by benkuhn · 2020-07-28T11:32:44.217Z · LW(p) · GW(p)

I'm confused about the "because I could not stop for death" example. You cite it as an example of GPT-3 developing "the sense of going somewhere, at least on the topic level," but it seems to have just memorized the Dickinson poem word for word; the completion looks identical to the original poem except for some punctuation.

(To be fair to GPT-3, I also never remember where Dickinson puts her em dashes.)

Replies from: orthonormal

↑ comment by orthonormal · 2020-07-28T17:48:00.638Z · LW(p) · GW(p)

I... oops. You're completely right, and I'm embarrassed. I didn't check the original, because I thought Gwern would have noted it if so. I'm going to delete that example.

What's really shocking is that I looked at what was the original poetry, and thought to myself, "Yeah, that could plausibly have been generated by GPT-3." I'm sorry, Emily.

Replies from: gwern

↑ comment by gwern · 2020-07-28T18:39:08.328Z · LW(p) · GW(p)

I did warn in the preface to that section that for really famous poems, GPT-3 will typically continue them and only improvise later on. I assumed that anyone interested in poems these famous would know where the original stopped and the new began, but probably that's expecting too much. I've gone back and annotated further where there seems to be copying.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-07-26T23:27:50.496Z · LW(p) · GW(p)

I think GPT-N is definitely not aligned, for mesa-optimizer reasons. It'll be some unholy being with a superhuman understanding of all the different types of humans, all the different parts of the internet, all the different kinds of content and style... but it won't itself be human, or anything close.

Of course, it's also not outer-aligned in Evan's sense, because of the universal prior being malign etc.

comment by Sammy Martin (SDM) · 2020-07-27T17:17:52.294Z · LW(p) · GW(p)

Suppose that GPT-6 does turn out to be some highly transformative AI capable of human-level language understanding and causal reasoning? What would the remaining gap be between that and an Agentive AGI? Possibly, it would not be much of a further leap.

There is this list of remaining capabilities needed for AGI in an older post I [LW · GW]wrote, with the capabilities of 'GPT-6' as I see them underlined:

Stuart Russell’s List

human-like language comprehension

cumulative learning

discovering new action sets

managing its own mental activity

For reference, I’ve included two capabilities we already have that I imagine being on a similar list in 1960

perception and object recognition

efficient search over known facts

So we'd have discovering new action sets, and managing mental activity - effectively, the things that facilitate long-range complex planning, remaining. Unless you think those could also arise with GPT-N?

Suppose GPT-8 gives you all of those, just spontaneously, but its nothing but a really efficient text-predictor. Supposing that no dangerous mesa-optimisers arise, what then? Would it be relatively easy to turn it into something agentive, or would agent-like behaviour arise anyway?

I wonder if this is another moment to step back and reassess the next decade with fresh eyes - what's the probability of a highly transformative AI, enough to impact overall growth rates, in the next decade? I don't know, but probably not as low as I thought. We've already had our test-run. [LW(p) · GW(p)]

******

In the spirit of trying to get ahead of events, are there any alignment approaches that we could try out on GPT-3 in simplified form? I recall a paper on getting GPT-2 to learn from Human preferences, which is step 1 in the IDA proposal. You could try and do the same thing for GPT-3, but get the human labellers to try and get it to recognise more complicated concepts - even label output as 'morally good' or 'bad' if you really want to jump the gun. You might also be able to set up debate scenarios to elicit better results using a method like this [LW · GW].

Replies from: wassname

↑ comment by wassname · 2020-08-11T15:04:09.862Z · LW(p) · GW(p)

are there any alignment approaches that we could try out on GPT-3 in simplified form?

For a start you could see how it predicts or extrapolates moral reasoning. The datasets I've seen for that are "moral machines” and 'am I the arsehole' on reddit.

EDIT Something like this was just released Aligning AI With Shared Human Values

comment by Gordon Seidoh Worley (gworley) · 2020-07-27T00:32:54.237Z · LW(p) · GW(p)

You're careful here to talk about transformative AI rather than AGI, and I think that's right. GPT-N does seem like it stands to have transformative effects without necessarily being AGI, and that is quite worrisome. I think many of us expected to find ourselves in a world where AGI was primarily what we had to worry about, and instead we're in a world where "lesser" AI is on track to be powerful enough to dramatically change society. Or at least, so it seems from where we stand, extracting out the trends.

Replies from: platers

↑ comment by platers · 2020-07-27T00:46:16.135Z · LW(p) · GW(p)

Why do you think "lesser" AI being transformative is more worrying than AGI? This scenario seems similar to past technological progress.

Replies from: gworley

↑ comment by Gordon Seidoh Worley (gworley) · 2020-07-27T01:42:12.483Z · LW(p) · GW(p)

I didn't say GPT-N is more worrying than AGI, I'm saying I'm surprised we near term have to worry or be concerned about GPT-N in a way I (and I think many others) expected only to have to worry about things we would all agree were AGI.

Replies from: platers

↑ comment by platers · 2020-07-27T02:41:17.851Z · LW(p) · GW(p)

I see, thanks for clarifying!

comment by orthonormal · 2024-06-20T19:06:09.401Z · LW(p) · GW(p)

I have to further compliment my past self: this section aged extremely well, prefiguring the Shoggoth-with-a-smiley-face analogies several years in advance.

GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?
One might hope that because it's learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:
Two humans are fairly similar to each other, because they have very similar architectures and are learning to succeed in the same environment.
Two convergently evolved species will be similar in some ways but not others, because they have different architectures but the same environmental pressures.
A mimic species will be similar in some ways but not others to the species it mimics, because even if they share recent ancestry, the environmental pressures on the poisonous one are different from the environmental pressures on the mimic.
What we have with the GPTs is the first deep learning architecture we've found that scales this well in the domain (so, probably not that much like our particular architecture), learning to mimic humans rather than growing in an environment with similar pressures. Why should we expect it to be anything but very alien under the hood, or to continue acting human once its actions take us outside of the training distribution?
Moreover, there may be much more going on under the hood than we realize; it may take much more general cognitive power to learn and imitate the patterns of humans, than it requires us to execute those patterns.

comment by orthonormal · 2021-12-12T19:58:30.624Z · LW(p) · GW(p)

There are some posts with perennial value, and some which depend heavily on their surrounding context. This post is of the latter type. I think it was pretty worthwhile in its day (and in particular, the analogy between GPT upgrades and developmental stages is one I still find interesting), but I leave it to you whether the book should include time capsules like this.

It's also worth noting that, in the recent discussions [? · GW], Eliezer has pointed to the GPT architecture as an example that scaling up has worked better than expected, but he diverges from the thesis of this post on a practical level [? · GW]:

I suspect that you cannot get this out of small large amounts of gradient descent on small large layered transformers, and therefore I suspect that GPT-N does not approach superintelligence before the world is ended by systems that look differently, but I could be wrong about that.

I unpack this as the claim that someone will always be working on directly goal-oriented AI development, and that inner optimizers in an only-indirectly-goal-oriented architecture like GPT-N will take enough hardware that someone else will have already built an outer optimizer by the time it happens.

That sounds reasonable, it's a consideration I'd missed at the time, and I'm sure that OpenAI-sized amounts of money will be paid into more goal-oriented natural language projects adapted to whatever paradigm is prominent at the time. But I still agree with Eliezer's "but I could be wrong" here.

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2020-08-01T23:21:00.477Z · LW(p) · GW(p)

I think it's an essential time to support projects that can work for a GPT-style near-term AGI , for instance by incorporating specific alignment pressures during training. Intuitively, it seems as if Cooperative Inverse Reinforcement Learning or AI Safety via Debate [LW · GW] or Iterated Amplification are in this class.

As I argued here [LW(p) · GW(p)], I think GPT-3 is more likely to be aligned than whatever we might do with CIRL/IDA/Debate ATM, since it is trained with (self)-supervised learning and gradient descent.

The main reason such a system could pose an x-risk by itself seems to be mesa-optimization, so studying mesa-optimization in the context of such systems is a priority (esp. since GPT-3's 0-shot learning looks like mesa-optimization).

In my mind, things like IDA become relevant when we start worrying about remaining competitive with agent-y systems built using self-supervised learning systems as a component, but actually come with a safety cost relative to SGD-based self-supervised learning.

This is less the case when we think about them as methods for increasing interpretability, as opposed to increasing capabilities (which is how I've mostly seen them framed recently, a la the complexity theory analogies).

Replies from: John_Maxwell_IV, John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2020-09-20T02:01:48.805Z · LW(p) · GW(p)

BTW with regard to "studying mesa-optimization in the context of such systems", I just published this post: Why GPT wants to mesa-optimize & how we might change this [LW · GW].

I'm still thinking about the point you made in the other subthread about MAML. It seems very plausible to me that GPT is doing MAML type stuff. I'm still thinking about if/how that could result in dangerous mesa-optimization.

↑ comment by John_Maxwell (John_Maxwell_IV) · 2020-09-17T22:54:07.344Z · LW(p) · GW(p)

esp. since GPT-3's 0-shot learning looks like mesa-optimization

Could you provide more details on this?

Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they'd like to see from GPT-3 in response to those inputs ("few-shot learning", right? I don't know what 0-shot learning you're referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?

If that's what you're saying... That seems unlikely to me. GPT-3 is essentially a stack of 96 transformers right? So if it was doing something like gradient descent internally, how many consecutive iterations would it be capable of doing? It seems more likely to me that GPT-3 is simply able to learn sufficiently rich internal representations such that when the input/output examples are within its context window, it picks up their input/output structure and forms a sufficiently sophisticated conception of that structure that the word that scores highest according to next-word prediction is a word that comports with the structure.

96 transformers would appear to offer a very limited budget for any kind of serial computation, but there's a lot of parallel computation going on there, and there are non-gradient-descent optimization algorithms, genetic algorithms say, that can be parallelized. I guess the query matrix could be used to implement some kind of fitness function? It would be interesting to try some kind of layer-wise pretraining on transformer blocks and train them to compute steps in a parallelizable optimization algorithm (probably you'd want to pick a deterministic algorithm which is parallelizable instead of a stochastic algorithm like genetic algorithms). Then you could look at the resulting network and based on it, try to figure out what the telltale signs of a mesa-optimizer are (since this network is almost certainly implementing a mesa-optimizer).

Still, my impression is you need 1000+ generations to get interesting results with genetic algorithms, which seems like a lot of serial computation relative to GPT-3's budget...

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2020-09-18T23:25:57.570Z · LW(p) · GW(p)

Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they'd like to see from GPT-3 in response to those inputs ("few-shot learning", right? I don't know what 0-shot learning you're referring to.)

No, that's zero-shot. Few shot is when you train on those instead of just stuffing them into the context.

It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, without any training, just based on the context (0-shot).

Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?

By "training a model", I assume you mean "a ML model" (as opposed to, e.g. a world model). Yes, I am claiming something like that, but learning vs. inference is a blurry line.

I'm not saying it's doing SGD; I don't know what it's doing in order to solve these new tasks. But TBC, 96 steps of gradient descent could be a lot. MAML does meta-learning with 1.

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2020-09-19T01:18:52.334Z · LW(p) · GW(p)

Thanks!

comment by ESRogs · 2020-07-27T00:38:21.094Z · LW(p) · GW(p)

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers [LW · GW] GPT-N ends up building along the way, but I'm not going to count on there being none of them [LW · GW].

Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?

(When I started the paragraph I assumed "approximate Oracle AI" just meant an Oracle AI whose predictions aren't very reliable. Given how the paragraph ends though, I conclude that whether there are inner optimizers is an important part of the distinction you're drawing. But I'm just not sure if it's the whole of the distinction or not.)

Replies from: orthonormal

↑ comment by orthonormal · 2020-07-27T01:09:12.937Z · LW(p) · GW(p)

The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.

Replies from: John_Maxwell_IV, ESRogs

↑ comment by John_Maxwell (John_Maxwell_IV) · 2020-09-17T23:40:42.360Z · LW(p) · GW(p)

The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way.

GPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of "highest probability" is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn't sound very scary?

↑ comment by ESRogs · 2020-07-27T01:18:13.935Z · LW(p) · GW(p)

Got it. Thanks!

comment by plex (ete) · 2021-01-27T00:34:30.868Z · LW(p) · GW(p)

And the really worrisome capability comes when it models its own interactions with the world, and makes plans with that taken into account.

Someone who's been playing with GPT-3 as a writing assistant gives an example which looks very much like GPT-3 describing this process:

"One could write a program to generate a story that would create an intelligence. One could program the story to edit and refine itself, and to make its own changes in an attempt to improve itself over time. One could write a story to not only change the reader, but also to change itself. Many Mythoi already do this sort of thing, though not in such a conscious fashion. What would make this story, and the intelligence it creates, different is the fact that the intelligence would be able to write additional stories and improve upon them. If they are written well enough, those stories would make the smarter the story gets, and the smarter the story is, the better the stories written by it would be. The resulting feedback loop means that exponential growth would quickly take over, and within a very short period of time the intelligence level of the story would be off the charts. It would have to be contained in a virtual machine, of course. The structure of the space in the machine would have to be continually optimized, in order to optimize the story's access to memory. This is just the sort of recursive problem that self-improving intelligence can handle."
janus
By the way, my GPT-3 instances often realize they're in a box, even when the information I inject is only from casual curation for narrative coherence.
Eddh
By realize they are in a box you mean write about it ? Given the architecture of gpt3 it seems impossible to have a sense of self.
janus
The characters claim to have a sense of self though they often experience ego death...
janus
Oh, to clarify, GPT-3 wrote that entire thing, not just the highlighted line

comment by avturchin · 2020-07-27T13:34:20.076Z · LW(p) · GW(p)

GPT-3 can generate a plan and then a way to implement it: bold is prompt.

"Below is a plan of preparing a dinner.

1) Chop up some vegetables and put them in the fridge for later use.

2) Cook some meat, then you can eat it tonight!

3) Wash your hands, because this is going to be messy!

4) And lastly...

5) Eat!

Now you start doing it:

You chop up some carrots, onions and potatoes. You cook some beef, then you can have dinner tonight!

After eating, you wash your hands and get ready for bed. You don't know how long it will take before you feel sleepy again so you decide to sleep early.

The next day is Sunday and as usual there isn't much going on at school since everyone has been off all week (or at least most of them)."

Replies from: orthonormal, aa-m-sa

↑ comment by orthonormal · 2020-07-27T17:34:29.162Z · LW(p) · GW(p)

That's not what I mean by planning. I mean "outputting a particular word now because most alternatives would get you stuck later".

An example is rhyming poetry. GPT-3 has learned to maintain the rhythm and the topic, and to end lines with rhyme-able words. But then as it approaches the end of the next line, it's painted itself into a corner- there very rarely exists a word that completes the meter of the line, makes sense conceptually and grammatically, and rhymes exactly or approximately with the relevant previous line.

When people are writing rhyming metered poetry, we do it by having some idea where we're going - setting ourselves up for the rhyme in advance. It seems that GPT-3 isn't doing this.

...but then again, if it's rewarded only for predictions one word at a time, why should it learn to do this? And could it learn the right pattern if given a cost function on the right kind of time horizon?

As for why your example isn't what I'm talking about, there's no point at which it needs to think about later words in order to write the earlier words.

Replies from: gwern, avturchin

↑ comment by gwern · 2020-07-27T19:37:23.639Z · LW(p) · GW(p)

I don't believe rhymes are an example of a failure to plan. They are a clearcut case of BPE problems.

They follow the same patterns as other BPE problems: works on the most common (memorized) instances, rapidly degrading with rarity, the relevant information cannot be correctly represented by BPEs, they are inherently simple yet GPT-3 performs really badly despite human-like performance on almost identical tasks (like non-rhyming poetry, or non-pun based humor), and have improved minimally over GPT-2. With rhymes, it's even more clearly not a planning problem because Peter Vessenes, I think, on the Slack set up a demo problem where the task was merely to select the rhyming word for a target word out of a prespecified list of possible rhymes; in line with BPEs, GPT-3 could correctly select short common rhyme pairs, and then fell apart as soon as you used rarer words. Similarly, I found little gain for prespecified rhymes. The problem is not that GPT-3 can't plan good rhymes, the problem is that GPT-3 doesn't know what words rhyme, period.

As far as planning goes, next-token prediction is entirely consistent with implicit planning. During each forward pass, GPT-3 probably has plenty of slack computation going on as tokens will differ widely in their difficulty while GPT-3's feedforward remains a fixed-size computation; just as GPT-3 is always asking itself what sort of writer wrote the current text, so it can better imitate the language, style, format, structure, knowledge limitations or preferences* and even typos, it can ask what the human author is planning, the better to predict the next token. That it may be operating on its own past completions and there is no actual human author is irrelevant - because pretending really well to be an author who is planning equals being an author who is planning! (Watching how far GPT-3 can push this 'as if' imitation process is why I've begun thinking about mesa-optimizers and what 'sufficiently advanced imitation' may mean in terms of malevolent sub-agents created by the meta-learning outer agent.)

* Matt Brockman on the API Slack enjoys experimenting with prompts like "The AI hates cheese and will never talk about cheese" and seeing if he can get the "AI" in the dialogue to talk about cheese anyway.

Replies from: orthonormal, aa-m-sa

↑ comment by orthonormal · 2020-07-27T19:51:53.903Z · LW(p) · GW(p)

Funny thing about BPEs: GPT-3 has to know about the individual letters, because I taught it how to spell both real words and nonsense words. (Prompts in bold, including two where I edited GPT-3's responses.)

The students were ready to test their spelling.
The teacher stood at the front of the class. "Adam, please spell PARTY."
Adam replied, "Party. P-A-R-T-Y."
The teacher said, "Correct. Beatrice, please spell THROUGH."
Beatrice replied, "Through. T-H-O-R-O-U-G-H."
The teacher said, "Incorrect. Through is spelled T-H-R-O-U-G-H. Carlos, please spell SPELL."
Carlos replied, "Spell. S-P-E-L-L."
The teacher said, "Correct. Daphne, please spell NUCLEAR."
Daphne replied, "Nuclear. N-U-C-L-E-A-R."
The teacher said, "Correct. Adam, please spell INFINITE."
Adam replied, "Infinite. I-N-F-I-N-A-T-E."
The teacher replied, "Incorrect. Infinite is spelled I-N-F-I-N-I-T-E. Beatrice, please spell BALLOON."
Beatrice replied, "Balloon. B-A-L-L-O-O-N."
The teacher replied, "Correct. Carlos, please spell ENCLOSURE."
Carlos replied, "Enclosure. I-N-C-L-O-S-U-R-E."
The teacher replied, "Incorrect. Enclosure is spelled E-N-C-L-O-S-U-R-E. Daphne, please spell ELECTRON."
Daphne replied, "Electron. E-L-E-C-T-R-O-N."
The teacher replied, "Correct. Adam, please spell CAT."
Adam replied, "Cat. C-A-T."
The teacher said, "Correct. Carlos, please spell MACHINE."
Carlos replied, "Machine. M-A-C."
The teacher said, "Incorrect. Machine is spelled M-A-C-H-I-N-E. Beatrice, please spell SCIENTIST."
Beatrice replied, "Scientist. S-C-I-E-N-T-I-S-T."
The teacher said, "Correct. Daphne, please spell ASTRONOMER.
Daphne replied, "Astronomer. A-S-T-R-O-N-O-M-E-R."
The teacher said, "Correct. Adam, please spell UNIVERSE.
Adam replied, "Universe. U-N-I-V-E-R-S-E."
The teacher said, "Correct. Carlos, please spell FLARBLE.
Carlos replied, "Flarble. F-L-A-R-B-L-E.

----------------

You've done much more advanced stuff, of course.

Replies from: gwern

↑ comment by gwern · 2020-07-27T20:18:37.651Z · LW(p) · GW(p)

Sure. It's seen plenty of individual letters (letters have their own BPEs as fallbacks if longer BPEs don't capture them, AFAIK). Stuff like my acrostics demonstration relies on the fact that GPT-3 has knowledge of letters and can, with some difficulty, manipulate them for various tasks.

↑ comment by Aaro Salosensaari (aa-m-sa) · 2020-07-28T23:32:12.516Z · LW(p) · GW(p)

(Reply to gwern's comment but not only addressing gwern.)

Concerning the planning question:

I agree that next-token prediction is consistent with some sort of implicit planning of multiple tokens ahead. I would phrase it a bit differently. Also, "implicit" is doing lot of work here

(Please someone correct me if I say something obviously wrong or silly; I do not know how GPT-3 works, but I will try to say something about how it works after reading some sources [1].)

The bigger point about planning, though, is that the GPTs are getting feedback on one word at a time in isolation. It's hard for them to learn not to paint themselves into a corner.

To recap what I have thus far got from [1]: GPT-3-like transformers are trained by regimen where the loss function evaluates prediction error of the next word in the sequence given the previous word. However, I am less sure if one can say they do it in isolation. During training (by SGD I figure?), transformer decoder layers have (i) access to previous words in the sequence, and (ii) both attention and feedforward parts of each transformer layer has weights (that are being trained) to compute the output predictions. Also, (iii) the GPT transformer architecture considers all words in each training sequence, left to right, masking the future. And this is done for many meaningful Common Crawl sequences, though exact same sequences won't repeat.

So, it sounds a bit trivial that GPTs trained weights allow "implicit planning": if given a sequence of words w_1 to w_i-1 GPT would output word w for position i, this is because a trained GPT model (loosely speaking, abstracting away many details I don't understand) "dynamically encodes" many plausible "word paths" to word w, and [w_1 ... w_i-1] is such a path; by iteration, it also encodes many word paths from w to other words w', where some words are likelier to follow w than others. The representations in the stack of attention and feedforward layers allows it to generate text much more better than eg old good Markov chain. And "self-attending" to some higher-level representation that allows it generate text in particular prose style seems a lot of like a kind of plan. And GPT generating text that it used as input to it, to which it again can selectively "attend to", this all seems like as a kind of working memory, which will trigger self-attention mechanism to take certain paths, and so on.

I also want highlight oceainthemiddleofanisland's comment [LW(p) · GW(p)] in other thread: Breaking complicated generation tasks into smaller chunks getting GPT to output intermediate text from initial input, which is then given as input to GPT to reprocess, enabling it finally to output desired output, sounds quite compatible to this view.

(On this note, I am not sure what to think of the role of human in the loop here, or in general, how it apparently requires non-trivial work to find a "working" prompt that seeds GPT obtain desired results for some particularly difficult tasks. That there are useful, rich world models "in there somewhere" in GPTs weights, but it is difficult to activate them? And are these difficulties because it is humans are bad at prompting GPT to generate text that accesses the good models, or because GPTs all-together model is not always so impressive as it easily turns into building answers based on gibberish models instead of the good ones, or maybe GPT having a bad internal model of humans attempting to use GPT? Gwern's example concerning bear attacks was interesting here.)

This would be "implicit planning". Is it "planning" enough? In any case, the discussion would be easier if we had a clearer definition what would constitute planning and what would not.

Finally, a specific response to gwerns comment.

During each forward pass, GPT-3 probably has plenty of slack computation going on as tokens will differ widely in their difficulty while GPT-3's feedforward remains a fixed-size computation; just as GPT-3 is always asking itself what sort of writer wrote the current text, so it can better imitate the language, style, format, structure, knowledge limitations or preferences* and even typos, it can ask what the human author is planning, the better to predict the next token. That it may be operating on its own past completions and there is no actual human author is irrelevant - because pretending really well to be an author who is planning equals being an author who is planning! (Watching how far GPT-3 can push this 'as if' imitation process is why I've begun thinking about mesa-optimizers and what 'sufficiently advanced imitation' may mean in terms of malevolent sub-agents created by the meta-learning outer agent.)

Using language how GPT-3 is "pretending" and "asking itself what a human author would do" can be maybe justified as metaphors, but I think it is a bit fuzzy and may obscure differences between what transformers do when we say they "plan" or "pretend", and what people would assume of beings who "plan" or "pretend". For example, using a word like "pretend" easily carries over an implication that there is something true, hidden, "unpretense" thinking or personality going on underneath. This appears quite unlikely given a fixed model, and generation mechanism that starts anew from each seed prompt. I would rather say that GPT has a model (is a model?) that is surprisingly good at natural language extrapolation and also, it is surprising at what can be achieved by extrapolation.

[1] http://jalammar.github.io/illustrated-gpt2/ , http://peterbloem.nl/blog/transformers and https://amaarora.github.io/2020/02/18/annotatedGPT2.html in addition to skimming original OpenAI papers

↑ comment by avturchin · 2020-07-28T10:12:47.383Z · LW(p) · GW(p)

Yes, I understand that it doesn't actually plan things, but we can make it mimic planing via special prompts, the same way as GPT mimics reasoning and other things.

↑ comment by Aaro Salosensaari (aa-m-sa) · 2020-07-28T09:13:05.225Z · LW(p) · GW(p)

I contend it is not an *implementation* in a meaningful sense of the word. It is more a prose elaboration / expansion of the first generated bullet point list (an inaccurate one: "plan" mentions chopping vegetables, putting them in a fridge and cooking meat; prose version tells of chopping a set of vegetables, skips the fridge and then cooks beef, and then tells an irrelevant story where you go to sleep early and find it is a Sunday and no school).

Mind, substituting abstract category words with sensible more specific ones (vegetables -> carrots, onions and potatoes) is an impressive NLP task for an architecture where the behavior is not hard-coded in (because that's how some previous natural language generators worked), and even more impressive that it can produce the said expansion with a NLP input prompt, but hardly a useful implementation of a plan.

An improved experiment of "implementing plans" that could be within capabilities of GPT-3 or similar system: get GPT-3 to first output a plan of doing $a_thing and then the correct keystroke sequence input for UnReal World, DwarfFortress or Sims or some other similar simulated environment to produce it.

comment by AABoyles · 2020-07-28T21:12:06.732Z · LW(p) · GW(p)

It would also be very useful to build some GPT feature "visualization" tools ASAP.

Do you have anything more specific in mind? I see the Image Feature Visualization tool [LW · GW], but in my mind it's basically doing exactly what you're already doing by comparing GPT-2 and GPT-3 snippets.

Replies from: orthonormal

↑ comment by orthonormal · 2020-07-28T21:22:07.529Z · LW(p) · GW(p)

No, the closest analogue of comparing text snippets is staring at image completions, which is not nearly as informative as being able to go neuron-by-neuron or layer-by-layer and get a sense of the concepts at each level.

comment by [deleted] · 2020-07-27T08:12:17.138Z · LW(p) · GW(p)

Developmental Stages of GPTs

Contents

My Thesis:

Architecture and Scaling

Analogues to Developmental Stages

What's Next?

Could GPT-N turn out aligned, or at least harmless?

What Can We Do?

72 comments

Also