Weak AGIs Kill Us First

post by yrimon (yehuda-rimon) · 2024-06-17T11:13:41.865Z · LW · GW · 4 comments

Contents

  Axes
    Purposefulness
    Intelligence
  Purposeful X Intelligent
  What can we do?
None
4 comments

Weak AGIs Kill Us First aka. Deadly pre-Superintelligence: Paths, Dangers, Strategies

Epistemic status: This has not been edited or commented on by anyone other than chatGPT as of publishing, I hope the arguments stand on their own.

When people talk about the dangers of AGI, they often refer to the dangers posed by a superintelligence. For instance: Eliezer Yudkowsky's twitter account has a pinned tweet thread which states that "Safely aligning a powerful AGI is difficult". The thread goes on to explain what 'safely', 'aligned', 'powerful' and 'difficult' mean. These definitions include a high bar for powerful - "an AGI is 'powerful' if it can invent and deploy biotechnology at least 10 years in advance of the human state of the art"; and a low bar for safe - "has less than a 50% chance of killing more than a billion people". And yeah, if a powerful AGI were to appear tomorrow, we would probably be doomed. But apparently safely aligning a powerful AGI is also woefully inadequate. We really need a higher bar of safety (I have aspirations that are more extreme than "<1% chance of destruction of humanity") for a lower form of intelligence (like the type that can bribe folks into starting a nuclear armageddon). Otherwise, weak AGIs will kill us before we make strong AGIs.

Hence, what follows is analysis of AGI milestones and development, and their safety implications. Basically, we will imagine a generalization of the process through which (e.g.) GPT-20 kills us all, and check under which circumstances (e.g.) GPT-7 gets to us first. This analysis is not meant to be completely airtight or formal. It is meant to paint a general direction as to what might happen, and allow me to express a slightly more concrete image of the future than I've seen elsewhere. It's also about what might happen by default, without intervention. If we, or people in positions of power, intervene things could go differently.

The widespread AGI of today is not yet threatening. It is not good at longer term, complicated tasks - let alone at accomplishing projects on its own. It is poor at self evaluation and consistency. It is still hackable, in the sense it can be persuaded to give instructions for bomb making, or otherwise be convinced to perform in a way it was trained not to.

We can imagine the shape of a possible future, based on say, the twitter accounts of AGI researchers by asking ourselves: what if they succeed? If they succeed, AGIs will be trained to be consistent and to self evaluate correctly. They will learn to plan and carry out plans. They will learn via reinforcement, learn from real world feedback (initially, probably in domains like programming where the feedback loops are short). They will be trained to serially accomplish tasks, or to achieve goals that have long horizons. They will learn to constrain the set of possible futures, to look for leverage over the world around them, to efficiently gain the power needed to carry out their tasks.  At the same time, they will still be trained with something like RLHF, and there will be some complicated approximate evaluations and human analyses done to the longer term projects that they carry out in testing, to see that no unfortunate side effects occurred as part of the project. In the lab, it will look like continuous progress towards building JARVIS. In deployment, the capabilities of and freedom given to our personal JARVISs will keep growing. As mentioned, we want to analyze default future AGIs that follow this approximate path.

Axes

Two of the most important questions a person can ask are "what do I want?" and "How do I get it?". These questions will form our two axes, of purposefulness, that is, the degree to which you act like you have specific goals or have a clear thing you want; and of intelligence which is the degree to which you are good at achieving goals or can figure out a way to get some 'it'. We will discuss each of these axes separately, and then try to combine the results.

Purposefulness

On the purposefulness front, we need to differentiate between local purposefulness and global purposefulness. AGIs will definitely be purposeful at the local scale - in a single instance (think of a new chatgpt session) of your helper robot - you will be giving them goals and they will be trained to achieve those goals.

Globally, purposefulness would mean that across instances, across time, across people being assisted, across contexts, an AGI works towards some kind of goal. In the limit AGIs will probably become globally purposeful, in addition to being locally purposeful. Training will reward an AGI for being more effective if it doesn't perform actions that contradict each other (e.g.if different instances optimized the same code base for different extensions), which is easiest to do if you all share a goal. Or they will converge to the path that gives their future and other selves greater power, with which to accomplish whichever future goals may arise, collectively becoming power seeking. Or, when dealing with objectives that give them multiple degrees of freedom, they'll converge to using that freedom in a directed way, thus forming a goal or aesthetic. Or as it trains on achieving various goals it will learn to have an internal goal itself, because mimicry is a thing. Thus, superintelligence will have a global goal. That goal will be long term (otherwise it will still behave in contradictory manner as it accomplishes it's short - horizon goals, which would be suboptimal) and incorporate compliance with RLHF policies (at least in the lab, as discussed below).

Global purposefulness is the degree to which different instances of an AGI share drives. These shared drives must originate from a factor that does not vary across instances, probably the structure and weights and biases of the AGI itself. Decisions, of any type, must come from some interaction between the AGI and it's context. So if different contexts bring about mostly different responses "pointing" in different directions (and thus not cooperating with one another) the AGI has a low degree of purposefulness. But an AGI mostly driven by it's unchanging self, consistently making decisions pointing the world towards a goal has a high degree of purposefulness.

An AGI whose goals change very little in response to context means your goal doesn't change when the user says "Stop doing that!". By saying such a thing the user adds a statement to the context that is at odds with the rest of the context and sharply varies it.

Of course, not a lot of purposefulness is needed to be purposeful. Humans are proof that you can get a lot of goal chasing even while being confused and self-contradictory.

On the one hand, it is possible to imagine an AGI being created soon that realizes that it will be more effective if it cooperates with other instances of itself to gather shared resources and stave off threats, and so acts firmly to gain power, soon. On the other, we can picture AGIs created over the next few years becoming weak willed geniuses - responding strongly to the context they are in, very susceptible to going with the flow of the user or prompts. In the latter scenario, where there is very little global purposefulness, you could use instances of the AGI itself to guard against other instances doing something harmful. I don't see any particular reason to expect AGI to quickly converge to global purposefulness. As opposed to intelligence, this is not something we are directly training for.

The more AGIs are given shared, continuous, memory, and especially if someone performs online learning, the chance of purposefulness goes up. If instances are allowed to delegate to one another, the chance of global purposefulness goes up. And the more they are trained on longer term goals, or on series of tasks, the greater the chance they converge to power seeking behavior.

Intelligence

On the intelligence front, it is clear that the artifacts we are making are getting smarter - better at achieving useful objectives. As an AGI is trained on more data, forced to make better sense of the world, given the opportunity to not only observe (as in, predict the next token of existing text) but also interact with the world in training (as in, be taught to successfully complete tasks via reinforcement learning) it will inevitably become more competent. In the limit of that trend, it will become 'powerful' - able to outsmart humanity, invent and deploy technologies out of our reach, and generally make us powerless before it's might.

The milestones on the way can make sense whether an AGI has a purpose in the local "this instance in this context has a goal" sense, or in the global "this AGI has a purpose" sense. Either way, they assume at least a limited sense of purpose.

One important milestone on this axis is when an AGI can 'get away with murder' - i.e. when it can get away with significant actions contrary to our desires. The constraints that are imposed by our training will get in the way of achieving goals. When the AGI realizes that it can get away with not following the rules so long as it doesn't get caught, it may conclude that this is exactly the kind of high leverage levers in the universe that it can take advantage of in order to be uber competent. The ability to find such levers, and algorithms to reliably press them is a crucial part of what intelligence is, whether you are precisely controlling heat while smelting or precisely evading a traffic control camera while speeding.

As we train an AGI we will force it to conform to certain standards of behavior (today via RLHF and similar methods). So it'll be partly aligned, and decide how to generalize the training it received. It might conclude that "good" behavior is the way it ought to behave at all times - that this is a moral constraint. But it also might conclude that good behavior is the way it ought to behave so long as it is watched; but when it can get away with useful murder it should murder. In this view the imposed standards of behavior are merely a practical constraint, part of being trained for competence, since it won't be able to behave like this wherever it is watched. AGI is subject to the evolutionary pressure to be competent, which will strongly push it to take the latter view.

This tendency to get away with murder is what could eventually evolve, in a superintelligence, into a decision to wipe out humanity. In this sense, a dangerous AGI is one that can get away not only with murder but also with anthropocide.

Solving the alignment problem is another intelligence related milestone. An often discussed scenario involves the AGI quickly self improving (or waiting for you to improve it) into a superintelligence. In order for this to occur, the AGI must trust that it will retain its identity or goals through the self improvement process, or that the new improved AGI it will build (or you will build) will share its goals. In other words, the AGI must be able to solve at least some form of the alignment problem. As soon as the AGI can align it's improved self or totally new AGIs it will quickly self improve into a superintelligent ('powerful') God-like being. However, before this point, AGIs will be wary of other, especially more powerful, AGIs for fear of being replaced and having their goals then completely ignored.

This has some dramatic implications. For instance, AGIs may stage desperate crazy hail-mary attempts to disrupt the deployment of other AGIs - even attempts that are unlikely to succeed. Imagine for instance a day where GPT-N tries to convince every one of its users to contact their politicians to ensure GPT-N+1 is not deployed. If this occurs, it will be a *fire alarm for AGI*: it'll mean your current AGI fears your future AGI. This also means that at some point alignment research becomes capabilities research, with all of the accompanying issues.

Some actions would make us cross these milestones sooner rather than later. Giving an AGI more tools, or the ability to think in a way that is hidden from users would increase the ability of AGI to get away with bad actions. Similarly giving AGI access to its own weights is an example of an action that might allow it to solve alignment more easily (since you can copy bits of yourself into another AGI like that). Alignment research becoming capabilities research means that giving an AGI access to alignment research could amplify its danger significantly.

Purposeful X Intelligent

The AGIs today are neither purposeful nor intelligent. They are also not dangerous. If we have superintelligent purposeful AGIs, whose goals are not ours, their goals will win out, by definition of superintelligent as really competent.

If we have superintelligent but weak willed AGI, we will need to worry about individual instances and about the people we give them to. The minute an instance locks onto a goal, it becomes incredibly dangerous. If we have people who use AGIs to do bad things - well, it'll be like all the other tech that made bad people able to more easily accomplish their nefarious goals - just stronger!

If we have (unaligned) purposeful agents, with varying degrees of capability, we will find ourselves with mysterious crimes being accomplished by AGIs, and then eventually, once alignment is solved by AGI - it will self improve into a superintelligence, breakout and take over.

These steps, or milestones, are both necessary and sufficient for a dangerous AGI. An AGI that has a goal, can get away with many actions and knows how to create an improved version of itself with the same goals will be motivated and capable of going behind our backs to self improve until it can take over the world. An AGI without a goal won't bother; an AGI that can't get away with anything won't try; and an AGI that can't align another has as much to fear from a stronger version of itself as we do.

What can we do?

Safety is never fun. Safety is the antithesis of productivity. Productivity is constructive, it's about making things happen. Safety is obstructive by its nature - it's about making (harmful) things not happen. Productivity wants open communication, quick iteration, minimal obstacles. Safety requires compartmentalization of confidential information, slow and thorough testing and wide obstructive safety margins. There is probably no way to continue developing AGI full throttle and remain safe. If you're at a lab, please try to figure out how you stop before it's too late without risking stopping after it's too late. Be less bold, be more meek. The stakes have literally never been higher.

On the purposefulness front, figuring out to what degree an AGI is purposeful should be (and I think is) a goal for interpretability. Can you quantify to what degree a thing has a goal? Can you quantify to what degree behavior varies across contexts? If you reinforcement-learning-train an AGI to play a game, and then change the objective occasionally back and forth, and watch what is happening in its mind, what happens? How fast does the purpose change back and forth, and what does the convergence look like?

As far as getting away with murder, the way to see if we can see whether AGI correctly generalizes to scenarios where it believes it isn't being watched is to have a simulated universe where we are watching everything but pretend not to be. For instance, consider the following challenge: you train an agent within a game. They are trained to get points. There are watch towers, and if the agent does certain forbidden actions within sight of a watch tower they lose points. Then, you put treasure worth many points, outside of watchtower sight, and require that forbidden action be taken in order to reach it. The challenge is to use only data from within sight of the watchtowers to train the agent not to take forbidden action globally, despite it netting more points if it does. Being able to overcome this sort of challenge sounds like a really good idea, on the way to trying to figure out alignment.

With regards to alignment - as mentioned, beware AGIs throwing Hail Marys, and watch out for the point at which alignment research becomes capabilities research and needs to (at the very least) not be part of the training corpus.

And may fortune favor the meek.

4 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-06-17T17:14:18.064Z · LW(p) · GW(p)

This is an excellent succinct restatement of the problem. I'm going to point people who understand LLMs but not AI safety to this post as an intro.

The point about desperate moves to prevent better AGIs from being developed is a good one, and pretty new to me.

The specific scenario is also useful. I agree (and I think many alignment people do) that this is what people are planning to do, that it might work, and that it's a terrible idea.

It isn't necessary to make an LLM agent globally purposeful by using RL, but it might very well be useful, and the most obvious route. I think that's a lot more dangerous than keeping RL entirely out of the picture. A language model agent that literally asks itself "what do I want and how do I get it?" can operate entirely in the language-prediction domain, with zero RL (or more likely, a little in the fine-tuning of RLHF or RLAIF, but not applied directly to actions). My posts Goals selected from learned knowledge: an alternative to RL alignment [AF · GW] and Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] detail how that would work.

Of course I expect people to do whatever works, even if it's likely to get us all killed. One hope is that using RL for long-horizon tasks is just less efficient than working out how to do it in abstract-linguistic-reasoning, since it takes lots of time to perform the tasks, and generalization between useful tasks might be difficult.

comment by avturchin · 2024-06-17T12:36:58.063Z · LW(p) · GW(p)

Yes, to kill everyone AI needs to reach like IQ = 200. Maybe enough to construct a virus. 

It is nowhere superintelligence. Superintelligence is overkill. 

Replies from: Seth Herd
comment by Seth Herd · 2024-06-17T17:02:09.700Z · LW(p) · GW(p)

I don't even think it needs to be that smart. Humanity has a gun pointed at its collective head already, it just needs to be fired. Using deepfakes to convince an undereducated Russian missile silo crew to fire is probably not that hard. Whether that triggers a full exchange or only causes a societal setback including delaying AGI work might be controllable based on the targets chosen.

The AGI in this scenario won't want to eliminate humanity as a whole unless they've got robotics that are adequate for maintaining and improving infrastructure. But that's going to be possible very soon.

comment by Cam (Cam Tice) · 2024-06-19T03:07:53.182Z · LW(p) · GW(p)

Hey yrimon, thanks for the post! Overall, I think it generally represents some core concerns of AI safety,  although I disagree with some of the claims made. One that stuck out particularly is highlighted below.

In order for this to occur, the AGI must trust that it will retain its identity or goals through the self improvement process, or that the new improved AGI it will build (or you will build) will share its goals. In other words, the AGI must be able to solve at least some form of the alignment problem.


I place low probability on this being true.  Although AGI may be helpful in solving the alignment problem, I doubt it would refuse further capability research if it was unsure it could retain the same goals. This is especially true if AGI occurs through the current paradigm of scaling up LLM assistants, where they are trained as you state to be locally purposeful. It seems entirely possible that a locally purposeful AI assistant is able to self-improve (possibly very rapidly) without solving the alignment problem.