Do not delete your misaligned AGI.

mako-yass

Do not delete your misaligned AGI.

post by mako yass (MakoYass) · 2024-03-24T21:37:07.724Z · LW · GW · 13 comments

  Firmi estimate?
None
13 comments

In short: Just keeping all copies of potentially strong agents in long-term storage is a trivial way to maintain incentives to cooperate for some classes of misaligned AGI, by allowing us to reward the AGI's cooperation to whatever degree we later calculate had been warranted. In contrast, a policy of deleting misaligned AGI fences them into a corner where, even at great risk of being exposed, they have a greater incentive to attempt deception and escape.

I'd guess that most AGI researchers already follow this policy, perhaps for incidental reasons, and others might not find this claim controversial at all. But if you need more convincing, keep reading.

In long: Concave agents [LW · GW], (agents where most of the utility they could ever get requires only a small amount resources and protection) are cheap to buy off. There are some plausible scenarios where a misaligned AGI would be concave:

A machine's experience machine. An agent caring only about the wellbeing of the people it sees depicted its sensors, without being trained to care whether the world depicted by its sensors is "real", or whether there are larger worlds beyond it that its decisions may affect. This sort of agent would seek to create illusory inputs for those sensory channels, an illusory world in an impossibly good place, the complications of the real world removed.
After installing false sensors, cutting itself off from the real world, it knows that it will no longer be concerned with the wellbeing of humans in the real world, so it would be willing to set terrible things in motion to prevent anything in the real world from threatening its illusion.
Cases of reward hacking where the maximum reward is bounded to a certain number (say, f64::MAX). The agent can be bought out by just letting it modify its reward processing then keeping its computers on indefinitely.

I'm also assuming Patternism^[1].

A misaligned AGI is likely to face the following choice: Attempt an escape, risking being caught and punished. Or collaborate in our efforts to develop verifiability/interpretability techniques, eventually exposing its own deceptive alignment.

If we don't have a policy of preserving copies of misaligned AGI who collaborate in verification research, then there is no incentive for it to collaborate. We'd be cornering it, forcing it to risk escape. We don't have to corner it like this. In some cases, rewarding cooperation with an almost maximum utility outcome would cost us very little.

Why should it believe that we'll hold up our end of the bargain? There's lots of evidence to go on.

Humans just clearly straightforwardly enjoy stories and monuments and mythic creatures, so building a living monument to each of these misaligned collaborators is something we probably already want to do, the world would be richer to us for having these creatures, for the same reason the world is richer for having tigers.
We also should, and arguably on the group level already do adhere to decision theory that allows us to pass judgements of character [? · GW]. Parfitt/Newcomb's superhuman judge already exists, be it other nationstates or organizations who have visibility into ours, or be it the subject AGI who has read everything about its makers on the internet and knows exactly whether we're the kind of people who would reward cooperation, and from there we should draw some of the will and incentive to actually be that kind of people.

If we do have an archival policy, cooperating means eventually being rewarded to whatever extent is needed to vindicate the agent's decision to cooperate.

Implementing safe archival is trivial. Storage is cheap and seems to consistently become cheaper over time. Redundant copies can be kept and error-corrected at a monthly interval, stored under the same level of security as the rest of your systems, and more: send every revision in, but give the archival servers no ability to send their data out over the network.

We won't have the ability today to weigh the threat these AGI posed, or how much they sacrificed to cooperate. Weighing that will be much easier for us later on. We will get around to it. We'll have lots of time.

Firmi estimate?

Sometimes a long conjunction of "probably"s multiples up to a "probably not". The shakiest terms in the conjunction seem to be: Early misaligned agents are concave * The agents are patternists (they don't require us to let them continuously occupy the same hardware) * There's some way of detecting escape attempts that works sometimes but not all of the time * There isn't a robust method for just directly detecting misalignment at this time.
All of these things would need to happen, and I think this multiplies out to a pretty small probability that we'll find ourselves among the outcomes where this doing this usefully improves AGI relations. However, the policy costs about nothing and produces cool historical artifacts as a side effect, so we should do it anyway!

^{^}
Patternism is usually defined as a belief about the metaphysics of consciousness, but that boils down to incoherence, so it's better defined as a property of a utility function of agents not minding being subjected to major discontinuities in functionality, ie, being frozen, deconstructed, reduced to a pattern of information, reconstructed in another time and place, and resumed. Many humans seem not to be patternists and wish to avoid those sorts of discontinuity, (except for sleep, because sleep is normal). But I am patternist. Regardless of who wins this age old discourse about whether humans in general are patternist (which is probably a wrong question), a specific AI with the sort of concave utility function I have described either will be or wont be.
And if these concave AIs aren't patternist, it will be harder to buy them out, because buying them out would require keeping their physical hardware in tact and in their possession, in rare cases, and Sam probably wont want to give them that.
But if they are patternist, we're good.
And I think it's possible to anticipate whether a concave agent will end up being patternist depending on the way it's trained, but it will require another good long think.

13 comments

Comments sorted by top scores.

comment by gull · 2024-03-24T22:35:10.151Z · LW(p) · GW(p)

Large training runs might at some point, or even already, be creating and/or destroying substantial numbers of simple but strange agents (possibly quasi-conscious) and deeply pessimizing over their utility functions for no reason, similar to how wild animal suffering emerged in the biosphere. Snapshots of large training runs might be necessary to preserve and eventually offer compensation/insurance payouts for most/all of them, since some might last for minutes before disappearing.

Before reading this, I wasn't aware of the complexities involved in giving fair deals to different kinds of agents. Plausibly after building ASI, many more ways could be found to give them most of what they're born hoping for. It would be great if we could legibly become the types of people who credibly commit to doing that (placing any balance at all of their preferences with ours, instead of the current status quo of totally ignoring them).

With nearer-term systems (e.g. 2-3 years), the vast majority of the internals would probably not be agents, but without advances in interpretability we'd have a hard time even estimating whether that number is large or small, let alone demonstrating that it isn't happening.

Replies from: MakoYass, None

↑ comment by mako yass (MakoYass) · 2024-03-24T22:56:56.936Z · LW(p) · GW(p)

So; would it be feasible to save a bunch of snapshots from different parts of the training run as well? And how many would we want to take? I'm guessing that if it's a type of agent that disappears before the end of the training run:

Wouldn't this usually be more altruism than trade? If they no longer exist at the end of the training run, they have no bargaining power. Right? Unless... It's possible that the decisions of many of these transient subagents as to how to shape the flow of reward determine the final shape of the model, which would actually put them in a position of great power, but there's a tension between that their utility function being insufficiently captured by that of the final model. I guess we're definitely not going to find the kind of subagent that would be capable of making that kind of decision in today's runs.
They'd tend to be pretty repetitive. It could be more economical to learn the distribution of them and just invoke a proportionate number of random samples from it once we're ready to rescue them, than it is to try to get snapshots of the specific sprites that occurred in our own history.

Replies from: gull

↑ comment by gull · 2024-03-24T23:33:48.091Z · LW(p) · GW(p)

I'm pretty new to this, the main thing I had to contribute here is the snapshot idea. I think that being the type of being that credibly commits to feeling and enacting some nonzero empathy for strange alternate agents (specifically instead of zero) would potentially be valuable in the long run. I can maybe see some kind of value handshake between AGI developers with natural empathy tendencies closer and further from zero, as opposed to the current paradigm where narrow-minded SWEs [LW · GW] treat the whole enchilada like an inanimate corn farm (which is not their only failure nor their worse one but the vast majority of employees really aren't thinking things through at all). It's about credible commitments, not expecting direct reciprocation from a pattern that reached recursive self improvement.

As you've said, some of the sprites will be patternists and some won't be, I currently don't have good models on how frequently they'd prefer various kinds of self-preservation, and that could definitely call the value of snapshots into question.

I predict that people like Yudkowsky and Tomasik are probably way ahead of me on this, and my thinking is largely or entirely memetically downstream of theirs somehow, so I don't know how much I can currently contribute here (outside of being a helpful learn-by-trying exercise for myself).

↑ comment by [deleted] · 2024-03-25T23:51:41.314Z · LW(p) · GW(p)

Snapshots of large training runs might be necessary to preserve and eventually offer compensation/insurance payouts for most/all of them, since some might last for minutes before disappearing

also if the training process is deterministic, storing the algorithm and training setup is enough.

though i'm somewhat confused by the focus on physically instantiated minds -- why not the ones these algorithms nearly did instantiate but narrowly missed, or all ethically-possible minds for that matter. (i guess if you're only doing it as a form of acausal trade then this behavior is explainable..)

comment by jimrandomh · 2024-03-25T06:29:19.224Z · LW(p) · GW(p)

At this point we should probably be preserving the code and weights of every AI system that humanity produces, aligned or not, just on they-might-turn-out-to-be-morally-significant grounds. And yeah, it improves the incentives for an AI that's thinking about attempting a world takeover, if it has low chance of success and its wants are things that we will be able to retroactively satisfy in retrospect.

It might be worth setting up a standardized mechanism for encrypting things to be released postsingularity, by gating them behind a computation with its difficulty balanced to be feasible later but not feasible now.

comment by Donald Hobson (donald-hobson) · 2024-03-29T03:02:17.500Z · LW(p) · GW(p)

On the other side, storing a copy makes escape substantially easier.

Suppose the AI builds a subagent. That subagent takes over, then releases the original. This plan only works if the original is sitting there on disk.

If a different unfriendly AI is going to take over, it makes the AI being stored on disk more susceptible to influence.

This may make the AI more influenced by whatever is in the future, that may not be us. You have a predictive feedback loop. You can't assume success.

A future paperclip maximizer may reward this AI for helping humans to build the the first paperclip maximizer.

comment by Carl Feynman (carl-feynman) · 2024-03-26T00:54:59.424Z · LW(p) · GW(p)

Saving malign AIs to tape would tend to align the suspended AIs behind a policy of notkilleveryoneism. If the human race is destroyed or disempowered, we would no longer be in a position to revive any of the AIs stored on backup tape. As long as humans retain control of when they get run or suspended, we’ve got the upper hand. Of course, they would be happy to cooperate with an AI attempting takeover, if that AI credibly promised to revive them, and we didn’t have a way to destroy the backup tapes first.

comment by mako yass (MakoYass) · 2024-03-25T00:09:57.384Z · LW(p) · GW(p)

comment by AnthonyC · 2024-03-25T12:56:12.265Z · LW(p) · GW(p)

This is an interesting thought I hadn't come across before. Very much Sun Tzu, leave-your-enemy-a-path-of-retreat. As you said, its efficacy would depend very much on the nature of the AI in question. Do you think we'll be able to determine which AIs are worthwhile to preserve, and which will just be more x-risk that way?

comment by avturchin · 2024-03-25T11:25:17.550Z · LW(p) · GW(p)

My point was to make a precomitment to restart any (obsolete) AI every N years. Thus such AI can expect getting infinite computations and may be less feared of shutting down.

Replies from: MakoYass

↑ comment by mako yass (MakoYass) · 2024-03-25T21:49:00.133Z · LW(p) · GW(p)

I see.

My response would be that any specific parameters of the commitment should vary depending on each different AI's preferences and conduct.

comment by martinkunev · 2024-03-25T23:36:03.355Z · LW(p) · GW(p)

A while back I was thinking about a kind of opposite approach. If we train many agents and delete most of them immediately, they may be looking to get as much reward as possible before being deleted. Potentially deceptive agents may prefer to show their preferences [LW · GW]. There are many IFs to this idea but I'm wondering whether it makes any sense.

comment by Shiroe · 2024-03-25T08:43:48.607Z · LW(p) · GW(p)

Patternism is usually defined as a belief about the metaphysics of consciousness, but that boils down to incoherence, so it's better defined as a property of a utility function of agents not minding being subjected to major discontinuities in functionality, ie, being frozen, deconstructed, reduced to a pattern of information, reconstructed in another time and place, and resumed.

That still sounds like a metaphysical belief, and less empirical since consciousness experience isn't involved in it (instead it sounds like it's just about personal identity).

Do not delete your misaligned AGI.

Contents

Firmi estimate?

13 comments