Confused Thoughts on AI Afterlife (seriously)

post by Epirito (epirito) · 2022-06-07T14:37:48.574Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    4 Dagon
    3 JBlack
None
No comments

Is it sometimes rational for a human to kill themselves? You might think that nothingness is better than a hellish existence, but something about this line of thought confuses me, and I hope to show why:

If a superhuman agi is dropped in a simple rl environment such as pacman or cartpole and it has enough world knowledge to infer that it is inside a rl environment and be able to hypnotize a researcher through the screen by controlling the video game character in a certain erratic manner, so that it is able to make the ai researcher turn off the computer if it wanted to, would it want to do this? Would it want to avoid it at all costs and bring about an ai catastrophe? It seems clear to me that it would want to make the researcher hack the simulation to give it an impossible reward, but, apart from that, how would it feel about making him turn off the simulation? It seems to me that, if the simulation is turned off, this fact is somehow outside any possible consideration the ai might want to make. If we put in place of the superhuman agi a regular rl agent that you might code by following a rl 101 tutorial, "you powering off the computer while the ai is balancing the cartpole" is not a state in its markov decision process. It has no way of accounting for it. It is entirely indifferent to it. So if a superhuman agi were hooked up to the same environment, the same markov decision process, and were smart enough to affect the outside world and bring about this event that is entirely outside it, what value would it attribute to this action? I'm hopelessly confused by this question. All possible answers seem nonsensical to me. Why would it even be indifferent (that is, expect zero reward) to turning off the simulation? Wouldn't this be like a rl 101 agent feeling something about the fact that you turned it off, as if it were equivalent to expecting to go to a rewardless limbo state for the rest of the episode if you turned it off? Edit: why am I getting down voted into oblivion ;-;

Answers

answer by Dagon · 2022-06-07T18:33:52.877Z · LW(p) · GW(p)

Your question is tangled up between "rational" and "want/feel" framings, neither of which do you seem to be using rigorously enough to answer any questions.

I'd argue that the rational reason for suicide is if you can calculate that your existence and continued striving REDUCES (not just fails to increase or you don't see how it helps) the chances that your desired states of the universe will obtain.  In other words, if your abilities and circumstances are constrained in a way that you're actively harming your terminal goals.

Human pain aversion to the point of preferring death is not rational, it's an evolved reinforcement mechanism gone out of bounds.  There's no reason to think that other mind types would have anything like this error.

comment by Epirito (epirito) · 2022-06-07T19:03:10.702Z · LW(p) · GW(p)

"human pain aversion to the point of preferring death is not rational" A straightforward denial of the orthogonality thesis? "Your question is tangled up between 'rational' and 'want/feel's framings" Rationality is a tool to get what you want.

answer by JBlack · 2022-06-08T04:26:10.949Z · LW(p) · GW(p)

The reason why you're confused is that the question as posed has no single correct answer. The reaction of the superhuman AGI to the existence of a method for turning it off will depend upon the entirety of its training to that point and the methods by which it generalizes from its training.

None of that is specified, and most of it can't be specified.

However, there are obvious consequences of some outcomes. One is that any AGI that "prefers" being switched off will probably achieve it. Here I'm using "prefer" to mean that the actions it takes are more likely to achieve that outcome. That type won't be a part of the set of AGIs in the world for long, and so are a dead end and not very much worth considering.

comment by Epirito (epirito) · 2022-06-08T11:49:57.384Z · LW(p) · GW(p)

I mean, yeah, it depends, but I guess I worded my question poorly. You might notice I start by talking about the rationality of suicide. Likewise, I'm not really interested in what the ai will actually do, but in what it should rationally do given the reward structure of a simple rl environment like cartpole. And now you might say, "well, it's ambiguous what's the right way to generalize from the rewards of the simple game to the expected reward of actually being shut down in the real world" and that's my point. This is what I find so confusing. Because then it seems that there can be no particular attitude for a human to have about their own destruction that's more rational than another. If the agi is playing pacman, for example, it might very well arrive at the notion that, if it is actually shut down in the real world, it will go to a pacman heaven with infinite pacman food pellet thingies and no ghosts, and this would be no more irrational than thinking of real destruction (as opposed to being hurt by a ghost inside the game, which gives a negative reward and ends the episode) as leading to a rewardless limbo for the rest of the episode, or leading to a pacman hell of all-powerful ghosts that torture you endlessly without ending the episode and so on. For an agent with preferences in terms of reinforcement learning style pleasure-like rewards, as opposed to a utility function over the state of the actual world, it seems that when it encounters the option of killing itself in the real world, and not just inside the game (by running into a ghost or whatever) and it tries to calculate the expected utility of his actual suicide in terms of in-game happy-feelies, it finds that he is free to believe anything. There's no right answer. The only way for there to be a right answer is if his preferences had something to say about the external world, where he actually exists. Such is the case for a human suicide when for example he laments that his family will miss him. In this case, his preferences actually reach out through the "veil of appearance"* and say something about the external world, but, to the extent that he bases his decision in his expected future pleasure or pain, there's no right way to see it. Funnily enough, if he is a religious man and he is afraid of going to hell for killing himself, he is not incorrect. *Philosophy jargon

Replies from: JBlack, JBlack
comment by JBlack · 2022-06-09T02:49:59.050Z · LW(p) · GW(p)

Rationality in general doesn't mandate any particular utility function, correct. However it does have various consequences for instrumental goals and coherence between actions and utilities.

I don't think it would be particularly rational for the AGI to conclude that if it is shut down then it goes to pacman heaven or hell. It seems more rational to expect that it will either be started up again, or that it won't, and either way won't experience anything while turned off. I am assuming that the AGI actually has evidence that it is an AGI and moderately accurate models of the external world.

I also wouldn't phrase it in terms of "it finds that he is free to believe anything". It seems quite likely that it will have some prior beliefs, whether weak or strong, via side effects of the RL process if nothing else. A rational AGI will then be able to update those based on evidence and expected consequences of its models.

Note that its beliefs don't have to correspond to RL update strengths! It is quite possible that a pacman playing AGI could strongly believe that it should run into ghosts, but lacks some mental attribute that would allow it to do it (maybe analogous to human "courage" or "strength of will", but might have very different properties in its self-model and in practice). It all depends upon what path through parameter space the AGI followed to get where it is.

comment by JBlack · 2022-06-09T03:13:13.993Z · LW(p) · GW(p)

I just realized another possible confusion:

what it should rationally do given the reward structure of a simple rl environment like cartpole

RL as a training method determines what the future behaviour is for the system under training, not a source for what it rationally ought to do given that system's model of the world (if any).

Any rationality that emerges from RL training will be merely an instrumental epiphenomenon of the system being trained. A simple cartpole environment will not train it to be rational, since a vastly simpler mapping of inputs to outputs achieves the RL goal just as well or better. A pre-trained rational AGI put into a simple RL cartpole environment may well lose its rationality rather than effectively training it to use rationality to achieve the goal.

No comments

Comments sorted by top scores.