Would I think for ten thousand years?

post by Stuart_Armstrong · 2019-02-11T19:37:53.591Z · score: 25 (9 votes) · LW · GW · 12 comments

Some AI safety ideas delegate key decisions to our idealised selves. This is sometimes phrased as "allowing versions of yourself to think for ten thousand years", or similar sentiments.

Occasionally, when I've objected to these ideas, it's been pointed out that any attempt to construct a safe AI design would involve a lot of thinking, so therefore there can't be anything wrong with delegating this thinking to an algorithm or an algorithmic version of myself.

But there is a tension between "more thinking" in the sense of "solve specific problems" and in the sense of "change your own values [LW · GW]".

An unrestricted "do whatever a copy of Stuart Armstrong would have done after he thought about morality for ten thousand years" seems to positively beg for value drift (worsened by the difficulty in defining what we mean by "a copy of Stuart Armstrong [...] thought [...] for ten thousand years").

A more narrow "have ten copies of Stuart think about these ten theorems for a subjective week each and give me a proof or counter-example" seems much safer.

In between those two extremes, how do we assess the degree of value drift and its potential importance to the question being asked? Ideally, we'd have a theory of human values [LW · GW] to help distinguish the cases. Even without that, we can use some common sense on issues like length of thought, nature of problem, bandwidth of output, and so on.

12 comments

Comments sorted by top scores.

comment by Wei_Dai · 2019-02-12T01:04:25.864Z · score: 13 (6 votes) · LW · GW

Why do you think this problem needs to be solved now? Couldn't the idealized version of yourself spend the first few years to figure out how best to protect again value drift during the rest of the available time? It seems to me that a more urgent problem is, given that a person thinking alone for even a few years would likely go crazy, how do we set up the initial social dynamics for a group of virtual humans?

comment by Stuart_Armstrong · 2019-02-12T10:54:38.624Z · score: 10 (3 votes) · LW · GW

Because I've already found problems with these systems in the past few years, problems that other people did not expect there to be. If one of them had been put into such a setup then, I expect that it would have failed. Sure, if current me was put in the system, maybe I could find a few more problems and patch them, because I expect to find them.

But I wouldn't trust many others, and I barely trust myself. Because the difference is large between what the setup will be in practice, and what current research is in practice. The more we can solve these issues ahead of time, the more we can delegate.

comment by Wei_Dai · 2019-02-13T06:21:44.588Z · score: 3 (1 votes) · LW · GW

Because I’ve already found problems with these systems in the past few years, problems that other people did not expect there to be.

I don't know which problems/systems you're referring to. Maybe you could cite these in the post to give more motivation?

Because the difference is large between what the setup will be in practice, and what current research is in practice.

What are the most important differences that you foresee?

comment by Stuart_Armstrong · 2019-02-13T14:07:36.126Z · score: 7 (2 votes) · LW · GW

I don't know which problems/systems you're referring to. Maybe you could cite these in the post to give more motivation?

The main one is when I realised the problems with CEV: https://www.lesswrong.com/posts/vgFvnr7FefZ3s3tHp/mahatma-armstrong-ceved-to-death

The others are mainly oral, with people coming up with plans that involve simulating humans for long periods of time, me doing the equivalent of saying "have you considered value drift" and (often) the reaction from the other revealing that no, they had not considered value drift.

Because the difference is large between what the setup will be in practice, and what current research is in practice.

What are the most important differences that you foresee?

The most important differences I foresee are the unforseen :-) I mean that seriously, because anything that is easy to foresee will possibly be patched before implementation.

But if we look at how research happens nowadays, it has a variety of different approaches and institutional cultures, certain levels of feedback both from within the AI safety community and the surrounding world, grounding our morality and keeping us connected to the flow of culture (such as it is).

Most of the simulation ideas do away with that. If someone suggested that the best idea for AI safety would be to lock up AI safety researchers in an isolated internet-free house for ten years and see what they came up with, we'd be all over the flaws in this plan (and not just the opportunity costs). But replace that physical, grounded idea with a similar one that involves "simulation", and suddenly people flip into far mode and are more willing to accept it. In practice, a simulation is likely to be far more alien and alienating that just locking people up in a house. We have certain levels of control in a simulation that we wouldn't have in reality, but even that could hurt - I'm not sure how I would react if I knew my mind and emotions and state of tiredness were open to manipulation.

So what I'm mainly trying to say is that using simulations (or predictions about simulations) to do safety work is a difficult and subtle project, and needs to be thoroughly planned out with, at minimum, a lot of psychologists and some anthropologists. I think it can be done, but not glibly and not easily.

comment by Wei_Dai · 2019-02-14T20:48:24.360Z · score: 4 (2 votes) · LW · GW

The others are mainly oral, with people coming up with plans that involve simulating humans for long periods of time, me doing the equivalent of saying “have you considered value drift” and (often) the reaction from the other revealing that no, they had not considered value drift.

Ah, value drift has been on my mind for so long that it's surprising to me that people could be thinking about simulating humans for long periods of time without thinking about value drift. Thanks for the update!

The most important differences I foresee are the unforseen :-) I mean that seriously, because anything that is easy to foresee will possibly be patched before implementation.

I guess my perspective here is that pretty soon we'll be forced to live in a real environment that will be quite alien / drift-inducing [LW · GW] already, so maybe it wouldn't be so hard to construct a virtual environment that would be better in comparison, so the risk-minimizing thing to do would be to put yourself in such an environment as soon as possible and then work on further risk reduction from there. (See this recent news as another sign pointing to that coming soon.)

Most of the simulation ideas do away with that.

Yeah I agree that getting the social aspect right is probably the hardest part, and we might need more than a small group of virtual humans to do that.

So what I’m mainly trying to say is that using simulations (or predictions about simulations) to do safety work is a difficult and subtle project, and needs to be thoroughly planned out with, at minimum, a lot of psychologists and some anthropologists. I think it can be done, but not glibly and not easily.

I think this framing makes sense.

comment by Stuart_Armstrong · 2019-02-12T11:02:42.836Z · score: 2 (1 votes) · LW · GW

Also, on a more minor note, I expect that if I try and preserve myself from value drift, using only the resources I had in the simulation - I expect to fail. Social dynamics might work though, so we do need to think about those.

comment by Raemon · 2019-02-12T02:08:05.951Z · score: 2 (1 votes) · LW · GW

I agree with both individual points but... for the second point, can't you pass the recursive buck almost as easily there?

At least "what should I have thought about already for outsourcing questions to emulations?" seems like a pretty good first question to ask.

comment by Wei_Dai · 2019-02-12T08:45:53.975Z · score: 3 (1 votes) · LW · GW

for the second point, can’t you pass the recursive buck almost as easily there?

How so? If you set up a group of virtual humans to think about some problem, you have to decide, at least initially, who to bring into the group, how they can interact with each other, how the final output gets determined (if they don't all agree on one answer), and under what circumstances the rules can be changed. If you do it wrong, you could get bad social dynamics before the group can figure out how to fix or improve the setup.

comment by avturchin · 2019-02-11T22:17:32.418Z · score: 8 (7 votes) · LW · GW

What more worry me is not a value drift, but the hardening of values in the wrong position.

We could see examples of people whose values have formed during their youth, and these values didn't evolve in the older age but instead become a rigid self-supporting system, not connected with reality. These old-schoolers don't have any new wisdom to tell.

Obviously, brain aging plays a role here, but it is not only a cellular aging, but also an "informational aging", that is, in particular, hardening of pavlovian reflexes between thoughts. Personally, I found that I have the same thought in my mind every time I am start eating, which is annoying ( it is not related to food: it is basically a number).

comment by Raemon · 2019-02-12T00:34:49.892Z · score: 3 (2 votes) · LW · GW

In most cases my thought is "well, what's the alternative?"

I'm either doing what I would have done after thinking for N years, or I'm committing to a course of action after thinking less than N years. The former risks value drift, the latter risks... well, not having had as much time to think, which isn't obviously better than value drift.

I do think there's a few variations that seem like improvements, like:

  • run X copies of myself, slightly randomizing their starting conditions, running them for a range of times (maybe as wide as "1 week" to "10,000 years". Before revealing their results to me, reveal how convergent they were. If there's high convergence I'm probably less worried about the answer. )
  • make sure simulated me can only think about certain classes of things ("solve this problem with these constraints"). I'm more worried about value drift from "10,000 year me who lived life generally" than "10,000 year me who just thought about this one problem. Unless the problem was meta-ethics, in which case I probably want some kind of value drift."
comment by ofer · 2019-02-12T07:45:15.931Z · score: 1 (1 votes) · LW · GW
In most cases my thought is "well, what's the alternative?"

Perhaps we humans should think ourselves for 10,000 years (passing the task from one generation to the next until aging is solved), instead of deferring to some "idealized" digital versions of ourselves.

This would require preventing existential catastrophes, during those 10,000 years, via "conventional means" (e.g. stabilizing the world to some extent).

comment by avturchin · 2019-02-12T10:30:59.468Z · score: 1 (3 votes) · LW · GW

Also, the thinking about own meaning of life is an important part of human activity, and if we delegate this to an AI, we will destroy significant part of human values. In other words, if AI will make philosophers unemployed, they will not be happy. Or else, we will not accept the AI's answer, and will continue to search for the ultimate goal.

One more thought: the more one argues for the creation for her own copies for some task, the more one should think that she is already is such task-solving simulation. Welcome to our value-solving matrix!