Time in Machine Metaethics

post by Razmęk Massaräinen · 2018-03-31T15:02:55.295Z · LW · GW · 1 comments

Contents

  Main points 
  John's timeless self-delusion 
  Temporal consistency in AI 
  Temporal consistency in humans 
  Convergence through wireheading 
  Terms and assumptions I used 
  The main idea 
None
1 comment

Main points

I'm not discussing worst-case AGI scenarios here. I tried to imagine the best case scenario under some conditions, but failed.

In 'Terms and assumptions' section down below I tried, though also failed probably, to clarify things I assumed.

John's timeless self-delusion

I don't exactly remember anymore why I wrote this once, but I'll use this story to illustrate this post.

Suppose John wants to murder Bill, and has to escape arrest by the thought police that will come and check on him tomorrow morning. Luckily or not, John is also a meditation expert and can modify his perceptions and wishes. John could've dropped his intent towards Bill if he wanted to. However, to fulfill his desire at the time and make sure Bill gets killed, he chooses instead to meditate to a state of insanity, in which he believes that Bill is a superhuman that loves playing the surprise catch-the-brick-with-your-head game at mornings, and was his best friend all along. Tomorrow, after passing thought police's routine check, John proceeds to drop a brick on poor Bill's head from a roof, then finds himself confused and unhappy ever since.

Originally, the point of this story wasn't that John is a poorly-designed runaway AI that deludes itself, then does bad things. John was intended to be human, and a subject to experience and own qualia. (Although one arguably doesn't exclude another.) The point probably was to demonstrate it's difficult to describe decision-making based on preferences, when preferences change over time.

I found myself at lack of dictionary for distinguishing between future and past selves, describing various degrees of cooperation between them, and talking about subjective utility. I mean, one could say John the Killer got what he wanted, but John the Brick Thrower is going to lead a miserable life, or refer to them as the same John, and ask whether he is generally happy with his decisions; but it makes no sense. (Probably we could quantify his actual subjective experience during some time span instead, if we try.) From outside we could, however, model 'overall John' as an evolving intelligent behavior-executing system. Then we'd talk about outcomes, since it's the outcomes we observe and care about. If thought police knew that John had a way to tamper with his intentions or world perception, and won't cooperate with future self to achieve his present-time goals, they arguably could do better in predicting the outcome (maybe set him up for a marshmallow test sometime earlier.)

Temporal consistency in AI

There seem to be consistent opinion that AGI should preserve their utility function over the course of time. However, it also has to change itself, process new information and update in order to self-improve. Arguably, maintaining exactly the same utility function may be impossible under such conditions. The problem is that AGI probably won't know everything about future, either.

Because we'll be observing AGI from outside anyway, we could think of AGI as an intelligent behavior-executor, and try to make descriptive norms about what we want it to do. As a behavior-executor, AGI should act as if they have preferences about their own future behavior, about predicting own behavior correctly, etc. - basically, to model future themselves. Their ability to do so with absolute certainty may be limited to raw computational power and lack of information about future outcomes - at time T, you may be limited to guessing about what you'll do and know at time T+N, and until that time you have no way to know for sure. (That may be twice as true for a seed AI and significant values of N.) I'd further speculate that a seed AI may take extra time to compensate for lack of computational power before updating, to mathematically prove that:

Or maybe, modeling future itself with sufficient accuracy and control for accumulating errors will do the job. The number of computations needed for making a decision, considering all the time points in the future, when utility function changes at the next time point (and probably not in a predictable way - the future uncertainty) may exhibit exponential growth (now I hope my future self has a chance to speculate on this matter extensively.) Also, the value of an action can be calculated to be very different, depending on the horizon of future, and perfectly rational agent has to calculate outcomes of self-modification with over infinite number of points in time.

I'm also tempted to argue - what if utility of AGI may need to be flexible to a degree for some other reasons - just in case that at some point it has to be improved, but I see no reason that a superintelligent AI couldn't get everyone in future happy, even if designed with rather fuzzy understanding of what happiness is. Our present selves may have wished that understanding would be better, but it's as irrelevant as John the Killer never gets to rejoice of completing his goal.

Temporal consistency in humans

On the other hand, human behavior, as well as utility, is ever-changing and state-dependent, and not in ways that are well understood at present time. There are preferences about future self, and cooperation with future self to a limited extent, with hyperbolic discounting involved. Preferences for future would be different things in case of humans and AI. While an artificial intelligence, like ourselves, may be prone to ontological noise, in principle there could be guaranteed that it adheres to a decision theory, and is rational as much as possible, even while having limited computational power. With humans, it's more difficult to guarantee that, and often we have a hard time figuring out what we want. Each individual holds multiple desires simultaneously, and they do not really form an hierarchic structure of terminal goals and sub-goals. How preferences are represented in the human hardware, and how they translate into actual behavior is also not entirely clear, as for now. However, on the most basic level, they seem to be encoded within temporal difference reinforcement model as reward prediction error potentials, but there seem to be a degree of randomness involved, as in random utility model. Thus, arguably, subjective experience is something that can't be divorced from continuity of its past experience and subjective uncertainty about future states. Human predicting their future state of mind with 100% certainty is a paradox; you have to actually perform the computations to know the outcome.

To speak of AI 'aligned with human values' - that is, presumably, something that takes humans' utility into consideration when it acts - we're going to need a better understanding of what exactly makes subjective experience positive, and that seems a rather urgent question. (Having a superintelligent AI discovering that for us is rather scary.)

Then again, utility may differ from one person to another, and may as well be both state-dependent and time-dependent. Without having human utility figured out, it's hard to know for sure whether is it possible for a human to have an endless cycle of positive subjective experiences - or are we stuck in a zero-sum game and it isn't, not really. While there is a lack of consensus about nature of consciousness and The Hard Problem, the notion of flow of time could be inseparable from qualia and continuous perception. After all, cognitive processes of each individual are continuous from birth to death.

Convergence through wireheading

Given what's above, I will restrict myself on speculating in detail about what a superintelligent AI will do if tasked with something along the lines of making everyone equally happy (especially given current limited understanding of happiness), but I'd share another few runaway thoughts nevertheless:

The prospect of conflicts in individuals' preferences about objective reality getting resolved through wireheading actually seems to be even more looming: the scenario in which reality gets transformed to the tastes of not the entire humanity, but one or a subgroup of individuals (presumably those who make AGI first), seem far more likely to begin with.

Terms and assumptions I used

I assume there is a multitude of humans that are subjects of experience (qualia), and they share only one objective reality.

By human utility, I mean things that trigger reward system and are subjectively pleasant, under assumption that it is the same thing under different angles.

By wireheading, I meant not only wireheading in classic sense via administering external stimulation to brain reward system directly, but also Yampolskiy's 'delusion boxes', or running a simulation tailored for an individual's optimal positive experience - anything that heightens their over-time utility while decoupling them from the actual state of reality. This simulation may abide to Yudkowsky's Laws of Fun - or their opposite, if something went south after all.

I use naive notion of time - that it flows in one direction. Reality can be seen as deterministic, but there is no way to predict its future state with absolute certainty, and no way to build a perfect model of it from within it.

I'm also making a stretchy assumption that maximization of utility can be translated to executing complex behavior, and back - executing behavior translates to maximizing some complex utility. The only difference being is, from which viewpoint the description is simpler.

The main idea

To get priorities right, let's figure out the really hard problem of consciousness first, and find answers to it, while AGI is not yet built. (My hunch is that time has to be involved somehow.)

1 comments

Comments sorted by top scores.

comment by Gordon Seidoh Worley (gworley) · 2018-04-08T22:15:46.907Z · LW(p) · GW(p)

Couple of comments.

First, I agree with you that trying to ignore time or designing agents that are time-independent is probably not that sensible since time is a fundamental aspect of the world created by our experience of it. I think the current reasons for preferring this, though, is that it offers a way to solve some tricky decision theory problems that need solving if we were to build a maximizing agent and want it to be aligned.

Second, on the issue of the project of determining what makes us happy or how humans measure value, I recommend the work of the Qualia Research Institute. I'm not sure they have the answer yet, but they are working on the issue.

Finally, as to the hard problem of consciousness, my own suspicion is that this problem doesn't really exist and instead masks a different problem we might call the hard problem of existence, i.e. why does anything exist? I actually think we might have an answer for that, too, but it's not very satisfying. I've not taken the time to write up my thoughts on this issue specifically but you can find the foundations of where my thinking will go on this topic in my introduction to noematology.