Figuring out what Alice wants, part I

stuart_armstrong

Figuring out what Alice wants, part I

post by Stuart_Armstrong · 2018-07-17T13:59:35.395Z · LW · GW · 8 comments

  The theory: model fragments
  What model fragments look like
None
8 comments

This is a very preliminary two-part post sketching out the direction I'm taking my research now (second post here [LW · GW]). I'm expecting and hoping that everything in here will get superseded quite quickly. This has obvious connections to classical machine intelligence research areas (such as interpretability). I'd be very grateful for any links with papers or people related to the ideas of this post.

The theory: model fragments

I've presented the theoretical argument for why we cannot deduce the preferences of an irrational agent, and a practical example [LW · GW] of that difficulty. I'll be building on that example to illustrate some algorithms that produce the same actions, but where we nonetheless can feel confident deducing different preferences.

I've mentioned a few ideas for "normative assumptions": the assumptions that we, or an AI, could use to distinguish between different possible preferences even if they result in the same behaviour. I've mentioned things such as regret [LW · GW], humans stating their values with more or less truthfulness, human narratives, how we categorise our own emotions (those last three are in this post [LW · GW]), or the structure [LW · GW] of the human algorithm.

Those all seems rather add-hoc, but they are all trying to do the same thing: hone in on human judgement about rationality and preferences. But what is this judgement? This judgement is defined to be the internal models [LW · GW] that humans use to assess situations. These models, about ourselves and about other humans, often agree with each other [LW · GW] from one human to the next (for instance, most people agree that you're less rational when you're drunk).

Calling them models might be a bit of an exaggeration, though. We often only get a fragmentary or momentary piece of a model - "he's being silly", "she's angry", "you won't get a promotion with that attitude". These are called to mind, thought upon, and then swiftly dismissed.

So what we want to access, is the piece of the model that the human used to judge the situation. Now, these model fragments can often be contradictory, but we can deal with that problem later [LW · GW].

Then all the normative assumptions noted above are just ways of defining these model fragments, or accessing them (via emotion, truthful description, or regret). Regret is a particularly useful emotion, as it indicates a divergence between what was expected in the model, and what actually happened (similarly to temporal difference learning).

So I'll broadly categorise methods of learning human model fragments into three categories:

Direct access to the internal model.
Regret and surprise as showing mismatchs between model expectation and outcomes.
Privileged output (eg certain human statements in certain circumstances are taken to be true-ish statements about the internal model).

The first method violates algorithmic equivalence and extentionality: two algorithms with identical outputs can nevertheless use different models. The second two methods do respect algorithmic equivalence, once we have defined what behaviours correspond to regret/surprise, or what situations humans can be expected to respond truthfully to. In the process of defining those behaviours and situations, however, we are likely to use introspection and our own models: a sober, relaxed rational human confiding confidentially with an impersonal computer, is more likely to be truthful than a precariously employed worker on stage in front of their whole office.

What model fragments look like

The second post will provide examples of the approach, but here I'll just list the kind of things that we can expect as model fragment:

Direct statements about rewards ("I want chocolate now").
Direct statements about rationality ("I'm irrational around them").
An action is deemed better than another ("you should starts a paper trail, rather than just rely on oral instructions").
An action is seen as good (or bad), compared with some implicit set of standard actions. ("compliment your lover often").
Similarly to actions, observations/outcomes can be treated as above ("the second prize is actually better", "it was unlucky you broke your foot").
An outcome is seen as surprising ("that was the greatest stock market crash in history"), or the action of another agent is seen as that ("I didn't expect them to move to France").

A human can think these things about themselves or about other agents; the most complicated variants are assessing the actions of one agent from the perspective of another agent ("if she signed the check, he'd be in a good position").

Finally, there are meta, and meta-meta, etc... versions of these, as we model other agents modelling us. All of these give a partial indication of our models of the rationality or reward, about ourselves and about other humans.

8 comments

Comments sorted by top scores.

comment by habryka (habryka4) · 2018-07-02T18:30:55.473Z · LW(p) · GW(p)

Moved back to drafts, given that I am 70% confident that this is still a draft (or maybe it's some kind of game where I am supposed to figure out what Alice wants based on the sentence fragments in this post, feel free to move it back in that case).

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2018-07-03T09:00:23.316Z · LW(p) · GW(p)

Ooops! Sorry, this is indeed a draft.

comment by Gordon Seidoh Worley (gworley) · 2018-08-01T22:33:41.327Z · LW(p) · GW(p)

Herein I'm thinking about this and the sequel post and trying to understand why you might be interested in this since it doesn't feel to me like you spell it out.

It seems we might care about model fragments if we think we can't build complete models of other agents/things but can instead build partial models. The "we" building these models might be literally us, but also an AI or a composite agent like humanity. Having a theory of what to do with these model fragments is useful if we want to address at least two questions, then, that we might be worried about around these parts: how do we decide an AI is safe based on our fragmentary models of it, and how does an AI model humanity based on its fragmentary models of humans.

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2018-08-04T21:27:52.803Z · LW(p) · GW(p)

I'm looking at how humans model each other based on their fragmentary models, and using this to get to their values.

Replies from: gworley

↑ comment by Gordon Seidoh Worley (gworley) · 2018-08-06T20:01:54.029Z · LW(p) · GW(p)

Thinking a bit more, it seems a big problem we may face in using model fragments is that they are fragments and we will have to find a way to stitch them together so that they fill the gaps between the models, perhaps requiring something like model interpolation. Of course, maybe this isn't necessary if we think of fragments as mostly overlapping (although probably inconsistent in the overlaps) or of new fragments to fill gaps as available on demand if we discover we need them and don't have them.

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2018-08-07T13:27:40.596Z · LW(p) · GW(p)

For contradictions: https://www.lesswrong.com/posts/Y2LhX3925RodndwpC/resolving-human-values-completely-and-adequately [LW · GW]

Replies from: gworley

↑ comment by Gordon Seidoh Worley (gworley) · 2018-08-07T17:37:54.312Z · LW(p) · GW(p)

I suspect dealing adequately with contradictions will be significantly more complicated than you propose, but haven't written about that in depth yet. When I get around to addressing what I view as necessary in this area (practicing moral particularism that will be robust to false positives) I definitely look forward to talking with you more about it.

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2018-08-08T17:06:05.135Z · LW(p) · GW(p)

I agree with you to some extent. That post is mainly a placeholder that tells me that the contradictions problem is not intrinsically unsolvable, so I can put it aside while I concentrate on this problem for the moment.

Figuring out what Alice wants, part I

Contents

The theory: model fragments

What model fragments look like

8 comments