## Posts

Toy model piece #5: combining partial preferences 2019-09-12T03:31:25.295Z · score: 10 (2 votes)
Toy model piece #4: partial preferences, re-re-visited 2019-09-12T03:31:08.628Z · score: 9 (1 votes)
Is my result wrong? Maths vs intuition vs evolution in learning human preferences 2019-09-10T00:46:25.356Z · score: 19 (6 votes)
Simple and composite partial preferences 2019-09-09T23:07:26.358Z · score: 11 (2 votes)
Best utility normalisation method to date? 2019-09-02T18:24:29.318Z · score: 15 (5 votes)
Reversible changes: consider a bucket of water 2019-08-26T22:55:23.616Z · score: 25 (24 votes)
Toy model piece #3: close and distant situations 2019-08-26T22:41:17.500Z · score: 10 (2 votes)
Problems with AI debate 2019-08-26T19:21:40.051Z · score: 21 (10 votes)
Gratification: a useful concept, maybe new 2019-08-25T18:58:15.740Z · score: 17 (7 votes)
Under a week left to win $1,000! By questioning Oracle AIs. 2019-08-25T17:02:46.921Z · score: 14 (3 votes) Toy model piece #2: Combining short and long range partial preferences 2019-08-08T00:11:39.578Z · score: 15 (4 votes) Preferences as an (instinctive) stance 2019-08-06T00:43:40.424Z · score: 20 (6 votes) Practical consequences of impossibility of value learning 2019-08-02T23:06:03.317Z · score: 23 (11 votes) Very different, very adequate outcomes 2019-08-02T20:31:00.751Z · score: 13 (4 votes) Contest:$1,000 for good questions to ask to an Oracle AI 2019-07-31T18:48:59.406Z · score: 68 (28 votes)
Toy model piece #1: Partial preferences revisited 2019-07-29T16:35:19.561Z · score: 12 (3 votes)
Normalising utility as willingness to pay 2019-07-18T11:44:52.272Z · score: 16 (4 votes)
Intertheoretic utility comparison: examples 2019-07-17T12:39:45.147Z · score: 13 (3 votes)
Indifference: multiple changes, multiple agents 2019-07-08T13:36:42.095Z · score: 16 (3 votes)
Self-confirming prophecies, and simplified Oracle designs 2019-06-28T09:57:35.571Z · score: 6 (3 votes)
Apocalypse, corrupted 2019-06-26T13:46:05.548Z · score: 20 (12 votes)
Research Agenda in reverse: what *would* a solution look like? 2019-06-25T13:52:48.934Z · score: 36 (14 votes)
Research Agenda v0.9: Synthesising a human's preferences into a utility function 2019-06-17T17:46:39.317Z · score: 61 (16 votes)
Preference conditional on circumstances and past preference satisfaction 2019-06-17T15:30:32.580Z · score: 11 (2 votes)
For the past, in some ways only, we are moral degenerates 2019-06-07T15:57:10.962Z · score: 29 (9 votes)
To first order, moral realism and moral anti-realism are the same thing 2019-06-03T15:04:56.363Z · score: 17 (4 votes)
Conditional meta-preferences 2019-06-03T14:09:54.357Z · score: 6 (3 votes)
Uncertainty versus fuzziness versus extrapolation desiderata 2019-05-30T13:52:16.831Z · score: 20 (5 votes)
And the AI would have got away with it too, if... 2019-05-22T21:35:35.543Z · score: 75 (30 votes)
By default, avoid ambiguous distant situations 2019-05-21T14:48:15.453Z · score: 31 (8 votes)
Oracles, sequence predictors, and self-confirming predictions 2019-05-03T14:09:31.702Z · score: 21 (7 votes)
Self-confirming predictions can be arbitrarily bad 2019-05-03T11:34:47.441Z · score: 45 (17 votes)
Nash equilibriums can be arbitrarily bad 2019-05-01T14:58:21.765Z · score: 36 (16 votes)
Defeating Goodhart and the "closest unblocked strategy" problem 2019-04-03T14:46:41.936Z · score: 44 (14 votes)
Learning "known" information when the information is not actually known 2019-04-01T17:56:17.719Z · score: 15 (5 votes)
Relative exchange rate between preferences 2019-03-29T11:46:35.285Z · score: 12 (3 votes)
Being wrong in ethics 2019-03-29T11:28:55.436Z · score: 22 (5 votes)
Models of preferences in distant situations 2019-03-29T10:42:14.633Z · score: 11 (2 votes)
The low cost of human preference incoherence 2019-03-27T11:58:14.845Z · score: 21 (8 votes)
"Moral" as a preference label 2019-03-26T10:30:17.102Z · score: 16 (5 votes)
Partial preferences and models 2019-03-19T16:29:23.162Z · score: 13 (3 votes)
Combining individual preference utility functions 2019-03-14T14:14:38.772Z · score: 12 (4 votes)
Mysteries, identity, and preferences over non-rewards 2019-03-14T13:52:40.170Z · score: 14 (4 votes)
A theory of human values 2019-03-13T15:22:44.845Z · score: 29 (8 votes)
Example population ethics: ordered discounted utility 2019-03-11T16:10:43.458Z · score: 14 (5 votes)
Smoothmin and personal identity 2019-03-08T15:16:28.980Z · score: 20 (10 votes)
Preferences in subpieces of hierarchical systems 2019-03-06T15:18:21.003Z · score: 11 (3 votes)
mAIry's room: AI reasoning to solve philosophical problems 2019-03-05T20:24:13.056Z · score: 64 (21 votes)
Simplified preferences needed; simplified preferences sufficient 2019-03-05T19:39:55.000Z · score: 31 (12 votes)
Finding the variables 2019-03-04T19:37:54.696Z · score: 30 (7 votes)

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-12T02:42:10.374Z · score: 2 (1 votes) · LW · GW

I like this analogy. Probably not best to put too much weight on it, but it has some insights.

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-10T18:59:03.995Z · score: 2 (1 votes) · LW · GW

And whether those programs could then perform well if their opponent forces them into a very unusual situation, such as would not have ever appeared in a chessmaster game.

If I sacrifice a knight for no advantage whatsoever, will the opponent be able to deal with that? What if I set up a trap to capture a piece, relying on my opponent not seeing the trap? A chessmaster playing another chessmaster would never play a simple trap, as it would never succeed; so would the ML be able to deal with it?

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-10T16:50:33.181Z · score: 2 (1 votes) · LW · GW

PS: the other title I considered was "Why do people feel my result is wrong", which felt too condescending.

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-10T16:34:12.187Z · score: 5 (3 votes) · LW · GW

I agree we're not as good as we think we are. But there are a lot of things we do agree on, that seem trivial: eg "this person is red in the face, shouting at me, and punching me; I deduce that they are angry and wish to do me harm". We have far, far, more agreement than random agents would.

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-10T16:23:01.918Z · score: 2 (1 votes) · LW · GW

Hehe - I don't normally do this, but I feel I can indulge once ^_^

having implicit access to categorisation modules that themselves are valid only in typical situations... is not a way to generalise well

How do you know this?

Moravec's paradox again. Chessmasters didn't easily program chess programs; and those chess programs didn't generalise to games in general.

Should we turn this into one of those concrete ML experiments?

That would be good. I'm aiming to have a lot more practical experiments from my research project, and this could be one of them.

Comment by stuart_armstrong on Best utility normalisation method to date? · 2019-09-07T23:09:29.936Z · score: 2 (1 votes) · LW · GW

Hum... It seems that we can stratify here. Let represent the values of a collection of variables that we are uncertain about, and that we are stratifying on.

When we compute the normalising factor for utility under two policies and , we normally do it as:

• , with .

And then we replace with .

Instead we might normalise the utility separately for each value of :

• Conditional on , then , with .

The problem is that, since we're dividing by the , the expectation of is not the same .

Is there an obvious improvement on this?

Note that here, total utilitarianism get less weight in large universes, and more in small ones.

I'll think more...

Comment by stuart_armstrong on Problems with AI debate · 2019-09-06T01:13:47.636Z · score: 3 (2 votes) · LW · GW

How about a third AI that gives a (hidden) probability about which one you'll be convinced by, conditional on which argument you see first? That hidden probability is passed to someone else, then the debate is run, and the result recorded. If that third AI gives good calibration and good discrimination over multiple experiments, then we can consider its predictions accurate in the future.

Comment by stuart_armstrong on Best utility normalisation method to date? · 2019-09-03T16:38:35.640Z · score: 3 (2 votes) · LW · GW

Er, this normalisation system way well solve that problem entirely. If prefers option (utility ), with second choice (utility ), and all the other options as third choice (utility ), then the expected utility of the random dictator is for all (as gives utility , and gives utility for all ), so the normalised weighted utility to maximise is:

• .

Using (because scaling doesn't change expected utility decisions), the utility of any , , is , while the utility of is . So if , the compromise option will get chosen.

Don't confuse the problems of the random dictator, with the problems of maximising the weighted sum of the normalisations that used the random dictator (and don't confuse the other way, either; the random dictator is immune to players' lying, this normalisation is not).

Comment by stuart_armstrong on How to Make Billions of Dollars Reducing Loneliness · 2019-09-01T21:16:13.728Z · score: 6 (4 votes) · LW · GW

But community isn't about friends; it's about a background level of acquaintances you're comfortable with.

Comment by stuart_armstrong on September Bragging Thread · 2019-08-31T22:02:04.809Z · score: 44 (24 votes) · LW · GW

I finished the research agenda on constructing a preference utility function for any given human, and presented the ideas to CHAI and MIRI. Woot!

Comment by stuart_armstrong on The Very Repugnant Conclusion · 2019-08-31T13:25:42.902Z · score: 2 (1 votes) · LW · GW

Something like or or in general (for decreasing, continuous ) could work, I think.

Comment by stuart_armstrong on Why so much variance in human intelligence? · 2019-08-30T20:19:02.054Z · score: 6 (3 votes) · LW · GW

I'd say that intelligence variations are more visible in (modern) humans, not that they're necessarily larger.

Let's go back to the tribal environment. In that situation, humans want to mate, to dominate/be admired, to have food and shelter, and so on. Apart from a few people with mental defects, the variability in outcome is actually quite small - most humans won't get ostracised, many will have children, only very few will rise to the top of the hierarchy (and even there, tribal environments are more egalitarian that most, so the leader is not that much different from the others). So we might say that the variation in human intelligence (or social intelligence) is low.

Fast forward to an agricultural empire, or to the modern world. Now the top minds can become god emperor, invading and sacking other civilizations, or can be part of projects that produce atom bombs and lunar rockets. The variability of outcomes is huge, and so the variability in intelligence appears to be much higher.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-30T19:18:55.010Z · score: 6 (3 votes) · LW · GW

That's an excellent summary.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-29T02:39:40.606Z · score: 3 (2 votes) · LW · GW

One might think just doing ontology doesn't involve making preference choice but making some preferences impossible to articulate it is in fact a partial preference choice.

Yep, that's my argument: some (but not all) aspects of human preferences have to be included in the setup somehow.

it's more reasonable for a human to taste a salt level differnce, it's more plausible to say "I couldn't know" about radioactivity

I hope you don't taste every bucket of water before putting it away! ^_^

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-29T02:36:29.769Z · score: 2 (1 votes) · LW · GW

In the later part of the post, it seems you're basically talking about entropy and similar concepts? And I agree that "reversible" is kinda like entropy, in that we want to be able to return to a "macrostate" that is considered indistinguishable from the starting macrostate (even if the details are different).

However, as in the the bucket example above, the problem is that, for humans, what "counts" as the same macrostate can vary a lot. If we need a liquid, any liquid, then replacing the bucket's contents with purple-tinted alchool is fine; if we're thinking of the bath water of the dear departed husband, then any change to the contents is irreversible. Human concepts of "acceptably similar" don't match up with entropic ones.

there needs to be an effect that counts as "significant".

Are you deferring this to human judgement of significant? If so, we agree - human judgement needs to be included in some way in the definition.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-29T02:29:09.841Z · score: 4 (2 votes) · LW · GW

Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward.

Yep, I agree :-)

I generally think that impact measures don't have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.

Then we are in full agreement :-) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. "some" because of arguments like this one; "not all" because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.

Comment by stuart_armstrong on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-08-29T02:26:05.230Z · score: 2 (1 votes) · LW · GW Good idea. Comment by stuart_armstrong on Gratification: a useful concept, maybe new · 2019-08-29T02:22:47.592Z · score: 2 (1 votes) · LW · GW intrinsic motivation That might be the concept I'm looking for. I'll think whether it covers exactly what I'm trying to say... Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-08-27T21:31:27.141Z · score: 4 (2 votes) · LW · GW Ok, we strongly disagree on your simple constraints being enough. I'd need to see these constraints explicitly formulated before I had any confidence in them. I suspect (though I'm not certain) that the more explicit you make them, the more tricky you'll see that it is. And no, I don't want to throw IRL out (this is an old post), I want to make it work. I got this big impossibility result, and now I want to get around it. This is my current plan: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/research-agenda-v0-9-synthesising-a-human-s-preferences-into Comment by stuart_armstrong on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-08-27T21:23:33.678Z · score: 2 (1 votes) · LW · GW

Very worthwhile concern, and I will think about it more.

Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-08-27T18:48:07.628Z · score: 2 (1 votes) · LW · GW

We may not be disagreeing any more. Just to check, do you agree with both these statements:

1. Adding a few obvious constraints rule out many different R, including the ones in the OP.

2. Adding a few obvious constraints is not enough to get a safe or reasonable R.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-27T16:27:02.182Z · score: 5 (3 votes) · LW · GW

I've added an edit to the post, to show the problem: sometimes, the robot can't kick the bucket, sometimes it must. And only human preferences distinguish these two cases. So, without knowing these preferences, how can it decide?

Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-08-27T16:19:48.161Z · score: 2 (1 votes) · LW · GW

Rejecting any specific R is easy - one bit of information (at most) per specific R. So saying "humans have preferences, and they are not always rational or always anti-rational" rules out R(1), R(2), and R(3). Saying "this apparent preference is genuine" rules out R(4).

But it's not like there are just these five preferences and once we have four of them out of the way, we're done. There are many, many different preferences in the space of preferences, and many, many of them will be simpler than R(0). So to converge to R(0), we need to add huge amounts of information, ruling out more and more examples.

Basically, we need to include enough information to define R(0) - which is what my research project is trying to do. What you're seeing as "adding enough clear examples" is actually "hand-crafting R(0) in totality".

For more details see here: https://arxiv.org/abs/1712.05812

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-27T03:09:29.726Z · score: 4 (3 votes) · LW · GW

kicking the bucket into the pool perturbs most AUs. There’s no real “risk” to not kicking the bucket.

In this specific setup, no. But sometimes kicking the bucket is fine; sometimes kicking the metaphorical equivalent of the bucket is necessary. If the AI is never willing to kick the bucket - ie never willing to take actions that might, for certain utility functions, cause huge and irreparable harm - then it's not willing to take any action at all.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-26T23:04:31.210Z · score: 6 (5 votes) · LW · GW

Your presentation had an example with randomly selected utility functions in a block world, that resulted in the agent taking less-irreversible actions around a specific block.

If we have randomly selected utility functions in the bucket-and-pool world, this may include utilities that care about the salt content or the exact molecules, or not. Depending on whether or not we include these, we run the risk of preserving the bucket when we need not, or kicking it when we should preserve it. This is because the "worth" of the water being in the bucket varies depending on human preferences, not on anything intrinsic to the design of the bucket and the pool.

Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-08-26T21:11:08.689Z · score: 7 (3 votes) · LW · GW

we have plenty of natural assumptions to choose from.

You'd think so, but nobody has defined these assumptions in anything like sufficient detail to make IRL work. My whole research agenda is essentially a way of defining these assumptions, and it seems to be a long and complicated process.

Comment by stuart_armstrong on Gratification: a useful concept, maybe new · 2019-08-26T19:24:33.116Z · score: 2 (1 votes) · LW · GW

Basically yes. My take on 2) is that identity-affirming things can be somewhat pleasurable - but they're unlikely to be the most pleasurable thing the human could do at that moment. So they can be valued for something else than pure pleasure.

And you can get other examples where someone, say, is truthful, even if that causes them more pain than a simple lie would.

Comment by stuart_armstrong on Gratification: a useful concept, maybe new · 2019-08-26T18:47:04.617Z · score: 3 (2 votes) · LW · GW

Cool, thanks. I see the torture example as being closer to hedonism (or rather, anti-hedonism), though.

Comment by stuart_armstrong on Under a week left to win $1,000! By questioning Oracle AIs. · 2019-08-25T20:06:01.446Z · score: 2 (1 votes) · LW · GW Thanks, good idea. Comment by stuart_armstrong on Solving the Doomsday argument · 2019-08-25T17:09:24.615Z · score: 4 (2 votes) · LW · GW These are valid points, but we have wandered a bit away from the initial argument, and we're now talking about numbers that can't be compared (my money is on TREE(3) being smaller in this example, but that's irrelevant to your general point), or ways of truncating in the infinite case. But we seem to have solved the finite-and-comparable case. Now, back to the infinite case. First of all, there may be a correct decision even if probabilities cannot be computed. If we have a suitable utility function, we may decide simply not to care about what happens in universes that are of the type 5, which would rule them out completely. Or maybe the truncation can be improved slightly. For example, we could give each observer a bubble of radius 20 mega-light years, which is defined according to their own subjective experience: how many individuals do they expect to encounter within that radius, if they were made immortal and allowed to explore it fully. Then we truncate by this subjective bubble, or something similar. But yeah, in general, the infinite case is not solved. Comment by stuart_armstrong on Solving the Doomsday argument · 2019-08-19T00:41:53.266Z · score: 4 (2 votes) · LW · GW If we set aside infinity, which I don't know how to deal with, then the SIA answer does not depend on utility bounds - unlike my anthropic decision theory post. Q1: "How many copies of people (currently) like me are there in each universe?" is well-defined in all finite settings, even huge ones. Incidentally, when you say there are “not many” copies of me in universes 3 and 4, then you presumably mean “not a high proportion, compared to the vast total of observers” No, I mean not many, as compared with how many there are in universes 1 and 2. Other observers are not relevant to Q1. I'll reiterate my claim that different anthropic probability theories are "correct answers to different questions": https://www.lesswrong.com/posts/nxRjC93AmsFkfDYQj/anthropic-probabilities-answering-different-questions Comment by stuart_armstrong on Toy model piece #1: Partial preferences revisited · 2019-08-18T12:53:57.324Z · score: 3 (2 votes) · LW · GW Yep, sorry, I saw -3, -2, -1, etc... and concluded you weren't doing the 2 jumps; my bad! Then somehow the work is just postponed to the point where we try to combine partial preferences? Yes. But unless we have other partial preferences or meta-preferences, then the only resonable way of combining them is just to add them, after weighting. I like your reciprocal weighting formula. It seems to have good properties. Comment by stuart_armstrong on Solving the Doomsday argument · 2019-08-15T01:27:08.215Z · score: 4 (2 votes) · LW · GW "How many copies of people like me are there in each universe?" Then as long as your copies know that 3K has been observed, and excluding simulations and such, the answers are "(a lot, a lot, not many, not many)" in the four universes (I'm interpreting "die off before spreading through space" as "die off just before spreading through space"). This is the SIA answer, since I asked the SIA question. Comment by stuart_armstrong on Toy model piece #1: Partial preferences revisited · 2019-08-15T01:21:08.790Z · score: 2 (1 votes) · LW · GW Each partial preference is meant to represent a single mental model inside the human, with all preferences weighted the same (so there can't be "extremely weak" preferences, compared with other preference in the same partial preference). Things like "increased income is better", "more people smiling is better", "being embarrassed on stage is the worse". We can imagine a partial preference with more internal structure, maybe internal weights, but I'd simply see that as two separate partial preferences. So we'd have the utilities you gave to through to for one partial preference (actually, my formula doubles the numbers you gave), and , , for the other partial preference - which has a very low weight by assumption. So the order of and is not affected. EDIT: I'm pretty sure we can generalise my method for different weights of preferences, by changing the formula that sums the squares of utility difference. Comment by stuart_armstrong on Categorial preferences and utility functions · 2019-08-11T00:28:40.644Z · score: 3 (2 votes) · LW · GW Neat! Though I should mention that my current version of partial preferences does not assume all cycles are closed - the constrained optimisation can be seen as trying to get "as close as possible" to that, given non-closed cycles. Comment by stuart_armstrong on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-08-10T18:20:29.019Z · score: 2 (1 votes) · LW · GW

What's the set of answers, and how are they assessed?

Comment by stuart_armstrong on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-08-09T17:33:40.363Z · score: 2 (1 votes) · LW · GW 1. I encourage you to submit other ideas anyway, since your ideas are good. 2. Not sure yet about how all these things relate; will maybe think of that more later. Comment by stuart_armstrong on Partial preferences and models · 2019-08-09T17:30:02.550Z · score: 5 (3 votes) · LW · GW Hey there! Thanks for your long comment - but, alas, this model of partial preferences is obsolete :-( Because of other problems with this, I've replaced it with the much more general concept of a preorder. This can express all the things we want to express, but is a lot less intuitive for how humans model things. I may come up with some alternative definition at some point (less general than a preorder, but more general than this post. Thanks for the comment in any case. Comment by stuart_armstrong on Preferences as an (instinctive) stance · 2019-08-07T21:26:05.377Z · score: 2 (1 votes) · LW · GW I find it hard to imagine that you're actually denying that you or I have things that, colloquially, one would describe as preferences, and exist in an objective sense. I deny that a generic outside observer would describe us as having any specific set of preferences, in an objective sense. This doesn't bother me too much, because it's sufficient that we have preferences in a subjective sense - that we can use our own empathy modules and self-reflection to define, to some extent, our preferences. a brain is ultimately many fewer assumptions (to the pre-industrial Norse people) "Realistic" preferences make ultimately fewer assumptions (to actual humans) that "fully rational" or other preference sets. The problem is that this is not true for generic agents, or AIs. We have to get the human empathy module into the AI first - not so it can predict us (it can already do that through other means), but so that its decomposition of our preferences is the same as ours. Comment by stuart_armstrong on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-08-07T21:14:17.537Z · score: 2 (1 votes) · LW · GW

Can you make this a bit more general, rather than just for the specific example?

Comment by stuart_armstrong on Contest: \$1,000 for good questions to ask to an Oracle AI · 2019-08-07T21:13:33.853Z · score: 8 (3 votes) · LW · GW

For low bandwidth, you have to specify the set of answers that are available (and how they would be checked).

Comment by stuart_armstrong on Practical consequences of impossibility of value learning · 2019-08-06T00:45:51.175Z · score: 2 (1 votes) · LW · GW

I tried to answer in more detail here: https://www.lesswrong.com/posts/f5p7AiDkpkqCyBnBL/preferences-as-an-instinctive-stance (hope you didn't mind; I used your comment as a starting point for a major point I wanted to clarify).

But I admit to being confused now, and not understanding what you mean. Preferences don't exist in the territory, so I'm not following you, sorry! :-(

Comment by stuart_armstrong on Practical consequences of impossibility of value learning · 2019-08-05T17:34:50.289Z · score: 2 (1 votes) · LW · GW

Saying that an agent has a preference/reward R is an interpretation of that agent (similar to the "intentional stance" of seeing it as an agent, rather than a collection of atoms). And the (p,R) and (-p,-R) interpretations are (almost) equally complex.

Comment by stuart_armstrong on Practical consequences of impossibility of value learning · 2019-08-04T23:24:51.399Z · score: 2 (1 votes) · LW · GW

I disagree. I think that if we put a complexity upper bound on human rationality, and assume noisy rationality, then we will get values that are "meaningless" from your perspective.

I'm trying to think of ways how of we could test this....

Comment by stuart_armstrong on Practical consequences of impossibility of value learning · 2019-08-04T19:52:52.939Z · score: 2 (1 votes) · LW · GW

Imagine [...] distinguish.

It's because of concerns like this that we have to solve the symbol grounding problem for the human we are trying to model; see, eg, https://www.lesswrong.com/posts/EEPdbtvW8ei9Yi2e8/bridging-syntax-and-semantics-empirically

But that doesn't detract from the main point: that simplicity, on its own, is not sufficient to resolve the issue.

Comment by stuart_armstrong on Practical consequences of impossibility of value learning · 2019-08-04T19:49:07.470Z · score: 4 (2 votes) · LW · GW

If you assume that human values are simple (low komelgorov complexity) and that human behavior is quite good at fulfilling those values, then you can deduce non trivial values for humans.

And you will deduce them wrong. "Human values are simple" pushes you towards "humans have no preferences", and if by "human behavior is quite good at fulfilling those values" you mean something like noisy rationality, then it will go very wrong, see for example https://www.lesswrong.com/posts/DuPjCTeW9oRZzi27M/bounded-rationality-abounds-in-models-not-explicitly-defined

And if instead you mean a proper accounting of bounded rationality, of the difference between anchoring bias and taste, of the difference between system 1 and system 2, of the whole collection of human biases... well, then, yes, I might agree with you. But that's because you've already put all the hard work in.

Comment by stuart_armstrong on Very different, very adequate outcomes · 2019-08-04T19:25:15.968Z · score: 5 (3 votes) · LW · GW

I've thinking of a rather naive form of preference utilitarianism, of the sort "if the human agree to it or choose it, then it's ok". In particular, you can end up with some forms of depression where the human is miserable, but isn't willing to change.

I'll clarify that in the post.

Comment by stuart_armstrong on Very different, very adequate outcomes · 2019-08-04T19:23:21.198Z · score: 2 (1 votes) · LW · GW

I'm finding these "is the correct utility function" hard to parse. Humans have a bit of and a bit of . But we are underdefined systems; there is no specific value of that is "true". We can only assess the quality of using other aspects of human underdefined preferences.

This seems way too handwavy.

It is. Here's an attempt at a more formal definition: humans have collections of underdefined and somewhat contradictory preferences (using preferences in a more general sense than preference utilitarianism). These preferences seem to be stronger in the negative sense than in the positive: humans seem to find the loss of a preference much worse than the gain. And the negative is much more salient, and often much more clearly defined, than that positive.

Given that maximising one preference tends to put the values of others at extreme values, human overall preferences seem better captured by a weighted mix of preferences (or a smooth min of preferences) than by any single preference, or small set of preferences. So it is not a good idea to be too close to the extremes (extremes being places where some preferences have weight put on them).

Now there may be some sense in which these extreme preferences are "correct", according to some formal system. But this formal system must reject the actual preferences of humans today; so I don't see why these preferences should be followed at all, even if they are correct.

Ok, so the extremes are out; how about being very close to the extremes? Here is where it gets wishywashy. We don't have a full theory of human preferences. But, according to the picture I've sketched above, the important thing is that each preference gets some positive traction in our future. So, yes to might no mean much (and smooth min might be better anyway). But I believe I could say:

• There are many weighted combinations of human preferences that are compatible with the picture I've sketched here. Very different outcomes, from the numerical perspective of the different preferences, but all falling within an "acceptability" range.

Still a bit too handwavy. I'll try and improve it again.

Comment by stuart_armstrong on Very different, very adequate outcomes · 2019-08-02T22:24:51.781Z · score: 4 (2 votes) · LW · GW

https://www.lesswrong.com/posts/hBJCMWELaW6MxinYW/intertheoretic-utility-comparison

Or we could come up with a normalisation method by having people rank the intensity of their preferences versus the intensity of their enjoyments. It doesn't have to be particularly good, just give non-crazy results.

Comment by stuart_armstrong on Toy model piece #1: Partial preferences revisited · 2019-07-31T16:41:28.465Z · score: 4 (2 votes) · LW · GW

Thanks, corrected a few typos.

Why must a preorder decompose into disjoint ordered chains?

They don't have to; I'm saying that sensible partial preferences (eg ) should do so. I then see how I'd deal with sensible preorders, then generalise to all preorders in the next section.

How do cycles vanish in ? Can you work through the example where the partial preference expressed by the human is ?

Note that what you've written is impossible as means but not . A preorder is transitive, so the best you can get is .

Then projecting down (via ) to will project all these down to the same element. That's why there are no cycles, because all cycles go to points.

Then we need to check some math. Define on by iff .

This is well defined (independently of which and we use to represent and ), because if , then , so, by transitivity, . The same argument works for .

We now want to show the is a partial order on . It's transitive, because if and , then , and the transitivity in implies and hence .

That shows it's a preorder. To show partial order, we need to show there are no cycles. So, if and , then and , hence, by definition of , . So it's a partial order.