Comment by stuart_armstrong on By default, avoid ambiguous distant situations · 2019-05-23T12:01:55.615Z · score: 2 (1 votes) · LW · GW

It's locking in the moral/preference status quo; once that's done, non-Pareto overall gains are fine.

Even when locking in that status quo, it explicitly trades off certain values against others, so there is no "only Pareto" restriction.

I have a research agenda to be published soon that will look into these issues in more detail.

Comment by stuart_armstrong on [AN #56] Should ML researchers stop running experiments before making hypotheses? · 2019-05-23T11:31:55.901Z · score: 2 (1 votes) · LW · GW

Relevant to Robin Hanson's point: I've argued here that agents betraying their principals happens in politics all the time, sometime with disastrous results. By restricting to the economic literature on this problem, we're only looking at a small subsets of "agency problems", and implicitly assuming that institutions are sufficiently strong to detect and deter bad behaviour of very powerful AI agents - which is not at all evident.

Comment by stuart_armstrong on By default, avoid ambiguous distant situations · 2019-05-22T22:19:09.406Z · score: 8 (3 votes) · LW · GW

Main difference between brainwashing and creating: the pre-brainwashed person had preferences about their future selves that are partially satisfied by reverting back.

Comment by stuart_armstrong on By default, avoid ambiguous distant situations · 2019-05-22T22:17:17.985Z · score: 7 (2 votes) · LW · GW

had magicked house elves into existence [...] should we change them?

I'm explicitly arguing that even though we might not want to change them, we could still prefer they not exist in the first place.

should we change from our world to one where people are not culturally molded to appreciate any of our current values?

I'm trying to synthesise actual human values, not hypothetical other values that other beings might have. So in this process, our current values (or our current meta-preferences for our future values) get special place. If we had different values currently, the synthesis would be different. So that would-change is, from our perspective, a loss.

And the AI would have got away with it too, if...

2019-05-22T21:35:35.543Z · score: 45 (13 votes)
Comment by stuart_armstrong on By default, avoid ambiguous distant situations · 2019-05-22T11:14:57.473Z · score: 4 (2 votes) · LW · GW

Interesting take; I'll consider it more...

By default, avoid ambiguous distant situations

2019-05-21T14:48:15.453Z · score: 23 (5 votes)
Comment by stuart_armstrong on Are you in a Boltzmann simulation? · 2019-05-21T12:10:33.048Z · score: 4 (2 votes) · LW · GW

Basically that the "dust minds" are all crazy, because their internal beliefs correspond to nothing in reality, and there is no causality for them, except by sheer coincidence.

See also this old post: https://www.lesswrong.com/posts/295KiqZKAb55YLBzF/hedonium-s-semantic-problem

My main true reason for rejecting BBs of most types is this causality breakdown: there's no point computing the probability of being a BB, because your decision is irrelevant in those cases. In longer-lived Boltzmann Simulations, however, causality matters, so you should include them.

Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2019-05-21T12:03:26.755Z · score: 4 (2 votes) · LW · GW

Upvoted for the useful comment, but my mind works completely the opposite to this - only through seeing the math does the formalism make sense to me. I suspect many lesswrongers are similar in that respect, but it's interesting to see that not all are.

(also, yes, I could make my posts easier to follow, I admit that; one day, when I have more time, I will work on that)

Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2019-05-21T11:55:03.137Z · score: 2 (1 votes) · LW · GW

Mainly people describing their own subjective experience in ways that make me think "hey, that's just like me - and I haven't told anyone about it!" Or me modelling you as having a subjective experience close to my own, using this model to predict your actions, and being reasonably accurate.

Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2019-05-20T12:40:45.998Z · score: 3 (2 votes) · LW · GW

Has the author read Nagel's Seeing Like a Bat?

Yep (I read "What is it like to be a bat?")! Indeed, that started the thought process that let to the above post.

Nagel's biggest failing, as I see it, it that he makes everything boolean. "a bat's consciousness is inaccessible to a human" differs in degree, not kind, from "a human's consciousness is inaccessible to another human". There are observations that could convince me that a human has or has not true insight into a bat's consciousness.

Comment by stuart_armstrong on Are you in a Boltzmann simulation? · 2019-05-20T12:31:29.168Z · score: 4 (2 votes) · LW · GW

Mostly the symbol grounding posts: https://www.lesswrong.com/posts/EEPdbtvW8ei9Yi2e8/bridging-syntax-and-semantics-empirically https://www.lesswrong.com/posts/ix3KdfJxjo9GQFkCo/web-of-connotations-bleggs-rubes-thermostats-and-beliefs https://www.lesswrong.com/posts/XApNuXPckPxwp5ZcW/bridging-syntax-and-semantics-with-quine-s-gavagai

Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2019-05-14T17:09:35.878Z · score: 4 (2 votes) · LW · GW

Cool. Glad it was useful!

Comment by stuart_armstrong on Self-confirming predictions can be arbitrarily bad · 2019-05-07T10:59:41.881Z · score: 3 (2 votes) · LW · GW

And such a predictor cannot unlock new corners of strategy space, or generate self-reinforcing predictions, because the past sequence on which it's trained won't have those features.

See my last paragraph above; I don't think we can rely on predictors not unlocking new corners of strategy space, because it may be able to learn gradually how to do so.

Comment by stuart_armstrong on Self-confirming predictions can be arbitrarily bad · 2019-05-07T10:57:21.641Z · score: 2 (1 votes) · LW · GW

Exactly ^_^

https://www.lesswrong.com/posts/i2dNFgbjnqZBfeitT/oracles-sequence-predictors-and-self-confirming-predictions

Comment by stuart_armstrong on Self-confirming predictions can be arbitrarily bad · 2019-05-04T02:17:42.326Z · score: 3 (2 votes) · LW · GW

If you want to avoid changing distances, set the outcome as £P+1 for P less than a million, and £P-1 for P greater than or equal to a million (for example).

Comment by stuart_armstrong on Self-confirming predictions can be arbitrarily bad · 2019-05-04T02:13:47.301Z · score: 2 (1 votes) · LW · GW

A self-confirming prediction is what an oracle that was a naive sequence predictor (or that was rewarded on results) would give. https://www.lesswrong.com/posts/i2dNFgbjnqZBfeitT/oracles-sequence-predictors-and-self-confirming-predictions

The donor example was to show how such a predictor could end up moving you far in the positive or negative direction. If you were optimising for income rather than accuracy, the choice is obvious.

The £(P±1) is a continuous model of a discontinuous reality. The model has a self-confirming prediction, and it turns out "reality" (the discretised version) has one too. Unless derivatives get extremely high, a continuous model implies a self-confirming prediction implies a close-to-self-confirming prediction in the discretised model.

Oracles, sequence predictors, and self-confirming predictions

2019-05-03T14:09:31.702Z · score: 21 (7 votes)
Comment by stuart_armstrong on Nash equilibriums can be arbitrarily bad · 2019-05-03T11:40:37.349Z · score: 4 (3 votes) · LW · GW

It still strange to see a game with only one round and no collusion to land pretty close to the optimal, while its repeated version (dollar auction) seems to deviate badly from the Pareto outcome.

It is a bit strange. It seems this is because in the dollar auction, you can always make your position slightly better unilaterally, in a way that will make it worse once the other player reacts. Iterate enough, and all value is destroyed. But in a one-round game, you can't slide down that path, so you pick by looking at the overall picture.

Self-confirming predictions can be arbitrarily bad

2019-05-03T11:34:47.441Z · score: 41 (13 votes)
Comment by stuart_armstrong on Nash equilibriums can be arbitrarily bad · 2019-05-03T09:19:12.380Z · score: 2 (1 votes) · LW · GW

There is no mixed Nash equilibrium in the TD example above (see the proof above).

Comment by stuart_armstrong on Nash equilibriums can be arbitrarily bad · 2019-05-02T13:08:30.052Z · score: 6 (4 votes) · LW · GW

you'd expect it to end as Prisoner's Dilemma, no?

I think a key difference is that in PD, (Defect, Cooperate) is a Pareto outcome (you can't make it better for the cooperator without making it worse for the defector). While (0, 0) is far from the Pareto boundary. So people can clearly see that naming numbers around 0 is a massive loss, so they focus on avoiding that loss rather than optimising their game vs the other player.

Comment by stuart_armstrong on Nash equilibriums can be arbitrarily bad · 2019-05-02T10:30:07.082Z · score: 4 (2 votes) · LW · GW

I think another key difference between PD and traveller/AFL is that in the PD variant, (n2, n1) is a Pareto outcome - you can't improve the first player's outcome without making the second one worse off. However, in the other problem, (0,0) is very very far from being Pareto.

Comment by stuart_armstrong on Nash equilibriums can be arbitrarily bad · 2019-05-02T07:39:36.869Z · score: 9 (4 votes) · LW · GW

That is true, but I meant it as "as close as you want to the worst possible outcome, and as far as you want from the best mutual outcome".

Comment by stuart_armstrong on Nash equilibriums can be arbitrarily bad · 2019-05-02T06:53:54.136Z · score: 2 (1 votes) · LW · GW

That's useful; I added a link to the other game in the main text (as far as I can tell, I came up with this independently).

Comment by stuart_armstrong on Nash equilibriums can be arbitrarily bad · 2019-05-01T22:01:44.898Z · score: 4 (2 votes) · LW · GW

Cool! I prefer my example, though; it feels more intuitive (and has a single equilibrium).

Nash equilibriums can be arbitrarily bad

2019-05-01T14:58:21.765Z · score: 33 (14 votes)
Comment by stuart_armstrong on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-10T07:32:25.855Z · score: 2 (1 votes) · LW · GW

"I'm not sure which of these approaches will work out so let's research them simultaneously and then implement whichever one seems most promising later"

That, plus "this approach has progressed as far as it can, there remains uncertainty/fuzziness, so we can now choose to accept the known loss to avoid the likely failure of maximising our current candidate without fuzziness". This is especially the case if, like me, you feel that human values have diminishing marginal utility to resources. Even without that, the fuzziness can be an acceptable cost, if we assign a high probability to loss to Goodhart-like effects if we maximise the wrong thing without fuzziness.

There's one other aspect I should emphasise: AIs drawing boundaries we have no clue about (as they do now between pictures of cats and dogs). When an AI draws boundaries between acceptable and unacceptable worlds, we can't describe this as reducing human uncertainty: the AI is constructing its own concepts, finding patterns in human examples. Trying to make those boundaries work well is, to my eyes, not well described in any Bayesian framework.

It's very possible that we might get to a point were we could say "we expect that this AI will synthesise a good measure of human preferences. The measure itself has some light fuzziness/uncertainty, but our knowledge of it has a lot of uncertainty".

So I'm not sure that uncertainty or even fuzziness are necessarily the best ways of describing this.

Comment by stuart_armstrong on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-09T15:14:24.650Z · score: 2 (1 votes) · LW · GW

In this approach, does the uncertainty/fuzziness ever get resolved (if so how?), or is the AI stuck with a "fuzzy" utility function forever? If the latter, why should we not expect that to incur an astronomically high opportunity cost (due to the AI wasting resources optimizing for values that we might have but actually don't) from the perspective of our real values?

The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster; I'm working on other approaches that could allow the synthesis of the actual values.

Comment by stuart_armstrong on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-08T16:30:43.199Z · score: 2 (1 votes) · LW · GW

I did read it. The main difference is that I don't assume that humans know their utility function, or that "observing it over time" will converge on a single point. The AI is expected to draw boundaries between concepts; boundaries that humans don't know and can't know (just as image recognition neural nets do).

What I term uncertainty might better be phrased as "known (or learnt) fuzziness of a concept or statement". It differs from uncertainty in the Jessica sense in that knowing absolutely everything about the universe, about logic, and about human brains, doesn't resolve it.

Comment by stuart_armstrong on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-04T08:54:52.293Z · score: 2 (1 votes) · LW · GW

This seems interesting but I don't really understand what you're proposing.

The last section is more aspirational and underdevelopped; the main point is noticing that Goodhart can be defeated in certain circumstances, and speculating how that could be extended. I'll get back to this at a later date (or others can work on it!)

Also, Jessica Taylor's A first look at the hard problem of corrigibility went over a few different ways that an AI could formalize the fact that humans are uncertain about their utility functions, and concluded that none of them would solve the problem of corrigibility.

This is not a design for corrigible agents (if anything, it's more a design for low impact agents). The aim of this approach is not to have an AI that puts together the best , but one that doesn't go maximising a narrow , and has wide enough uncertainty to include a decent among the possible utility functions, and that doesn't behave too badly.

Comment by stuart_armstrong on Defeating Goodhart and the "closest unblocked strategy" problem · 2019-04-04T08:26:56.070Z · score: 2 (1 votes) · LW · GW

Yep, those could work as well. I'm most worried about human errors/uncertainties on distribution shifts (ie we write out a way of dealing with distribution shifts, but don't correctly include our uncertainty about the writeup).

Defeating Goodhart and the "closest unblocked strategy" problem

2019-04-03T14:46:41.936Z · score: 25 (9 votes)

Learning "known" information when the information is not actually known

2019-04-01T17:56:17.719Z · score: 13 (4 votes)

Relative exchange rate between preferences

2019-03-29T11:46:35.285Z · score: 12 (3 votes)

Being wrong in ethics

2019-03-29T11:28:55.436Z · score: 22 (5 votes)

Models of preferences in distant situations

2019-03-29T10:42:14.633Z · score: 11 (2 votes)
Comment by stuart_armstrong on The low cost of human preference incoherence · 2019-03-28T03:01:06.562Z · score: 2 (1 votes) · LW · GW

I feel it's more like "you want something sweet, except between 2 and 3 pm". In that case, one solution is for the shop to only stock sweet things, and not let you in between 2 and 3 (or just ignore you during that time).

The low cost of human preference incoherence

2019-03-27T11:58:14.845Z · score: 19 (7 votes)
Comment by stuart_armstrong on The low cost of human preference incoherence · 2019-03-27T11:56:27.456Z · score: 4 (2 votes) · LW · GW

Note that it "routes around" but it also satisfies those preferences; ultimately, AI should prevent humans from dying of terrorist attacks and from other causes.

"Moral" as a preference label

2019-03-26T10:30:17.102Z · score: 14 (4 votes)
Comment by stuart_armstrong on Simplified preferences needed; simplified preferences sufficient · 2019-03-21T10:38:48.173Z · score: 3 (2 votes) · LW · GW

The counter-examples are of that type because the examples are often of that type - presented formally, so vulnerable to a formal solution.

If you're saying that " utility on something like turning on a yellow light" is not a reasonable utility function, then I agree with you, and that's the very point of this post - we need to define what a "reasonable" utility function is, at least to some extent ("partial preferences..."), to get anywhere with these ideas.

Comment by stuart_armstrong on Simplified preferences needed; simplified preferences sufficient · 2019-03-20T08:21:34.698Z · score: 2 (1 votes) · LW · GW

I'm trying to figure out why we have this difference.

My judgements come mainly from trying to make corrigibility/impact measures etc... work, and having similar problems in all cases.

Comment by stuart_armstrong on Partial preferences and models · 2019-03-20T08:20:11.715Z · score: 2 (1 votes) · LW · GW

Those are very normal preferences; they refer to states of the outside world, and we can estimate whether that state is met or not. Just because it's potentially manipulative, doesn't mean it isn't well-defined.

Partial preferences and models

2019-03-19T16:29:23.162Z · score: 13 (3 votes)
Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-03-19T13:11:39.379Z · score: 4 (2 votes) · LW · GW

Godel theorem: there are true propositions which can't be proved by AI (and explanation could be counted as a type of prove).

That's what I'm fearing, so I'm trying to see if the concept makes sense.

Comment by stuart_armstrong on Is there a difference between uncertainty over your utility function and uncertainty over outcomes? · 2019-03-19T10:32:22.950Z · score: 3 (2 votes) · LW · GW

The min-max normalisation of https://www.lesswrong.com/posts/hBJCMWELaW6MxinYW/intertheoretic-utility-comparison can be seen as the formalisation of normalising on effort (it normalises on what you could achieve if you dedicated yourself entirely to one goal).

Comment by stuart_armstrong on Is there a difference between uncertainty over your utility function and uncertainty over outcomes? · 2019-03-19T10:30:12.537Z · score: 4 (2 votes) · LW · GW

Indeed.

We tried to develop a whole theory to deal with these questions, didn't find any nice answer: https://www.lesswrong.com/posts/hBJCMWELaW6MxinYW/intertheoretic-utility-comparison

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-03-19T10:25:50.839Z · score: 2 (1 votes) · LW · GW

We also could live now in such hellworld but don't know it.

Indeed. But you've just described it to us ^_^

What I'm mainly asking is "if we end up in world , and no honest AI can describe to us how this might be a hellworld, is it automatically not a hellworld?"

Comment by stuart_armstrong on A theory of human values · 2019-03-18T16:52:58.108Z · score: 2 (1 votes) · LW · GW

There is one way of doing metaphilosophy this way, which is "run (simulated) William MacAskill until he thinks he's found a good metaphilosophy" or "find a description of metaphilosophy to which WA would say 'yes'."

But what the system I've sketched would most likely do is come up with something to which WA would say "yes, I can kinda see why that was built, but it doesn't really fit together as I'd like and has a some of ad hoc and object level features". That's the "adequate" part of the process.

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-03-18T16:15:31.667Z · score: 2 (1 votes) · LW · GW

The question of this post is whether there exist indescribable hellworlds - worlds that are bad, but where it cannot be explained to humans how/why they are bad.

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-03-18T09:27:38.353Z · score: 2 (1 votes) · LW · GW

But you seem to have described these hells quite well - enough for us to clearly rule them out.

Comment by stuart_armstrong on A theory of human values · 2019-03-15T13:06:39.909Z · score: 4 (2 votes) · LW · GW

In this post, Stuart seems to be trying to construct an extrapolated/synthesized (vNM or vNM-like) utility function out of a single human's incomplete and inconsistent preferences and meta-preferences

Indeed that's what I'm trying to do. The reasons are that utility functions are often more portable (easier to extend to new situations) and more stable (less likely to change under self-improvement).

Comment by stuart_armstrong on A theory of human values · 2019-03-14T16:30:24.473Z · score: 2 (1 votes) · LW · GW

I would less concerned if this was used on someone like William MacAskill [...] but a lot of humans have seemingly terrible meta-preferences

In those cases, I'd give more weight to the preferences than the meta-preferences. There is the issue of avoiding ignorant-yet-confident meta-preferences, which I'm working on writing up right now (partially thanks to you very comment here, thanks!)

or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).

Moral realism is ill-defined, and some allow that humans and AI would have different types of morally true facts. So it's not too much of a stretch to assume that different humans might have different morally true facts from each other, so I don't see this as being necessarily a problem.

Moral realism through acausal trade is the only version of moral realism that seems to be coherent, and to do that, you still have to synthesise individual preferences first. So "one single universal true morality" does not necessarily contradict "contingent choices in figuring out your own preferences".

Comment by stuart_armstrong on A theory of human values · 2019-03-14T16:08:42.551Z · score: 2 (1 votes) · LW · GW

My aim is to find a decent synthesis of human preferences. If someone has a specific metaethics and compelling reasons why we should follow that metaethics, I'd then defer to that. The fact I'm focusing my research on the synthesis is because I find that possibility very unlikely (the more work I do, the less coherent moral realism seems to become).

But, as I said, I'm not opposed to moral realism in principle. Looking over your post, I would expect that if 1, 4, 5, or 6 were true, that would be reflected in the synthesis process. Depending on how I interpret it, 2 would be partially reflected in the synthesis process, and 3 maybe very partially.

If there were strong evidence for 2 or 3, then we could either a) include them in the synthesis process, or b) tell humans about them, which would include them in the synthesis process indirectly.

Since I see the synthesis process as aiming for an adequate outcome, rather than an optimal one (which I don't think exists), I'm actually ok with adding in some moral-realism or other assumptions, as I see this as making a small shift among adequate outcomes.

As you can see in this post, I'm also ok with some extra assumptions in how we combine individual preferences.

There's also some moral-realism-for-humans variants, which assume that there are some moral facts which are true for humans specifically, but not for agents in general; this would be like saying there is a unique synthesis process. For those variants, and some other moral realist claims, I expect the process of figuring out partial preferences and synthesising them, will be useful building blocks.

But mainly, my attitude to most moral realist arguments, is "define your terms and start proving your claims". I'd be willing to take part in such a project, if it seemed realistically likely to succeed.

I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.

You may not be the most typical of persons :-) What I mean is that if we divided people's lifetimes by a third, or had a vicious totalitarian takeover, or made everyone live in total poverty, then people would find either of these outcomes quite bad, even if we increased lifetimes/democracy/GDP to compensate for the loss along one axis.

Combining individual preference utility functions

2019-03-14T14:14:38.772Z · score: 12 (4 votes)

Mysteries, identity, and preferences over non-rewards

2019-03-14T13:52:40.170Z · score: 14 (4 votes)
Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-14T13:01:52.751Z · score: 2 (1 votes) · LW · GW

I think that's the style of repugnance that'd be a practical danger: vast amounts of happy-but-simple minds.

Yep, that does seem a risk. I think that's what the "muzak and potatoes" formulation of repugnance is about.

Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-14T12:57:30.806Z · score: 2 (1 votes) · LW · GW

Hum, not entirely sure what you're getting at...

I'd say that always "looks like ", in the sense that there is a continuity in the overall ; small changes to our knowledge of and make small changes to our estimate of .

I'm not really sure what stronger condition you could want; after all, when , we can always write

as:

  • .

We could equivalently define that way, in fact (it generalises to larger sets of equal utilities).

Would that formulation help?

A theory of human values

2019-03-13T15:22:44.845Z · score: 27 (7 votes)
Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-13T12:49:02.542Z · score: 2 (1 votes) · LW · GW

Is there a natural extension for infinite population? It seems harder than most approaches to adapt.

None of the population ethics have decent extensions to infinite populations. I have a very separate idea for infinite populations here. I suppose the extension of this method to infinite population would use the same method as in that post, but use instead of (where and are the limsup and liminf of utilities, respectively).

I'm always suspicious of schemes that change what they advocate massively based on events a long time ago in a galaxy far, far away - in particular when it can have catastrophic implications. If it turns out there were 3^^^3 Jedi living in a perfect state of bliss, this advocates for preventing any more births now and forever.

You can always zero out those utilities by decree, and only consider utilities that you can change. There are other patches you can apply. By talking this way, I'm revealing the principle I'm most willing to sacrifice: elegance.

Do you know a similar failure case for total utilitarianism? All the sadistic/repugnant/very-repugnant... conclusions seem to be comparing highly undesirable states - not attractor states. If we'd never want world A or B, wouldn't head towards B from A, and wouldn't head towards A from B (since there'd always be some preferable direction), does an A-vs-B comparison actually matter at all?

If A is repugnant and C is now, you can get from C to A by doing improvements (by the standard of total utilitarianism) every step of the way. Similarly, if B is worse than A on that standard, there is a hypothetical path from B to A which is an "improvement" at each step (most population ethics have this property, but not all - you need some form of "continuity").

It's possible that the most total-ut distribution of matter in the universe is a repugnant way; in that case, a sufficiently powerful AI may find a way to reach that.

In general, I'd be interested to know whether you think an objective measure of per-person utility even makes sense.

a) I don't think it makes sense in any strongly principled way, b) I'm trying to build one anyway :-)

Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-13T12:21:29.381Z · score: 2 (1 votes) · LW · GW

a can be prioritized over b just by the ordering, even though they have identical utility.

Nope. Their ordering is only arbitrary as long as they have exactly the same utility. As soon as a policy would result in one of them having higher utility than the other, their ordering is no longer arbitrary. So if we ignore other people means the term in the sum is . If , it's . If , it can be either term (and they are equal).

(I can explain in more detail if that's not enough?)

Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-11T20:47:24.142Z · score: 4 (2 votes) · LW · GW

EDIT: I realised I wasn't clear that the sum was over everyone that ever lived. I've clarified that in the post.

Killing people with future lifetime non-negative utility won't help, as they will still be included in the sum.

Another issue is that two individuals with the same unweighted utility can become victims of the ordering

No. If , then . The ordering between identical utilities won't matter for the total sum, and the individual that is currently behind will be prioritised.

Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-11T18:09:27.305Z · score: 3 (2 votes) · LW · GW

EDIT: I realised I wasn't clear that the sum was over everyone that ever lived. I've clarified that in the post.

Actually, it recommends killing only people who's future lifetime utility is about going to go negative, as the sum is over all humans in the world in total.

You're correct on the "not creating" incentives.

Now, this doesn't represent what I'd endorse (I prefer more asymmetry between life and death), but it's good enough as an example for most cases that come up.

Example population ethics: ordered discounted utility

2019-03-11T16:10:43.458Z · score: 14 (5 votes)
Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2019-03-10T16:39:10.824Z · score: 2 (1 votes) · LW · GW

Added a link to orthonormal's sequence, thanks!

The Boolean was a simplification of "a certain pattern of activation in the neural net", corresponding to seeing purple. The Boolean was tracking the changes in a still-learning neural net caused by seeing purple.

So there are parts of maIry's brain that are activating as never before, causing her to "learn" what purple looks like. I'm not too clear on how that can be distinguished from a "non-verbal belief": what are the key differentiating features?

Smoothmin and personal identity

2019-03-08T15:16:28.980Z · score: 20 (10 votes)

Preferences in subpieces of hierarchical systems

2019-03-06T15:18:21.003Z · score: 11 (3 votes)

mAIry's room: AI reasoning to solve philosophical problems

2019-03-05T20:24:13.056Z · score: 60 (18 votes)
Comment by stuart_armstrong on Thoughts on Human Models · 2019-03-05T19:42:52.671Z · score: 13 (4 votes) · LW · GW

Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment.

Most of these require at least partial specification of human preferences, hence partial modelling of humans: https://www.lesswrong.com/posts/sEqu6jMgnHG2fvaoQ/partial-preferences-needed-partial-preferences-sufficient

Simplified preferences needed; simplified preferences sufficient

2019-03-05T19:39:55.000Z · score: 29 (11 votes)

Finding the variables

2019-03-04T19:37:54.696Z · score: 28 (6 votes)

Syntax vs semantics: alarm better example than thermostat

2019-03-04T12:43:58.280Z · score: 12 (3 votes)

Decelerating: laser vs gun vs rocket

2019-02-18T23:21:46.294Z · score: 22 (6 votes)

Humans interpreting humans

2019-02-13T19:03:52.067Z · score: 10 (2 votes)

Anchoring vs Taste: a model

2019-02-13T19:03:08.851Z · score: 11 (2 votes)

Would I think for ten thousand years?

2019-02-11T19:37:53.591Z · score: 25 (9 votes)

"Normative assumptions" need not be complex

2019-02-11T19:03:38.493Z · score: 11 (3 votes)

Wireheading is in the eye of the beholder

2019-01-30T18:23:07.143Z · score: 25 (10 votes)

Can there be an indescribable hellworld?

2019-01-29T15:00:54.481Z · score: 19 (8 votes)

How much can value learning be disentangled?

2019-01-29T14:17:00.601Z · score: 22 (6 votes)

A small example of one-step hypotheticals

2019-01-28T16:12:02.722Z · score: 14 (5 votes)

One-step hypothetical preferences

2019-01-23T15:14:52.063Z · score: 8 (5 votes)

Synthesising divergent preferences: an example in population ethics

2019-01-18T14:29:18.805Z · score: 13 (3 votes)

The Very Repugnant Conclusion

2019-01-18T14:26:08.083Z · score: 27 (15 votes)

Anthropics is pretty normal

2019-01-17T13:26:22.929Z · score: 28 (11 votes)

Solving the Doomsday argument

2019-01-17T12:32:23.104Z · score: 12 (6 votes)

The questions and classes of SSA

2019-01-17T11:50:50.828Z · score: 11 (3 votes)

In SIA, reference classes (almost) don't matter

2019-01-17T11:29:26.131Z · score: 17 (6 votes)

Anthropic probabilities: answering different questions

2019-01-14T18:50:56.086Z · score: 19 (7 votes)

Anthropics: Full Non-indexical Conditioning (FNC) is inconsistent

2019-01-14T15:03:04.288Z · score: 22 (5 votes)

Hierarchical system preferences and subagent preferences

2019-01-11T18:47:08.860Z · score: 19 (3 votes)

Latex rendering

2019-01-09T22:32:52.881Z · score: 10 (2 votes)

No surjection onto function space for manifold X

2019-01-09T18:07:26.157Z · score: 22 (6 votes)

What emotions would AIs need to feel?

2019-01-08T15:09:32.424Z · score: 15 (5 votes)

Anthropic probabilities and cost functions

2018-12-21T17:54:20.921Z · score: 16 (5 votes)

Anthropic paradoxes transposed into Anthropic Decision Theory

2018-12-19T18:07:42.251Z · score: 19 (9 votes)

A hundred Shakespeares

2018-12-11T23:11:48.668Z · score: 31 (12 votes)

Bounded rationality abounds in models, not explicitly defined

2018-12-11T19:34:17.476Z · score: 12 (6 votes)

Figuring out what Alice wants: non-human Alice

2018-12-11T19:31:13.830Z · score: 14 (5 votes)