Comment by stuart_armstrong on Partial preferences needed; partial preferences sufficient · 2019-03-21T10:38:48.173Z · score: 3 (2 votes) · LW · GW

The counter-examples are of that type because the examples are often of that type - presented formally, so vulnerable to a formal solution.

If you're saying that " utility on something like turning on a yellow light" is not a reasonable utility function, then I agree with you, and that's the very point of this post - we need to define what a "reasonable" utility function is, at least to some extent ("partial preferences..."), to get anywhere with these ideas.

Comment by stuart_armstrong on Partial preferences needed; partial preferences sufficient · 2019-03-20T08:21:34.698Z · score: 2 (1 votes) · LW · GW

I'm trying to figure out why we have this difference.

My judgements come mainly from trying to make corrigibility/impact measures etc... work, and having similar problems in all cases.

Comment by stuart_armstrong on Partial preferences and models · 2019-03-20T08:20:11.715Z · score: 2 (1 votes) · LW · GW

Those are very normal preferences; they refer to states of the outside world, and we can estimate whether that state is met or not. Just because it's potentially manipulative, doesn't mean it isn't well-defined.

Partial preferences and models

2019-03-19T16:29:23.162Z · score: 13 (3 votes)
Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-03-19T13:11:39.379Z · score: 4 (2 votes) · LW · GW

Godel theorem: there are true propositions which can't be proved by AI (and explanation could be counted as a type of prove).

That's what I'm fearing, so I'm trying to see if the concept makes sense.

Comment by stuart_armstrong on Is there a difference between uncertainty over your utility function and uncertainty over outcomes? · 2019-03-19T10:32:22.950Z · score: 3 (2 votes) · LW · GW

The min-max normalisation of https://www.lesswrong.com/posts/hBJCMWELaW6MxinYW/intertheoretic-utility-comparison can be seen as the formalisation of normalising on effort (it normalises on what you could achieve if you dedicated yourself entirely to one goal).

Comment by stuart_armstrong on Is there a difference between uncertainty over your utility function and uncertainty over outcomes? · 2019-03-19T10:30:12.537Z · score: 2 (1 votes) · LW · GW

Indeed.

We tried to develop a whole theory to deal with these questions, didn't find any nice answer: https://www.lesswrong.com/posts/hBJCMWELaW6MxinYW/intertheoretic-utility-comparison

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-03-19T10:25:50.839Z · score: 2 (1 votes) · LW · GW

We also could live now in such hellworld but don't know it.

Indeed. But you've just described it to us ^_^

What I'm mainly asking is "if we end up in world , and no honest AI can describe to us how this might be a hellworld, is it automatically not a hellworld?"

Comment by stuart_armstrong on A theory of human values · 2019-03-18T16:52:58.108Z · score: 2 (1 votes) · LW · GW

There is one way of doing metaphilosophy this way, which is "run (simulated) William MacAskill until he thinks he's found a good metaphilosophy" or "find a description of metaphilosophy to which WA would say 'yes'."

But what the system I've sketched would most likely do is come up with something to which WA would say "yes, I can kinda see why that was built, but it doesn't really fit together as I'd like and has a some of ad hoc and object level features". That's the "adequate" part of the process.

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-03-18T16:15:31.667Z · score: 2 (1 votes) · LW · GW

The question of this post is whether there exist indescribable hellworlds - worlds that are bad, but where it cannot be explained to humans how/why they are bad.

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-03-18T09:27:38.353Z · score: 2 (1 votes) · LW · GW

But you seem to have described these hells quite well - enough for us to clearly rule them out.

Comment by stuart_armstrong on A theory of human values · 2019-03-15T13:06:39.909Z · score: 4 (2 votes) · LW · GW

In this post, Stuart seems to be trying to construct an extrapolated/synthesized (vNM or vNM-like) utility function out of a single human's incomplete and inconsistent preferences and meta-preferences

Indeed that's what I'm trying to do. The reasons are that utility functions are often more portable (easier to extend to new situations) and more stable (less likely to change under self-improvement).

Comment by stuart_armstrong on A theory of human values · 2019-03-14T16:30:24.473Z · score: 2 (1 votes) · LW · GW

I would less concerned if this was used on someone like William MacAskill [...] but a lot of humans have seemingly terrible meta-preferences

In those cases, I'd give more weight to the preferences than the meta-preferences. There is the issue of avoiding ignorant-yet-confident meta-preferences, which I'm working on writing up right now (partially thanks to you very comment here, thanks!)

or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).

Moral realism is ill-defined, and some allow that humans and AI would have different types of morally true facts. So it's not too much of a stretch to assume that different humans might have different morally true facts from each other, so I don't see this as being necessarily a problem.

Moral realism through acausal trade is the only version of moral realism that seems to be coherent, and to do that, you still have to synthesise individual preferences first. So "one single universal true morality" does not necessarily contradict "contingent choices in figuring out your own preferences".

Comment by stuart_armstrong on A theory of human values · 2019-03-14T16:08:42.551Z · score: 2 (1 votes) · LW · GW

My aim is to find a decent synthesis of human preferences. If someone has a specific metaethics and compelling reasons why we should follow that metaethics, I'd then defer to that. The fact I'm focusing my research on the synthesis is because I find that possibility very unlikely (the more work I do, the less coherent moral realism seems to become).

But, as I said, I'm not opposed to moral realism in principle. Looking over your post, I would expect that if 1, 4, 5, or 6 were true, that would be reflected in the synthesis process. Depending on how I interpret it, 2 would be partially reflected in the synthesis process, and 3 maybe very partially.

If there were strong evidence for 2 or 3, then we could either a) include them in the synthesis process, or b) tell humans about them, which would include them in the synthesis process indirectly.

Since I see the synthesis process as aiming for an adequate outcome, rather than an optimal one (which I don't think exists), I'm actually ok with adding in some moral-realism or other assumptions, as I see this as making a small shift among adequate outcomes.

As you can see in this post, I'm also ok with some extra assumptions in how we combine individual preferences.

There's also some moral-realism-for-humans variants, which assume that there are some moral facts which are true for humans specifically, but not for agents in general; this would be like saying there is a unique synthesis process. For those variants, and some other moral realist claims, I expect the process of figuring out partial preferences and synthesising them, will be useful building blocks.

But mainly, my attitude to most moral realist arguments, is "define your terms and start proving your claims". I'd be willing to take part in such a project, if it seemed realistically likely to succeed.

I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.

You may not be the most typical of persons :-) What I mean is that if we divided people's lifetimes by a third, or had a vicious totalitarian takeover, or made everyone live in total poverty, then people would find either of these outcomes quite bad, even if we increased lifetimes/democracy/GDP to compensate for the loss along one axis.

Combining individual preference utility functions

2019-03-14T14:14:38.772Z · score: 11 (3 votes)

Mysteries, identity, and preferences over non-rewards

2019-03-14T13:52:40.170Z · score: 14 (4 votes)
Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-14T13:01:52.751Z · score: 2 (1 votes) · LW · GW

I think that's the style of repugnance that'd be a practical danger: vast amounts of happy-but-simple minds.

Yep, that does seem a risk. I think that's what the "muzak and potatoes" formulation of repugnance is about.

Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-14T12:57:30.806Z · score: 2 (1 votes) · LW · GW

Hum, not entirely sure what you're getting at...

I'd say that always "looks like ", in the sense that there is a continuity in the overall ; small changes to our knowledge of and make small changes to our estimate of .

I'm not really sure what stronger condition you could want; after all, when , we can always write

as:

  • .

We could equivalently define that way, in fact (it generalises to larger sets of equal utilities).

Would that formulation help?

A theory of human values

2019-03-13T15:22:44.845Z · score: 26 (6 votes)
Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-13T12:49:02.542Z · score: 2 (1 votes) · LW · GW

Is there a natural extension for infinite population? It seems harder than most approaches to adapt.

None of the population ethics have decent extensions to infinite populations. I have a very separate idea for infinite populations here. I suppose the extension of this method to infinite population would use the same method as in that post, but use instead of (where and are the limsup and liminf of utilities, respectively).

I'm always suspicious of schemes that change what they advocate massively based on events a long time ago in a galaxy far, far away - in particular when it can have catastrophic implications. If it turns out there were 3^^^3 Jedi living in a perfect state of bliss, this advocates for preventing any more births now and forever.

You can always zero out those utilities by decree, and only consider utilities that you can change. There are other patches you can apply. By talking this way, I'm revealing the principle I'm most willing to sacrifice: elegance.

Do you know a similar failure case for total utilitarianism? All the sadistic/repugnant/very-repugnant... conclusions seem to be comparing highly undesirable states - not attractor states. If we'd never want world A or B, wouldn't head towards B from A, and wouldn't head towards A from B (since there'd always be some preferable direction), does an A-vs-B comparison actually matter at all?

If A is repugnant and C is now, you can get from C to A by doing improvements (by the standard of total utilitarianism) every step of the way. Similarly, if B is worse than A on that standard, there is a hypothetical path from B to A which is an "improvement" at each step (most population ethics have this property, but not all - you need some form of "continuity").

It's possible that the most total-ut distribution of matter in the universe is a repugnant way; in that case, a sufficiently powerful AI may find a way to reach that.

In general, I'd be interested to know whether you think an objective measure of per-person utility even makes sense.

a) I don't think it makes sense in any strongly principled way, b) I'm trying to build one anyway :-)

Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-13T12:21:29.381Z · score: 2 (1 votes) · LW · GW

a can be prioritized over b just by the ordering, even though they have identical utility.

Nope. Their ordering is only arbitrary as long as they have exactly the same utility. As soon as a policy would result in one of them having higher utility than the other, their ordering is no longer arbitrary. So if we ignore other people means the term in the sum is . If , it's . If , it can be either term (and they are equal).

(I can explain in more detail if that's not enough?)

Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-11T20:47:24.142Z · score: 4 (2 votes) · LW · GW

EDIT: I realised I wasn't clear that the sum was over everyone that ever lived. I've clarified that in the post.

Killing people with future lifetime non-negative utility won't help, as they will still be included in the sum.

Another issue is that two individuals with the same unweighted utility can become victims of the ordering

No. If , then . The ordering between identical utilities won't matter for the total sum, and the individual that is currently behind will be prioritised.

Comment by stuart_armstrong on Example population ethics: ordered discounted utility · 2019-03-11T18:09:27.305Z · score: 3 (2 votes) · LW · GW

EDIT: I realised I wasn't clear that the sum was over everyone that ever lived. I've clarified that in the post.

Actually, it recommends killing only people who's future lifetime utility is about going to go negative, as the sum is over all humans in the world in total.

You're correct on the "not creating" incentives.

Now, this doesn't represent what I'd endorse (I prefer more asymmetry between life and death), but it's good enough as an example for most cases that come up.

Example population ethics: ordered discounted utility

2019-03-11T16:10:43.458Z · score: 14 (5 votes)
Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2019-03-10T16:39:10.824Z · score: 2 (1 votes) · LW · GW

Added a link to orthonormal's sequence, thanks!

The Boolean was a simplification of "a certain pattern of activation in the neural net", corresponding to seeing purple. The Boolean was tracking the changes in a still-learning neural net caused by seeing purple.

So there are parts of maIry's brain that are activating as never before, causing her to "learn" what purple looks like. I'm not too clear on how that can be distinguished from a "non-verbal belief": what are the key differentiating features?

Smoothmin and personal identity

2019-03-08T15:16:28.980Z · score: 20 (10 votes)

Preferences in subpieces of hierarchical systems

2019-03-06T15:18:21.003Z · score: 11 (3 votes)

mAIry's room: AI reasoning to solve philosophical problems

2019-03-05T20:24:13.056Z · score: 34 (13 votes)
Comment by stuart_armstrong on Thoughts on Human Models · 2019-03-05T19:42:52.671Z · score: 6 (3 votes) · LW · GW

Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment.

Most of these require at least partial specification of human preferences, hence partial modelling of humans: https://www.lesswrong.com/posts/sEqu6jMgnHG2fvaoQ/partial-preferences-needed-partial-preferences-sufficient

Partial preferences needed; partial preferences sufficient

2019-03-05T19:39:55.000Z · score: 27 (9 votes)
Comment by stuart_armstrong on Bridging syntax and semantics, empirically · 2019-03-04T19:50:08.470Z · score: 2 (1 votes) · LW · GW

The finding variables post is now up: https://www.lesswrong.com/posts/pHHhyZX5zwvwNqDXm/finding-the-variables

Finding the variables

2019-03-04T19:37:54.696Z · score: 28 (6 votes)

Syntax vs semantics: alarm better example than thermostat

2019-03-04T12:43:58.280Z · score: 12 (3 votes)
Comment by stuart_armstrong on Decelerating: laser vs gun vs rocket · 2019-02-19T19:20:24.515Z · score: 2 (1 votes) · LW · GW

Thanks!

Comment by stuart_armstrong on Decelerating: laser vs gun vs rocket · 2019-02-19T18:41:14.090Z · score: 3 (2 votes) · LW · GW

Hum, what does this gain over sending out all the probes in one clump from the start?

Decelerating: laser vs gun vs rocket

2019-02-18T23:21:46.294Z · score: 22 (6 votes)
Comment by stuart_armstrong on Why we need a *theory* of human values · 2019-02-18T18:03:46.789Z · score: 2 (1 votes) · LW · GW

We have a much clearer understanding of the pressures we are under now, as to what pressures simulated versions of ourselves would be in the future. Also, we agree much more strongly with the values of our current selves than with the values of possible simulated future selves.

Consequently, we should try and solve early the problems with value alignment, and punt technical problems to our future simulated selves.

How are we currently in a better position to influence the outcome?

It's not particularly a question of influencing the outcome, but of reaching the right solution. It would be a tragedy if our future selves had great influence, but pernicious values.

Comment by stuart_armstrong on Alignment Newsletter #45 · 2019-02-14T12:09:58.671Z · score: 20 (1 votes) · LW · GW

Just want to thank you for doing these newsletters...

Humans interpreting humans

2019-02-13T19:03:52.067Z · score: 10 (2 votes)

Anchoring vs Taste: a model

2019-02-13T19:03:08.851Z · score: 11 (2 votes)
Comment by stuart_armstrong on Would I think for ten thousand years? · 2019-02-13T14:07:36.126Z · score: 7 (2 votes) · LW · GW

I don't know which problems/systems you're referring to. Maybe you could cite these in the post to give more motivation?

The main one is when I realised the problems with CEV: https://www.lesswrong.com/posts/vgFvnr7FefZ3s3tHp/mahatma-armstrong-ceved-to-death

The others are mainly oral, with people coming up with plans that involve simulating humans for long periods of time, me doing the equivalent of saying "have you considered value drift" and (often) the reaction from the other revealing that no, they had not considered value drift.

Because the difference is large between what the setup will be in practice, and what current research is in practice.

What are the most important differences that you foresee?

The most important differences I foresee are the unforseen :-) I mean that seriously, because anything that is easy to foresee will possibly be patched before implementation.

But if we look at how research happens nowadays, it has a variety of different approaches and institutional cultures, certain levels of feedback both from within the AI safety community and the surrounding world, grounding our morality and keeping us connected to the flow of culture (such as it is).

Most of the simulation ideas do away with that. If someone suggested that the best idea for AI safety would be to lock up AI safety researchers in an isolated internet-free house for ten years and see what they came up with, we'd be all over the flaws in this plan (and not just the opportunity costs). But replace that physical, grounded idea with a similar one that involves "simulation", and suddenly people flip into far mode and are more willing to accept it. In practice, a simulation is likely to be far more alien and alienating that just locking people up in a house. We have certain levels of control in a simulation that we wouldn't have in reality, but even that could hurt - I'm not sure how I would react if I knew my mind and emotions and state of tiredness were open to manipulation.

So what I'm mainly trying to say is that using simulations (or predictions about simulations) to do safety work is a difficult and subtle project, and needs to be thoroughly planned out with, at minimum, a lot of psychologists and some anthropologists. I think it can be done, but not glibly and not easily.

Comment by stuart_armstrong on Would I think for ten thousand years? · 2019-02-12T11:02:42.836Z · score: 2 (1 votes) · LW · GW

Also, on a more minor note, I expect that if I try and preserve myself from value drift, using only the resources I had in the simulation - I expect to fail. Social dynamics might work though, so we do need to think about those.

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-02-12T10:59:20.880Z · score: 7 (2 votes) · LW · GW

That they make some sensible points, but they're wrong when they push them to far (and that they are mixing factual truths with preferences a lot). Christians do have their own "truths", if we interpret these truths as values, which is what they generally are. "It is a sin to engage in sex before marriage" vs "(some) sex can lead to pregnancy". If we call both of these "truths", then we have a confusion.

Comment by stuart_armstrong on Would I think for ten thousand years? · 2019-02-12T10:54:38.624Z · score: 10 (3 votes) · LW · GW

Because I've already found problems with these systems in the past few years, problems that other people did not expect there to be. If one of them had been put into such a setup then, I expect that it would have failed. Sure, if current me was put in the system, maybe I could find a few more problems and patch them, because I expect to find them.

But I wouldn't trust many others, and I barely trust myself. Because the difference is large between what the setup will be in practice, and what current research is in practice. The more we can solve these issues ahead of time, the more we can delegate.

Would I think for ten thousand years?

2019-02-11T19:37:53.591Z · score: 25 (9 votes)

"Normative assumptions" need not be complex

2019-02-11T19:03:38.493Z · score: 11 (3 votes)
Comment by stuart_armstrong on Why we need a *theory* of human values · 2019-02-11T13:15:59.536Z · score: 9 (3 votes) · LW · GW

I think that a model of an individual's preferences is likely to be better represented by taking multiple approaches, where each fails differently.

I agree. But what counts as a failure? Unless we have a theory of what we're trying to define, we can't define failure beyond our own vague intuitions. But once we have a better theory, defining failure becomes a lot easier.

Comment by stuart_armstrong on Why we need a *theory* of human values · 2019-02-10T09:52:34.222Z · score: 3 (2 votes) · LW · GW

Cheers!

Comment by stuart_armstrong on Why is this utilitarian calculus wrong? Or is it? · 2019-02-04T08:21:14.181Z · score: 1 (2 votes) · LW · GW

Even without self-flagellation, if your marginal utility per $ is much lower, and you don't use your own surplus in a fungible way to donate/buy more, donating can be much higher impact than trade. First of all, you have more freedom to target donations that trade, and even if we ignore that, capturing all your money is better for the producer than just capturing the producer surplus (and the marginal utility of the consumer surplus to you is sufficiently low that adding it on doesn't bring the surpluses to a higher number).

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-02-01T13:51:29.291Z · score: 2 (1 votes) · LW · GW

I haven't seen anyone on Less Wrong argue against CEV as a vision for how the future of humanity should be determined.

Well, now you've seen one ^_^ : https://www.lesswrong.com/posts/vgFvnr7FefZ3s3tHp/mahatma-armstrong-ceved-to-death

I've been going on about the problems with CEV (specifically with extrapolation) for years. This post could also be considered a CEV critique: https://www.lesswrong.com/posts/WeAt5TeS8aYc4Cpms/values-determined-by-stopping-properties

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-02-01T13:48:27.256Z · score: 2 (1 votes) · LW · GW

Shame :-(

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-02-01T13:47:55.946Z · score: 4 (2 votes) · LW · GW

My impression is that in full generality it is unsolvable, but something like starting with an imprecise model of approval / utility function learned via ambitious value learning and restricting explanations/questions/manipulation by that may be work.

Yep. As so often, I think these things are not fully value agnostic, but don't need full human values to be defined.

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-02-01T13:45:50.128Z · score: 2 (1 votes) · LW · GW

possibility that explanation can be usefully defined in a way that distinguishes it from manipulation?

I think explanation can be defined (see https://agentfoundations.org/item?id=1249 ). I'm not confident "explanation with no manipulation" can be defined.

Comment by stuart_armstrong on Wireheading is in the eye of the beholder · 2019-02-01T13:43:57.783Z · score: 2 (1 votes) · LW · GW

Mainly agree, but I'll point out that addicts at different moment can prefer to not have heroin - in fact, as a addict of much more minor things (eg News), I can testify that I've done things I knew I didn't want to do at every moment of the process (before, during, and after).

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-02-01T13:41:28.836Z · score: 2 (1 votes) · LW · GW

Given some definition of corrigibility, yes.

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-01-31T09:16:50.701Z · score: 4 (2 votes) · LW · GW

Humans have beliefs and values twisted together in all kinds of odd ways. In practice, increasing our understanding tends to go along with having a more individualist outlook, a greater power to impact the natural world, less concern about difficult-to-measure issues, and less respect for traditional practices and group identities (and often the creation of new group identities, and sometimes new traditions).

Now, I find those changes to be (generally) positive, and I'd like them to be more common. But these are value changes, and I understand why people with different values could object to them.

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-01-31T09:04:05.184Z · score: 2 (1 votes) · LW · GW

Thanks for writing that post; have you got much in terms of volunteers currently?

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-01-30T22:12:42.509Z · score: 2 (1 votes) · LW · GW

The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.

Indeed, by "friendly AI" I meant "an AI friendly for us". So yes, I was showing a contrived example of an AI that was friendly, and low impact, from our perspective, but that was not, as you said, universally friendly (or universally low impact).

something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.

In my experience so far, we need to include our values, in part, to define "reasonable" utility functions.

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-01-30T21:38:56.309Z · score: 2 (1 votes) · LW · GW

In many areas, we have no terminal values until the problem is presented to us, then we develop terminal values (often dependent on how the problem was phrased) and stick to them. Eg the example with Soviet and American journalists visiting each other's countries.

Comment by stuart_armstrong on Wireheading is in the eye of the beholder · 2019-01-30T20:50:15.341Z · score: 5 (3 votes) · LW · GW

Maybe we can define wireheading as a subset of goodharting, in a way similar to what you're defining.

However, we need the extra assumption that putting the reward on the maximal level is not what we actually desire; the reward function is part of the world, just as the AI is.

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-01-30T20:31:59.835Z · score: 2 (1 votes) · LW · GW

This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible.

? I don't see that. What's the argument?

(If you want to say that we can't define friendly behaviour without using our values, then I would agree ^_^ but I think you're trying to argue something else).

Wireheading is in the eye of the beholder

2019-01-30T18:23:07.143Z · score: 25 (10 votes)
Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-01-30T18:10:46.779Z · score: 4 (2 votes) · LW · GW

Assuming that we can come up with a reasonable definition of suffering

Checking whether there is a large amount of suffering in a deliberately obfuscated world seems hard, or impossible if a superintelligent has done the obfuscating.

Comment by stuart_armstrong on Can there be an indescribable hellworld? · 2019-01-30T18:08:22.951Z · score: 2 (1 votes) · LW · GW

Can you imagine sitting through a ten-year lecture without your values changing? Can you imagine sitting through that lecture without your values changing somewhat in reaction to the content?

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-01-30T18:06:23.405Z · score: 2 (1 votes) · LW · GW

This is based more on experience than on a full formal argument (yet). Take an AI that, according to our preferences, is low impact and still does stuff. Then there is a utility function for which that "does stuff" is the single worst and highest impact thing the AI could have done (you just trivially define a that only cares about that "stuff").

Now, that's a contrived case, but my experience is that problems like that come up all the time in low impact research, and that we really need to include - explicitly or implicitly - a lot of our values/preferences directly, in order to have something that satisfies low impact.

Comment by stuart_armstrong on How much can value learning be disentangled? · 2019-01-30T18:02:46.537Z · score: 3 (2 votes) · LW · GW

Certain groups (most prominently religious ones) see secular education systems as examples of indoctrination. I'm not saying that it's impossible to distinguish manipulation from coercion, just that we have to use part of our values when doing the judgement.

Can there be an indescribable hellworld?

2019-01-29T15:00:54.481Z · score: 18 (7 votes)
Comment by stuart_armstrong on Assuming we've solved X, could we do Y... · 2019-01-29T14:17:41.625Z · score: 3 (2 votes) · LW · GW

Hey there!

Given a longer answer here: https://www.lesswrong.com/posts/Q7WiHdSSShkNsgDpa/how-much-can-value-learning-be-disentangled

How much can value learning be disentangled?

2019-01-29T14:17:00.601Z · score: 22 (6 votes)

A small example of one-step hypotheticals

2019-01-28T16:12:02.722Z · score: 14 (5 votes)

One-step hypothetical preferences

2019-01-23T15:14:52.063Z · score: 13 (4 votes)

Synthesising divergent preferences: an example in population ethics

2019-01-18T14:29:18.805Z · score: 13 (3 votes)

The Very Repugnant Conclusion

2019-01-18T14:26:08.083Z · score: 27 (15 votes)

Anthropics is pretty normal

2019-01-17T13:26:22.929Z · score: 28 (11 votes)

Solving the Doomsday argument

2019-01-17T12:32:23.104Z · score: 12 (6 votes)

The questions and classes of SSA

2019-01-17T11:50:50.828Z · score: 11 (3 votes)

In SIA, reference classes (almost) don't matter

2019-01-17T11:29:26.131Z · score: 17 (6 votes)

Anthropic probabilities: answering different questions

2019-01-14T18:50:56.086Z · score: 19 (7 votes)

Anthropics: Full Non-indexical Conditioning (FNC) is inconsistent

2019-01-14T15:03:04.288Z · score: 22 (5 votes)

Hierarchical system preferences and subagent preferences

2019-01-11T18:47:08.860Z · score: 19 (3 votes)

Latex rendering

2019-01-09T22:32:52.881Z · score: 10 (2 votes)

No surjection onto function space for manifold X

2019-01-09T18:07:26.157Z · score: 22 (6 votes)

What emotions would AIs need to feel?

2019-01-08T15:09:32.424Z · score: 15 (5 votes)

Anthropic probabilities and cost functions

2018-12-21T17:54:20.921Z · score: 16 (5 votes)

Anthropic paradoxes transposed into Anthropic Decision Theory

2018-12-19T18:07:42.251Z · score: 19 (9 votes)

A hundred Shakespeares

2018-12-11T23:11:48.668Z · score: 31 (12 votes)

Bounded rationality abounds in models, not explicitly defined

2018-12-11T19:34:17.476Z · score: 12 (6 votes)

Figuring out what Alice wants: non-human Alice

2018-12-11T19:31:13.830Z · score: 12 (4 votes)

Assuming we've solved X, could we do Y...

2018-12-11T18:13:56.021Z · score: 34 (14 votes)

Why we need a *theory* of human values

2018-12-05T16:00:13.711Z · score: 63 (24 votes)

Humans can be assigned any values whatsoever…

2018-11-05T14:26:41.337Z · score: 43 (12 votes)

Standard ML Oracles vs Counterfactual ones

2018-10-10T20:01:13.765Z · score: 15 (5 votes)

Wireheading as a potential problem with the new impact measure

2018-09-25T14:15:37.911Z · score: 25 (8 votes)

Bridging syntax and semantics with Quine's Gavagai

2018-09-24T14:39:55.981Z · score: 20 (7 votes)

Bridging syntax and semantics, empirically

2018-09-19T16:48:32.436Z · score: 25 (7 votes)

Web of connotations: Bleggs, Rubes, thermostats and beliefs

2018-09-19T16:47:39.673Z · score: 20 (9 votes)

Are you in a Boltzmann simulation?

2018-09-13T12:56:08.283Z · score: 19 (10 votes)

Petrov corrigibility

2018-09-11T13:50:51.167Z · score: 21 (9 votes)

Boltzmann brain decision theory

2018-09-11T13:24:30.016Z · score: 10 (5 votes)

Disagreement with Paul: alignment induction

2018-09-10T13:54:09.844Z · score: 33 (12 votes)