## Posts

$100/$50 rewards for good references 2021-12-03T16:55:56.764Z
Morally underdefined situations can be deadly 2021-11-22T14:48:10.819Z
General alignment plus human values, or alignment via human values? 2021-10-22T10:11:38.507Z
Beyond the human training distribution: would the AI CEO create almost-illegal teddies? 2021-10-18T21:10:53.146Z
Classical symbol grounding and causal graphs 2021-10-14T18:04:32.452Z
Preferences from (real and hypothetical) psychology papers 2021-10-06T09:06:08.484Z
Force neural nets to use models, then detect these 2021-10-05T11:31:08.130Z
AI learns betrayal and how to avoid it 2021-09-30T09:39:10.397Z
Sigmoids behaving badly: arXiv paper 2021-09-20T10:29:20.736Z
Immobile AI makes a move: anti-wireheading, ontology change, and model splintering 2021-09-17T15:24:01.880Z
Reward splintering as reverse of interpretability 2021-08-31T22:27:30.625Z
What are biases, anyway? Multiple type signatures 2021-08-31T21:16:59.785Z
What does GPT-3 understand? Symbol grounding and Chinese rooms 2021-08-03T13:14:42.106Z
Reward splintering for AI design 2021-07-21T16:13:17.917Z
Bayesianism versus conservatism versus Goodhart 2021-07-16T23:39:18.059Z
Underlying model of an imperfect morphism 2021-07-16T13:13:10.483Z
Anthropic decision theory for self-locating beliefs 2021-07-12T14:11:40.715Z
Generalised models: imperfect morphisms and informational entropy 2021-07-09T17:35:21.039Z
Practical anthropics summary 2021-07-08T15:10:44.805Z
Anthropics and Fermi: grabby, visible, zoo-keeping, and early aliens 2021-07-08T15:07:30.891Z
The SIA population update can be surprisingly small 2021-07-08T10:45:02.803Z
Anthropics in infinite universes 2021-07-08T06:56:05.666Z
Non-poisonous cake: anthropic updates are normal 2021-06-18T14:51:43.143Z
The reverse Goodhart problem 2021-06-08T15:48:03.041Z
Dangerous optimisation includes variance minimisation 2021-06-08T11:34:04.621Z
The underlying model of a morphism 2021-06-04T22:29:49.635Z
SIA is basically just Bayesian updating on existence 2021-06-04T13:17:20.590Z
The blue-minimising robot and model splintering 2021-05-28T15:09:54.516Z
Human priors, features and models, languages, and Solmonoff induction 2021-05-10T10:55:12.078Z
Anthropics: different probabilities, different questions 2021-05-06T13:14:06.827Z
Consistencies as (meta-)preferences 2021-05-03T15:10:50.841Z
Why unriggable *almost* implies uninfluenceable 2021-04-09T17:07:07.016Z
A possible preference algorithm 2021-04-08T18:25:25.855Z
If you don't design for extrapolation, you'll extrapolate poorly - possibly fatally 2021-04-08T18:10:52.420Z
Which counterfactuals should an AI follow? 2021-04-07T16:47:42.505Z
Toy model of preference, bias, and extra information 2021-03-24T10:14:34.629Z
Preferences and biases, the information argument 2021-03-23T12:44:46.965Z
Why sigmoids are so hard to predict 2021-03-18T18:21:51.203Z
Connecting the good regulator theorem with semantics and symbol grounding 2021-03-04T14:35:40.214Z
Cartesian frames as generalised models 2021-02-16T16:09:20.496Z
Generalised models as a category 2021-02-16T16:08:27.774Z
Counterfactual control incentives 2021-01-21T16:54:59.309Z
Short summary of mAIry's room 2021-01-18T18:11:36.035Z
Syntax, semantics, and symbol grounding, simplified 2020-11-23T16:12:11.678Z
The ethics of AI for the Routledge Encyclopedia of Philosophy 2020-11-18T17:55:49.952Z
Extortion beats brinksmanship, but the audience matters 2020-11-16T21:13:18.822Z
Humans are stunningly rational and stunningly irrational 2020-10-23T14:13:59.956Z
Knowledge, manipulation, and free will 2020-10-13T17:47:12.547Z
Dehumanisation *errors* 2020-09-23T09:51:53.091Z

Comment by Stuart_Armstrong on Nearest unblocked strategy versus learning patches · 2021-11-25T12:17:10.873Z · LW · GW

I agree with you; this is an old post that I don't really agree with any more.

Comment by Stuart_Armstrong on Morally underdefined situations can be deadly · 2021-11-23T13:53:08.898Z · LW · GW

The things avoided seem like they increase, not decrease risk.

Yes, but it might be that the means needed to avoid them - maybe heavy-handed AI interventions? - could be even more dangerous.

Comment by Stuart_Armstrong on Morally underdefined situations can be deadly · 2021-11-23T13:51:36.723Z · LW · GW

The second one (though I think there is some overlap with the first).

Comment by Stuart_Armstrong on Morally underdefined situations can be deadly · 2021-11-23T13:50:02.571Z · LW · GW

It is indeed our moral intuitions that are underdefined, not the states of the universe.

Comment by Stuart_Armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2021-11-22T14:17:09.780Z · LW · GW

You need to add some assumptions to make it work. For example, I believe the following works:

"In second order arithmetic, we can prove that NP1 implies NF, where NP1 is the statement 'there exists no first order proof of the conjecture' and NF is the statement 'the conjecture isn't false'."

Comment by Stuart_Armstrong on Research Agenda v0.9: Synthesising a human's preferences into a utility function · 2021-11-21T14:54:51.584Z · LW · GW

Because our preferences are inconsistent, and if an AI says "your true preferences are ", we're likely to react by saying "no! No machine will tell me what my preferences are. My true preferences are , which are different in subtle ways".

Comment by Stuart_Armstrong on General alignment plus human values, or alignment via human values? · 2021-10-25T15:17:54.739Z · LW · GW

Thanks for developing the argument. This is very useful.

The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

Comment by Stuart_Armstrong on General alignment plus human values, or alignment via human values? · 2021-10-25T13:48:37.843Z · LW · GW

The successor problem is important, but it assumes we have the values already.

I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).

Comment by Stuart_Armstrong on General alignment plus human values, or alignment via human values? · 2021-10-25T09:58:29.426Z · LW · GW

My thought is that when deciding to take a morally neutral act with tradeoffs, the AI needs to be able to balance the positive and negative to get a reasonable acceptable tradeoff, and hence needs to know both positive and negative human values to achieve that.

Comment by Stuart_Armstrong on General alignment plus human values, or alignment via human values? · 2021-10-22T15:13:12.087Z · LW · GW

I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

Comment by Stuart_Armstrong on Force neural nets to use models, then detect these · 2021-10-19T16:25:26.149Z · LW · GW

Thanks! Lots of useful thoughts here.

Comment by Stuart_Armstrong on AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism · 2021-10-11T09:30:19.641Z · LW · GW

Cool, thanks; already in contact with them.

Comment by Stuart_Armstrong on AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism · 2021-10-08T14:28:23.615Z · LW · GW

Those are very relevant to this project, thanks. I want to see how far we can push these approaches; maybe some people you know would like to take part?

Comment by Stuart_Armstrong on Generalised models as a category · 2021-10-08T13:52:18.163Z · LW · GW

For the moment, I'm going to be trying to resolve practical questions of model splintering, and then I'll see if this formalism turns out to be useful for them.

Comment by Stuart_Armstrong on Force neural nets to use models, then detect these · 2021-10-06T09:18:38.874Z · LW · GW

Vertigo, lust, pain reactions, some fear responses, and so on, don't involve a model. Some versions of "learning that it's cold outside" don't involve a model, just looking out and shivering; the model aspect comes in when you start reasoning about what to do about it. People often drive to work without consciously modelling anything on the way.

Think model-based learning versus Q-learning. Anything that's more Q-learning is not model based.

Comment by Stuart_Armstrong on Force neural nets to use models, then detect these · 2021-10-05T14:00:45.032Z · LW · GW

I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer.

I don't think it has an easy yes or no answer (at least without some thought as to what constitutes a model within the mess of human reasoning) and I'm sure that even if it does, it's not straightforward.

since we probably won't have those kinds of real-time-brain-scanning technologies, right?

One hope would be that, by the time we have those technologies, we'd know what to look for.

Comment by Stuart_Armstrong on Force neural nets to use models, then detect these · 2021-10-05T13:50:48.067Z · LW · GW

Interesting idea. I might use that; thanks!

Comment by Stuart_Armstrong on Beyond fire alarms: freeing the groupstruck · 2021-09-28T17:40:08.284Z · LW · GW

No, it isn't easy to read independently and grasp the argument. A conclusion that also served as a summary would start something like this "Eliezer used the metaphor of a fire alarm for people realising the AI alignment problem. However, that metaphor is misleading for a number of reasons. First of all..."

Starting with "fear shame" in the very first sentence means it's not a summary conclusion.

Comment by Stuart_Armstrong on Beyond fire alarms: freeing the groupstruck · 2021-09-27T16:41:32.427Z · LW · GW

Could you write a stripped-down version of this, making just the key few points?

Comment by Stuart_Armstrong on Example population ethics: ordered discounted utility · 2021-09-13T11:52:45.815Z · LW · GW

Hey there!

I haven't been working much on population ethics (I'm more wanting to automate the construction of values from human preferences so that an AI could extract a whole messy theory from it).

My main thought on these issues is to set up a stronger divergence between killing someone and not bringing them into existence. For example, we could restrict preference-satisfaction to existing beings (and future existing beings). So if they don't want to be killed, that counts as a negative if we do that, even if we replace them with someone happier.

This has degenerate solutions too - it incentivises producing beings that are very easy to satisfy and that don't mind being killed. But note that "create beings that score max on this utility scale, even if they aren't conscious or human" is a failure mode for average and total utilitarianism as well, so this isn't a new problem.

Comment by Stuart_Armstrong on Resolving human values, completely and adequately · 2021-08-17T23:44:37.616Z · LW · GW

Thanks! Glad you got good stuff out of it.

I won't edit the post, due to markdown and latex issues, but thanks for pointing out the typos.

Comment by Stuart_Armstrong on What does GPT-3 understand? Symbol grounding and Chinese rooms · 2021-08-04T19:51:51.635Z · LW · GW

The multiplication example is good, and I should have thought about it and worked it into the post.

Comment by Stuart_Armstrong on What does GPT-3 understand? Symbol grounding and Chinese rooms · 2021-08-04T07:21:27.479Z · LW · GW

I have only very limited access to GPT-3; it would be interesting if others played around with my instructions, making them easier for humans to follow, while still checking that GPT-3 failed.

Comment by Stuart_Armstrong on Stuart_Armstrong's Shortform · 2021-07-21T10:44:00.569Z · LW · GW

Here are a few examples of model splintering in the past:

1. The concept of honour; which includes concepts such as: "nobility of soul, magnanimity, and a scorn of meanness" [...] personal integrity [...] reputation [...] fame [...] privileges of rank or birth [...] respect [...] consequence of power [...] chastity". That is a grab-bag of different concepts, but in various times and social situations, "honour" was seen as single, clear concept.
2. Gender. We're now in a period where people are questioning and redefining gender, but gender has been splintering for a long time. In middle class Victorian England, gender would define so much about a person (dress style, acceptable public attitudes, genitals, right to vote, right to own property if married, whether they would work or not, etc...). In other times (and in other classes of society, and other locations), gender is far less informative.
3. Consider a Croat, communist, Yugoslav nationalist in the 1980s. They would be clear in their identity, which would be just one thing. Then the 1990s come along, and all these aspects come into conflict with each other.

Here are a few that might happen in the future; the first two could result from technological change, while the last could come from social change:

1. A human subspecies created who want to be left alone without interactions with others, but who are lonely and unhappy when solitary. This splinters preferences and happiness (more than they are today), and changes the standard assumptions about personal freedom and
2. A brain, or parts of a human brain, that loop forever through feelings of "I am am happy" and "I want this moment to repeat forever". This splinters happiness-and-preferences from identity.
3. We have various ages of consent and responsibility; but, by age 21, most people are taken to be free to make decisions, are held responsible for their actions, and are seen to have a certain level of understanding about their world. With personalised education, varying subcultures, and more precise psychological measurements, we might end up in a world where "maturity" splinters into lots of pieces, with people having different levels of autonomy, responsibility, and freedom in different domains - and these might not be particularly connected with their age.
Comment by Stuart_Armstrong on The topic is not the content · 2021-07-20T08:06:59.141Z · LW · GW

A very good point.

I'd add the caveat that a key issue in a job is not the just the content, but who you interact in. eg a graduate student job in a lab can be very interesting even if the work is mindless, because of the people you get to interact with.

Comment by Stuart_Armstrong on Dangerous optimisation includes variance minimisation · 2021-07-16T18:27:19.372Z · LW · GW

This is a variant of my old question:

• There is a button at your table. If you press it, it will give you absolute power. Do you press it?

More people answer no. Followed by:

• Hitler is sitting at the same table, and is looking at the button. Now do you press it?
Comment by Stuart_Armstrong on The SIA population update can be surprisingly small · 2021-07-15T09:01:02.573Z · LW · GW

Nope, that's not the model. Your initial expected population is . After the anthropic update, your probabilities of being in the boxes are , , and (roughly , , and ). The expected population, however is . That's an expected population update of 3.27 times.

Note that, in this instance, the expected population update and the probability update are roughly equivalent, but that need not be the case. Eg if your prior odds are about the population being , , or , then the expected population is roughly , the anthropic-updated odds are , and the updated expected population is roughly . So the probability boost to the larger population is roughly (, but the boost to the expected population is roughly .

Comment by Stuart_Armstrong on The SIA population update can be surprisingly small · 2021-07-13T16:36:25.156Z · LW · GW

Anthropic updates do not increase the probability of life in general; they increase the probability of you existing specifically (which, since you've observed many other humans and heard about a lot more, is roughly the same as the probability of any current human existing), and this might have indirect effects on life in general.

So they does not distinguish between "simple life is very hard, but getting from that to human-level life is very easy" and "simple life is very easy, but getting from that to human-level life is very hard". So panspermia remains at its prior, relative to other theories of the same type (see here).

However, panspermia gets a boost from the universe seeming empty, as some versions of panspermia would make humans unexpectedly early (since panspermia needs more time to get going); this means that these theories avoid the penalty from the universe seeming empty, a much larger effect than the anthropic update (see here).

Comment by Stuart_Armstrong on The SIA population update can be surprisingly small · 2021-07-13T10:37:04.585Z · LW · GW

Yep. Though I've found that, in most situations, the observations "we don't see anyone" has a much stronger effect than the anthropic update. It's not always exactly comparable, as anthropic updates are "multiply by and renormalise", while observing no-one is "multiply by and renormalise" - but generally I find the second effect to be much stronger.

Comment by Stuart_Armstrong on The SIA population update can be surprisingly small · 2021-07-09T16:30:09.595Z · LW · GW

I adapted the presumptuous philosopher for densities, because we'd been using densities in the rest of the post. The argument works for total population as well, moving from an average population of (for some ) to an average population of roughly .

Comment by Stuart_Armstrong on Practical anthropics summary · 2021-07-09T16:26:31.427Z · LW · GW

If there are no exact duplicates, FNC=SIA whatever the reference class is.

Comment by Stuart_Armstrong on Non-poisonous cake: anthropic updates are normal · 2021-06-19T21:43:39.326Z · LW · GW

More SIAish for conventional anthropic problems. Other theories are more applicable for more specific situations, specific questions, and for duplicate issues.

Comment by Stuart_Armstrong on Non-poisonous cake: anthropic updates are normal · 2021-06-19T14:16:56.705Z · LW · GW

Thanks! The typo is now corrected.

Comment by Stuart_Armstrong on The reverse Goodhart problem · 2021-06-14T12:23:14.181Z · LW · GW

Cheers, these are useful classifications.

Comment by Stuart_Armstrong on The reverse Goodhart problem · 2021-06-10T11:31:57.229Z · LW · GW

Almost equally hard to define. You just need to define , which, by assumption, is easy.

Comment by Stuart_Armstrong on The reverse Goodhart problem · 2021-06-09T10:25:51.288Z · LW · GW

By Goodhart's law, this set has the property that any will with probability 1 be uncorrelated with outside the observed domain.

If we have a collection of variables , and , then is positively correlated in practice with most expressed simply in terms of the variables.

I've seen Goodhart's law as an observation or a fact of human society - you seem to have a mathematical version of it in mind. Is there a reference for that.

Comment by Stuart_Armstrong on The reverse Goodhart problem · 2021-06-09T10:17:11.832Z · LW · GW

It seems I didn't articulate my point clearly. What I was saying is that V and V' are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can't be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).

So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It's not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.

Comment by Stuart_Armstrong on The reverse Goodhart problem · 2021-06-09T10:15:08.353Z · LW · GW

It seems I didn't articulate my point clearly. What I was saying is that V and V' are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can't be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).

So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It's not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.

Comment by Stuart_Armstrong on Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI · 2021-06-09T10:03:08.212Z · LW · GW

Thanks for writing this.

For myself, I know that power dynamics are important, but I've chosen to specialise down on the "solve technical alignment problem towards a single entity" and leave those multi-agent concerns to others (eg the GovAI part of the FHI), except when they ask for advice.

Comment by Stuart_Armstrong on The reverse Goodhart problem · 2021-06-08T19:32:52.454Z · LW · GW

V and V' are symmetric; indeed, you can define V as 2U-V'. Given U, they are as well defined as each other.

Comment by Stuart_Armstrong on The reverse Goodhart problem · 2021-06-08T17:21:12.142Z · LW · GW

The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.

After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".

Comment by Stuart_Armstrong on SIA is basically just Bayesian updating on existence · 2021-06-07T09:48:19.692Z · LW · GW

Ah, understood. And I think I agree.

Comment by Stuart_Armstrong on SIA is basically just Bayesian updating on existence · 2021-06-06T14:07:12.581Z · LW · GW

SIA is the Bayesian update on knowing your existence (ie if they were always going to ask if dadadarren existed, and get a yes or no answer). The other effects come from issues like "how did they learn of your existence, and what else could they have learnt instead?" This often does change the impact of learning facts, but that's not a specifically anthropics problem.

Comment by Stuart_Armstrong on SIA is basically just Bayesian updating on existence · 2021-06-04T15:35:54.773Z · LW · GW

Depends; when you constructed your priors, did you already take that fact into explicit account? You can "know" things, but not have taken them into account.

Comment by Stuart_Armstrong on Anthropics: different probabilities, different questions · 2021-06-04T11:29:33.735Z · LW · GW

So regardless of how we describe the difference between T1 and T2, SIA will definitely think that T1 is a lot more likely once we start colonising space, if we ever do that.

SIA isn't needed for that; standard probability theory will be enough (as our becoming grabby is evidence that grabbiness is easier than expected, and vice-versa).

I think there's a confusion with SIA and reference classes and so on. If there are no other exact copies of me, then SIA is just standard Bayesian update on the fact that I exist. If theory T_i has prior probability p_i and gives a probability q_i of me existing, then SIA changes its probability to q_i*p_i (and renormalises).

Effects that increase the expected number of other humans, other observers, etc... are indirect consequences of this update. So a theory that says life in general is easy also says that me existing is easy, so gets boosted. But "Earth is special" theories also get boosted: if a theory claims life is very easy but only on Earth-like planets, then those also get boosted.

Comment by Stuart_Armstrong on Anthropics: different probabilities, different questions · 2021-06-04T11:20:27.874Z · LW · GW

In this process, I never have to consider "this awakening" as a member of any reference class. Do you think "keeping the score" this way invalid?

Different ways of keeping the score give different answers. So, no, I don't think that's invalid.

Comment by Stuart_Armstrong on Anthropics: different probabilities, different questions · 2021-06-02T12:41:29.420Z · LW · GW

In the classical sleeping beauty problem, if I guess the coin was tails, I will be correct in 50% of the experiments, and in 67% of my guesses. Whether you score by "experiments" or by "guesses" gives a different optimal performance.