## Posts

ACDT: a hack-y acausal decision theory 2020-01-15T17:22:48.676Z · score: 44 (12 votes)
Predictors exist: CDT going bonkers... forever 2020-01-14T16:19:13.256Z · score: 34 (17 votes)
Preference synthesis illustrated: Star Wars 2020-01-09T16:47:26.567Z · score: 17 (6 votes)
12020: a fine future for these holidays 2019-12-25T15:01:33.788Z · score: 40 (17 votes)
When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors 2019-12-19T13:55:28.954Z · score: 19 (6 votes)
Oracles: reject all deals - break superrationality, with superrationality 2019-12-05T13:51:27.196Z · score: 20 (3 votes)
A test for symbol grounding methods: true zero-sum games 2019-11-26T14:15:14.776Z · score: 23 (9 votes)
Breaking Oracles: superrationality and acausal trade 2019-11-25T10:40:18.062Z · score: 23 (7 votes)
Ultra-simplified research agenda 2019-11-22T14:29:41.227Z · score: 36 (14 votes)
Analysing: Dangerous messages from future UFAI via Oracles 2019-11-22T14:17:43.075Z · score: 24 (11 votes)
Platonic rewards, reward features, and rewards as information 2019-11-12T19:38:10.685Z · score: 21 (6 votes)
All I know is Goodhart 2019-10-21T12:12:53.248Z · score: 28 (5 votes)
Full toy model for preference learning 2019-10-16T11:06:03.746Z · score: 20 (6 votes)
Toy model #6: Rationality and partial preferences 2019-10-02T12:04:53.048Z · score: 11 (2 votes)
Stuart_Armstrong's Shortform 2019-09-30T12:08:13.617Z · score: 9 (1 votes)
Toy model piece #5: combining partial preferences 2019-09-12T03:31:25.295Z · score: 12 (3 votes)
Toy model piece #4: partial preferences, re-re-visited 2019-09-12T03:31:08.628Z · score: 11 (2 votes)
Is my result wrong? Maths vs intuition vs evolution in learning human preferences 2019-09-10T00:46:25.356Z · score: 19 (6 votes)
Simple and composite partial preferences 2019-09-09T23:07:26.358Z · score: 11 (2 votes)
Best utility normalisation method to date? 2019-09-02T18:24:29.318Z · score: 15 (5 votes)
Reversible changes: consider a bucket of water 2019-08-26T22:55:23.616Z · score: 25 (24 votes)
Toy model piece #3: close and distant situations 2019-08-26T22:41:17.500Z · score: 10 (2 votes)
Problems with AI debate 2019-08-26T19:21:40.051Z · score: 22 (11 votes)
Gratification: a useful concept, maybe new 2019-08-25T18:58:15.740Z · score: 17 (7 votes)
Under a week left to win $1,000! By questioning Oracle AIs. 2019-08-25T17:02:46.921Z · score: 14 (3 votes) Toy model piece #2: Combining short and long range partial preferences 2019-08-08T00:11:39.578Z · score: 15 (4 votes) Preferences as an (instinctive) stance 2019-08-06T00:43:40.424Z · score: 20 (6 votes) Practical consequences of impossibility of value learning 2019-08-02T23:06:03.317Z · score: 23 (11 votes) Very different, very adequate outcomes 2019-08-02T20:31:00.751Z · score: 13 (4 votes) Contest:$1,000 for good questions to ask to an Oracle AI 2019-07-31T18:48:59.406Z · score: 68 (28 votes)
Toy model piece #1: Partial preferences revisited 2019-07-29T16:35:19.561Z · score: 12 (3 votes)
Normalising utility as willingness to pay 2019-07-18T11:44:52.272Z · score: 16 (4 votes)
Intertheoretic utility comparison: examples 2019-07-17T12:39:45.147Z · score: 13 (3 votes)
Indifference: multiple changes, multiple agents 2019-07-08T13:36:42.095Z · score: 16 (3 votes)
Self-confirming prophecies, and simplified Oracle designs 2019-06-28T09:57:35.571Z · score: 6 (3 votes)
Apocalypse, corrupted 2019-06-26T13:46:05.548Z · score: 20 (12 votes)
Research Agenda in reverse: what *would* a solution look like? 2019-06-25T13:52:48.934Z · score: 35 (15 votes)
Research Agenda v0.9: Synthesising a human's preferences into a utility function 2019-06-17T17:46:39.317Z · score: 61 (16 votes)
Preference conditional on circumstances and past preference satisfaction 2019-06-17T15:30:32.580Z · score: 11 (2 votes)
For the past, in some ways only, we are moral degenerates 2019-06-07T15:57:10.962Z · score: 29 (9 votes)
To first order, moral realism and moral anti-realism are the same thing 2019-06-03T15:04:56.363Z · score: 18 (5 votes)
Conditional meta-preferences 2019-06-03T14:09:54.357Z · score: 6 (3 votes)
Uncertainty versus fuzziness versus extrapolation desiderata 2019-05-30T13:52:16.831Z · score: 21 (6 votes)
And the AI would have got away with it too, if... 2019-05-22T21:35:35.543Z · score: 75 (30 votes)
By default, avoid ambiguous distant situations 2019-05-21T14:48:15.453Z · score: 31 (8 votes)
Oracles, sequence predictors, and self-confirming predictions 2019-05-03T14:09:31.702Z · score: 21 (7 votes)
Self-confirming predictions can be arbitrarily bad 2019-05-03T11:34:47.441Z · score: 45 (17 votes)
Nash equilibriums can be arbitrarily bad 2019-05-01T14:58:21.765Z · score: 36 (16 votes)

Comment by stuart_armstrong on ACDT: a hack-y acausal decision theory · 2020-01-17T09:16:08.047Z · score: 2 (1 votes) · LW · GW

That's annoying - thanks for pointing it out. Any idea what the issue is?

Comment by stuart_armstrong on ACDT: a hack-y acausal decision theory · 2020-01-16T11:20:33.586Z · score: 4 (2 votes) · LW · GW

I don't quite see why the causality is this flexible and arbitrary.

In stories and movies, people often find that the key tool/skill/knowledge they need to solve the problem, is something minor they picked up some time before.

The world could work like this, so that every minor thing you spent any time on would have a payoff at some point in the future. Call this a teleological world.

This world would have a different "causal" structure to our own, and we'd probably not conceive traditional CDT agents as likely in this world.

Comment by stuart_armstrong on Predictors exist: CDT going bonkers... forever · 2020-01-16T11:00:50.882Z · score: 3 (2 votes) · LW · GW

I'm claiming that this post is conflating an error in constructing an accurate world-map with an error in the decision theory.

The problem is not that CDT has an inaccurate world-map; the problem is that CDT has an accurate world map, and then breaks it. CDT would work much better with an inaccurate world-map, one in which its decision causally affects the prediction.

See this post for how you can hack that: https://www.lesswrong.com/posts/9m2fzjNSJmd3yxxKG/acdt-a-hack-y-acausal-decision-theory

Comment by stuart_armstrong on Predictors exist: CDT going bonkers... forever · 2020-01-15T18:58:03.861Z · score: 5 (3 votes) · LW · GW

You may like this, then: https://www.lesswrong.com/posts/9m2fzjNSJmd3yxxKG/acdt-a-hack-y-acausal-decision-theory

Comment by stuart_armstrong on Predictors exist: CDT going bonkers... forever · 2020-01-15T17:29:33.810Z · score: 5 (3 votes) · LW · GW

I'm using CDT as it's formally stated (in, eg, the FDT paper).

The best defence I can imagine from a CDT proponent: CDT is decision theory, not game theory. Anything involving predictors is game theory, so doesn't count.

Comment by stuart_armstrong on Predictors exist: CDT going bonkers... forever · 2020-01-15T09:17:48.360Z · score: 2 (1 votes) · LW · GW

CDT would fight the hypothetical, and refuse to admit that perfect predictors of their own actions exist (the CDT agent is perfectly fine with perfect predictors of other people's actions).

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-13T12:20:54.572Z · score: 2 (1 votes) · LW · GW

I think normalisation doesn't fit in the convex-concave picture. Normalisation is to avoid things like being seen as the same as .

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-13T12:18:16.695Z · score: 2 (1 votes) · LW · GW

I like that way of seeing it.

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-11T17:37:38.151Z · score: 2 (1 votes) · LW · GW

Ok, have corrected it now; the negative-sum formulation was wrong, sorry.

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-10T16:11:26.821Z · score: 2 (1 votes) · LW · GW

You are correct; I was unclear (and wrong in that terminology). I will rework the post slightly.

Comment by stuart_armstrong on Preference synthesis illustrated: Star Wars · 2020-01-10T08:57:18.600Z · score: 2 (1 votes) · LW · GW

Does this reflect how much you want to watch each of the movies?

That's one of the features I considered.

Comment by stuart_armstrong on Preference synthesis illustrated: Star Wars · 2020-01-10T08:56:30.617Z · score: 3 (2 votes) · LW · GW

I don't consider it a core Star Wars movie; if I had, it would be just below revenge of the Sith.

Comment by stuart_armstrong on Preference synthesis illustrated: Star Wars · 2020-01-09T20:41:41.037Z · score: 2 (1 votes) · LW · GW

"I want to be a more generous person": what would you classify that as? Or "I want to want to write"?

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-08T11:41:01.215Z · score: 2 (1 votes) · LW · GW

Negative vs positive vs zero sum is all relative to what we take to be the default outcome.

I take the default as "no effort is made to increase or decrease any of the reward functions".

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-08T11:04:34.021Z · score: 2 (1 votes) · LW · GW

Negative sum vs zero sum (vs positive sum, in fact) depend on defining some "default state", against which the outcome is compared. A negative sum game can become a positive sum game if you just give all the "players" a fixed bonus (ie translate the default state). Default states are somewhat tricky and often subjective to define.

Now, you said "the best states for one of the rewards are bad for the other". "Bad" compared with what? I'm taking as a default something like "you make no effort to increase (or decrease) either reward".

So, my informal definition of "zero sum" is "you may choose to increase either or (roughly) independently of each other, from a fixed budget". Weakly positive sum would be "the more you increase , the easier it gets to increase (and vice versa) from a fixed budget"; strongly positive sum would be "the more you increase , the more increases (and vice versa)".

Negative sum would be the opposite of this ("easier"->"harder" and "increases"->"decreases").

The reason I distinguish weak and strong, is that if we add diminishing returns, this reduces the impact of weak negative sum, but can't solve strong negative sum.

Does this help, or add more confusion?

Comment by stuart_armstrong on Normative assumptions: regret · 2020-01-03T15:24:33.779Z · score: 3 (2 votes) · LW · GW

I saw the 's as normalised to 1 or zero, and the coefficients as giving them weights. So instead of , I'd write (given the behaviour and assumptions).

But this is an old post, and is mainly superseded by new ones, so I wouldn't spend too much time on it.

Comment by stuart_armstrong on Toy model piece #4: partial preferences, re-re-visited · 2019-12-30T19:02:49.006Z · score: 2 (1 votes) · LW · GW

Does this imply I prefer X apples to Y bananas and Z pears, where Y+Z=X?

If it's just for a single fruit, I'd decompose that preference into two separate ones? Apple vs Banana, Apple vs Pear.

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2019-12-30T18:57:21.294Z · score: 2 (1 votes) · LW · GW

Generally, spending resources is zero-sum, not negative sum.

Comment by stuart_armstrong on 12020: a fine future for these holidays · 2019-12-29T12:01:08.212Z · score: 2 (1 votes) · LW · GW

Nope, wasn't consciously echoing that ^_^

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2019-12-29T12:00:22.043Z · score: 2 (1 votes) · LW · GW

Renormalise those, so that slowly diminishing returns don't dominate.

Comment by stuart_armstrong on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2019-12-29T11:59:22.892Z · score: 2 (1 votes) · LW · GW

Flo's summary for the Alignment Newsletter:

I like that summary!

I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.

Resource constraints situations can be positive sum (consider most of the economy). The real problem is between antagonistic preferences, eg maximising flourishing lives vs negative utilitarianism, where a win for one is a loss for the other.

Note that this post considers the setting where we have uncertainty over the true reward function, but we can't learn about the true reward function. If you can gather information about the true reward function, which <@seems necessary to me@>(@Human-AI Interaction@), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.

Yes, if you're in a learning process and treat it as if you weren't in a learning process, things will go wrong ^_^

Comment by stuart_armstrong on 12020: a fine future for these holidays · 2019-12-26T15:33:58.898Z · score: 7 (4 votes) · LW · GW

That's actually the fourth Mercury, rebuilt for tourism.

Happy New Year!

Comment by stuart_armstrong on 12020: a fine future for these holidays · 2019-12-26T08:24:29.212Z · score: 2 (1 votes) · LW · GW

No idea, sorry - I got my copies printed locally.

Comment by stuart_armstrong on ozziegooen's Shortform · 2019-12-18T15:34:22.420Z · score: 4 (2 votes) · LW · GW
Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-12-06T20:34:23.931Z · score: 4 (2 votes) · LW · GW

(sorry, have now corrected link to https://www.lesswrong.com/posts/6XCTppoPAMdKCPFb4/oracles-reject-all-deals-break-superrationality-with-1 )

Comment by stuart_armstrong on Breaking Oracles: superrationality and acausal trade · 2019-12-06T20:32:02.473Z · score: 2 (1 votes) · LW · GW

Apologies, have now corrected the link.

Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-12-06T14:09:01.316Z · score: 2 (1 votes) · LW · GW
Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-12-06T13:57:48.929Z · score: 4 (2 votes) · LW · GW
Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-12-06T13:57:26.927Z · score: 2 (1 votes) · LW · GW
Comment by stuart_armstrong on Breaking Oracles: superrationality and acausal trade · 2019-12-06T13:56:58.569Z · score: 2 (1 votes) · LW · GW

Hum - my approach here seems to have a similarity to your idea.

Comment by stuart_armstrong on Breaking Oracles: superrationality and acausal trade · 2019-12-06T13:56:04.030Z · score: 2 (1 votes) · LW · GW
Comment by stuart_armstrong on A Critique of Functional Decision Theory · 2019-12-02T14:02:50.697Z · score: 25 (7 votes) · LW · GW

I have to say, I find these criticisms a bit weak. Going through them:

III. FDT sometimes makes bizarre recommendations

I'd note that successfully navigating Parfit's hitchhiker also involve violating "Guaranteed Payoffs": you pay the driver at a time when there is no uncertainty, and where you get better utility from not doing so. So I don't think Guaranteed Payoffs is that sound a principle.

Your bomb example is a bit underdefined, since the predictor is predicting your actions AND giving you the prediction. If the predictor is simulating you and asking "would you go left after reading a prediction that you are going right", then you should go left; because, by the probabilities in the setup, you are almost certainly a simulation (this is kind of a "counterfactual Parfit hitchhiker" situation).

If the predictor doesn't simulate you, and you KNOW they said to go right, you are in a slightly different situation, and you should go right. This is akin to waking up in the middle of the Parfit hitchhiker experiment, when the driver has already decided to save you, and deciding whether to pay them.

IV. FDT fails to get the answer Y&S want in most instances of the core example that’s supposed to motivate it

This section is incorrect, I think. In this variant, the contents of the boxes are determined not by your decision algorithm, but by your nationality. And of course two-boxing is the right decision in that situation!

the case for one-boxing in Newcomb’s problem didn’t seem to stem from whether the Predictor was running a simulation of me, or just using some other way to predict what I’d do.

But it does depend on things like this. There's no point in one-boxing unless your one-boxing is connected with the predictor believing that you'd one-box. In a simulation, that's the case; in some other situations where the predictor looks at your algorithm, that's also the case. But if the predictor is predicting based on nationality, then you can freely two-box without changing the predictor's prediction.

V. Implausible discontinuities

There's nothing implausible about discontinuity in the optimal policy, even if the underlying data is continuous. If is the probability that we're in a smoking lesion vs a Newcomb problem, then as changes from to , the expected utility of one-boxing falls and the expected utility of two-boxing rises. At some point, the optimal action will jump discontinuously from one to the other.

VI. FDT is deeply indeterminate

I agree FDT is indeterminate, but I don't agree with your example. Your two calculators are clearly isomorphic, just as if we used a different numbering system for one versus the other. Talking about isomorphic algorithms avoids worrying about whether they're the "same" algorithm.

And in general, it seems to me, there’s no fact of the matter about which algorithm a physical process is implementing in the absence of a particular interpretation of the inputs and outputs of that physical process.

Indeed. But since you and your simulation are isomorphic, you can look at what the consequences are of you outputting "two-box" while your simulation outputs "deux boites" (or "one-box" and "une boite"). And {one-box, une boite} is better than {two-box, deux boites}.

But why did I use those particular interpretations of me and my simulation's physical processes? Because those interpretations are the ones relevant to the problem at hand. Me and my simulation will have a different weight, consume different amounts of power, are run at different times, and probably at different speeds. If those were relevant to the Newcomb problem, then the fact we are different becomes relevant. But since they aren't, we can focus in on the core of the matter. (you can also consider the example of playing the prisoner's dilemma against an almost-but-not-quite-identical copy of yourself).

Comment by stuart_armstrong on My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms · 2019-11-28T10:17:25.901Z · score: 3 (2 votes) · LW · GW

Excellent and well worked on, suggesting many different interesting ideas and research avenues.

Comment by stuart_armstrong on 2018 AI Alignment Literature Review and Charity Comparison · 2019-11-28T10:16:53.610Z · score: 4 (2 votes) · LW · GW

Very thorough, and it's very worthwhile that posts like this are made.

Comment by stuart_armstrong on Bottle Caps Aren't Optimisers · 2019-11-28T10:16:06.048Z · score: 2 (1 votes) · LW · GW

It's helped me hone my thinking on what is and isn't an optimiser (and a wireheader, and so on, for associated concepts).

Comment by stuart_armstrong on Paradoxes in all anthropic probabilities · 2019-11-26T17:17:19.963Z · score: 2 (1 votes) · LW · GW

Was that unexpected?

Comment by stuart_armstrong on Breaking Oracles: superrationality and acausal trade · 2019-11-26T10:50:23.867Z · score: 4 (2 votes) · LW · GW

I don't think that would work, but it's worth thinking about in case it does...

Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-11-25T21:29:09.792Z · score: 2 (1 votes) · LW · GW

Cheers!

Comment by stuart_armstrong on Breaking Oracles: superrationality and acausal trade · 2019-11-25T16:53:20.660Z · score: 2 (1 votes) · LW · GW

Comment by stuart_armstrong on Breaking Oracles: superrationality and acausal trade · 2019-11-25T16:52:23.991Z · score: 5 (2 votes) · LW · GW

Well, that message only works if it actually produces an UFAI within the required timespan, and if the other Oracle would have its message not read. There are problems, but the probability is not too high, initially (though this depends on the number of significant figures in its message).

Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-11-25T11:22:00.119Z · score: 2 (1 votes) · LW · GW
Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-11-25T10:40:37.940Z · score: 2 (1 votes) · LW · GW

You might want to quote this as well: https://www.lesswrong.com/posts/42z4k8Co5BuHMBvER/hyperrationality-and-acausal-trade-break-oracles

Comment by stuart_armstrong on Ultra-simplified research agenda · 2019-11-25T10:23:20.107Z · score: 3 (2 votes) · LW · GW

I'm actually willing to believe that methods used for cat pictures might work for human theory of mind - if trained on that data (and this doesn't solve the underdefined problem).

Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-11-22T15:46:57.933Z · score: 2 (1 votes) · LW · GW

Yep. If they do acausal trade with each other.

Comment by stuart_armstrong on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-11-22T14:18:45.600Z · score: 2 (1 votes) · LW · GW Some thoughts on that idea: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles Comment by stuart_armstrong on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-11-22T14:18:28.457Z · score: 3 (2 votes) · LW · GW

Some thoughts on this idea, thanks for it: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles

Comment by stuart_armstrong on Defining AI wireheading · 2019-11-22T11:52:05.750Z · score: 2 (1 votes) · LW · GW

I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.

I agree. I've now added this line, which I thought I'd put in the original post, but apparently missed out:

Note, though, that the converse is true: every example of wireheading is a Goodhart curse.

Comment by stuart_armstrong on Defining AI wireheading · 2019-11-22T11:48:21.172Z · score: 2 (1 votes) · LW · GW

But really, what's the purpose of trying to distinguish wireheading from other forms of reward hacking?

Because mitigations for different failure modes might not be the same, depending on the circumstances.

Comment by stuart_armstrong on Defining AI wireheading · 2019-11-21T17:49:48.599Z · score: 3 (2 votes) · LW · GW

Where "measurement channel" not just one specific channel, but anything that has the properties of a measurement channel.

Comment by stuart_armstrong on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-11-08T13:59:38.293Z · score: 2 (1 votes) · LW · GW

Indeed. It might be possible to construct that complex bias function, from the policy, in a simple way. But that claim needs to be supported, and the fact that it hasn't been found so far (I repeat that it has to be simple) is evidence against it.