Posts

Call for volunteers: assessing Kurzweil, 2019 2020-04-02T12:07:57.246Z · score: 27 (9 votes)
Anthropics over-simplified: it's about priors, not updates 2020-03-02T13:45:11.710Z · score: 9 (1 votes)
If I were a well-intentioned AI... IV: Mesa-optimising 2020-03-02T12:16:15.609Z · score: 25 (7 votes)
If I were a well-intentioned AI... III: Extremal Goodhart 2020-02-28T11:24:23.090Z · score: 17 (5 votes)
If I were a well-intentioned AI... II: Acting in a world 2020-02-27T11:58:32.279Z · score: 19 (6 votes)
If I were a well-intentioned AI... I: Image classifier 2020-02-26T12:39:59.450Z · score: 30 (13 votes)
Other versions of "No free lunch in value learning" 2020-02-25T14:25:00.613Z · score: 16 (5 votes)
Subagents and impact measures, full and fully illustrated 2020-02-24T13:12:05.014Z · score: 32 (10 votes)
(In)action rollouts 2020-02-18T14:48:19.160Z · score: 11 (2 votes)
Counterfactuals versus the laws of physics 2020-02-18T13:21:02.232Z · score: 16 (3 votes)
Subagents and impact measures: summary tables 2020-02-17T14:09:32.029Z · score: 11 (2 votes)
Appendix: mathematics of indexical impact measures 2020-02-17T13:22:43.523Z · score: 12 (3 votes)
Stepwise inaction and non-indexical impact measures 2020-02-17T10:32:01.863Z · score: 12 (3 votes)
In theory: does building the subagent have an "impact"? 2020-02-13T14:17:23.880Z · score: 17 (5 votes)
Building and using the subagent 2020-02-12T19:28:52.320Z · score: 17 (6 votes)
Plausibly, almost every powerful algorithm would be manipulative 2020-02-06T11:50:15.957Z · score: 41 (13 votes)
The Adventure: a new Utopia story 2020-02-05T16:50:42.909Z · score: 51 (33 votes)
"But that's your job": why organisations can work 2020-02-05T12:25:59.636Z · score: 75 (32 votes)
Appendix: how a subagent could get powerful 2020-01-28T15:28:56.434Z · score: 53 (13 votes)
ACDT: a hack-y acausal decision theory 2020-01-15T17:22:48.676Z · score: 48 (14 votes)
Predictors exist: CDT going bonkers... forever 2020-01-14T16:19:13.256Z · score: 42 (18 votes)
Preference synthesis illustrated: Star Wars 2020-01-09T16:47:26.567Z · score: 19 (8 votes)
12020: a fine future for these holidays 2019-12-25T15:01:33.788Z · score: 40 (17 votes)
When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors 2019-12-19T13:55:28.954Z · score: 23 (7 votes)
Oracles: reject all deals - break superrationality, with superrationality 2019-12-05T13:51:27.196Z · score: 20 (3 votes)
"Fully" acausal trade 2019-12-04T16:39:46.481Z · score: 16 (6 votes)
A test for symbol grounding methods: true zero-sum games 2019-11-26T14:15:14.776Z · score: 23 (9 votes)
Breaking Oracles: superrationality and acausal trade 2019-11-25T10:40:18.062Z · score: 23 (7 votes)
Ultra-simplified research agenda 2019-11-22T14:29:41.227Z · score: 36 (14 votes)
Analysing: Dangerous messages from future UFAI via Oracles 2019-11-22T14:17:43.075Z · score: 24 (11 votes)
Defining AI wireheading 2019-11-21T13:04:49.406Z · score: 14 (4 votes)
Platonic rewards, reward features, and rewards as information 2019-11-12T19:38:10.685Z · score: 21 (6 votes)
All I know is Goodhart 2019-10-21T12:12:53.248Z · score: 28 (5 votes)
Full toy model for preference learning 2019-10-16T11:06:03.746Z · score: 20 (6 votes)
Toy model #6: Rationality and partial preferences 2019-10-02T12:04:53.048Z · score: 11 (2 votes)
Stuart_Armstrong's Shortform 2019-09-30T12:08:13.617Z · score: 9 (1 votes)
Toy model piece #5: combining partial preferences 2019-09-12T03:31:25.295Z · score: 14 (4 votes)
Toy model piece #4: partial preferences, re-re-visited 2019-09-12T03:31:08.628Z · score: 13 (3 votes)
Is my result wrong? Maths vs intuition vs evolution in learning human preferences 2019-09-10T00:46:25.356Z · score: 19 (6 votes)
Simple and composite partial preferences 2019-09-09T23:07:26.358Z · score: 11 (2 votes)
Best utility normalisation method to date? 2019-09-02T18:24:29.318Z · score: 15 (5 votes)
Reversible changes: consider a bucket of water 2019-08-26T22:55:23.616Z · score: 27 (23 votes)
Toy model piece #3: close and distant situations 2019-08-26T22:41:17.500Z · score: 10 (2 votes)
Problems with AI debate 2019-08-26T19:21:40.051Z · score: 22 (11 votes)
Gratification: a useful concept, maybe new 2019-08-25T18:58:15.740Z · score: 17 (7 votes)
Under a week left to win $1,000! By questioning Oracle AIs. 2019-08-25T17:02:46.921Z · score: 14 (3 votes)
Toy model piece #2: Combining short and long range partial preferences 2019-08-08T00:11:39.578Z · score: 15 (4 votes)
Preferences as an (instinctive) stance 2019-08-06T00:43:40.424Z · score: 20 (6 votes)
Practical consequences of impossibility of value learning 2019-08-02T23:06:03.317Z · score: 23 (11 votes)
Very different, very adequate outcomes 2019-08-02T20:31:00.751Z · score: 13 (4 votes)

Comments

Comment by stuart_armstrong on Assessing Kurzweil: the results · 2020-04-02T12:08:24.761Z · score: 4 (2 votes) · LW · GW

Going for it here: https://www.lesswrong.com/posts/TEqW7GFBuBvGo4fbW/call-for-volunteers-assessing-kurzweil-2019

Comment by stuart_armstrong on Thinking About Filtered Evidence Is (Very!) Hard · 2020-03-26T11:55:56.284Z · score: 2 (1 votes) · LW · GW

Is there any meaningful distinction between filtered evidence and lying? I know that in toy models these can be quite different, but in the expansive setting here, where the speaker can select the most misleading technically true fact, is there any major difference?

And how would the results here look if we expended it to allow the speaker to lie?

Comment by stuart_armstrong on Authorities and Amateurs · 2020-03-26T11:37:37.944Z · score: 9 (6 votes) · LW · GW

It's not hard to find amateurs who got it right, after the fact. Amateur advice is of lower expected quality, but of much higher variance than expert advice.

Apart from filtering out the obvious crazies, can we identify high quality amateur advice ahead of time?

Comment by stuart_armstrong on If I were a well-intentioned AI... IV: Mesa-optimising · 2020-03-26T11:24:51.220Z · score: 4 (2 votes) · LW · GW

Sounds good, cheers!

Comment by stuart_armstrong on ACDT: a hack-y acausal decision theory · 2020-03-24T15:00:13.804Z · score: 2 (1 votes) · LW · GW

There are some minor differences; your approach learns the whole model, whereas mine assumes the model is given, and learns only the "acausalish" aspects of it. But they are pretty similar.

One problem you might have, is learning the acausal stuff in the mid-term. If the agent learns that causality exists, and then that in the Newcomb problem is seems to have a causal effect, then it may search a lot for the causal link. Eventually this won't matter (see here), but in the mid-term it might be a problem.

Or not. We need to test more ^_^

Comment by stuart_armstrong on Risques existentiels en Français · 2020-03-20T12:08:22.340Z · score: 2 (1 votes) · LW · GW

The future is not what I expected...

Comment by stuart_armstrong on Plausibly, almost every powerful algorithm would be manipulative · 2020-03-17T15:09:59.121Z · score: 2 (1 votes) · LW · GW

Yep!

Comment by stuart_armstrong on Subagents and impact measures, full and fully illustrated · 2020-03-06T11:25:32.743Z · score: 2 (1 votes) · LW · GW

They don't fit together in that case; that's addressed immediately after, in section 2.3.

Comment by stuart_armstrong on Anthropics and Fermi · 2020-03-03T21:36:16.182Z · score: 4 (2 votes) · LW · GW

See https://www.lesswrong.com/posts/YZzoWGCJsoRBBbmQg/solve-psy-kosh-s-non-anthropic-problem

You're rediscovering some classics ^_^

That problem addresses some of the issues in anthropic reasoning - but not all.

Comment by stuart_armstrong on Writeup: Progress on AI Safety via Debate · 2020-03-03T14:26:34.188Z · score: 4 (2 votes) · LW · GW

Very impressive work, both the output and how you iterate on it.

Some thoughts about the cross-examination issue, prompted by your "Implementation 2 for human debaters: teams of two". It occurred to me that B* could win if it could predict A and B's future behaviour, and match up it's answer with B.

I'd prefer that such an option not exist; that B could answer the question directly, without needing to rewind. Hence prediction won't help.

Cross-examination still helps: A can cross examine as soon as they suspect B is shielding behind an ambiguity. This means that A might have to abandon their current question line, and start again on the other one. This seems more secure (if longer).

Comment by stuart_armstrong on Anthropics and Fermi · 2020-03-03T12:38:11.067Z · score: 4 (2 votes) · LW · GW

Yep ^_^

See 3.1 in my old tech report: https://www.fhi.ox.ac.uk/wp-content/uploads/Anthropic_Decision_Theory_Tech_Report.pdf

Comment by stuart_armstrong on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-28T11:15:48.142Z · score: 3 (2 votes) · LW · GW

The AUP is supposed to prevent the agent accumulating power. The AI initially has huge potential power (because its potential power is all the power it could ever accumulate, given its best strategy to accumulate power) and the penalty is supposed to prevent it turning that potential into actual power - as measured by AUP.

So the AI always has the power to build a subagent; that post just shows that it can do this without triggering the AUP-power penalty.

Comment by stuart_armstrong on And the AI would have got away with it too, if... · 2020-02-28T11:10:12.364Z · score: 4 (2 votes) · LW · GW

I mainly mentioned child-rulers because Robin was using that example; and I used "getting deposed" as an example of agency problems that weren't often (ever?) listed in the economics literature.

Comment by stuart_armstrong on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T13:45:38.149Z · score: 3 (2 votes) · LW · GW

Do you expect it to fail in other areas?

Yes. Subagent problems are not cleanly separated from other problems (see section 3.4 of https://www.lesswrong.com/posts/mdQEraEZQLg7jtozn/subagents-and-impact-measures-full-and-fully-illustrated , where the subagent is replaced with a rock). The impact penalty encourages the agent to put restrictions on their own future possible actions. Doing this through a subagent is one way, but there are many others (see Odysseus and the sirens, or section 6.2 of the post above in this comment).

Comment by stuart_armstrong on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T11:24:54.568Z · score: 5 (3 votes) · LW · GW

More seriously, the reason I'm sceptical of impact measures is because it feels that they all fail for the same reason. Unfortunately, I can't articulate that reason; it's the result of a long history of trying to build impact measures and trying to break them. I just have a feel for where the weaknesses are. So I knew that subagents would be a problem for AUP, long before I could articulate it formally.

But, as I said, I unfortunately can't formalise this feeling; it remains personal.

For this example, it was harder than usual to come up with a counter-example. And I was surprised that half of AUP survived fine - I would not have expected that a restriction against lowering your power would be unhackable. So consider these mild positives for your approach.

But my instinctive feeling remains: I hope that AUP can be made to work for superintelligences, but I expect that it won't :-(

Comment by stuart_armstrong on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T11:06:23.867Z · score: 5 (2 votes) · LW · GW

But if you look back at the history of impact measures and AUP, you’ll see lots of skulls; people say “this problem dooms AUP”, and I say “I think we’re talking about conceptually different things and that you’re a little overconfident; probably just a design choice issue”. It then ends up being a solvable design choice issue. So by Laplace’s Rule of Succession, I’d be surprised if this were The Insurmountable Problem That Dooms AUP.

But if you look back at the history of impact measures, you'll see a lot of skulls; people say "this impact measure will work", and I say "I think you're overconfident; I'm sure I can find a way it fails". It then ends up being a solvable "break the impact measure" problem. So by Laplace’s Rule of Succession, I’d be surprised if this approach were The Ultimate Impact Measure That Works.

;-)

Comment by stuart_armstrong on What cognitive biases feel like from the inside · 2020-02-26T13:06:53.144Z · score: 4 (2 votes) · LW · GW

I think it should also be mentioned that, when you're right and the other person is wrong, that also feels, from the inside, like confirmation bias does. That's why confirmation bias is so powerful: from the inside, you can't tell the difference between it and actually being right.

Comment by stuart_armstrong on What cognitive biases feel like from the inside · 2020-02-26T13:04:35.313Z · score: 2 (1 votes) · LW · GW

Very good. One minor quibble: it's not immediately obvious which table (the one above or the one below) is referred to in the "The scientific term for this mismatch is:" comments. Maybe a space after the comment to show it goes with the table above?

Comment by stuart_armstrong on What cognitive biases feel like from the inside · 2020-02-26T13:02:08.997Z · score: 2 (1 votes) · LW · GW

You can do it in latex, with textrm to get your formatting out of the math mode. Not elegant, but it serves:

Code:

$$\begin{array}{|c|c|c|c|}
\hline
\textrm{System}&& SA\textrm{ possible?} & \textrm{Penalty neutralised?} \\
\hline\hline \textrm{20BQ} && \textrm{Yes} & \textrm{No} \\
\hline \textrm{RR} && \textrm{Yes} & \textrm{No}\\
\hline \textrm{AU} && \textrm{Probably} & \textrm{Mostly}\\
\hline
\end{array}$$
Comment by stuart_armstrong on Subagents and impact measures, full and fully illustrated · 2020-02-26T11:32:17.057Z · score: 2 (1 votes) · LW · GW

The stepwise inaction baseline with inaction rollouts already uses the same policy for and rollouts, and yet it is not the inaction baseline.

In this case, it is, because the agent will only do from then on, to zero out the subsequent penalties.

Why not set ?

It messes up the comparison for rewards that fluctuate based on time, it doesn't block subagent creation... and I've never seen it before, so I don't know what it could do ^_^ Do you have a well-developed version of this?

The last point I don't understand at all.

Comment by stuart_armstrong on Subagents and impact measures, full and fully illustrated · 2020-02-25T20:56:36.887Z · score: 2 (1 votes) · LW · GW

I'm not following you here. Could you put this into equations/examples?

Comment by stuart_armstrong on Subagents and impact measures, full and fully illustrated · 2020-02-25T16:12:02.179Z · score: 2 (1 votes) · LW · GW

Well, as long as is wired to "get out of the way if starts moving", then the optimal -maximising policy is always to move towards the red button; anything else is clearly not -maximising (note that doesn't need to "know" anything; just be programmed to have a different policy depending on how moves, with itself setting this up to signal whether it's -maximising or not).

But in any case, that specific problem can be overcome with the right rollouts.

Comment by stuart_armstrong on Subagents and impact measures, full and fully illustrated · 2020-02-25T16:06:10.863Z · score: 2 (1 votes) · LW · GW

Assuming that to compute s' from s, we follow π_0 instead of the last action.

The standard definition of the penalty uses as the action used to compute . If the same policy is used to compute the and for the rollouts - which is ideal - then the stepwise inaction/ becomes an inaction/ baseline (so section 3 is relevant here).

Section 2.2 is relevant when different policies are used to compute and to rollout from . In that situation, the subagent can ultimately have no restrictions at all (basically, this section shows why you want to use the same policy for both purposes).

Comment by stuart_armstrong on Subagents and impact measures, full and fully illustrated · 2020-02-25T11:28:55.520Z · score: 2 (1 votes) · LW · GW

In 2.2, won't A incur a penalty by spinning because in a future where it has only waited, nothing happened, and in a future where it has spun, then waited, SA went all over the place?

Thank you for pointing that out, I have corrected the post and changed that.

Now SAs actions depend on what A-action optimizes R, and what A-action optimizes R depends on SAs actions. To ward off paradox, use modal logic instead, or prove that there is a non-circular implementation of your definition.

knows exactly what 's policy is (as it programmed that), so knows how would react, so can choose its own action to optimise via both agents - or not. And, knowing its own policy, could run the same algorithm to check whether was -optimal or not.

Also since and are collaborating, they could both arrange to have clear ways for to demonstrate whether or not its maximising . could give read access to its own mind, for example.

Comment by stuart_armstrong on How much delay do you generally have between having a good new idea and sharing that idea publicly online? · 2020-02-24T14:19:54.427Z · score: 3 (2 votes) · LW · GW

As fast as I can type it up into a decent format (which is normally a very instructive process in itself).

Comment by stuart_armstrong on The Adventure: a new Utopia story · 2020-02-19T21:51:00.930Z · score: 2 (1 votes) · LW · GW

Yeah, I was doing a lot of editing (which is how I got the blue colour), and didn't want it to appear little by little.

Comment by stuart_armstrong on The Adventure: a new Utopia story · 2020-02-19T09:39:31.778Z · score: 2 (1 votes) · LW · GW

I recently reformatted it, so that it would colour and indent properly, and put it in markdown. That seems to have reset the date.

Comment by stuart_armstrong on (In)action rollouts · 2020-02-18T17:36:37.080Z · score: 4 (2 votes) · LW · GW

I agree I should re-do these all with examples. But, to answer your questions as far as possible:

What will do: forever (maybe with some extra info for the ).

What would counterfactually do (to define ): for turns (the rollout, to define ) followed by (to define the aspect). Call this .

The penalty is defined in terms of these , entirely counterfactual except for .

What must do: iff follows , then must ensure that .

So a key question is: how early can detect if is following ? If it can detect this immediately, then it can immediately implement the policy for ensuring that (for the stricter impact measure) or the much easier (for the original impact measure).

If it has to wait turns to be sure, then it needs to preserve its power to control many different .

Now, will never actually follow , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of will determine the penalty; is rewarded for quickly gaining control of the rewards.

In practice, will signal early on that it's not following . Maybe there's a very small impact action, , with a tiny penalty. Then would like to use that to signal to that its not in the counterfactual world where it is following . So long as 's first action isn't , that signal is enough (of course, if 's first action isn't , then is a sufficient signal).

After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn't in.

The one mild constraint remaining comes from , which is not a counterfactually defined state (though is a counterfactually defined property of that state).

I'll try and write up a further clarification of this.

Comment by stuart_armstrong on Stepwise inaction and non-indexical impact measures · 2020-02-18T14:59:31.894Z · score: 2 (1 votes) · LW · GW

Alas, the inaction rollouts don't seem to fix the problem: https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/z9MfmF8gA7SBxGSmb

Comment by stuart_armstrong on Stepwise inaction and non-indexical impact measures · 2020-02-18T11:07:32.877Z · score: 3 (2 votes) · LW · GW

For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.

This requires identifying what a subagent is in general, a very tricky unsolved problem (which I feel is unsolvable).

There's another issue; it's not enough to show that the subagent triggers a penalty. We need to show the penalty is larger than not creating the subagent. Since the penalty is zero after the subagent is created, and since the subagent has very fine control over the rewards (much finer than actions that don't include creating an intelligent being), creating a subagent might be lower penalty than almost any other action.

It won't be a lower penalty than the agent doing nothing for ever, of course. But we typically want the agent to do something, so will calibrate the penalty or R_0 for that. And it's plausible that creating the subagent will have lower penalty (and/or higher R_0) than any safe "something".

Comment by stuart_armstrong on Stepwise inaction and non-indexical impact measures · 2020-02-18T10:54:21.635Z · score: 4 (2 votes) · LW · GW

The inaction rollouts effectively transforms the stepwise inaction baseline into an inaction baseline (starting from the moment the subagent is created; thus the agent has a bit more control than in a true inaction baseline).

Therefore the results on the inaction baseline apply ( https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/M9aoMixFLf8JFLRaP ).

This means that restrictions on increased power for the agent ("make sure you never have the power to increase the rewards") become restrictions on the actual policy followed for the subagent ("make sure you never increase these rewards").

Roughly, attainable utility becomes twenty billion questions.

For the original example, this means that the agent cannot press the red button nor gain the ability to teleport. But while the subagent cannot press the red button, it can gain the ability to teleport.

Comment by stuart_armstrong on Subagents and impact measures: summary tables · 2020-02-17T15:47:19.400Z · score: 2 (1 votes) · LW · GW

AU and RR have the same problem, formally, in terms of excess power; it's just that AU wants low power and RR wants high power, so they don't have the same problem in practice.

Comment by stuart_armstrong on Stepwise inaction and non-indexical impact measures · 2020-02-17T15:44:56.051Z · score: 2 (1 votes) · LW · GW

It needs to benefit for ; not necessarily for or alone.

It seems to me that, generically, since is supposed to be a hindrance to the agent, taking a few turns to then neutralise should be beneficial.

Comment by stuart_armstrong on In theory: does building the subagent have an "impact"? · 2020-02-17T14:15:38.349Z · score: 2 (1 votes) · LW · GW

It's the delta of that with that is penalised, not large changes on its own.

Comment by stuart_armstrong on In theory: does building the subagent have an "impact"? · 2020-02-17T14:14:19.087Z · score: 2 (1 votes) · LW · GW

I've removed that sentence, because its a bit more complicated than that; see the next two posts in the sequence, and the summary post: https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/PmqQKBmt2phMT7YLG

Comment by stuart_armstrong on Toy model piece #5: combining partial preferences · 2020-02-13T17:15:22.317Z · score: 2 (1 votes) · LW · GW

What is the intuition behind this specific choice?

That we don't currently have any information allowing us to distinguish between the equivalence classes. And the properties of what happens if we add this utility to one the distinguishes between the classes.

Comment by stuart_armstrong on Appendix: how a subagent could get powerful · 2020-02-13T14:45:57.898Z · score: 2 (1 votes) · LW · GW

See here for more on this https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/jrrZids4LPiLuLzpu

It seems the problem might be worse than I thought...

Comment by stuart_armstrong on Appendix: how a subagent could get powerful · 2020-02-13T14:36:13.023Z · score: 2 (1 votes) · LW · GW

Another relevant post: it seems that the subagent need not be constrained at all, except on the first action. https://www.lesswrong.com/posts/jrrZids4LPiLuLzpu/subagents-and-attainable-utility-in-general

Comment by stuart_armstrong on Appendix: how a subagent could get powerful · 2020-02-13T14:35:06.488Z · score: 2 (1 votes) · LW · GW

Another relevant post: it seems that the subagent need not be constrained at all, except on the first action. https://www.lesswrong.com/posts/jrrZids4LPiLuLzpu/subagents-and-attainable-utility-in-general

Comment by stuart_armstrong on Building and using the subagent · 2020-02-12T20:34:09.722Z · score: 2 (1 votes) · LW · GW

Have rephrased.

Comment by stuart_armstrong on Appendix: how a subagent could get powerful · 2020-02-12T08:22:28.491Z · score: 7 (2 votes) · LW · GW

Not quite... "If the auxiliary reward does not benefit from two agents instead of one optimizing it" should be "If the subagent can be constructed in any way that does not benefit the auxiliary reward(s)" - it's not that generic subagents wont have an impact, is whether the main agent is smart enough to construct one without having an impact.

For the opinion... the subagent does not have "commitments to balance out the original agent's attainable utility". The subagent has exactly the same goal as the original agent, namely (or ). Except that the penalty term specifically points to the first agent, not to the subagent. So the subagent wants to maximise while constraining the penalty term on the first agent.

That's why the subagent has so much more power than the first agent. It is only mildly constrained by the penalty term, and can reduce the term by actions on the first agent (indirectly empowering or directly weakening it as necessary).

Thus one subagent is enough (it itself will construct other subagents, if necessary). As soon as it is active, with the goal, then the penalty term is broken in practice, and the subagent can (usually) make itself powerful without triggering the penalty on any of the auxiliary rewards.

Comment by stuart_armstrong on Toy model piece #4: partial preferences, re-re-visited · 2020-02-11T17:34:26.818Z · score: 2 (1 votes) · LW · GW

This felt more intuitive to me (and it's a minor result that injective function->special preorder) and closer to what humans actually seem to have in their minds.

That said, since it's equivalent, there is nothing wrong with starting from either approach.

Comment by stuart_armstrong on Attainable Utility Landscape: How The World Is Changed · 2020-02-11T17:30:11.944Z · score: 2 (1 votes) · LW · GW

I find the existing MDP isomorphisms/equivalences to be pretty lacking.

I have a paper on equivalences (and counterfactual equivalences, which is stronger) for POMDPs: https://arxiv.org/abs/1801.03737

Comment by stuart_armstrong on Research Agenda v0.9: Synthesising a human's preferences into a utility function · 2020-02-10T11:23:08.096Z · score: 3 (2 votes) · LW · GW

Hey there!

Hope you can come on board ^_^ For open problems, what about looking at this toy model https://www.lesswrong.com/posts/hcrFxeYYfbFrkKQEJ/full-toy-model-for-preference-learning , and the links therein? Basically getting these ideas to work better/work more in general, would be ideal.

Cheers.

Comment by stuart_armstrong on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-07T15:19:27.743Z · score: 2 (1 votes) · LW · GW

Rephrased the title and the intro to make this clearer.

Comment by stuart_armstrong on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-07T15:16:54.109Z · score: 4 (2 votes) · LW · GW

cannot be steered away from maximizing their goal by ad-hoc variations in the training protocol.

That, and the fact these ad-hoc variations can introduce new goals that the programmers are not aware of.

Comment by stuart_armstrong on Stuart_Armstrong's Shortform · 2020-02-07T11:02:53.579Z · score: 4 (2 votes) · LW · GW

Preferences exist in the map, in human brains, and we want to port them to the territory with the minimum of distortion.

Comment by stuart_armstrong on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-07T10:46:54.865Z · score: 2 (1 votes) · LW · GW

To keep it simple, assume the hyperparameters are updated by evolutionary algorithm or some similar search-then-continue-or-stop process.

Comment by stuart_armstrong on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-07T10:45:26.031Z · score: 2 (1 votes) · LW · GW

If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.

I'd conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.

Is this such a common practice that we can expect "almost every powerful algorithm" to involve it somehow?

No. That was just one example I constructed, one of the easiest to see. But I can build examples in many different situations. I'll admit that "thinking longer term" is something that makes manipulation much more likely; genuinely episodic algorithms seem much harder to make manipulative. But we have to be sure the algorithm is episodic, and that there is no outer-loop optimisation going on.

Comment by stuart_armstrong on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-07T10:42:06.589Z · score: 2 (1 votes) · LW · GW

This specific problem could easily be fixed, but the problem of the goal not being what we think it is, remains.

See also Kaj's example: https://www.lesswrong.com/posts/Ez4zZQKWgC6fE3h9G/almost-every-powerful-algorithm-would-be-manipulative#vhZ9uvMwiMCepp6jH