ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations' 2020-03-25T20:03:39.590Z · score: 32 (7 votes)
Conclusion to 'Reframing Impact' 2020-02-28T16:05:40.656Z · score: 37 (12 votes)
Reasons for Excitement about Impact of Impact Measure Research 2020-02-27T21:42:18.903Z · score: 29 (9 votes)
Attainable Utility Preservation: Scaling to Superhuman 2020-02-27T00:52:49.970Z · score: 28 (8 votes)
How Low Should Fruit Hang Before We Pick It? 2020-02-25T02:08:52.630Z · score: 28 (8 votes)
Continuous Improvement: Insights from 'Topology' 2020-02-22T21:58:01.584Z · score: 28 (8 votes)
Attainable Utility Preservation: Empirical Results 2020-02-22T00:38:38.282Z · score: 36 (8 votes)
Attainable Utility Preservation: Concepts 2020-02-17T05:20:09.567Z · score: 33 (9 votes)
The Catastrophic Convergence Conjecture 2020-02-14T21:16:59.281Z · score: 40 (12 votes)
Attainable Utility Landscape: How The World Is Changed 2020-02-10T00:58:01.453Z · score: 44 (13 votes)
Does there exist an AGI-level parameter setting for modern DRL architectures? 2020-02-09T05:09:55.012Z · score: 15 (6 votes)
AI Alignment Corvallis Weekly Info 2020-01-26T21:24:22.370Z · score: 7 (1 votes)
On Being Robust 2020-01-10T03:51:28.185Z · score: 40 (17 votes)
Judgment Day: Insights from 'Judgment in Managerial Decision Making' 2019-12-29T18:03:28.352Z · score: 23 (7 votes)
Can fear of the dark bias us more generally? 2019-12-22T22:09:42.239Z · score: 22 (5 votes)
Clarifying Power-Seeking and Instrumental Convergence 2019-12-20T19:59:32.793Z · score: 41 (14 votes)
Seeking Power is Instrumentally Convergent in MDPs 2019-12-05T02:33:34.321Z · score: 107 (31 votes)
How I do research 2019-11-19T20:31:16.832Z · score: 56 (22 votes)
Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z · score: 54 (24 votes)
The Gears of Impact 2019-10-07T14:44:51.212Z · score: 42 (14 votes)
World State is the Wrong Level of Abstraction for Impact 2019-10-01T21:03:40.153Z · score: 55 (17 votes)
Attainable Utility Theory: Why Things Matter 2019-09-27T16:48:22.015Z · score: 54 (18 votes)
Deducing Impact 2019-09-24T21:14:43.177Z · score: 59 (17 votes)
Value Impact 2019-09-23T00:47:12.991Z · score: 51 (20 votes)
Reframing Impact 2019-09-20T19:03:27.898Z · score: 90 (35 votes)
What You See Isn't Always What You Want 2019-09-13T04:17:38.312Z · score: 28 (9 votes)
How often are new ideas discovered in old papers? 2019-07-26T01:00:34.684Z · score: 24 (9 votes)
TurnTrout's shortform feed 2019-06-30T18:56:49.775Z · score: 28 (6 votes)
Best reasons for pessimism about impact of impact measures? 2019-04-10T17:22:12.832Z · score: 76 (17 votes)
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z · score: 31 (6 votes)
And My Axiom! Insights from 'Computability and Logic' 2019-01-16T19:48:47.388Z · score: 40 (9 votes)
Penalizing Impact via Attainable Utility Preservation 2018-12-28T21:46:00.843Z · score: 26 (10 votes)
Why should I care about rationality? 2018-12-08T03:49:29.451Z · score: 26 (6 votes)
A New Mandate 2018-12-06T05:24:38.351Z · score: 15 (8 votes)
Towards a New Impact Measure 2018-09-18T17:21:34.114Z · score: 111 (38 votes)
Impact Measure Desiderata 2018-09-02T22:21:19.395Z · score: 40 (11 votes)
Turning Up the Heat: Insights from Tao's 'Analysis II' 2018-08-24T17:54:54.344Z · score: 40 (11 votes)
Pretense 2018-07-29T00:35:24.674Z · score: 36 (14 votes)
Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction' 2018-07-05T00:34:59.249Z · score: 35 (9 votes)
Overcoming Clinginess in Impact Measures 2018-06-30T22:51:29.065Z · score: 42 (14 votes)
Worrying about the Vase: Whitelisting 2018-06-16T02:17:08.890Z · score: 84 (20 votes)
Swimming Upstream: A Case Study in Instrumental Rationality 2018-06-03T03:16:21.613Z · score: 117 (39 votes)
Into the Kiln: Insights from Tao's 'Analysis I' 2018-06-01T18:16:32.616Z · score: 69 (19 votes)
Confounded No Longer: Insights from 'All of Statistics' 2018-05-03T22:56:27.057Z · score: 56 (13 votes)
Internalizing Internal Double Crux 2018-04-30T18:23:14.653Z · score: 80 (19 votes)
The First Rung: Insights from 'Linear Algebra Done Right' 2018-04-22T05:23:49.024Z · score: 77 (22 votes)
Unyielding Yoda Timers: Taking the Hammertime Final Exam 2018-04-03T02:38:48.327Z · score: 40 (12 votes)
Open-Category Classification 2018-03-28T14:49:23.665Z · score: 36 (8 votes)
The Art of the Artificial: Insights from 'Artificial Intelligence: A Modern Approach' 2018-03-25T06:55:46.204Z · score: 68 (18 votes)
Lightness and Unease 2018-03-21T05:24:26.289Z · score: 53 (15 votes)


Comment by turntrout on TurnTrout's shortform feed · 2020-03-28T13:14:46.395Z · score: 2 (1 votes) · LW · GW

Don't have much of an opinion - I haven't rigorously studied infinitesimals yet. I usually just think of infinite / infinitely small quantities as being produced by limiting processes. For example, the intersection of all the -balls around a real number is just that number (under the standard topology), which set has 0 measure and is, in a sense, "infinitely small".

Comment by turntrout on TurnTrout's shortform feed · 2020-03-27T13:30:45.421Z · score: 2 (1 votes) · LW · GW

To prolong my medicine stores by 200%, I've mixed in similar-looking iron supplement placebos with my real medication. (To be clear, nothing serious happens to me if I miss days)

Comment by turntrout on How important are MDPs for AGI (Safety)? · 2020-03-27T13:04:57.794Z · score: 2 (1 votes) · LW · GW

The point of this point is mostly to claim that it's not a hugely useful framework for thinking about RL.

Even though I agree it's unrealistic, MDPs are still easier to prove things in and I still think that they can give us important insights. for example, if I had started with more complex environments when I was investigating instrumental convergence, I would've spent a ton of extra time grappling with the theorems for little perceived benefit. that is, the MDP framework let me more easily cut to the core insights. sometimes it's worth thinking about more general computable environments, but probably not always.

Comment by turntrout on TurnTrout's shortform feed · 2020-03-26T23:51:48.935Z · score: 4 (2 votes) · LW · GW

It seems to me that Zeno's paradoxes leverage incorrect, naïve notions of time and computation. We exist in the world, and we might suppose that that the world is being computed in some way. If time is continuous, then the computer might need to do some pretty weird things to determine our location at an infinite number of intermediate times. However, even if that were the case, we would never notice it – we exist within time and we would not observe the external behavior of the system which is computing us, nor its runtime.

Comment by turntrout on ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations' · 2020-03-26T02:33:02.793Z · score: 3 (2 votes) · LW · GW

Thank you for this, that's very helpful.

Comment by turntrout on ODE to Joy: Insights from 'A First Course in Ordinary Differential Equations' · 2020-03-25T23:20:22.878Z · score: 2 (1 votes) · LW · GW

Counterexample: is analytic but its derivatives don't satisfy your proposed condition for being analytic.

Comment by turntrout on The human side of interaction · 2020-03-21T13:37:26.418Z · score: 4 (2 votes) · LW · GW

why do we even believe that human values are good?

Because they constitute, by definition, our goodness criterion? It's not like we have two separate modules - one for "human values", and one for "is this good?". (ETA or are you pointing out how our values might shift over time as we reflect on our meta-ethics?)

Perhaps the typical human behaviour amplified by possibilities of a super-intelligence would actually destroy the universe.

If I understand correctly, this is "are human behaviors catastrophic?" - not "are human values catastrophic?".

Comment by turntrout on TurnTrout's shortform feed · 2020-03-19T20:24:38.419Z · score: 7 (4 votes) · LW · GW

Broca’s area handles syntax, while Wernicke’s area handles the semantic side of language processing. Subjects with damage to the latter can speak in syntactically fluent jargon-filled sentences (fluent aphasia) – and they can’t even tell their utterances don’t make sense, because they can’t even make sense of the words leaving their own mouth!

It seems like GPT2 : Broca’s area :: ??? : Wernicke’s area. Are there any cog psych/AI theories on this?

Comment by turntrout on TurnTrout's shortform feed · 2020-03-18T18:36:37.285Z · score: 4 (2 votes) · LW · GW

Very rough idea

In 2018, I started thinking about corrigibility as "being the kind of agent lots of agents would be happy to have activated". This seems really close to a more ambitious version of what AUP tries to do (not be catastrophic for most agents).

I wonder if you could build an agent that rewrites itself / makes an agent which would tailor the AU landscape towards its creators' interests, under a wide distribution of creator agent goals/rationalities/capabilities. And maybe you then get a kind of generalization, where most simple algorithms which solve this solve ambitious AI alignment in full generality.

Comment by turntrout on [AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement · 2020-03-18T18:33:39.640Z · score: 4 (2 votes) · LW · GW

I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs change, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world).

Although the point is more easily made in the deterministic environments, impact doesn't happen in expectation for optimal agents in stochastic environments, either. This is by conservation of expected AU (this is the point I was making in The Gears of Impact).

Similar things can be said about power gain – when we think an agent is gaining power... gaining power compared to what? The agent "always had" that power, in a sense – the only thing that happens is that we realize it.

This line of argument makes me more pessimistic about there being a clean formalization of "don't gain power". I do think that the formalization of power is correct, but I suspect people are doing something heuristic and possibly kludgy when we think about someone else gaining power.

Comment by turntrout on Attainable Utility Preservation: Scaling to Superhuman · 2020-03-18T18:33:13.720Z · score: 2 (1 votes) · LW · GW

I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs change, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world).

For the benefit of future readers, I replied to this in the newsletter's comments.

Comment by turntrout on March Coronavirus Open Thread · 2020-03-15T02:28:29.403Z · score: 2 (1 votes) · LW · GW

My roommate tested positive for type A flu. Does this mean he is unlikely to have COVID?

Comment by turntrout on March Coronavirus Open Thread · 2020-03-14T18:48:59.411Z · score: 2 (1 votes) · LW · GW

Doesn't directly answer this concern, but: I just called the Cryonics Institute, and they said that CI and Suspended Animation both plan to continue offering services during the pandemic.

Comment by turntrout on Welcome to Less Wrong! · 2020-03-11T19:22:54.703Z · score: 2 (1 votes) · LW · GW

Welcome :)

Comment by turntrout on March Coronavirus Open Thread · 2020-03-09T05:25:14.001Z · score: 10 (5 votes) · LW · GW

Should poly people consider stopping intimate contact (hugs+) at some point? The network structure of polyamorous relationships might make people particularly vulnerable.

Comment by turntrout on Coherence arguments do not imply goal-directed behavior · 2020-03-08T14:57:25.004Z · score: 3 (2 votes) · LW · GW

because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence;

That's right; that would prove too much.

namely X = "the reward function is typical". Does that sound right?

Yeah, although note that I proved asymptotic instrumental convergence for typical functions under iid reward sampling assumptions at each state, so I think there's wiggle room to say "but the reward functions we provide aren't drawn from this distribution!". I personally think this doesn't matter much, because the work still tells us a lot about the underlying optimization pressures.

The result is also true in the general case of an arbitrary reward function distribution, you just don't know in advance which terminal states the distribution prefers.

Comment by turntrout on Coherence arguments do not imply goal-directed behavior · 2020-03-08T00:22:55.283Z · score: 7 (4 votes) · LW · GW

Sure, I can say more about Alex Turner's formalism! The theorems show that, with respect to some distribution of reward functions and in the limit of farsightedness (as the discount rate goes to 1), the optimal policies under this distribution tend to steer towards parts of the future which give the agent access to more terminal states.

Of course, there exist reward functions for which twitching or doing nothing is optimal. The theorems say that most reward functions aren't like this.

I encourage you to read the post and/or paper; it's quite different from the one you cited in that it shows how instrumental convergence and power-seeking arise from first principles. Rather than assuming "resources" exist, whatever that means, resource acquisition is explained as a special case of power-seeking.

ETA: Also, my recently completed sequence focuses on formally explaining and deeply understanding why catastrophic behavior seems to be incentivized. In particular, see The Catastrophic Convergence Conjecture.

Comment by turntrout on The Gears of Impact · 2020-03-06T14:44:57.078Z · score: 2 (1 votes) · LW · GW

I guess I implicitly imagined the robber could predict the clients with fairly high accuracy. What you described is also plausible.

Comment by turntrout on TurnTrout's shortform feed · 2020-03-06T02:06:02.268Z · score: 7 (4 votes) · LW · GW

Cool Math Concept You Never Realized You Wanted: Fréchet distance.

Imagine a man traversing a finite curved path while walking his dog on a leash, with the dog traversing a separate one. Each can vary their speed to keep slack in the leash, but neither can move backwards. The Fréchet distance between the two curves is the length of the shortest leash sufficient for both to traverse their separate paths. Note that the definition is symmetric with respect to the two curves—the Frechet distance would be the same if the dog was walking its owner.

The Fréchet distance between two concentric circles of radius and respectively is . The longest leash is required when the owner stands still and the dog travels to the opposite side of the circle (), and the shortest leash when both owner and dog walk at a constant angular velocity around the circle ().

Comment by turntrout on Goodhart's Curse and Limitations on AI Alignment · 2020-03-03T23:55:49.766Z · score: 2 (1 votes) · LW · GW

Coincidentally, just yesterday I was part of some conversations that now make me more bullish on this approach. I haven't thought about it much in quite a while, and now I'm returning to it.

The potential solution I was referring to is motivated in the recently-completed Reframing Impact sequence.

Comment by turntrout on Epistemic standards for “Why did it take so long to invent X?” · 2020-03-03T04:32:54.560Z · score: 5 (3 votes) · LW · GW

Strong upvote – I really enjoyed and appreciated your use of specific examples.

How did you format the captions and center the images?

Comment by turntrout on Towards a mechanistic understanding of corrigibility · 2020-03-02T22:13:45.520Z · score: 3 (2 votes) · LW · GW

That’s correct.

Comment by turntrout on I don't understand Rice's Theorem and it's killing me · 2020-03-02T15:53:45.252Z · score: 4 (2 votes) · LW · GW

Also, a “nontrivial semantic property” is captured by testing membership in a set of partial computable functions (which isn’t the empty set or the set of all computable functions). Note that testing what function the Turing machine is implementing automatically sidesteps syntactic checks. For example, I could just want to check whether the program implements the constant-zero function, but Rice’s theorem says I can’t check this for all programs!

Because if I could, then I could find out whether machine halts on input . Do you see how? Do you see how the proof fails if the property in question is trivial?

Comment by turntrout on I don't understand Rice's Theorem and it's killing me · 2020-03-02T03:32:29.415Z · score: 16 (7 votes) · LW · GW

You have a couple of confusions about what algorithms are, what “semantic properties” are, etc. I’ll try to unpack those briefly.

In my brain, this translates to: there is no general algorithm that can accurately assess non-trivial statements about any program's behavior.

About all programs’ behavior. You can certainly tell that certain machines halt, just by running them eg 10,000 steps and seeing that some halt and saying “yes” to those.

The part where I start to go mad is when I consider that turing machines can also be real physical machines, in my world.

Nitpick: Turing machines technically require an infinite storage tape, but all the computers in our universe have finite memory. You could think of us as abstractly implementing a bounded-space Turing machine, maybe?

But I'm made of cells. And all electronic computers are (for now) constructed of superatomic-level objects which we can "predict" the behavior of from fixed starting positions and basic physics. If that "semantic property" of a program is whether or not it flips the bit at 0xff812938 to 1, then how on earth is that not decidable by the laws of physics? Either the lever is going to fall or it's not. I can't reconcile the macro level concept with the micro level concept that my laptop is just a Rube-Goldberg machine and somewhere in there the result is determined by silicon, not abstract computer scientists.

Suppose we have infinite energy and run a perfect computer forever. We task it with deciding whether programs halt (without loss of generality wrt Rice’s Theorem, since HALT is Turing-reducible to nontrivial semantic checks). Whether the screen emits photons in a “Yes!” or in a “No.” pattern is a physical consequence in our universe, yes - that’s because this is a “semidecidable” problem: we can always test the Yes cases correctly after some length of time has passed, but we can stall out on some No instances.

But how do we know the computer will give us an answer at all? You can keep running the laws of physics forward in your prediction, yes, but there’s no guarantee that after any finite amount of time the computer has displayed an answer.

So we see that

So, what I think Rice's Theorem is saying by extension, is that since at some points in time I can do computation and base my behavior on the results of computation, there's no general algorithm to assess what I as a human being am going to do either.

is not true, because your future behavior could theoretically be predicted by running your physics simulator forward by a finite number of steps.

Am I supposed to believe that the outcome of a Magic the Gathering game is undecidable in some weird sense, or is this only the case about some weird pseudo-property of the game that doesn't actually affect my ability to predict how many life points player X is going to lose after these series of card plays?

To get intuition for this, I recommend getting a better feel for X-completeness in general (eg Mario is NP-complete).

Comment by turntrout on The Zettelkasten Method · 2020-03-01T23:12:19.200Z · score: 5 (3 votes) · LW · GW

No, but I've also stopped doing a number of other habits for unrelated personal reasons. I'm also not currently engaging in that kind of exploratory research. Once I am, I strongly expect I'll resume the habit.

Comment by turntrout on Towards a mechanistic understanding of corrigibility · 2020-03-01T21:29:21.480Z · score: 5 (3 votes) · LW · GW

The post answers to what extent safely tuning that trade-off is feasible, and the surrounding sequence motivates that penalization scheme in greater generality. From Conclusion to 'Reframing Impact':

Comment by turntrout on Towards a mechanistic understanding of corrigibility · 2020-03-01T20:45:52.395Z · score: 4 (2 votes) · LW · GW

See How Low Should Fruit Hang Before We Pick It?.

Comment by turntrout on Approaches for collecting and analyzing data about yourself? · 2020-02-29T17:19:45.322Z · score: 4 (2 votes) · LW · GW

correlations between events/actions and your state

The phrase you're looking for is credit assignment.

I feel like CFAR has some things in the handbook about this, but a quick ctrl-F didn't bring anything up.

Comment by turntrout on Subagents and impact measures, full and fully illustrated · 2020-02-28T15:18:49.617Z · score: 2 (1 votes) · LW · GW

Then we get an agent with an incentive to stop any human present in the environment from becoming too good

No, this modification stops people from actually optimizing if the world state is fully observable. If it’s partially observable, this actually seems like a pretty decent idea.

In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.

I disagree. First, we already have evidence that simple measures scale just fine to complex environments. Second, “responsibility” is a red herring in impact measurement. I wrote the Reframing Impact sequence to explain why I think the conceptual solution to impact measurement is quite simple.

Comment by turntrout on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T21:26:08.059Z · score: 2 (1 votes) · LW · GW

what do you mean by "for all "?

The random baseline is an idea I think about from time to time, but usually I don't dwell because it seems like the kind of clever idea that secretly goes wrong somehow? It depends whether the agent has any way of predicting what the random action will be at a future point in time.

if it can predict it, I'd imagine that it might find a way to gain a lot of power by selecting a state whose randomly selected action is near-optimal. because of the denominator, it would still be appropriately penalized for performing better than the randomly selected action, but it won't receive a penalty for choosing an action with expected optimal value just below the near-optimal action.

Comment by turntrout on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T21:06:04.094Z · score: 4 (2 votes) · LW · GW

I basically don't have much trust for meditation in this sort of case

I’m not asking you to trust in anything, which is why I emphasized that I want people to think more carefully about these choices. I do not think eq. 5 is AGI-safe. I do not think you should put it in an AGI. Do I think there’s a chance it might work? Yes. But we don’t work with “chances”, so it’s not ready.

Anyways, if theorem 11 of the low-hanging fruit post is met, the tradeoff penalty works fine. I also formally explored the hard constraint case and discussed a few reasons why the tradeoff is preferable to the hard constraint. Therefore, I think that particular design choice is reasonably determined. Would you want to think about this more before actually running an AGI with that choice? Of course.

To your broader point, I think there may be another implicit frame difference here. I’m talking about the diff of the progress, considering questions like “are we making a lot of progress? What’s the marginal benefit of more research like this? Are we getting good philosophical returns from this line of work?”, to which I think the answer is yes.

On the other hand, you might be asking “are we there yet?”, and I think the answer to that is no. Notice how these answers don’t contradict each other.

From the first frame, being skeptical because each part of the equation isn’t fully determined seems like an unreasonable demand for rigor. I wrote this sequence because it seemed that my original AUP post was pedagogically bad (I was already thinking about concepts like “overfitting the AU landscape” back in August 2018) and so very few people understood what I was arguing.

I’d like to think that my interpretive labor has paid off: AUP isn’t a slapdash mixture of constraints which is too complicated to be obviously broken, it’s attempting to directly disincentive catastrophes based off of straightforward philosophical reasoning, relying on assumptions and conjectures which I’ve clearly stated. In many cases, I waited weeks so I could formalize my reasoning in the context of MDPs (e.g. why should you think of the AU landscape as a ‘dual’ to the world state? Because I proved it).

There’s always another spot where I could make my claims more rigorous, where I could gather just a bit more evidence. But at some point I have to actually put the posts up, and I think I’ve provided some pretty good evidence in this sequence.

From the second frame, being skeptical because each part of the equation isn’t fully determined is entirely appropriate and something I encourage.

I think you’re writing from something closer to the second frame, but I don’t know for sure. For my part, this sequence has been arguing from the first frame: “towards a new impact measure”, and that’s why I’ve been providing pushback.

Comment by turntrout on How Low Should Fruit Hang Before We Pick It? · 2020-02-27T20:23:50.734Z · score: 2 (1 votes) · LW · GW

Oops, you’re right. I fixed the proof.

Comment by turntrout on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T17:44:56.343Z · score: 4 (2 votes) · LW · GW

Again, I worry that patches are based a lot on intuition.

If you want your math to abstractly describe reality in a meaningful sense, intuition has to enter somewhere (usually in how you formally define and operationalize the problem of interest). Therefore, I’m interpreting this as “I don’t see good principled intuitions behind the improvements”; please let me know if this is not what you meant.

I claim that, excepting the choice of denominator, all of the improvements follow directly from AUP (and actually, eq. 1 was the equation with arbitrary choices wrt the AGI case; I started with that because that’s how my published work formalizes the problem).

CCC says catastrophes are caused by power seeking behavior from the agent. Agents are only incentivized to pursue power in order to better achieve their own goals. Therefore, the correct equation should look something like “do your primary goal but be penalized for becoming more able to achieve your primary goal”. In this light, penalizing -AU is obviously better than using an auxiliary goal, penalizing decreases is obviously irrelevant, and penalizing immediate reward advantage is obviously irrelevant.

The denominator, on the other hand, is indeed the product of meditating on “What kind of elegant rescaling keeps making sense in all sorts of different situations, but also can’t be gamed to arbitrarily decrease the penalty?”.

Comment by turntrout on How Low Should Fruit Hang Before We Pick It? · 2020-02-27T17:19:45.107Z · score: 2 (1 votes) · LW · GW

Is there any particular part of it that seems locally invalid? Can you be a little more specific about what’s confusing?

Comment by turntrout on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T15:10:01.933Z · score: 2 (1 votes) · LW · GW

My very general concern is that strategies that maximize AUP reward might be very... let's say creative, and your claims are mostly relying on intuitive arguments for why those strategies won't be bad for humans.

My argument hinges on CCC being true. If CCC is true, and if we can actually penalize the agent for accumulating power, then if the agent doesn’t want to accumulate power, it’s not incentivized to screw us over. I feel like this is a pretty good intuitive argument, and it’s one I dedicated the first two-thirds of the sequence to explaining. You’re right that it’s intuitive, of course.

I guess our broader disagreement may be “what would an actual solution for impact measurement have going for it at this moment in time?”, and it’s not clear that I’d expect to have formal arguments to this effect / I don’t know how to meet this demand for rigor.

[ETA: I should note that some of my most fruitful work over the last year came from formalizing some of my claims. People were skeptical that slowly decreasing the penalty aggressiveness would work, so I hashed out the math in How Low Should Fruit Hang Before We Pick It?. People were uneasy that the original AUP design relied on instrumental convergence being a thing (eq. 5 doesn’t make that assumption) when maybe it actually isn’t. So I formalized instrumental convergence in Seeking Power is Instrumentally Convergent in MDPs and proved when it exists to at least some extent.

There’s probably more work to be done like this.]

I don't really buy the claim that if you've been able to patch each specific problem, we'll soon reach a version with no problems - the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.

The claim is dually resting on “we know conceptually how to solve impact measurement / what we want to implement, and it’s a simple and natural idea, so it’s plausible there’s a clean implementation of it”. I think learning “no, there isn’t a clean way to penalize the agent for becoming more able to achieve its own goal” would be quite surprising, but not implausible – I in fact think there’s a significant chance Stuart is right. More on that next post.

Also, you could argue against any approach to AI alignment by pointing out that there are still things to improve and fix, or that there were problems pointed out in the past which were fixed, but now people have found a few more problems. The thing that makes me think the patches might not be endless here is that, as I’ve argued earlier, I think AUP is conceptually correct.

This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.

It all depends whether we can get a buffer between catastrophes and reasonable plans here (reasonable plans show up for much less aggressive settings of ) and I think we can. Now, this particular problem (with huge reward) might not show up because we can bound the reward [0,1], and I generally think there exist reasonable plans where the agent gets at least 20% or so of its maximal return (suppose it thinks there’s a 40% chance we let it get 95% of its maximal per-timestep reward each timestep in exchange for it doing what we want).

[ETA: actually, if the "reasonable" reward is really, really low in expectation, it's not clear what happens. this might happen if catastrophe befalls us by default.]

You’re right we should inspect the equation for weird incentives, but to a limited extent, this is also something we can test experimentally. We don’t necessarily have to rely on intuition in all cases.

The hope is we can get to a formula that’s simple enough such that all of its incentives are thoroughly understood. I think you’ll agree eq. 5 is far better in this respect than the original AUP formulation!

Comment by turntrout on How Low Should Fruit Hang Before We Pick It? · 2020-02-26T16:48:58.015Z · score: 2 (1 votes) · LW · GW

Utility is bounded [0,1].

If theorem 11 is met, we’re fine. There are some good theoretical reasons not to use constraints (beyond the computational ones).

(It’s true that the buffering criterion is nice and simple for constrained partitions (the first nondominated catastrophe has times the impact of the first non-dominated reasonable plan).)

Comment by turntrout on Continuous Improvement: Insights from 'Topology' · 2020-02-25T19:36:56.362Z · score: 2 (1 votes) · LW · GW

Yikes, you’re right. Oops. Wrote that part early on my way through the book. Removed the section because I don’t think it was too insightful anyways.

Comment by turntrout on Subagents and impact measures, full and fully illustrated · 2020-02-25T15:24:15.523Z · score: 4 (2 votes) · LW · GW

Mind-reading violates the cartesian assumption and so we can’t reason about it formally (yet!), but i think there’s a version of effectively getting what you’re after that doesn’t.

Comment by turntrout on Continuous Improvement: Insights from 'Topology' · 2020-02-23T00:29:58.008Z · score: 3 (2 votes) · LW · GW

Wrt continuity, I was implicitly just thinking of metric spaces (which are all first-countable, obviously). I’ll edit the post to clarify.

Comment by turntrout on Attainable Utility Preservation: Empirical Results · 2020-02-22T20:07:01.210Z · score: 2 (1 votes) · LW · GW

Decreases or increases?

Decreases. Here, the "human" is just a block which paces back and forth. Removing the block removes access to all states containing that block.

  1. Is "Model-free AUP" the same as "AUP stepwise"?

Yes. See the paper for more details.

  1. Why does "Model-free AUP" wait for the pallet to reach the human before moving, while the "Vanilla" agent does not?

I'm pretty sure it's just an artifact of the training process and the penalty term. I remember investigating it in 2018 and concluding it wasn't anything important, but unfortunately I don't recall the exact explanation.

I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it's harder to think of how state/agent are 'contained' in these games.)

It would still try to preserve access to future states as much as possible with respect to doing nothing that turn.

Is the code for the SafeLife PPO-AUP stuff you did on github?

Here. Note that we're still ironing things out, but the preliminary results have been pretty solid.

Comment by turntrout on Attainable Utility Preservation: Empirical Results · 2020-02-22T16:05:09.526Z · score: 2 (1 votes) · LW · GW

It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".

This has been an idea I’ve been intrigued by ever since AUP came out. My main concern with it is the increase in compute required and loss of competitiveness. Still probably worth running the experiments.

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't seem right. Try a transposition-invariant normalization? (Or generate benign auxiliary reward functions in the first place.)

Correct. Proposition 4 in the AUP paper guarantees penalty invariance to affine transformation only if the denominator is also the penalty for taking some action (absolute difference in Q values). You could, for example, consider the penalty of some mild action: . It’s really up to the designer in the near-term. We’ll talk about more streamlined designs for superhuman use cases in two posts.

Is there an environment where this agent would spuriously go in circles?

Don’t think so. Moving generates tiny penalties, and going in circles usually isn’t a great way to accrue primary reward.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-18T18:34:14.853Z · score: 4 (2 votes) · LW · GW

Thanks for doing this. I was originally going to keep a text version of the whole sequence, but I ended up making lots of final edits in the images, and this sequence has already taken an incredible amount of time on my part.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-18T18:32:34.739Z · score: 2 (1 votes) · LW · GW

if we make sure that power is low enough we can turn it off, if the agent will acquire power if that's the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.

Yeah. I have the math for this kind of tradeoff worked out - stay tuned!

Though further up this comment I brought up the possibility that "power seeking behavior is the cause of catastrophe, rather than having power."

I think this is true, actually; if another agent already has a lot of power and it isn't already catastrophic for us, their continued existence isn't that big of a deal wrt the status quo. The bad stuff comes with the change in who has power.

The act of taking away our power is generally only incentivized so the agent can become better able to achieve its own goal. The question is, why is the agent trying to convince us of something / get someone else to do something catastrophic, if the agent isn't trying to increase its own AU?

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-18T15:15:07.266Z · score: 2 (1 votes) · LW · GW

The power limitation isn’t a hard cap, it’s a tradeoff. AUP agents do not have to half-ass anything. As I wrote in another comment,

It prefers plans that don’t gain unnecessary power.

If “unnecessary” is too squishy of a word for your tastes, I’m going get quite specific in the next few posts.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-18T03:14:51.253Z · score: 4 (2 votes) · LW · GW

The conclusion doesn't follow from the premise.

CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking". So its contrapositive is indeed as stated.

Note that preserving our attainable utilities isn't a good thing, it's just not a bad thing.

I meant "preserving" as in "not incentivized to take away power from us", not "keeps us from benefitting from anything", but you're right about the implication as stated. Sorry for the ambiguity.

Is this a metaphor for making an 'agent' with that goal, or actually creating an agent that we can give different commands to and switch out/modify/add to its goals?


"AUP_conceptual solves this "locality" problem by regularizing the agent's impact on the nearby AU landscape."

Nearby from its perspective? (From a practical standpoint, if you're close to an airport you're close to a lot of places on earth, that you aren't from a 'space' perspective.)

Nearby wrt this kind of "AU distance/practical perspective", yes. Great catch.

Also the agent might be concerned with flows rather than actions.* We have an intuitive notion that 'building factories increases power', but what about redirecting a river/stream/etc. with dams or digging new paths for water to flow? What does the agent do if it unexpectedly gains power by some means, or realizes its paperclip machines can be used to move strawberries/make a copy itself which is weaker but less constrained? Can the agent make a machine that makes paperclips/make making paperclips easier?

As a consequence of this being a more effective approach - it makes certain improvements obvious. If you have a really long commute to work, you might wish you lived closer to your work. (You might also be aware that houses closer to your work are more expensive, but humans are good at picking up on this kind of low hanging fruit. A capable agent that thinks about process seeing 'opportunities to gain power' is of some general concern. In this case because an agent that tries to minimize reducing/affecting* other agents attainable utility, without knowing/needing to know about other agents is somewhat counterintuitive.

**It's not clear if increasing shows up on the AUP map, or how that's handled.

Great thoughts. I think some of this will be answered in a few posts by the specific implementation details. What do you mean by "AUP map"? The AU landscape?

What does the agent do if it unexpectedly gains power by some means,

The idea is it only penalizes expected power gain.

Comment by turntrout on The Catastrophic Convergence Conjecture · 2020-02-17T17:25:49.397Z · score: 4 (2 votes) · LW · GW

Intriguing. I don't know whether that suggests our values aren't as complicated as we thought, or whether the pressures which selected them are not complicated.

While I'm not an expert on the biological intrinsic motivation literature, I think it's at least true that some parts of our values were selected for because they're good heuristics for maintaining AU. This is the thing that MCE was trying to explain:

The paper’s central notion begins with the claim is that there is a physical principle, called “causal entropic forces,” that drives a physical system toward a state that maximizes its options for future change. For example, a particle inside a rectangular box will move to the center rather than to the side, because once it is at the center it has the option of moving in any direction. Moreover, argues the paper, physical systems governed by causal entropic forces exhibit intelligent behavior.

I think they have this backwards: intelligent behavior often results in instrumentally convergent behavior (and not necessarily the other way around). Similarly, Salge et al. overview the behavioral empowerment hypothesis:

The adaptation brought about by natural evolution reduce organisms that in absence of specific goals behave as if they were maximizing [mutual information between their actions and future observations].

As I discuss in section 6.1 of Optimal Farsighted Agents Tend to Seek Power, I think that "ability to achieve goals in general" (power) is a better intuitive and technical notion than information-theoretic empowerment. I think it's pretty plausible that we have heuristics which, all else equal, push us to maintain or increase our power.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-17T16:42:51.206Z · score: 2 (1 votes) · LW · GW

This post is about AUP-the-concept, not about specific implementations. That plan increases its ability to have paperclips maximized and so is penalized by AUP. We'll talk specifics later.

ETA: As a more general note, this post should definitely have an "aha!" associated with it, so if it doesn't, I encourage people to ask questions.

Comment by turntrout on Subagents and impact measures: summary tables · 2020-02-17T15:41:26.579Z · score: 2 (1 votes) · LW · GW

RR attempted to control the side-effects of an agent by ensuring it had enough power to reach a lot of states; this effect is not neutralised by a subagent.

Things might get complicated by partial observability; in the real world, the agent is minimizing change in its beliefs about what it can reach. Otherwise, you could just get around the SA problem for AUP as well by substituting the reward functions for state indicator reward functions.

Comment by turntrout on Stepwise inaction and non-indexical impact measures · 2020-02-17T15:32:20.272Z · score: 2 (1 votes) · LW · GW

I'll establish two facts: that under the stepwise inaction baseline, a subagent completely undermines all impact measures (including twenty billion questions).

Note this implicitly assumes an agent benefits by building the subagent. The specific counterexample I have in mind will be a few posts later in my sequence.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-17T15:21:19.100Z · score: 5 (3 votes) · LW · GW

Depends how much power that gains compared to other plans. It prefers plans that don’t gain unnecessary power.

In fact, the “encouraged policy” in the post has the agent reading a Paperclips for Dummies book and making a few extra paperclips.