## Posts

Attainable Utility Preservation: Concepts 2020-02-17T05:20:09.567Z · score: 25 (7 votes)
The Catastrophic Convergence Conjecture 2020-02-14T21:16:59.281Z · score: 31 (9 votes)
Attainable Utility Landscape: How The World Is Changed 2020-02-10T00:58:01.453Z · score: 40 (11 votes)
Does there exist an AGI-level parameter setting for modern DRL architectures? 2020-02-09T05:09:55.012Z · score: 15 (6 votes)
AI Alignment Corvallis Weekly Info 2020-01-26T21:24:22.370Z · score: 7 (1 votes)
On Being Robust 2020-01-10T03:51:28.185Z · score: 40 (17 votes)
Judgment Day: Insights from 'Judgment in Managerial Decision Making' 2019-12-29T18:03:28.352Z · score: 23 (7 votes)
Can fear of the dark bias us more generally? 2019-12-22T22:09:42.239Z · score: 22 (5 votes)
Clarifying Power-Seeking and Instrumental Convergence 2019-12-20T19:59:32.793Z · score: 41 (14 votes)
Seeking Power is Instrumentally Convergent in MDPs 2019-12-05T02:33:34.321Z · score: 105 (30 votes)
How I do research 2019-11-19T20:31:16.832Z · score: 56 (22 votes)
Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z · score: 54 (24 votes)
The Gears of Impact 2019-10-07T14:44:51.212Z · score: 38 (12 votes)
World State is the Wrong Level of Abstraction for Impact 2019-10-01T21:03:40.153Z · score: 51 (15 votes)
Attainable Utility Theory: Why Things Matter 2019-09-27T16:48:22.015Z · score: 53 (17 votes)
Deducing Impact 2019-09-24T21:14:43.177Z · score: 57 (16 votes)
Value Impact 2019-09-23T00:47:12.991Z · score: 49 (19 votes)
Reframing Impact 2019-09-20T19:03:27.898Z · score: 86 (33 votes)
What You See Isn't Always What You Want 2019-09-13T04:17:38.312Z · score: 28 (9 votes)
How often are new ideas discovered in old papers? 2019-07-26T01:00:34.684Z · score: 24 (9 votes)
TurnTrout's shortform feed 2019-06-30T18:56:49.775Z · score: 28 (6 votes)
Best reasons for pessimism about impact of impact measures? 2019-04-10T17:22:12.832Z · score: 76 (17 votes)
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z · score: 31 (6 votes)
And My Axiom! Insights from 'Computability and Logic' 2019-01-16T19:48:47.388Z · score: 40 (9 votes)
Penalizing Impact via Attainable Utility Preservation 2018-12-28T21:46:00.843Z · score: 26 (10 votes)
Why should I care about rationality? 2018-12-08T03:49:29.451Z · score: 26 (6 votes)
A New Mandate 2018-12-06T05:24:38.351Z · score: 15 (8 votes)
Towards a New Impact Measure 2018-09-18T17:21:34.114Z · score: 111 (38 votes)
Impact Measure Desiderata 2018-09-02T22:21:19.395Z · score: 40 (11 votes)
Turning Up the Heat: Insights from Tao's 'Analysis II' 2018-08-24T17:54:54.344Z · score: 40 (11 votes)
Pretense 2018-07-29T00:35:24.674Z · score: 36 (14 votes)
Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction' 2018-07-05T00:34:59.249Z · score: 35 (9 votes)
Overcoming Clinginess in Impact Measures 2018-06-30T22:51:29.065Z · score: 42 (14 votes)
Worrying about the Vase: Whitelisting 2018-06-16T02:17:08.890Z · score: 84 (20 votes)
Swimming Upstream: A Case Study in Instrumental Rationality 2018-06-03T03:16:21.613Z · score: 115 (38 votes)
Into the Kiln: Insights from Tao's 'Analysis I' 2018-06-01T18:16:32.616Z · score: 69 (19 votes)
Confounded No Longer: Insights from 'All of Statistics' 2018-05-03T22:56:27.057Z · score: 56 (13 votes)
Internalizing Internal Double Crux 2018-04-30T18:23:14.653Z · score: 80 (19 votes)
The First Rung: Insights from 'Linear Algebra Done Right' 2018-04-22T05:23:49.024Z · score: 77 (22 votes)
Unyielding Yoda Timers: Taking the Hammertime Final Exam 2018-04-03T02:38:48.327Z · score: 40 (12 votes)
Open-Category Classification 2018-03-28T14:49:23.665Z · score: 36 (8 votes)
The Art of the Artificial: Insights from 'Artificial Intelligence: A Modern Approach' 2018-03-25T06:55:46.204Z · score: 68 (18 votes)
Lightness and Unease 2018-03-21T05:24:26.289Z · score: 53 (15 votes)
How to Dissolve It 2018-03-07T06:19:22.923Z · score: 41 (15 votes)
Ambiguity Detection 2018-03-01T04:23:13.682Z · score: 33 (9 votes)
Set Up for Success: Insights from 'Naïve Set Theory' 2018-02-28T02:01:43.790Z · score: 62 (18 votes)
Walkthrough of 'Formalizing Convergent Instrumental Goals' 2018-02-26T02:20:09.294Z · score: 27 (6 votes)
Interpersonal Approaches for X-Risk Education 2018-01-24T00:47:44.183Z · score: 29 (8 votes)

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-18T18:34:14.853Z · score: 4 (2 votes) · LW · GW

Thanks for doing this. I was originally going to keep a text version of the whole sequence, but I ended up making lots of final edits in the images, and this sequence has already taken an incredible amount of time on my part.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-18T18:32:34.739Z · score: 2 (1 votes) · LW · GW

if we make sure that power is low enough we can turn it off, if the agent will acquire power if that's the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.

Yeah. I have the math for this kind of tradeoff worked out - stay tuned!

Though further up this comment I brought up the possibility that "power seeking behavior is the cause of catastrophe, rather than having power."

I think this is true, actually; if another agent already has a lot of power and it isn't already catastrophic for us, their continued existence isn't that big of a deal wrt the status quo. The bad stuff comes with the change in who has power.

The act of taking away our power is generally only incentivized so the agent can become better able to achieve its own goal. The question is, why is the agent trying to convince us of something / get someone else to do something catastrophic, if the agent isn't trying to increase its own AU?

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-18T15:15:07.266Z · score: 2 (1 votes) · LW · GW

The power limitation isn’t a hard cap, it’s a tradeoff. AUP agents do not have to half-ass anything. As I wrote in another comment,

It prefers plans that don’t gain unnecessary power.

If “unnecessary” is too squishy of a word for your tastes, I’m going get quite specific in the next few posts.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-18T03:14:51.253Z · score: 4 (2 votes) · LW · GW

The conclusion doesn't follow from the premise.

CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking". So its contrapositive is indeed as stated.

Note that preserving our attainable utilities isn't a good thing, it's just not a bad thing.

I meant "preserving" as in "not incentivized to take away power from us", not "keeps us from benefitting from anything", but you're right about the implication as stated. Sorry for the ambiguity.

Is this a metaphor for making an 'agent' with that goal, or actually creating an agent that we can give different commands to and switch out/modify/add to its goals?

Metaphor.

"AUP_conceptual solves this "locality" problem by regularizing the agent's impact on the nearby AU landscape."

Nearby from its perspective? (From a practical standpoint, if you're close to an airport you're close to a lot of places on earth, that you aren't from a 'space' perspective.)

Nearby wrt this kind of "AU distance/practical perspective", yes. Great catch.

Also the agent might be concerned with flows rather than actions.* We have an intuitive notion that 'building factories increases power', but what about redirecting a river/stream/etc. with dams or digging new paths for water to flow? What does the agent do if it unexpectedly gains power by some means, or realizes its paperclip machines can be used to move strawberries/make a copy itself which is weaker but less constrained? Can the agent make a machine that makes paperclips/make making paperclips easier?

As a consequence of this being a more effective approach - it makes certain improvements obvious. If you have a really long commute to work, you might wish you lived closer to your work. (You might also be aware that houses closer to your work are more expensive, but humans are good at picking up on this kind of low hanging fruit. A capable agent that thinks about process seeing 'opportunities to gain power' is of some general concern. In this case because an agent that tries to minimize reducing/affecting* other agents attainable utility, without knowing/needing to know about other agents is somewhat counterintuitive.

**It's not clear if increasing shows up on the AUP map, or how that's handled.

Great thoughts. I think some of this will be answered in a few posts by the specific implementation details. What do you mean by "AUP map"? The AU landscape?

What does the agent do if it unexpectedly gains power by some means,

The idea is it only penalizes expected power gain.

Comment by turntrout on The Catastrophic Convergence Conjecture · 2020-02-17T17:25:49.397Z · score: 4 (2 votes) · LW · GW

Intriguing. I don't know whether that suggests our values aren't as complicated as we thought, or whether the pressures which selected them are not complicated.

While I'm not an expert on the biological intrinsic motivation literature, I think it's at least true that some parts of our values were selected for because they're good heuristics for maintaining AU. This is the thing that MCE was trying to explain:

The paper’s central notion begins with the claim is that there is a physical principle, called “causal entropic forces,” that drives a physical system toward a state that maximizes its options for future change. For example, a particle inside a rectangular box will move to the center rather than to the side, because once it is at the center it has the option of moving in any direction. Moreover, argues the paper, physical systems governed by causal entropic forces exhibit intelligent behavior.

I think they have this backwards: intelligent behavior often results in instrumentally convergent behavior (and not necessarily the other way around). Similarly, Salge et al. overview the behavioral empowerment hypothesis:

The adaptation brought about by natural evolution reduce organisms that in absence of specific goals behave as if they were maximizing [mutual information between their actions and future observations].

As I discuss in section 6.1 of Optimal Farsighted Agents Tend to Seek Power, I think that "ability to achieve goals in general" (power) is a better intuitive and technical notion than information-theoretic empowerment. I think it's pretty plausible that we have heuristics which, all else equal, push us to maintain or increase our power.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-17T16:42:51.206Z · score: 2 (1 votes) · LW · GW

This post is about AUP-the-concept, not about specific implementations. That plan increases its ability to have paperclips maximized and so is penalized by AUP. We'll talk specifics later.

ETA: As a more general note, this post should definitely have an "aha!" associated with it, so if it doesn't, I encourage people to ask questions.

Comment by turntrout on Subagents and impact measures: summary tables · 2020-02-17T15:41:26.579Z · score: 2 (1 votes) · LW · GW

RR attempted to control the side-effects of an agent by ensuring it had enough power to reach a lot of states; this effect is not neutralised by a subagent.

Things might get complicated by partial observability; in the real world, the agent is minimizing change in its beliefs about what it can reach. Otherwise, you could just get around the SA problem for AUP as well by substituting the reward functions for state indicator reward functions.

Comment by turntrout on Stepwise inaction and non-indexical impact measures · 2020-02-17T15:32:20.272Z · score: 2 (1 votes) · LW · GW

I'll establish two facts: that under the stepwise inaction baseline, a subagent completely undermines all impact measures (including twenty billion questions).

Note this implicitly assumes an agent benefits by building the subagent. The specific counterexample I have in mind will be a few posts later in my sequence.

Comment by turntrout on Attainable Utility Preservation: Concepts · 2020-02-17T15:21:19.100Z · score: 5 (3 votes) · LW · GW

Depends how much power that gains compared to other plans. It prefers plans that don’t gain unnecessary power.

In fact, the “encouraged policy” in the post has the agent reading a Paperclips for Dummies book and making a few extra paperclips.

Comment by turntrout on The Catastrophic Convergence Conjecture · 2020-02-17T05:20:02.466Z · score: 2 (1 votes) · LW · GW

I'm not thinking of optimizing for "not an objective catastrophe" directly - it's just a useful concept. The next post covers this.

Comment by turntrout on The Catastrophic Convergence Conjecture · 2020-02-15T23:52:52.794Z · score: 3 (2 votes) · LW · GW

Note that when I said

(we don't need any fancy engineering or arbitrary choices to figure out AUs/optimal value from the agent's perspective).

I meant we could just consider how the agent's AUs are changing without locating a human in the environment.

Comment by turntrout on The Catastrophic Convergence Conjecture · 2020-02-15T14:17:00.382Z · score: 2 (1 votes) · LW · GW

Wouldn't you need to locate the abstract concept of AU within the AI's ontology? Is that easier? Or sorry if I'm misunderstanding.

To the contrary, an AU is naturally calculated from reward, one of the few things that is ontologically fundamental in the paradigm of RL. As mentioned in the last post, the AU of reward function is - which calculates the maximum possible -return from a given state.

This will become much more obvious in the AUP empirical post.

Comment by turntrout on The Catastrophic Convergence Conjecture · 2020-02-15T05:19:03.189Z · score: 3 (2 votes) · LW · GW

Calling both of these things a "catastrophe" seems to sweep that difference under the rug.

Sure, but just like it makes sense to be able to say that a class of outcomes is "good" without every single such outcome being maximally good, it makes sense to have a concept for catastrophes, even if they're not literally the worst things possible.

Which seems bound to happen even in the best case where a FAI takes over.

Building a powerful agent helping you get what you want, doesn't destroy your ability to get what you want. By my definition, that's not a catastrophe.

as if a "catastrophe" is necessarily the worst thing possible and should be avoided at all costs. If an antialigned "evil" AI were about to be released with high probability, and you had a paperclip maximizer in a box, releasing the paperclip maximizer would be the best option, even though that moves the chance of catastrophe from high probability to indistinguishable from certainty.

Correct. Again, I don't mean to say that any catastrophe is literally the worst outcome possible.

Comment by turntrout on Chapter 23: Belief in Belief · 2020-02-12T23:54:40.685Z · score: 3 (2 votes) · LW · GW

My apologies for leaving that unmarked.

Comment by turntrout on Attainable utility has a subagent problem · 2020-02-12T16:01:55.641Z · score: 2 (1 votes) · LW · GW

Nitpick: "Attainable utility regularization" should be "Attainable utility preservation"

Comment by turntrout on Value Impact · 2020-02-12T15:48:10.731Z · score: 3 (2 votes) · LW · GW

Oh, I think I see. Yes, this is explicitly talked about later in the sequence - "resources" like cash are given their importance by how they affect future possibilities, and that's highly context-dependent.

(Let me know if this still isn't addressing your objection)

Comment by turntrout on Value Impact · 2020-02-12T13:44:55.390Z · score: 2 (1 votes) · LW · GW

Also, I feel that some of your examples of objective impact are indeed relevant to agents in general (not dying/being destroyed), while other depends on sharing a common context (cash, which would be utterly useless in Pebblia if the local economy was based on exchanging peebles for peebles).

Yeah, in the post I wrote

Even if we were on Pebblia, we'd probably think primarily of the impact on human-Pebblehoarder relations.

Comment by turntrout on TurnTrout's shortform feed · 2020-02-12T01:51:02.670Z · score: 15 (4 votes) · LW · GW

For quite some time, I've disliked wearing glasses. However, my eyes are sensitive, so I dismissed the possibility of contacts.

Over break, I realized I could still learn to use contacts, it would just take me longer. Sure enough, it took me an hour and five minutes to put in my first contact, and I couldn't get it out on my own. An hour of practice later, I put in a contact on my first try, and took it out a few seconds later. I'm very happily wearing contacts right now, as a matter of fact.

I'd suffered glasses for over fifteen years because of a cached decision – because I didn't think to rethink something literally right in front of my face every single day.

What cached decisions have you not reconsidered?

Comment by turntrout on Attainable Utility Landscape: How The World Is Changed · 2020-02-11T03:28:52.601Z · score: 3 (2 votes) · LW · GW

In a complete graph, all nodes are directly connected.

Surprisingly, unless you're talking about (complete 1-graph), opportunity cost still exists in (). Each round, you choose where to go next (and you can go to any state immediately). Going to one state next round means you can't go to a different state next round, so for any given action there exists a reward function which incurs opportunity cost.

Definition. We say opportunity cost exists at a state if there exist child states of state such that for some reward function . That is, has successor states with different (optimal) AUs for some reward function.

The opposite of this, is that if it were possible for an agent to be in more than one place at the same time, they could be infinitely farsighted. (Possibly as a consequence of FTL.)

Things get weird here, depending on your theory of identity and how that factors into the planning / reward process? Can you spell this out some more?

Comment by turntrout on Open & Welcome Thread - February 2020 · 2020-02-10T15:27:20.494Z · score: 2 (1 votes) · LW · GW

I think for humans most mentally accessible preferences are instrumental, and the right analogy for them is something like 'value functions' instead of 'reward' (as in RL).

I agree. As far as I can tell, people seem to be predicting their on-policy Q function when considering different choices. See also attainable utility theory and the gears of impact.

Comment by turntrout on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-07T13:17:43.217Z · score: 7 (4 votes) · LW · GW

I'd conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.

I'd suspect that's right, but I don't think your title has the appropriate epistemic status. I think people in general should be more careful about for-all quantifiers wrt alignment work. There's the use of the technical term "almost every", but you did not prove the set of "powerful" algorithms which is not "manipulative" has measure zero. There's also "would be" instead of "seems" (I think if you made this change, the title would be fine). I think it's vitally important we use the correct epistemic markers; if not, this can lead to research predicated on obvious-seeming hunches stated as fact.

Not that I disagree with your suspicion here.

Comment by turntrout on TurnTrout's shortform feed · 2020-02-07T13:10:46.833Z · score: 2 (1 votes) · LW · GW

Lost surplus is definitely a loss - it's not linear with utility, but it's not uncorrelated.

I don't think I was disagreeing?

Comment by turntrout on TurnTrout's shortform feed · 2020-02-06T16:56:15.771Z · score: 4 (2 votes) · LW · GW

My autodidacting has given me a mental reflex which attempts to construct a gears-level explanation of almost any claim I hear. For example, when listening to “Listen to Your Heart” by Roxette:

Listen to your heart,

There’s nothing else you can do

I understood what she obviously meant and simultaneously found myself subvocalizing “she means all other reasonable plans are worse than listening to your heart - not that that’s literally all you can do”.

This reflex is really silly and annoying in the wrong context - I’ll fix it soon. But it’s pretty amusing that this is now how I process claims by default, and I think it usually serves me well.

Comment by turntrout on TurnTrout's shortform feed · 2020-02-05T16:41:01.779Z · score: 5 (2 votes) · LW · GW

AFAICT, the deadweight loss triangle from eg price ceilings is just a lower bound on lost surplus. inefficient allocation to consumers means that people who value good less than market equilibrium price can buy it, while dwl triangle optimistically assumes consumers with highest willingness to buy will eat up the limited supply.

Comment by turntrout on Instrumental Occam? · 2020-02-01T02:42:25.927Z · score: 2 (1 votes) · LW · GW

This is an argument against instrumental convergence - in a deterministic environment that is completely understood. Under these conditions, the 'paperclip maximizer AI' knows that yes, it does want paperclips more than anything else, and has no issue turning itself into paperclips.

Can you elaborate? Does this argue against instrumental convergence because it would paperclip itself?

Comment by turntrout on Attainable utility has a subagent problem · 2020-01-28T17:26:59.997Z · score: 11 (4 votes) · LW · GW

I think it's really great to have this argument typed up somewhere, and I liked the images. There's something important going on with how the agent can make our formal measurement of its power stop tracking the actual powers it's able to exert over the world, and I think solving this question is the primary remaining open challenge in impact measurement. The second half of Reframing Impact (currently being written and drawn) will discuss this in detail, as well as proposing partial solutions to this problem.

The agent's own power plausibly seems like a thing we should be able to cleanly formalize in a way that's robust when implemented in an impact measure. The problem you've pointed out somewhat reminds me of the easy problem of wireheading, in which we are fighting against a design choice rather than value specification difficulty.

How is getting reward for being on the blue button? I assume gets reward whenever a robot is on the button?

This will give it a reward of ,

Is the +1 a typo?

Note, though this is not relevant to this post, that if there were no teleporters along the corridor (just at either end), the robot could not move towards the blue button.

Depends on how much impact is penalized compared to normal reward.

Now plausible is this to work in a more general situation? Well, if the R is rich enough, this similar to the "twenty billion questions" in our low impact paper (section 3.2). But that's excessively rich, and will probably condemn the agent to inaction.

This isn't necessarily true. Consider as the reward function class for all linear functionals over camera pixels. Or, even the max-ent distribution over observation-based reward functions. I claim that this doesn't look like 20 billion Q's.

ETA: I'd also like to note that, while implicitly expanding the action space in the way you did (e.g. " can issue requests to , and also program arbitrary non-Markovian policies into it") is valid, I want to explicitly point it out.

Comment by turntrout on Concerns Surrounding CEV: A case for human friendliness first · 2020-01-23T18:27:18.693Z · score: 12 (3 votes) · LW · GW

I imagine by the time it can extrapolate humanities volition it will be intelligent enough to consider what it would rather do than that.

Why would it rather choose plans which rate lower in its own preference ordering? What is causing the "rather"?

Comment by turntrout on Concerns Surrounding CEV: A case for human friendliness first · 2020-01-23T17:28:09.043Z · score: 11 (3 votes) · LW · GW

wouldn't the AI be intelligent enough to be offended by our self-centredness and change that utility function?

(Supposing this is an accurate summary of your position), this is anthropomorphizing. Morality is a two-place function; things aren't inherently offensive. A certain mind may find a thing to be offensive, and another may not.

but I can't help but imagine that if we pre committed to giving it a mutually beneficial utility function, it would be more prone to treating us in a friendly way.

I think you might dissolve some confusion by considering: what exactly does "beneficial" mean for the AI, here? Beneficial according to what standard?

Comment by turntrout on Inner alignment requires making assumptions about human values · 2020-01-20T22:32:46.313Z · score: 9 (2 votes) · LW · GW

It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.

I think people should generally be a little more careful about saying "this requires value-laden information". First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say "don't do all these bad things we don't like"!

Second, it's always good to check "would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?":

This isn't an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.

Comment by turntrout on Inner alignment requires making assumptions about human values · 2020-01-20T21:14:05.424Z · score: 7 (3 votes) · LW · GW

I (low-confidence) think that there might be a "choose two" wrt impact measures: large effect, no ontology, no/very limited value assumptions. I see how we might get small good effects without needing a nice pre-specified ontology or info. about human values (AUP; to be discussed in upcoming Reframing Impact posts). I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.

I know this isn't saying why I think this yet, but I'd just like to register this now for later discussion.

Comment by turntrout on Vanessa Kosoy's Shortform · 2020-01-19T02:29:19.720Z · score: 2 (1 votes) · LW · GW

This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user's stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That's because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period...

This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage.

This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I've made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.

Comment by turntrout on Red Flags for Rationalization · 2020-01-14T16:34:51.147Z · score: 4 (3 votes) · LW · GW

If anyone has experience trying to develop [the skill of noticing what it feels like to rationalize], please leave a comment.

I've developed this skill some. To me, it feels like part of my brain is "slipping sideways", tugging me harder than appropriate towards a particular line of thinking or conclusion. I think I'm reasonably good at noticing rationalization, but part of my brain still tries to rationalize even after I notice it. I want to get better at responding appropriately.

Comment by turntrout on On Being Robust · 2020-01-14T16:26:51.495Z · score: 2 (1 votes) · LW · GW

Yeah, I think that's quite close to this concept - thanks for the link.

Comment by turntrout on Why a New Rationalization Sequence? · 2020-01-13T16:04:43.583Z · score: 4 (3 votes) · LW · GW

A common delusion among grandiose schizophrenics in institutions is that they are themselves psychiatrists. Consider a particularly underfunded mental hospital, in which the majority of people who "know" themselves to be psychiatrists are wrong. No examination of the evidence will convince them otherwise. No matter how overwhelming, some reason to disbelieve will be found.

Given this, should any amount of evidence suffice to convince you that you are such a psychiatrist?

I am not aware of any resolution to this paradox.

I don't think there's a resolution to this kind of thing. "What algorithm produces good epistemics, even though your memories and beliefs may be arbitrarily corrupted?". In the general case, you can't do better than approximating normatively correct reasoning as best you can, given the information you have.

Of course, as actual people, we quickly run into the following issue:

The simplest ideal of thinking deals extensively with uncertainty of external facts, but trusts its own reasoning implicitly. Directly imitating this, when your own reasoning is not 100% trustworthy, is a bad plan. Hopefully this sequence will provide some alternatives.

Comment by turntrout on TurnTrout's shortform feed · 2020-01-13T02:15:39.463Z · score: 18 (8 votes) · LW · GW

While reading Focusing today, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: "I explicitly and consciously notice that I felt averse to some aspect of this book".

I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:

If the book does have exercises, it'll take more time. That means I'm spending reading time on things that aren't math textbooks. That means I'm slowing down.

(Transcription of a deeper Focusing on this reasoning)

I'm afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just get better. They didn't (for a long while), so I just lay there and cried. That was slow, and it hurt. No reviews, no posts, no typing, no coding. No writing, slow reading. That was slow, and it hurt.

Part of it used to be a sense of "I need to catch up and learn these other subjects which [Eliezer / Paul / Luke / Nate] already know". Through internal double crux, I've nearly eradicated this line of thinking, which is neither helpful nor relevant nor conducive to excitedly learning the beautiful settled science of humanity. Although my most recent post touched on impostor syndrome, that isn't really a thing for me. I feel reasonably secure in who I am, now (although part of me worries that others wrongly view me as an impostor?).

However, I mostly just want to feel fast, efficient, and swift again. I sometimes feel like I'm in a race with Alex, and I feel like I'm losing.

Comment by turntrout on On Being Robust · 2020-01-12T23:43:56.906Z · score: 2 (1 votes) · LW · GW

Awesome. I should also note this generator is post hoc; I was (trying to) do this for a few years before I was even thinking about the planning fallacy.

Comment by turntrout on On Being Robust · 2020-01-10T06:55:29.160Z · score: 4 (2 votes) · LW · GW

I think the important generator is: being robust seems like a solution to this "generalized planning fallacy"[1], where you don't correctly anticipate which corners should not be cut. So, even though you could theoretically excise some wasted motions by cutting pointless corners, you can't tell which corners are pointless. Therefore, a better policy is just not cutting corners by default.

I think you should probably do most of these things, but not sure which order to do them in,

TBC, the main point isn't that people should do these specific things per se, the main thing is the overall mindset.

and meanwhile, I think so long as you're afraid of being unmasked part of the problem seems like it's about the fear itself?

This is what I was getting at with

There are a lot of things to say about the impostor syndrome on a psychological basis (the fears are often unrealistic / unmerited, etc). But I'd like to take another angle.

I think the fear itself is the key problem with impostor syndrome, and I wasn't trying to say "just be so good you feel secure" should be the main line of attack on that insecurity.

1. I don't particularly like this name, but it's just a temporary handle. In the planning fallacy, you optimistically stress-test your planned schedule. In this case, you optimistically stress-test your life in a broader sense; being robust attempts to counter that. ↩︎

Comment by turntrout on Being a Robust Agent (v2) · 2020-01-10T02:24:47.432Z · score: 2 (1 votes) · LW · GW

Your definition says "no", which seems fine. But it might leave you with an awkward distinction between "instrumentally rational decisions" and "decisions rooted in instrumental rationality."

My definition was the first, which is "instrumental rationality = acting so you win". So, wouldn't it say that following your gut was instrumentally rational? At least, if it's a great idea in expectation given what you knew - I wouldn't say lottery winners were instrumentally rational.

Comment by turntrout on Being a Robust Agent (v2) · 2020-01-10T02:05:13.124Z · score: 2 (1 votes) · LW · GW

I define it the first way, and don't see the case for the second way. Analogously, for a while, Bayesian reasoning was our best guess of what the epistemic Way might look like. But then we find out about logical induction, and that seems to tell us a little more about what to do when you're embedded. So, we now see it would have been a mistake to define "epistemic rationality" as "adhering to the dictates of probability theory as best as possible".

I think that Eliezer's other usage of "instrumental rationality" points to fields of study for theoretical underpinning of effective action.

Comment by turntrout on Being a Robust Agent (v2) · 2020-01-09T17:54:53.426Z · score: 2 (1 votes) · LW · GW

I don't think is quite the same thing as instrumental rationality (although it's tightly entwined). If your goals are simple and well-understood, and you're interfacing in a social domain with clear rules, the most instrumentally rational thing might be to not overthink it and follow common wisdom.

But it's particularly important if you want to coordinate with other agents, over the long term. Especially on ambitious, complicated projects in novel domains.

On my initial read, I read this as saying "this is the right thing for some people, even when it isn't instrumentally rational" (?!). But

I think it's important to have at least one concept that is "anyone with goals should ultimately be trying to solve them the best way possible", and at least one concept that is "you might consider specifically studying cognitive patterns and policies and a cluster of related things, as a strategy to pursue particular goals."

makes me think this isn't what you meant. Maybe clarify the OP?

Comment by turntrout on TurnTrout's shortform feed · 2020-01-06T20:15:42.012Z · score: 2 (1 votes) · LW · GW

Windows machines circa ~2013. Let’s say 128GB hard drives which magically never fail, for 10 PCs.

Comment by turntrout on TurnTrout's shortform feed · 2020-01-05T02:27:54.205Z · score: 11 (2 votes) · LW · GW

Suppose you could choose how much time to spend at your local library, during which:

• you do not age. Time stands still outside; no one enters or exits the library (which is otherwise devoid of people).
• you don't need to sleep/eat/get sunlight/etc
• you can use any computers, but not access the internet or otherwise bring in materials with you
• you can't leave before the requested time is up

Suppose you don't go crazy from solitary confinement, etc. Remember that value drift is a potential thing.

How long would you ask for?

Comment by turntrout on [AN #80]: Why AI risk might be solved without additional intervention from longtermists · 2020-01-02T23:05:52.974Z · score: 34 (9 votes) · LW · GW

There can't be too many things that reduce the expected value of the future by 10%; if there were, there would be no expected value left. So, the prior that any particular thing has such an impact should be quite low.

I don't follow this argument; I also checked the transcript, and I still don't see why I should buy it. Paul said:

A priori you might’ve been like, well, if you’re going to build some AI, you’re probably going to build the AI so it’s trying to do what you want it to do. Probably that’s that. Plus, most things can’t destroy the expected value of the future by 10%. You just can’t have that many things, otherwise there’s not going to be any value left in the end. In particular, if you had 100 such things, then you’d be down to like 1/1000th of your values. 1/10 hundred thousandth? I don’t know, I’m not good at arithmetic.

Anyway, that’s a priori, just aren’t that many things are that bad and it seems like people would try and make AI that’s trying to do what they want.

In my words, the argument is "we agree that the future has nontrivial EV, therefore big negative impacts are a priori unlikely".

But why do we agree about this? Why are we assuming the future can't be that bleak in expectation? I think there are good outside-view arguments to this effect, but that isn't the reasoning here.

Comment by turntrout on TurnTrout's shortform feed · 2020-01-02T16:18:14.203Z · score: 7 (4 votes) · LW · GW

Earlier today, I became curious why extrinsic motivation tends to preclude or decrease intrinsic motivation. This phenomenon is known as overjustification. There's likely agreed-upon theories for this, but here's some stream-of-consciousness as I reason and read through summarized experimental results. (ETA: Looks like there isn't consensus on why this happens)

My first hypothesis was that recognizing external rewards somehow precludes activation of curiosity-circuits in our brain. I'm imagining a kid engrossed in a puzzle. Then, they're told that they'll be given \$10 upon completion. I'm predicting that the kid won't become significantly less engaged, which surprises me?

third graders who were rewarded with a book showed more reading behaviour in the future, implying that some rewards do not undermine intrinsic motivation.

Might this be because the reward for reading is more reading, which doesn't undermine the intrinsic interest in reading? You aren't looking forward to escaping the task, after all.

While the provision of extrinsic rewards might reduce the desirability of an activity, the use of extrinsic constraints, such as the threat of punishment, against performing an activity has actually been found to increase one's intrinsic interest in that activity. In one study, when children were given mild threats against playing with an attractive toy, it was found that the threat actually served to increase the child's interest in the toy, which was previously undesirable to the child in the absence of threat.

A few experimental summaries:

1 Researchers at Southern Methodist University conducted an experiment on 188 female university students in which they measured the subjects' continued interest in a cognitive task (a word game) after their initial performance under different incentives.

The subjects were divided into two groups. Members of the first group were told that they would be rewarded for competence. Above-average players would be paid more and below-average players would be paid less. Members of the second group were told that they would be rewarded only for completion. Their pay was scaled by the number of repetitions or the number of hours playing. Afterwards, half of the subjects in each group were told that they over-performed, and the other half were told that they under-performed, regardless of how well each subject actually did.

Members of the first group generally showed greater interest in the game and continued playing for a longer time than the members of the second group. "Over-performers" continued playing longer than "under-performers" in the first group, but "under-performers" continued playing longer than "over-performers" in the second group. This study showed that, when rewards do not reflect competence, higher rewards lead to less intrinsic motivation. But when rewards do reflect competence, higher rewards lead to greater intrinsic motivation.

2 Richard Titmuss suggested that paying for blood donations might reduce the supply of blood donors. To test this, a field experiment with three treatments was conducted. In the first treatment, the donors did not receive compensation. In the second treatment, the donors received a small payment. In the third treatment, donors were given a choice between the payment and an equivalent-valued contribution to charity. None of the three treatments affected the number of male donors, but the second treatment almost halved the number of female donors. However, allowing the contribution to charity fully eliminated this effect.

From a glance at the Wikipedia page, it seems like there's not really expert consensus on why this happens. However, according to self-perception theory,

a person infers causes about his or her own behavior based on external constraints. The presence of a strong constraint (such as a reward) would lead a person to conclude that he or she is performing the behavior solely for the reward, which shifts the person's motivation from intrinsic to extrinsic.

This lines up with my understanding of self-consistency effects.

Comment by turntrout on 2020's Prediction Thread · 2019-12-31T16:39:52.038Z · score: 4 (2 votes) · LW · GW

5's confidence seems a bit high, as does 10.

But several of your predictions seem way too confident, given how specific they are. 6,7,14 in particular. 40% for

1. Hot, but limited war between China and USA resulting in an internal Civil War in mainland China driven by Muslim revolt in north-western China, Taiwan & Hong Kong and economic interests in southern China

seems wrong due to its burdensome details.

What would 15) mean, exactly?

Comment by turntrout on New Year's Predictions Thread · 2019-12-31T00:54:53.396Z · score: 7 (4 votes) · LW · GW

Too vague a prediction indeed, but also collaborative filtering seems to have become a cornerstone of modern online advertising / content recommendation services.

Comment by turntrout on Judgment Day: Insights from 'Judgment in Managerial Decision Making' · 2019-12-30T22:26:13.088Z · score: 7 (4 votes) · LW · GW

I don't think about my self-study as "know a bunch of math things". Rather, it's a) continually improve at mathematical reasoning, and b) accumulate a bunch of problem-solving strategies and ways-of-looking-at-the-world. I can

• consider the analytic properties of a real polynomial (how quickly does the output change with the input? all polynomials are analytic. polynomial approximation theorem)
• put on my group theory hat (what kinds of operations does the space of polynomials admit?)
• consider it from a linear algebraic standpoint ( is secretly )
• put on a number-theoretic hat (consider the distributional behavior of integer polynomials in a modular context)
• think about how polynomials might be used in ML (hypothesis classes, complexity-of-fit, what's the search space behavior like for SGD if the model space is just the coefficients for degree-50 polynomials?)
• think about complexity theory (the nice closure properties of polynomial composition)
• or even consider the computability properties of polynomials (this is a bit of a stretch, but... for every decision problem containing a finite amount of YES answers, there exists a polynomial with roots on the base-ten encodings of those YES inputs).

Then, when I actually need to do stuff with polynomials for my research, I can see how the difference of returns from two policies can be represented by a polynomial in the agent's discount rate, which is a nice result (lemma 32). Insights build on insights.

So, I'm not trying to memorize everything. Leafing through Linear Algebra Done Right, I don't remember much about what self-adjointness means, or Jordan normal form, or whatever. However, I don't think that really matters. If I need to use the extraneous stuff, I know it exists, and could just pick it back up.

I am, however, able to regenerate fundamental things I actually use / run into in later studies. I have a mental habit of making myself regenerate random claims in every proof I consider. If we're relying on the commutativity of addition on the reals, I reflexively supply a proof of that property. I came up with: use the Cauchy sequence limit notion of reals, then rely on the commutativity of rationals under addition for each member of the sequence.

It's not like I've perfectly retained everything I need. I can tell I'm a little rusty on some things. However, I never do conscious reviewing (beyond building on past insights through further study, using the insights in my professional research, and rederiving things periodically). I have empirical feedback that this works pretty well. My PhD qualifier exam involved a matrix analysis question; without having even taken the class, I was able to get the right answer by reasoning using the skills and knowledge I got from self-study.

ETA: FWIW, when I talk with math undergrads at my university about areas we've both studied / solve problems with them, my impression is that my comprehension is often better.

Comment by turntrout on There's No Fire Alarm for Artificial General Intelligence · 2019-12-30T04:09:08.301Z · score: 4 (2 votes) · LW · GW

uses it to make a confusing claim that "there's nothing wrong" as though we have no more agency over the development of AI than we do over the chaotic motion of a dice.

It's foolish to liken the development of AI to a roll of the dice. Given the stakes, we must try to study, prepare for, and guide the development of AI as best we can.

I think you're misinterpreting the original comment. Scott was talking about there being "nothing wrong" with this conception of epistemic uncertainty before the 1 arrives, where each new roll doesn't tell you anything about when the 1 will come. He isn't advocating pacifism about AI risk, though. Ironically enough, in his capacity as lead of the Agent Foundations team at MIRI, Scott is arguably one of the least AI-risk-passive people on the planet.

Comment by turntrout on Judgment Day: Insights from 'Judgment in Managerial Decision Making' · 2019-12-30T02:17:09.937Z · score: 2 (1 votes) · LW · GW

Just a cheeky way to I decided to refer to the planning fallacy, but for allocating money instead of time.

Comment by turntrout on Towards a New Impact Measure · 2019-12-28T14:49:52.712Z · score: 2 (1 votes) · LW · GW

I think that this idea behind AUP has fairly obvious applications to human rationality and cooperation, although they aren’t spelled out in this post. This seems like a good candidate for follow-up work.

I'm curious whether these are applications I've started to gesture at in Reframing Impact, or whether what you have in mind as obvious isn't a subset of what I have in mind. I'd be interested in seeing your shortlist.

For more detail, see my exchange in the descendents of this comment - I still mostly agree with my claims about the technical aspects of AUP as presented in this post. Fleshing out these details is also, in my opinion, a good candidate for follow-up work.

Without rereading all of the threads, I'd like to note that I now agree with Daniel about the subhistories issue. I also agree that the formalization in this post is overly confusing and complicated.