How often are new ideas discovered in old papers? 2019-07-26T01:00:34.684Z · score: 24 (9 votes)
TurnTrout's shortform feed 2019-06-30T18:56:49.775Z · score: 22 (5 votes)
Best reasons for pessimism about impact of impact measures? 2019-04-10T17:22:12.832Z · score: 75 (16 votes)
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z · score: 31 (6 votes)
And My Axiom! Insights from 'Computability and Logic' 2019-01-16T19:48:47.388Z · score: 40 (9 votes)
Penalizing Impact via Attainable Utility Preservation 2018-12-28T21:46:00.843Z · score: 26 (10 votes)
Why should I care about rationality? 2018-12-08T03:49:29.451Z · score: 26 (6 votes)
A New Mandate 2018-12-06T05:24:38.351Z · score: 15 (8 votes)
Towards a New Impact Measure 2018-09-18T17:21:34.114Z · score: 109 (37 votes)
Impact Measure Desiderata 2018-09-02T22:21:19.395Z · score: 40 (11 votes)
Turning Up the Heat: Insights from Tao's 'Analysis II' 2018-08-24T17:54:54.344Z · score: 40 (11 votes)
Pretense 2018-07-29T00:35:24.674Z · score: 36 (14 votes)
Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction' 2018-07-05T00:34:59.249Z · score: 35 (9 votes)
Overcoming Clinginess in Impact Measures 2018-06-30T22:51:29.065Z · score: 40 (13 votes)
Worrying about the Vase: Whitelisting 2018-06-16T02:17:08.890Z · score: 84 (20 votes)
Swimming Upstream: A Case Study in Instrumental Rationality 2018-06-03T03:16:21.613Z · score: 114 (37 votes)
Into the Kiln: Insights from Tao's 'Analysis I' 2018-06-01T18:16:32.616Z · score: 69 (19 votes)
Confounded No Longer: Insights from 'All of Statistics' 2018-05-03T22:56:27.057Z · score: 56 (13 votes)
Internalizing Internal Double Crux 2018-04-30T18:23:14.653Z · score: 80 (19 votes)
The First Rung: Insights from 'Linear Algebra Done Right' 2018-04-22T05:23:49.024Z · score: 77 (21 votes)
Unyielding Yoda Timers: Taking the Hammertime Final Exam 2018-04-03T02:38:48.327Z · score: 39 (11 votes)
Open-Category Classification 2018-03-28T14:49:23.665Z · score: 36 (8 votes)
The Art of the Artificial: Insights from 'Artificial Intelligence: A Modern Approach' 2018-03-25T06:55:46.204Z · score: 68 (18 votes)
Lightness and Unease 2018-03-21T05:24:26.289Z · score: 53 (15 votes)
How to Dissolve It 2018-03-07T06:19:22.923Z · score: 41 (15 votes)
Ambiguity Detection 2018-03-01T04:23:13.682Z · score: 33 (9 votes)
Set Up for Success: Insights from 'Naïve Set Theory' 2018-02-28T02:01:43.790Z · score: 62 (18 votes)
Walkthrough of 'Formalizing Convergent Instrumental Goals' 2018-02-26T02:20:09.294Z · score: 27 (6 votes)
Interpersonal Approaches for X-Risk Education 2018-01-24T00:47:44.183Z · score: 29 (8 votes)


Comment by turntrout on Mechanistic Corrigibility · 2019-08-23T00:27:11.021Z · score: 5 (3 votes) · LW · GW

Myopia feels like it has the wrong shape. As I understand it, deceptive alignment stems from the instrumental convergence of defecting later: the model is incentivized to accrue power. We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.

I think this requires less of a capabilities hit than myopia does. This predicate might be precisely the right shape to cut off the mesa optimizer's instrumental incentive for power and deceptive alignment. At the least, it feels like a much better fit. However, this might not tile by default? Not sure.

Comment by turntrout on Goodhart's Curse and Limitations on AI Alignment · 2019-08-19T17:26:37.868Z · score: 6 (3 votes) · LW · GW

This feels like painting with too broad a brush, and from my state of knowledge, the assumed frame eliminates at least one viable solution. For example, can one build an AI without harmful instrumental incentives (without requiring any fragile specification of "harmful")? If you think not, how do you know that? Do we even presently have a gears-level understanding of why instrumental incentives occur?

In HCH and safety via debate, it's a human preferentially selecting AI that the human observes and then comes to believe does what it wants.

To say e.g. HCH is so likely to fail we should feel pessimistic about it, it doesn't seem to be enough to say "Goodhart's curse applies". Goodhart's curse applies when I'm buying apples at the grocery store. Why should we expect this bias of HCH to be enough to cause catastrophes, like it would for a superintelligent EU maximizer operating on an unbiased (but noisy) estimate of what we want? Some designs leave more room for correction and cushion, and it seems prudent to consider to what extent that is true for a proposed design.

I remain doubtful, since without sufficient optimization it's not clear how we do better than picking at random.

This isn't obvious to me. Mild optimization seems like a natural thing people are able to imagine doing. If I think about "kinda helping you write a post but not going all-out", the result is not at all random actions. Can you expand?

Comment by turntrout on Coherence arguments do not imply goal-directed behavior · 2019-08-17T04:28:01.484Z · score: 3 (2 votes) · LW · GW

Intuitively, it seems easy to make agents that are ignorant or indifferent(/"irrational") in such a way that they will only seek to optimize things within the ontology we've provided (in this case, of the chess game), instead of outside (i.e. seizing additional compute)

It isn't obvious to me that specifying the ontology is significantly easier than specifying the right objective. I have an intuition that ontological approaches are doomed. As a simple case, I'm not aware of any fundamental progress on building something that actually maximizes the number of diamonds in the physical universe, nor do I think that such a thing has a natural, simple description.

Comment by turntrout on Matthew Barnett's Shortform · 2019-08-17T00:02:12.592Z · score: 3 (2 votes) · LW · GW

I don't want to imply that this is the only route to impact, just the only route to impactful research.

“Only” seems a little strong, no? To me, the argument seems to be better expressed as: if you want to build on existing work where there’s unlikely to be low-hanging fruit, you should be an expert. But what if there’s a new problem, or one that’s incorrectly framed? Why should we think there isn’t low-hanging conceptual fruit, or exploitable problems to those with moderate experience?

Comment by turntrout on Following human norms · 2019-08-08T20:00:43.936Z · score: 2 (1 votes) · LW · GW

Existing approaches like impact measures and mild optimization are aiming to define what not to do rather than learn it.

Stuart’s early impact approach was like this, but modern work isn’t. Or maybe by “define what not to do”, you don’t mean “leave these variables alone”, but rather that eg (some ideally formalized variant of) AUP implicitly specifies a way in which the agent interacts with its environment: passivity to significant power changes. But then by this definition, we’re doing the “defining” thing for norm-learning approaches.

I agree that norm-based approaches use learning. I just don’t know whether I agree with your assertion that eg AUP “defines” what not to do.

To my understanding, mild optimization is about how we can navigate a search space intelligently without applying too much internal optimization pressure to find really “amazing” plans. This doesn’t seem to fit either.

Relatedly, learning what not to do imposes a limitation on behavior. If an AI system is goal-directed, then given sufficient intelligence it will likely find a nearest unblocked strategy.

How pessimistic are you about this concern for this idea?

Comment by turntrout on Four Ways An Impact Measure Could Help Alignment · 2019-08-08T02:53:32.719Z · score: 9 (8 votes) · LW · GW

To me, impact measurement research crystallizes how agents affect (or impact) each other; the special case of this is about how an AI will affect us (and what it even means for us to be "affected").

A distinction between "difference in world models" and "differences in what we are able to do" is subtle, and enlightening (at least to me). It allows a new terminology in which I can talk about the impact of artificial intelligence.

I find this important as well. With this understanding, we can easily consider how a system of agents affects the world and each other throughout their deployment.

The concept of impact appears to neighbor other relevant alignment concepts, like mild optimization, corrigibility, safe shutdowns, and task AGIs. I suspect that even if impact measures are never actually used in practice, there is still some potential that drawing clear boundaries between these concepts will help clarify approaches for designing powerful artificial intelligence.

This is essentially my model for why some AI alignment researchers believe that deconfusion is helpful. Developing a rich vocabulary for describing concepts is a key feature of how science advances. Particularly clean and insightful definitions help clarify ambiguity, allowing researchers to say things like "That technique sounds like it is a combination of X and Y without having the side effect of Z."

A good counterargument is that there isn't any particular reason to believe that this concept requires priority for deconfusion. It would be bordering on a motte and bailey to claim that some particular research will lead to deconfusion and then when pressed I appeal to research in general. I am not trying to do that here. Instead, I think that impact measurements are potentially good because they focus attention on a subproblem of AI, in particular catastrophe avoidance. And I also think there has empirically been demonstrable progress in a way that provides evidence that this approach is a good idea.

IMO: Deconfusion isn't a motte and bailey according to the private information I have; to me, the substantial deconfusion is a simple fact. Also from my point of view, many people seem wildly underexcited about this direction in general (hence the upcoming sequence).

There's a natural kind here, and there's lovely math for it. The natural kind lets us formalize power, and prove when and why power differentials exist. The natural kind lets us formalize instrumental convergence, and prove when and why it happens. (Or, it will, and I'm working out the details now.) The natural kind lets us understand why instrumental convergence ends up being bad news for us.

Now, when I consider the effects of running an AI, many more facets of my thoughts feel clear and sharp and well-defined. "Low-impact AGI can't do really ambitious stuff" seems like a true critique (for now! and with a few other qualifications), but it seems irrelevant to the main reasons I'm excited about impact measurement these days. IMO: there's so much low-hanging fruit, so many gold nuggets floating down the stream, so much gemstone that there's more gem than stone - we should exhaustively investigate this, as this fruit, these nuggets, these gems may[1] later be connected to other important facts in AI alignment theory.

There is a power in the truth, in all the pieces of the truth which interact with each other, which you can only find by discovering as many truths as possible.

  1. In fact, the deconfusion already connects to important facts: instrumental convergence is important to understand. ↩︎

Comment by turntrout on Understanding Recent Impact Measures · 2019-08-07T17:49:44.418Z · score: 2 (1 votes) · LW · GW

Neither of those seem good, since the AI being able to do the right thing is not at all the same as us being able to get the AI to do the right thing. If someone steals your TV, they “could” easily give your TV back, but that doesn’t mean you can actually get them to do that. So your reading isn’t unreasonable, but that’s not AUP.

Introducing AUP over all utility functions was a mistake, since it’s a total red herring. Briefly, AUP is about designing an agent that achieves its goal without acting to gain or lose power - an agent without nasty convergent instrumental incentives. Eg “Make paperclips while being penalized for becoming more or less able to make paperclips”. We don’t need anything like realizability for this.

Comment by turntrout on Understanding Recent Impact Measures · 2019-08-07T15:14:27.858Z · score: 2 (1 votes) · LW · GW

Can you expand? I don't quite follow.

Comment by turntrout on New paper: Corrigibility with Utility Preservation · 2019-08-07T11:57:34.846Z · score: 5 (3 votes) · LW · GW

Haven't read the paper yet, but note that AU is already an abbreviation for attainable utility (in particular, "AUP agents"). Similarly for "utility preserving", which might be confusing (but maybe inevitable).

Comment by turntrout on Understanding Recent Impact Measures · 2019-08-07T11:26:49.846Z · score: 11 (7 votes) · LW · GW

These posts are quite good, thank you for writing them.

I no longer think that the desiderata I listed in Impact Measure Desiderata should be our guiding star (although I think Rohin's three are about right). Let's instead look directly at the process of getting a (goal-directed) AI to do what we want, and think about what designs do well.

First, we specify the utility function. Second, the agent computes and follows a high-performing policy. This process continues, where we refine the goal if the agent isn't doing what we want.

What we want is for the AI to eventually be doing the right thing (even if we have to correct it a few times). The first way this can not happen is that the agent can act to make what we want no longer feasible, or at least more expensive. That is, the agent changes the world so that even if it had the goal we wanted to give it, it would be significantly harder to accomplish:


The second problem is that the agent can prevent us from being able to correct it properly (by gaining or preserving too much power for itself, generally):

Together, these are catastrophes - we're no longer able to get what we want in either situation. We should consider what designs preclude these failures naturally.

When considering debates over desiderata, it seems to me that we're debating whether the desideratum will lead to good things (and each of us probably secretly had a different goal in mind for what impact measures should do). I'm interested in making the goal of this research explicit and getting it right. My upcoming sequence will cover this at length.

Comment by turntrout on [Resource Request] What's the sequence post which explains you should continue to believe things about a particle moving that's moving beyond your ability to observe it? · 2019-08-04T22:59:51.752Z · score: 4 (2 votes) · LW · GW

The Fabric of Real Things.

Comment by turntrout on How can I help research Friendly AI? · 2019-07-09T15:22:41.698Z · score: 16 (5 votes) · LW · GW

Hi. I'm also a PhD student (upcoming fourth year). Last year, I wrote about my process of auto didacting and switching research areas to directly work on FAI. Beyond that advice, I have more which is yet unpublished: select your committee wisely. I purposefully chose those professors who were easy to get along with and who seemed likely to be receptive to longer-term concerns.

I generally recommend working on an alignment research area in your free time, levelling up as necessary. One path I took: make some proposals and demonstrate you can think novel thoughts and do research. Then, get funding to make this your full-time research.

Alternatively, you can just level up while in your program. Check out Critch's Deliberate Grad School and Leveraging Academia.

If you're interested in a Skype sometime, feel free to message me. There's a MIRIx Discord server I can invite you to as well.

Welcome to the journey!

Comment by turntrout on [FINAL CHAPTER] Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 122 · 2019-07-09T04:44:05.827Z · score: 4 (2 votes) · LW · GW

Quirrell tried to create an aligned successor and failed. Hard. He literally imprinted his own cognitive patterns and shaped Harry the entire year, and still failed.

Comment by turntrout on TurnTrout's shortform feed · 2019-07-06T06:08:46.662Z · score: 11 (3 votes) · LW · GW
… yet isn’t this what you’re already doing?

I work on technical AI alignment, so some of those I help (in expectation) don't even exist yet. I don't view this as what I'd do if my top priority were helping this man.

The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

That's a good question. I think the answer is yes, at least for my close family. Recently, I've expended substantial energy persuading my family to sign up for cryonics with me, winning over my mother, brother, and (I anticipate) my aunt. My father has lingering concerns which I think he wouldn't have upon sufficient reflection, so I've designed a similar plan for ensuring he makes what I perceive to be the correct, option-preserving choice. For example, I made significant targeted donations to effective charities on his behalf to offset (what he perceives as) a considerable drawback of cryonics: his inability to also be an organ donor.

A universe in which humanity wins but my dad is gone would be quite sad to me, and I'll take whatever steps necessary to minimize the chances of that.

I don't know how unusual this is. This reminds me of the relevant Harry-Quirrell exchange; most people seem beaten-down and hurt themselves, and I can imagine a world in which people are in better places and going to greater lengths for those they love. I don't know if this is actually what would make more people go to these lengths (just an immediate impression).

Comment by turntrout on TurnTrout's shortform feed · 2019-07-06T05:02:37.715Z · score: 13 (5 votes) · LW · GW

Suppose I actually cared about this man with the intensity he deserved - imagine that he were my brother, father, or best friend.

The obvious first thing to do before interacting further is to buy him a good meal and a healthy helping of groceries. Then, I need to figure out his deal. Is he hurting, or is he also suffering from mental illness?

If the former, I'd go the more straightforward route of befriending him, helping him purchase a sharp business professional outfit, teaching him to interview and present himself with confidence, secure an apartment, and find a job.

If the latter, this gets trickier. I'd still try and befriend him (consistently being a source of cheerful conversation and delicious food would probably help), but he might not be willing or able to get the help he needs, and I wouldn't have the legal right to force him. My best bet might be to enlist the help of a psychological professional for these interactions. If this doesn't work, my first thought would be to influence the local government to get the broader problem fixed (I'd spend at least an hour considering other plans before proceeding further, here). Realistically, there's likely a lot of pressure in this direction already, so I'd need to find an angle from which few others are pushing or pulling where I can make a difference. I'd have to plot out the relevant political forces, study accounts of successful past lobbying, pinpoint the people I need on my side, and then target my influencing accordingly.

(All of this is without spending time looking at birds-eye research and case studies of poverty reduction; assume counterfactually that I incorporate any obvious improvements to these plans, because I'd care about him and dedicate more than like 4 minutes of thought).

Comment by turntrout on TurnTrout's shortform feed · 2019-07-05T23:00:58.761Z · score: 14 (10 votes) · LW · GW

I passed a homeless man today. His face was wracked in pain, body rocking back and forth, eyes clenched shut. A dirty sign lay forgotten on the ground: "very hungry".

This man was once a child, with parents and friends and dreams and birthday parties and maybe siblings he'd get in arguments with and snow days he'd hope for.

And now he's just hurting.

And now I can't help him without abandoning others. So he's still hurting. Right now.

Reality is still allowed to make this happen. This is wrong. This has to change.

Comment by turntrout on Problems with Counterfactual Oracles · 2019-07-04T19:55:23.783Z · score: 2 (1 votes) · LW · GW

It's more like 4a. The line of thinking seems useful, but I'm not sure that it lands.

Comment by turntrout on Problems with Counterfactual Oracles · 2019-07-04T19:53:02.430Z · score: 2 (1 votes) · LW · GW

I appreciate the answer, but my concerns remain.

1. Doesn't decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.

But for 2., how do we get an automated system and containment setup that is secure against a superintelligence? That's where the difficulty seems to be transferred: we assume we have a machine immune to this "manipulation" thing, and then we can conclude there's no manipulation. But is the global optimum for a stock-tracking system "correctly answering the query" instead of, like, "subtly modify the machine to always match this very output", or "use the machine to escape anyways"? Humans can't even check for this, because we assumed we won't peek!

But that seems optimistic. We assume, contra to 4., that it has a model of the world and its future rewards, so what if it exploits some vulnerability to ensure it can manipulate us during the "innocent" step four? Again, we've already assumed we can't check it during the erasure training, and if we check its code afterwards to make sure it's no longer learning, that source could have manipulative content as well. (I'm generally confused about what capabilities are assumed - is it just souped-up modern ML? I anticipate that even if I misunderstand some points, I anticipate that there exists a steelmanned translation of this concern.)

Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train? This isn't acknowledged anywhere I've read. I'd be more comfortable if we had an oracle which doesn't want out of the box, like one that wants to answer questions without becoming more able to answer questions. (That's along the lines of an AUP oracle)

Comment by turntrout on Problems with Counterfactual Oracles · 2019-07-04T15:13:10.284Z · score: 4 (3 votes) · LW · GW

In my mind, there's a notion of taking advantage of conceptual insights to make the superintelligent mountain less likely to be pushing against you. What part of a proposal is tackling a root cause of misalignment in the desired use cases? It's alright if the proposal isn't perfect, but heuristically I'd want to see something like "here's an analysis of why manipulation happens, and here are principled reasons to think that this proposal averts some or all of the causes".

Concretely, take CIRL, which I'm pretty sure most agree won't work for the general case as formulated. In addition to the normal IRL component, there's the insight of trying to formalize an agent cooperatively learning from a human. This contribution aimed to address a significant component of value learning failure.

(To be sure, I think the structure of "hey, what if the AI anticipates not being on anyways and is somehow rewarded only for accuracy" is a worthwhile suggestion and idea, and I am glad Stuart shared it. I just am not presently convinced it's appropriate to conclude the design averts manipulation incentives / is safe at present.)

Comment by turntrout on TurnTrout's shortform feed · 2019-06-30T18:57:46.543Z · score: 24 (7 votes) · LW · GW

Comment by turntrout on Research Agenda in reverse: what *would* a solution look like? · 2019-06-27T14:13:27.080Z · score: 2 (1 votes) · LW · GW

Thanks for clarifying! I haven't brought this up on your research agenda because I prefer to have the discussion during an upcoming sequence of mine, and it felt unfair to comment on your agenda, "I disagree but I won't elaborate right now".

Comment by turntrout on Research Agenda in reverse: what *would* a solution look like? · 2019-06-26T21:52:34.533Z · score: 2 (1 votes) · LW · GW

I also am curious why this should be so.

I also continue to disagree with Stuart on low impact in particular being intractable without learning human values.

Comment by turntrout on Modeling AGI Safety Frameworks with Causal Influence Diagrams · 2019-06-21T15:15:48.881Z · score: 9 (5 votes) · LW · GW

I really like this layout, this idea, and the diagrams. Great work.

I don't agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like "how is the automated system not vulnerable to manipulation" and "why do we think the system correctly formally measures the quantity in question?" (see more potential problems). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don't see how to break (and probably not safety measures that don't break).

Also, on page 10 you write that during deployment, agents appear as if they are optimizing the training reward function. As evhub et al point out, this isn't usually true: the objective recoverable from perfect IRL on a trained RL agent is often different (behavioral objective != training objective).

Comment by turntrout on Deceptive Alignment · 2019-06-17T20:51:09.424Z · score: 7 (4 votes) · LW · GW

I'm confused what "corrigible alignment" means. Can you expand?

Comment by turntrout on Problems with Counterfactual Oracles · 2019-06-12T15:39:29.420Z · score: 16 (6 votes) · LW · GW

My main problem with these kinds of approaches is they seem to rely on winning a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure. If we acknowledge that by default a full oracle search over consequences basically goes just as wrong as a full sovereign search over consequences, then the optimum of this agent's search is only desirable if we nail the engineering and things work as expected. I have an intuition that this is highly unlikely - the odds just seem too high that we'll forget some corner case (or even be able to see it).

ETA: I see I’ve been strongly downvoted, but I don’t see what’s objectionable.

Comment by turntrout on Does Bayes Beat Goodhart? · 2019-06-03T20:23:22.498Z · score: 4 (2 votes) · LW · GW

If optimizing an arbitrary somewhat-but-not-perfectly-right utility function gives rise to serious Goodhart-related concerns

One thing I’ve been thinking about recently is: why does this happen? Could we have predicted the general phenomenon in advance, without imagining individual scenarios? What aspect of the structure of optimal goal pursuit in an environment reliably produces this result?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-05-10T00:57:41.537Z · score: 2 (1 votes) · LW · GW
"AUP is not about state" - what does it mean for a method to be "about state"?

Here's a potentially helpful analogy. Imagine I program a calculator. Although its computation is determined by the state of the solar system, the computation isn't "about" the state of the solar system.

Comment by turntrout on Not Deceiving the Evaluator · 2019-05-09T16:01:59.228Z · score: 2 (1 votes) · LW · GW

What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.

And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.

Comment by turntrout on Not Deceiving the Evaluator · 2019-05-08T23:54:06.652Z · score: 2 (1 votes) · LW · GW

so I still don't understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I'm suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don't have the right formalism, you're going to get Goodharting on incorrect conceptual contours.

Comment by turntrout on Not Deceiving the Evaluator · 2019-05-08T15:04:59.494Z · score: 11 (6 votes) · LW · GW

Meta: I’d have appreciated a version with less math, because extra formalization can hide the contribution. Or, first explain colloquially why you believe X, and then show the math that shows X.

I don’t see your claim. It looks heavily incentivized to steer state sequences to be desirable to its utility mixture. How do the evaluators even enter the picture?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-05-05T00:29:57.803Z · score: 4 (2 votes) · LW · GW
I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.

(I'm going to take a shot at this now because it's meta, and I think there's a compact explanation I can provide that hopefully makes sense.)

Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say "it's penalizing opportunity cost or instrumental convergence" post hoc because that's why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.

In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can't actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.

Here's an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.

the theory of relative state reachability says choice A is maximally impactful. Why? You can't reach anything like the states you could under inaction. How does this decision track with opportunity cost?

Attainable utility says choice B is the bigger deal. You couldn't do anything with that part of the universe anyways, so it doesn't change much. This is the correct answer.

this scenario is important because it isn't just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It's an illustration of where state reachability diverges from these notions.

a natural reply is, what about things that AUP penalizes that we don't find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)

however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.

ETA: Here's a physically realistic alternative scenario. Again, we're thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.

Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.

Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.

Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.

It's also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don't change much. So it doesn't require our values to do the right thing here, either.

The main point is that the reason it's doing the right thing is based on opportunity cost, while relative reachability's incorrect judgment is not.

I don't agree that AUP is stopping you from "overfitting the environment" (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows).

It isn't the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-05-04T19:09:35.686Z · score: 2 (1 votes) · LW · GW

which do you disagree with?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-05-03T15:46:21.339Z · score: 6 (3 votes) · LW · GW

These are good questions.

As I mentioned, my goal here isn’t to explain the object level, so I’m going to punt on these for now. I think these will be comprehensible after the sequence, which is being optimized to answer this in the clearest way possible.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-23T02:40:08.713Z · score: 4 (2 votes) · LW · GW

I don't read everything that you write, and when I do read things there seems to be some amount of dropout that occurs resulting in me missing certain clauses

Yes, this is fine and understandable. I wasn’t meaning to imply that responsible people should have thought of all these things, but rather pointing to different examples. I’ll edit my phrasing there.

but only the quote

I had a feeling that there was some illusion of transparency, (which is why I said “when I read it”), but I had no idea it was that strong. Good data point, thanks

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-23T02:37:30.219Z · score: 2 (1 votes) · LW · GW

If AUP is not in fact about restricting an agent's impact on the world (or, in other words, on the state of the world)

So the end result is this, but it doesn’t do it by considering impact to be a thing that happens to the state primarily, but rather to agents; impact not in the sense of “how different is the state”, but “how big of a deal is this to me?”. The objective is to limit the agent’s impact on us, which I think is the more important thing. I think this still falls under normal colloquial use of ‘impact’, but I agree that this is different from the approaches so far. I’m going to talk about this distinction quite a bit in the future.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-22T23:55:35.261Z · score: 16 (3 votes) · LW · GW

So there's a thing people do when they talk about AUP which I don't understand. They think it's about state, even though I insist it's fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven't been very good; in the given conversation, they acknowledge that it's different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciously slip back to their old model. out of everyone I've talked to, I can probably count on my hands the number of people who get this – note that agreeing with specific predictions of mine is different.

Now, it's the author's job to communicate their ideas. When I say "as far as I can tell, few others have internalized how AUP actually works", this doesn't connote "gosh, I can't stand you guys, how could you do this", it's more like "somehow I messed up the explanations; I wonder what key ideas are missing still? How can I fix this?".

my goal with this comment isn't to explain, but rather to figure out what's happening. Let's go through some of my past comments about this.

Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.
To scale, relative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric. Relative reachability isn't ontology-agnostic.
In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.
I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power.
Here, we’re directly measuring the agent’s power: its ability to wirehead a trivial utility function.
The plausibility of [this] makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.
By changing our perspective from "what effects on the world are 'impactful'?" to "how can we stop agents from overfitting their environments?", a natural, satisfying definition of impact falls out.
Towards a New Impact Measure

When I read this, it seems like I'm really trying to emphasize that I don't think the direct focus should be on the world state in any way. But it was a long post, and I said a lot of things, so I'm not too surprised.

I tried to nip this confusion in the bud.

"The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it."
I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.
Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history -> world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?
This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance"/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The "state difference weighting" problem disappears. Two concepts of impact are unified.
I’m not saying RR isn’t important - just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.
~ my reply to your initial comment on the AUP post

even more confusing is when I say "there are fundamental concepts here you're missing", people don't seem to become any less confident in their predictions about what AUP does. if people think that AUP is penalizing effects in the world, why don't they notice their confusion when they read a comment like the one above?

a little earlier,

Thinking in terms of "effects" seems like a subtle map/territory confusion. That is, it seems highly unlikely that there exists a robust, value-agnostic means of detecting "effects" that makes sense across representations and environments.
Impact Measure Desiderata

As a more obscure example, some people with a state interpretation might wonder how come I'm not worried about stuff I mentioned in the whitelisting post anymore since I strangely don't think representation/state similarity metric matters for AUP:

due to entropy, you may not be able to return to the exact same universe configuration.
Worrying about the Vase: Whitelisting

(this is actually your "chaotic world" concern)

right now, I'm just chalking this up to "Since the explanations don't make any sense because they're too inferentially distant/it just looks like I built a palace of equations, it probably seems like I'm not on the same page with their concerns, so there's nothing to be curious about." can you give me some of your perspective? (others are welcome to chime in)

to directly answer your question: no, the real world version of AUP which I proposed doesn't reward based on state, and would not have its behavior influenced solely by different possible arrangements of air molecules. (I guess I'm directly responding to this concern, but I don't see any other way to get information on why this phenomenon is happening)

as for the question – I was just curious. I think you'll see why I asked when I send you some drafts of the new sequence. :)

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-20T15:47:36.275Z · score: 2 (1 votes) · LW · GW

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules.

Are you thinking of an action observation formalism, or some kind of reward function over inferred state?


If you had to pose the problem of impact measurement as a question, what would it be?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-19T15:58:15.780Z · score: 4 (2 votes) · LW · GW

Thanks for the detailed list!

AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

Humans get around this by only counting easily predictable effects as impact that they are considered responsible for.

What makes you think that?

Comment by turntrout on Simplified preferences needed; simplified preferences sufficient · 2019-04-19T15:54:21.296Z · score: 4 (2 votes) · LW · GW

people working in these areas don't often disagree with this formal argument; they just think it isn't that important.

I do disagree with this formal argument in that I think it’s incorrectly framed. See the difference between avoiding huge impact to utility and avoiding huge impact to attainable utility, discussed here:

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-17T22:47:27.182Z · score: 2 (1 votes) · LW · GW
this only works if we specified the goal and the cost correctly

Wait, why doesn't it work if you just specify the cost (impact) correctly?

Comment by turntrout on Towards a New Impact Measure · 2019-04-14T01:56:22.896Z · score: 2 (1 votes) · LW · GW

(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:36:26.620Z · score: 2 (1 votes) · LW · GW

I think there's some variance, but not as much as you have in mind. Even if there were a very large value, however, this isn't how N-incrementation works (in the post – if you're thinking of the paper, then yes, the version I presented there doesn't bound lifetime returns and therefore doesn't get the same desirable properties as in the post). If you'll forgive my postponing this discussion, I'd be interested in hearing your thoughts after I post a more in-depth exploration of the phenomenon?

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:32:34.475Z · score: 2 (1 votes) · LW · GW

I don't think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn't supply reward for other reward functions, the agent now has a much more stable existence. If you're saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.

Notice something interesting here where the thing which would be goodharted upon without intent verification isn't the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it's a specific agent with I/O channels, and so on. more on this later.

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:24:43.628Z · score: 2 (1 votes) · LW · GW

Where did I purport that it was safe for AGI in the paper, or in the post? I specifically disclaim that I'm not making that point yet, although I'm pretty sure we can get there.

There is a deeper explanation which I didn't have space to fit in the paper, and I didn't have the foresight to focus on when I wrote this post. I agree that it calls out for more investigation, and (this feels like a refrain for me at this point) I'll be answering this call in a more in-depth sequence on what is actually going on at a deep level with AUP, and how fundamental the phenomenon is to agent-environment interaction.

I don't remember how I found the first version, I think it was in a Google search somehow?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-11T23:14:17.202Z · score: 4 (2 votes) · LW · GW

What would you predict AUP does for the chaotic scenarios? Suppose the attainable set just includes the survival utility function, which is 1 if the agent is activated and 0 otherwise.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-11T23:06:08.113Z · score: 9 (3 votes) · LW · GW
I don't see how representation invariance addresses this concern.

I think my post was basically saying "representation selection seems like a problem because people are confused about the type signature of impact, which is actually a thing you can figure out no matter what you think the world is made of". I don't want to go into too much detail here (as I explained below), but part of what this implies is that discrete "effects" are fake/fuzzy mental constructs/not something to think about when designing an impact measure. In turn, this would mean we should ask a different question that isn't about butterfly effects.

Comment by turntrout on Towards a New Impact Measure · 2019-04-11T16:23:30.143Z · score: 2 (1 votes) · LW · GW

1) Why wouldn't gaining trust be useful for other rewards? I think that it wouldn't be motivated to do so, because the notion of gaining power seems to be deeply intertwined with the notion of heavy maximization. It might attempt to Goodhart our particular way of measuring impact; the fact that we are actually measuring goal achievement ability from a particular vantage point and are using a particular counterfactual structure means that there could be cheeky ways of tricking that structure. This is why intent verification is a thing in this longer post. However, I think the attainable utility measure itself is correct.

2) this doesn't appear in the paper, but I do talk about in the post and I think it's great that you raise this point. Attainable utility preservation says that impact is measured along the arc of your actions, taking into account the deviation of the Q functions at each step compared to doing nothing. If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate. In other words, since impact is measured along the arc of your actions, if your differential elements are chunky, you're not going to get a very good approximation. I think there are good reasons to suspect that in the real world, the way we think about actions is granular enough to avoid this dangerous phenomenon.

3) this is true. My stance here is that this is basically a capabilities problem/a safe exploration issue, which is disjoint from impact measurement.

4) this is why we want to slowly increment . This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T23:23:56.111Z · score: 2 (1 votes) · LW · GW

is there a central example you have in mind for this potential failure mode?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T23:21:16.945Z · score: 13 (-1 votes) · LW · GW

I do plan on pushing back on certain concerns, but I think if I did so now, some of my reasons for believing things would seem weird and complicated-enough-to-be-shaky because of inferential distance. The main pedagogical mistake I made with Towards a New Impact Measure wasn't putting too much in one post, but rather spending too much time on conclusions, telling people what I think happens without helping build in them the intuitions and insights which generate those results. Over the last 8 months, I think I've substantially enriched my model of how agents interact with their environments. I'm interested in seeing how many disagreements melt away when these new insights are properly shared and understood, and what people still disagree with me on. That's why I'm planning on waiting until my upcoming sequence to debate these points.

I am comfortable sharing those concerns which I have specific reason to believe don't hold up. However, I'm wary of dismissing them in a way that doesn't Include those specific reasons. That seems unfair. If you're curious which ones I think these are, feel free to ask me over private message.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T22:51:39.468Z · score: 2 (1 votes) · LW · GW

How does this concern interact with the effective representation invariance claim I made when introducing AUP?