Thoughts on "Human-Compatible" 2019-10-10T05:24:31.689Z · score: 49 (21 votes)
The Gears of Impact 2019-10-07T14:44:51.212Z · score: 27 (8 votes)
World State is the Wrong Level of Abstraction for Impact 2019-10-01T21:03:40.153Z · score: 42 (12 votes)
Attainable Utility Theory: Why Things Matter 2019-09-27T16:48:22.015Z · score: 42 (13 votes)
Deducing Impact 2019-09-24T21:14:43.177Z · score: 54 (14 votes)
Value Impact 2019-09-23T00:47:12.991Z · score: 42 (15 votes)
Reframing Impact 2019-09-20T19:03:27.898Z · score: 79 (28 votes)
What You See Isn't Always What You Want 2019-09-13T04:17:38.312Z · score: 28 (9 votes)
How often are new ideas discovered in old papers? 2019-07-26T01:00:34.684Z · score: 24 (9 votes)
TurnTrout's shortform feed 2019-06-30T18:56:49.775Z · score: 22 (5 votes)
Best reasons for pessimism about impact of impact measures? 2019-04-10T17:22:12.832Z · score: 76 (17 votes)
Designing agent incentives to avoid side effects 2019-03-11T20:55:10.448Z · score: 31 (6 votes)
And My Axiom! Insights from 'Computability and Logic' 2019-01-16T19:48:47.388Z · score: 40 (9 votes)
Penalizing Impact via Attainable Utility Preservation 2018-12-28T21:46:00.843Z · score: 26 (10 votes)
Why should I care about rationality? 2018-12-08T03:49:29.451Z · score: 26 (6 votes)
A New Mandate 2018-12-06T05:24:38.351Z · score: 15 (8 votes)
Towards a New Impact Measure 2018-09-18T17:21:34.114Z · score: 109 (37 votes)
Impact Measure Desiderata 2018-09-02T22:21:19.395Z · score: 40 (11 votes)
Turning Up the Heat: Insights from Tao's 'Analysis II' 2018-08-24T17:54:54.344Z · score: 40 (11 votes)
Pretense 2018-07-29T00:35:24.674Z · score: 36 (14 votes)
Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction' 2018-07-05T00:34:59.249Z · score: 35 (9 votes)
Overcoming Clinginess in Impact Measures 2018-06-30T22:51:29.065Z · score: 40 (13 votes)
Worrying about the Vase: Whitelisting 2018-06-16T02:17:08.890Z · score: 84 (20 votes)
Swimming Upstream: A Case Study in Instrumental Rationality 2018-06-03T03:16:21.613Z · score: 114 (37 votes)
Into the Kiln: Insights from Tao's 'Analysis I' 2018-06-01T18:16:32.616Z · score: 69 (19 votes)
Confounded No Longer: Insights from 'All of Statistics' 2018-05-03T22:56:27.057Z · score: 56 (13 votes)
Internalizing Internal Double Crux 2018-04-30T18:23:14.653Z · score: 80 (19 votes)
The First Rung: Insights from 'Linear Algebra Done Right' 2018-04-22T05:23:49.024Z · score: 77 (22 votes)
Unyielding Yoda Timers: Taking the Hammertime Final Exam 2018-04-03T02:38:48.327Z · score: 40 (12 votes)
Open-Category Classification 2018-03-28T14:49:23.665Z · score: 36 (8 votes)
The Art of the Artificial: Insights from 'Artificial Intelligence: A Modern Approach' 2018-03-25T06:55:46.204Z · score: 68 (18 votes)
Lightness and Unease 2018-03-21T05:24:26.289Z · score: 53 (15 votes)
How to Dissolve It 2018-03-07T06:19:22.923Z · score: 41 (15 votes)
Ambiguity Detection 2018-03-01T04:23:13.682Z · score: 33 (9 votes)
Set Up for Success: Insights from 'Naïve Set Theory' 2018-02-28T02:01:43.790Z · score: 62 (18 votes)
Walkthrough of 'Formalizing Convergent Instrumental Goals' 2018-02-26T02:20:09.294Z · score: 27 (6 votes)
Interpersonal Approaches for X-Risk Education 2018-01-24T00:47:44.183Z · score: 29 (8 votes)


Comment by turntrout on Thoughts on "Human-Compatible" · 2019-10-12T15:50:52.865Z · score: 4 (2 votes) · LW · GW

Uh, the former - looks like I didn’t catch the dictation mistake.

Also, did you mean “wasn’t”? :)

Comment by turntrout on Thoughts on "Human-Compatible" · 2019-10-11T15:48:40.380Z · score: 2 (1 votes) · LW · GW

Note that I wasn’t talking about approval directed agents in the part you originally quoted. I was saying that normal maximizers will learn to build good models as part of capability generalization.

Comment by turntrout on Thoughts on "Human-Compatible" · 2019-10-11T15:46:32.953Z · score: 2 (1 votes) · LW · GW

Can’t quite read your edit, did you mean 3?

Comment by turntrout on Thoughts on "Human-Compatible" · 2019-10-10T23:22:39.699Z · score: 2 (1 votes) · LW · GW

yes, but its underlying model is still accurate, even if it doesn't reveal that to us? I wasn’t claiming that the AI would reveal to us all of the truths it learns.

Perhaps I misunderstand your point.

Comment by turntrout on Thoughts on "Human-Compatible" · 2019-10-10T23:09:59.665Z · score: 2 (1 votes) · LW · GW

Another design: imitation learning. Generally, there seems to be a pattern of: policies which aren't selected for on the basis of maximizing some kind of return.

Comment by turntrout on Thoughts on "Human-Compatible" · 2019-10-10T20:31:43.624Z · score: 4 (2 votes) · LW · GW

Note: I've only started to delve into the literature about Paul's agenda, so these opinions are lightly held.

Before I respond to specific points, recall that I wrote

I'm not necessarily worried about the difficulties themselves, but that the [uncertainty] framework seems so sensitive to them.


the approval-policy does what a predictor says to do at each time step, which is different from maximizing a signal. Its shape feels different to me; the policy isn't shaped to maximize some reward signal (and pursue instrumental subgoals). Errors in prediction almost certainly don't produce a policy adversarial to human interests.

The approval agent is taking actions according to the output of an ML-trained approval predictor; the fact that the policy isn't selected to maximize a signal is critical, and part of why I find approval-based methods so intriguing. There's a very specific kind of policy you need in order to pursue instrumental subgoals, which is reliably produced by maximization, but which otherwise seems to be vanishingly unlikely.

I'm a bit confused because you're citing this in comparison with approval-directed agency, but doesn't approval-directed agency also have this problem?

The contrast is the failing gracefully, not (necessarily) the specific problems.

In addition to the above (even if approval-directed agents have this problem, this doesn't mean disaster, just reduced performance), my understanding is that approval doesn't require actually locating the person, just imitating the output of their approval after reflection. This should be able to be trained in the normal fashion, right? (see the learning from examples section)

Suppose we train the predictor Approval using examples and high-powered ML. Then we have the agent take the action most highly rated by Approval at each time step. This seems to fail much more gracefully as the quality of Approval degrades?

Comment by turntrout on Thoughts on "Human-Compatible" · 2019-10-10T16:09:51.219Z · score: 4 (2 votes) · LW · GW

I don’t think that low impact is decoupled, and it might be misleading to view them from that frame / lend a false sense of security. The policy is still very much shaped by utility, unlike approval.

Comment by turntrout on Thoughts on "Human-Compatible" · 2019-10-10T15:56:54.860Z · score: 2 (1 votes) · LW · GW

I agree for 3, but not for 2.

Comment by turntrout on World State is the Wrong Level of Abstraction for Impact · 2019-10-04T23:06:30.480Z · score: 4 (2 votes) · LW · GW

Most of the reason is indeed exposition: our intuitions about AU-impact are surprisingly clear-cut and lead naturally to the thing we want "low impact" AIs to do (not be incentivized to catastrophically decrease our attainable utilities, yet still execute decent plans). If our intuitions about impact were garbage and misleading, then I would have taken a different (and perhaps test-case-driven) approach. Plus, I already know that the chain of reasoning leads to a compact understanding of the test cases anyways.

I've also found that test-case based discussion (without first knowing what we want) can lead to a blending of concerns, where someone might think the low-impact agent should do X because agents who generally do X are safer (and they don't see a way around that), where someone might secretly have a different conception of the problems that low-impact agency should solve, etc.

Comment by turntrout on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-04T20:21:50.476Z · score: 18 (6 votes) · LW · GW

I don’t know that his arguments “echo”, it’s more like “can be translated into existing discourse”. For example, the leap from his 5) to IDA is massive, and I don’t understand why he imagines tackling the “we can’t align AGIs” problem with “build another AGI to stop the bad AGI”.

Comment by turntrout on Honoring Petrov Day on LessWrong, in 2019 · 2019-10-03T17:01:33.919Z · score: 2 (1 votes) · LW · GW

A group of satellite radar operators told him they had registered no missiles. BBC

I don’t see why this is bottom-line reasoning. It is in fact implausible that the US would first-strike with only five missiles, as that would leave the USSR able to respond.

Comment by turntrout on Honoring Petrov Day on LessWrong, in 2019 · 2019-10-03T16:59:17.244Z · score: 4 (2 votes) · LW · GW

A group of satellite radar operators told him they had registered no missiles. BBC

Comment by turntrout on Honoring Petrov Day on LessWrong, in 2019 · 2019-10-03T15:23:40.970Z · score: 4 (2 votes) · LW · GW

Petrov did consult with the satellite radar operators, who said they detected nothing.

Comment by turntrout on Honoring Petrov Day on LessWrong, in 2019 · 2019-10-03T14:14:09.093Z · score: 4 (2 votes) · LW · GW

He thought it unlikely that the US would launch a strike with 5 ICBMs only, since a first strike would likely be comprehensive. As far as Bayesian reasoning goes, this seems pretty good.

Also, a big part of being good at Bayesian reasoning is refining your ability to reason even when you can't gather data, when you can't view the same scenario "redrawn" ten thousand times and gather statistics on it.

ETA: the satellite radar operators reported all-clear; however, instructions were to only make decisions based on the computer readouts.

Comment by turntrout on World State is the Wrong Level of Abstraction for Impact · 2019-10-02T22:25:09.719Z · score: 2 (1 votes) · LW · GW

It is meant to explain starting from scratch, so no worries! To clarify, although I agree with Matthew's comment, I'll later explain how value learning (or progress therein) is unnecessary for the approach I think is most promising.

Comment by turntrout on Deducing Impact · 2019-10-02T20:02:01.435Z · score: 2 (1 votes) · LW · GW

Probably, unless you already deeply get the thing the exercise is pointing at. I wrote this sequence in part because my past writing didn't do a great job imparting the important insights. Since I don't a priori know who already does and doesn't get each idea, you might as well follow along (note that usually the exercises are much shorter than this post's).

Comment by turntrout on World State is the Wrong Level of Abstraction for Impact · 2019-10-02T15:37:17.994Z · score: 5 (2 votes) · LW · GW

I really like this line of thinking.

Then, if one runs into an ontological crisis, one can in principle re-generate their ontology by figuring out how to reason in terms of the new ontology in order to best fulfill their values.

I've found myself confused by how the process at the end of this sentence works. It seems like there's some abstract "will this worldview lead to value fulfillment?" question being asked, even though the core values seem undefined during an ontological crisis! I agree that once you can regenerate the ontology once you have the core values redefined.

Comment by turntrout on World State is the Wrong Level of Abstraction for Impact · 2019-10-02T00:37:29.429Z · score: 3 (2 votes) · LW · GW

It seems like one downside of impact in the AU sense

Even in worlds where we wanted to build a low impact agent that did something with the state, we'd still want to understand what people actually find impactful. (I don't think we're in such a world, though)

in order to figure out whether an action has high impact, the AI needs to have a detailed understanding of human values and the ontology used by humans.

Let's review what we want: we want an agent design that isn't incentivized to catastrophically impact us. You've observed that directly inferring value-laden AU impact on humans seems pretty hard, so maybe we shouldn't do that. What's a better design? How can we reframe the problem so the solution is obvious?

Let me give you a nudge in the right direction (which will be covered starting two posts from now; that part of the sequence won't be out for a while unfortunately):

Why are goal-directed AIs incentivized to catastrophically impact us - why is there selection pressure in this direction? Would they be incentivized to catastrophically impact Pebblehoarders?

it's not clear to me what AU theory adds in terms of increased safety

AU theory is descriptive; it's about why people find things impactful. We haven't discussed what we should implement yet.

Comment by turntrout on TurnTrout's shortform feed · 2019-10-01T20:57:16.698Z · score: 6 (3 votes) · LW · GW

We can think about how consumers respond to changes in price by considering the elasticity of the quantity demanded at a given price - how quickly does demand decrease as we raise prices? Price elasticity of demand is defined as ; in other words, for price and quantity , this is (this looks kinda weird, and it wasn't immediately obvious what's happening here...). Revenue is the total amount of cash changing hands: .

What's happening here is that raising prices is a good idea when the revenue gained (the "price effect") outweighs the revenue lost to falling demand (the "quantity effect"). A lot of words so far for an easy concept:

If price elasticity is greater than 1, demand is inelastic and price hikes decrease revenue (and you should probably have a sale). However, if it's less than 1, demand is elastic and boosting the price increases revenue - demand isn't dropping off quickly enough to drag down the revenue. You can just look at the area of the revenue rectangle for each effect!

Comment by turntrout on TurnTrout's shortform feed · 2019-10-01T02:44:33.918Z · score: 3 (2 votes) · LW · GW

(Just starting to learn microecon, so please feel free to chirp corrections)

How diminishing marginal utility helps create supply/demand curves: think about the uses you could find for a pillow. Your first few pillows are used to help you fall asleep. After that, maybe some for your couch, and then a few spares to keep in storage. You prioritize pillow allocation in this manner; the value of the latter uses is much less than the value of having a place to rest your head.

How many pillows do you buy at a given price point? Well, if you buy any, you'll buy some for your bed at least. Then, when pillows get cheap enough, you'll start buying them for your couch. At what price, exactly? Depends on the person, and their utility function. So as the price goes up or down, it does or doesn't become worth it to buy pillows for different levels of the "use hierarchy".

Then part of what the supply/demand curve is reflecting is the distribution of pillow use valuations in the market. It tracks when different uses become worth it for different agents, and how significant these shifts are!

Comment by turntrout on TurnTrout's shortform feed · 2019-10-01T01:07:11.804Z · score: 7 (4 votes) · LW · GW

An exercise in the companion workbook to the Feynman Lectures on Physics asked me to compute a rather arduous numerical simulation. At first, this seemed like a "pass" in favor of an exercise more amenable to analytic and conceptual analysis; arithmetic really bores me. Then, I realized I was being dumb - I'm a computer scientist.

Suddenly, this exercise became very cool, as I quickly figured out the equations and code, crunched the numbers in an instant, and churned out a nice scatterplot. This seems like a case where cross-domain competence is unusually helpful (although it's not like I had to bust out any esoteric theoretical CS knowledge). I'm wondering whether this kind of thing will compound as I learn more and more areas; whether previously arduous or difficult exercises become easy when attacked with well-honed tools and frames from other disciplines.

Comment by turntrout on Approval-directed agents · 2019-09-30T20:10:03.514Z · score: 2 (1 votes) · LW · GW

The way I'm conceptualizing it is: in a goal-directed system, the policy is shaped around an external criterion (reward). In approval-directed agents, the policy maximizes the output of the "predictor" (whatever that is). The policy is looking in a different direction for guidance, so to speak.

Two other points:

  • The judgments of the predictor are not influenced by the policy itself, at least not in the same way reward can be influenced by the policy (wireheading). Due to the nature of instrumental convergence, the policies which lead to catastrophic behavior by the agent are actually hard to stumble upon without heavy goal-directed optimization pressure.
  • Even if the predictor is misspecified, we probably won’t get catastrophic behavior (for similar reasons). The main concern I have about this has to do with mesa optimization.
Comment by turntrout on TurnTrout's shortform feed · 2019-09-30T00:45:44.638Z · score: 4 (2 votes) · LW · GW

One of the reasons I think corrigibility might have a simple core principle is: it seems possible to imagine a kind of AI which would make a lot of different possible designers happy. That is, if you imagine the same AI design deployed by counterfactually different agents with different values and somewhat-reasonable rationalities, it ends up doing a good job by almost all of them. It ends up acting to further the designers' interests in each counterfactual. This has been a useful informal way for me to think about corrigibility, when considering different proposals.

This invariance also shows up (in a different way) in AUP, where the agent maintains its ability to satisfy many different goals. In the context of long-term safety, AUP agents are designed to avoid gaining power, which implicitly ends up respecting the control of other agents present in the environment (no matter their goals).

I'm interested in thinking more about this invariance, and why it seems to show up in a sensible way in two different places.

Comment by turntrout on TurnTrout's shortform feed · 2019-09-30T00:29:52.406Z · score: 5 (3 votes) · LW · GW

I had an intuition that attainable utility preservation (RL but you maintain your ability to achieve other goals) points at a broader template for regularization. AUP regularizes the agent's optimal policy to be more palatable towards a bunch of different goals we may wish we had specified. I hinted at the end of Towards a New Impact Measure that the thing-behind-AUP might produce interesting ML regularization techniques.

This hunch was roughly correct; Model-Agnostic Meta-Learning tunes the network parameters such that they can be quickly adapted to achieve low loss on other tasks (the problem of few-shot learning). The parameters are not overfit on the scant few data points to which the parameters are adapted, which is also interesting.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-29T15:33:54.028Z · score: 2 (1 votes) · LW · GW

That doesn't feel at all impactful to me, under those assumptions. It feels like I've learned a new fact about the world, which isn't the same feeling. ETA Another example of this was mentioned by Slider: if you're a taxi driver ambivalent between different destinations, and the client announces where they want to go, it feels like you've learned something but doesn't feel impactful (in the way I'm trying to point at).

I think an issue we might run into here is that I don't exist in your mind, and I've tried to extensionally define for you what I'm pointing at. So if you try to find edge cases according to your understanding of exactly which emotion I'm pointing to, then you'll probably be able to, and it could be difficult for me to clarify without access to your emotions. That said, I'm still happy to try, and I welcome this exploration of how what I've claimed lines up with others' experiences.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-29T00:34:02.146Z · score: 2 (1 votes) · LW · GW

I think I might be expereriencing goal directed behaviour very differntly on the inside and I am unsure how much of the terminology is supposed to be abstract math concepts and how much of it is supposed to be emotional language. It might be for other people there is a more natural link between being in a low or high utility state and feeling low or high.

The sequence uses emotional language (so far), as it's written to be widely accessible. I'm extensionally defining what I'm thinking of and how that works for me. These intuitions translated for the 20 or so people I showed the first part of the sequence, but minds are different and it's possible it doesn't feel the same for you. As long as the idea of "how well the agent can achieve their goals" makes sense and you see why I'm pointing to these properties, that's probably fine.

I am also confused what the realtionship between expected utility and attainable utility is supposed to be. If you expect to maximise they should be pretty close.

Great catch, covered two posts from now.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-28T21:39:05.743Z · score: 3 (2 votes) · LW · GW

Surface Pro 6 with Leonardo from the Windows app store.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-28T21:33:30.907Z · score: 2 (1 votes) · LW · GW

I assert that we feel impacted when we change our beliefs about how well we can get what we want. Learning the address does not affect their attainable utility, so (when I simulate this experience) it doesn't feel impactful in this specific way. It just feels like learning something.

Is this engaging with what you have in mind by "life-changes"?

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-28T21:28:29.477Z · score: 2 (1 votes) · LW · GW

Let me substantiate my claim a bit with a random sampling; I just pulled up a relative reachability blogpost. From the first paragraph, (emphasis mine)

An incorrect or incomplete specification of the objective can result in undesirable behavior like specification gaming or causing negative side effects. There are various ways to make the notion of a “side effect” more precise – I think of it as a disruption of the agent’s environment that is unnecessary for achieving its objective. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because the robot could have easily gone around the vase. On the other hand, a cooking robot that’s making an omelette has to break some eggs, so breaking eggs is not a side effect.

But notice now we're talking about "disruption of the agent's environment". Relative reachability is indeed tackling the impact measure problem, so using what we now understand we might prefer to reframe as:

We think about "side effects" when they change our attainable utilities, so they're really just a conceptual discretization of "things which negatively affect us". We want the robot to prefer policies which avoid overly changing our attainable utilities. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because it's not that easy for us to repair the vase...

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-28T21:10:07.930Z · score: 2 (1 votes) · LW · GW

It seems like you're considering the changes in actions or information-theoretic surprisal, and I'm considering impact to the taxi driver. It's valid to consider how substantially plans change, it's just not the focus of the sequence.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-28T21:06:23.260Z · score: 3 (2 votes) · LW · GW

I would need an expectation that the tile was reachable before. If I have zero clue how the nature works I don't have an expectation that it was possible beforehand so I am not losing any ability.

Whoops, yeah, I forgot to specify that by the rules of this maze (which is actually generated from the labyrinth in Undertale), you can initially reach the goal. These are really fair critiques. I initially included the rules, but there were a lot of rules and it was distracting. I might add something more.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-28T21:03:41.958Z · score: 10 (4 votes) · LW · GW

Well, the claim is tautological, after all! The problem with the first part of this sequence is that it can seem... obvious... until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility). By default, one considers what "big deals" have in common, and then thinks about not breaking vases / not changing too much stuff in the world state. This attractor is so strong that when I say, "wait, maybe it's not primarily about vases or objects", it didn't make sense.

The point of the first portion of the sequence isn't to amaze people with the crazy surprising insane twists I've discovered in what impact really is about - it's to show how things add up to normalcy, so as to set the stage for a straightforward discussion about one promising direction I have in mind for averting instrumental incentives.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-28T20:35:26.050Z · score: 2 (1 votes) · LW · GW

In the blowing up sun scenario I imagined myself being helpless as I normally can't command sun altering lasers or anything like that. In a world that has a slow apocalypse there would be a period of altered living. But in a world that suddenly turns off it is business as usual to the cutoff point. It doesn't feel impactful.

So learning that you and your family won't die in like a week doesn't feel like a big deal? I feel confused by this, and think maybe you meant something else?

being stuck in the abstract doesn't feel that bad. Is being stuck bad? Is being stuck good or worse by being killed by unpredictable natural forces? Does stuck come with immortality?

Well, your goal is to reach the gray tile. So, if you imagine yourself existing in this strange maze-reality, having that goal be your one and only imperative, and then no longer being able to reach the tile at all... that feels like a huge setback. But crucially, it only feels like a setback once you comprehend the rules of the maze enough to realize what happened.

If I am a taxi driver when the customer announces their destination it is very impactful for my driving but addresses are not better or worse amongst each other.

My framing of impact is something that only agents experience and consider. I'm not talking about how your strategies themselves are "impacted" or change as the result of new information. (I feel like we're using different words for the same things, so I wouldn't be surprised if just this reply didn't clarify what I mean.)

ETA: I'm saying that "getting to win less and losing access to utility" is impact to you, in my conception.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-28T16:17:53.234Z · score: 2 (1 votes) · LW · GW

Can you elaborate the situation further? I’m not sure I follow where the proxy comes apart, but I’m interested in hearing more.

Comment by turntrout on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-28T01:37:35.268Z · score: 11 (4 votes) · LW · GW

It seems to me like the algorithm people are following is: if an action would be unilateralist, and there could be disagreement about its benefit, don't take the action. This will systematically bias the group towards inaction. While this is fine for low-stakes situations, in higher-stakes situations where the group can invest effort, you should actually figure out whether it is good to take the action (via the two-step method above). We need to be able to take irreversible actions; the skill we should be practicing is not "don't take unilateralist actions", it's "take unilateralist actions only if they have an expected positive effect after taking the unilateralist curse into account".

I don’t disagree with this, and am glad to see reminders to actually evaluate different courses of action besides the one expected of us. my comment was more debating your own valuation as being too low, it not being a one-off event once you consider scenarios either logically or causally downstream of this one, and just a general sense that you view the consequences of this event as quite isolated.

Comment by turntrout on Attainable Utility Theory: Why Things Matter · 2019-09-27T23:06:53.569Z · score: 2 (1 votes) · LW · GW

Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"

The next post will cover this.

there's a bunch of different things that "want" could mean. Is that something you've thought about and if so, is it important to pick the right sense of "want"?

I haven't considered this at length yet. Since we're only thinking descriptively right now and in light of where the sequence is headed, I don't know it's important to nail down the right sense. That said, I'm still quite interested in doing so.

In terms of the want/like distinction (keeping in mind that want is being used in its neuroscientific that-which-motivates sense, and not the sense I've been using in the post), consider the following:

A University of Michigan study analyzed the brains of rats eating a favorite food. They found separate circuits for "wanting" and "liking", and were able to knock out either circuit without affecting the other... When they knocked out the "liking" system, the rats would eat exactly as much of the food without making any of the satisifed lip-licking expression, and areas of the brain thought to be correlated with pleasure wouldn't show up in the MRI. Knock out "wanting", and the rats seem to enjoy the food as much when they get it but not be especially motivated to seek it out. Are wireheads happy?

Imagining my "liking" system being forever disabled feels pretty terrible, but not maximally negatively impactful (because I also have preferences about the world, not just how much I enjoy my life). Imagining my "wanting" system being disabled feels similar to imagining losing significant executive function - it's not that I wouldn't be able to find value in life, but my future actions now seem unlikely to be pushing my life and the world towards outcomes I prefer. Good things still might happen, and I'd like that, but they seem less likely to come about.

The above is still cheating, because I'm using "preferences" in my speculation, but I think it helps pin down things a bit. It seems like there's some combination of liking/endorsing for "how good things are", while "wanting" comes into play when I'm predicting how I'll act (more on that in two posts, along with other embedded agentic considerations re: "ability to get").

Or should I just ask and let the author say that they'll be answered later in the sequence?

Doing this is fine! We're basically past the point where I wanted to avoid past framings, so people can talk about whatever (although I reserve the right to reply "this will be much easier to discuss later").

Comment by turntrout on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-27T01:52:05.916Z · score: 25 (7 votes) · LW · GW

But this is not a one-time situation. If you're a professional musician, would you agree to mess up at every dress rehearsal, because it isn't the real show?

More indirectly... the whole point of "celebrating and practicing our ability to not push buttons" is that we need to be able to not push buttons, even when it seems like a good idea (or necessary, or urgent that we defect while we can still salvage the the percieved situation). The vast majority of people aren't tempted by pushing a button when pushing it seems like an obviously bad idea. I think we need to take trust building seriously, and practice the art of actually cooperating. Real life doesn't grade you on how well you understand TDT considerations and how many blog posts you've read on it, it grades you on whether you actually can make the cooperation equilibrium happen.

Comment by turntrout on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-27T00:22:26.628Z · score: 59 (18 votes) · LW · GW

Comment by turntrout on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-26T20:58:35.979Z · score: 36 (10 votes) · LW · GW

I'm leaning towards this not being a good trade, even though it's taxing to type that.

In the future, some people will find themselves in situations not too unlike this, where there are compelling utilitarian reasons for pressing the button.

Look, the system should be corrigible. It really, really should; the safety team's internal prediction market had some pretty lopsided results. There are untrustworthy actors with capabilities similar to or exceeding ours. If we press the button, it probably goes better than if they press it. And they can press it. Twenty people died since I started talking, more will die if we don't start pushing the world in a better direction, and do you feel the crushing astronomical weight of the entire future's eyes upon us? Even a small probability increase in a good outcome makes pressing the button worth it.

And I think your policy should still be to not press the button to launch a singleton from this epistemic state, because we have to be able to cooperate! You don't press buttons at will, under pressure, when the entire future hangs in the balance! If we can't even cooperate, right here, right now, under much weaker pressures, what do we expect of the "untrustworthy actors"?

So how about people instead donate to charity in celebration of not pressing the button?

ETA I have launch codes btw.

Comment by turntrout on Deducing Impact · 2019-09-25T18:01:01.282Z · score: 4 (2 votes) · LW · GW

Translation to normal spoiler text:

I think I might be spoilered from reading the original paper, but my guess is "The impact to someone of an event is how much it changes their ability to get what we want". Humans care about Iniron existing because it makes it harder to reduce suffering and increase happiness. (Not sure how to square this definition of impact with finding out new info that was already there, as in the case of Iniron, it had already existed, we just found out about it.) Pebblites care about all their pebbles becoming obsidian because it changes their ability to stack pebbles. Both humans and pebblites care about being hit by an asteroid because it's harder to pursue one's values if one is killed by an asteroid.

Comment by turntrout on Deducing Impact · 2019-09-25T17:59:49.856Z · score: 4 (2 votes) · LW · GW

Translation to normal spoiler text:

The impact of an event on you is the difference between the expected value of your utility function given certainty that the event will happen, and the current expected value of your utility function.

More formally, we say that the expected value of your utility function is the sum, over all possible worldstates , of , while the expected value of your utility function given certainty that a statement about the world is true is the sum over all possible worldstates of . The impact of being true, then, is the absolute value of the difference of those two quantities.

Comment by turntrout on Deducing Impact · 2019-09-25T17:57:59.492Z · score: 3 (2 votes) · LW · GW

For ease of reference, I'm going to translate any ROT13 comments into normal spoilers.

Impact is the amount I must do things differently to reach my goals Atleast three big strong intuitions. A thing that happens if it turns the results of my current actions to be way worse is big impact. A thing that happens if the feasibility or utility of action at my disposal is changed a lot then that is a big deal (which often means that action must be performed or must not be performed). If there is a lot of surprise but the way to overcome the surprises is to carry on exactly as I was already doing is low to no impact.

Comment by turntrout on How can I reframe my study motivation? · 2019-09-25T16:38:55.548Z · score: 2 (1 votes) · LW · GW

If you're trying to learn probability theory, I think you'd indeed be better off with Jaynes' Probability Theory: The Logic of Science over Eliezer's essays on Bayesian probability theory. However, in my experience, the Sequences offer a special suite of mental skills and stances I haven't found elsewhere.

Comment by turntrout on Deducing Impact · 2019-09-24T22:59:12.353Z · score: 3 (2 votes) · LW · GW

In draft.js, you have to start a new line. Like a quote. In Markdown, you do spoilers differently.

Check out the FAQ.

Comment by turntrout on Towards a mechanistic understanding of corrigibility · 2019-09-24T15:25:31.365Z · score: 2 (1 votes) · LW · GW

I think my claim is something like "hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle."

There seem to be different ways you can modify the objective. Take the solution to the easy problem of wireheading: I think we're comfortable saying there's a solution because the AI obviously grading the future before it happens. No matter how smart you are, you're grading the future in an obviously-better way. So, we say the problem is solved. On the other extreme is AI boxing, where you put a bunch of traffic cones in the way of a distant oncoming car and say, "there's no way anyone could drive around this"!

Comment by turntrout on Value Impact · 2019-09-24T14:14:47.737Z · score: 2 (1 votes) · LW · GW

VI is dependent on some agent's ability to reach arbitrary value-driven goals, while OI depends on any agent's ability to reach goals in general.

VI depends on the ability to do one kind of goal in particular, like human values. OI depends on goals in general.

I'm not sure if there exists a robust distinction between the two - the post doesn't discuss any general demarcation tool.

If I understand correctly, this is wondering whether there are some impacts that count for ~50% of all agents, or 10%, or .01% - where do we draw the line? It seems to me that any natural impact (that doesn't involve something crazy like "if the goal encoding starts with '0', shut them off; otherwise, leave them alone") either affects a very low percentage of agents or a very high percentage of agents. So, I'm not going to draw an exact line, but I think it should be intuitively obvious most of the time.

Maybe I'm wrong, but I think the most important point to note here is that 'objectiveness' of an impact is defined not to be about the 'objective state of the world' - rather about how 'general to all agents' an impact is

This is exactly it.

Comment by turntrout on Towards a mechanistic understanding of corrigibility · 2019-09-24T01:47:39.797Z · score: 11 (3 votes) · LW · GW

This feeds into my general impression that we should in most cases be thinking about getting the system to really do what we want, rather than warping its utility function to try and de-motivate it from making trouble.

A decomposition that's been on my mind lately: we can center our framing on the alignment and motivation of the system's actual goal (what you're leaning towards), and we can also center our framing on why misspecifications are magnified into catastrophically bad behavior, as opposed to just bad behavior.

We can look at attempts to e.g. find one simple easy wish that gets what we want ("AI alignment researchers hate him! Find out how he aligns superintelligence with one simple wish!"), but by combining concepts like superexponential concept space/fragility of value and Goodhart's law, we can see why there shouldn't be a low complexity object-level solution. So, we know not to look.

My understanding of the update being done on your general impression here is: "there are lots of past attempts to apply simple fixes to avoid disastrous / power-seeking behavior, and those all break, and also complexity of value. In combination with those factors, there shouldn't be a simple way to avoid catastrophes because nearest-unblocked-solution."

But I suggest there might be something missing from that argument, because there isn't yet common gears-level understanding of why catastrophes happen by default, so how do we know that we can't prevent catastrophes from being incentivized? Like, it seems imaginable that we could understand the gears so well that we can avoid problems; after all, the gears underlying catastrophic incentives are not the same as the gears underlying specification difficulty.

It may in fact just be the case that yes, preventing catastrophic incentives does not admit a simple and obviously-correct solution! A strong judgment seems premature; it isn't obvious to me whether this is true. I do think that we should be thinking about why these incentives exist, regardless of whether there is a simple object-level solution.

Comment by turntrout on FactorialCode's Shortform · 2019-09-23T15:37:41.967Z · score: 4 (2 votes) · LW · GW

(This is the subject of the subsequent post, and will continue to be discussed in a few more posts to follow)

Comment by turntrout on Reframing Impact · 2019-09-22T03:52:16.894Z · score: 2 (1 votes) · LW · GW

Yeah, the Maximum Pink marble has a sheen on it, but outside of that admittedly obscure cue... there's only so many gradations of pink you can tell apart at once.

Comment by turntrout on TurnTrout's shortform feed · 2019-09-22T02:28:49.631Z · score: 4 (2 votes) · LW · GW

How does representation interact with consciousness? Suppose you're reasoning about the universe via a partially observable Markov decision process, and that your model is incredibly detailed and accurate. Further suppose you represent states as numbers, as their numeric labels.

To get a handle on what I mean, consider the game of Pac-Man, which can be represented as a finite, deterministic, fully-observable MDP. Think about all possible game screens you can observe, and number them. Now get rid of the game screens. From the perspective of reinforcement learning, you haven't lost anything - all policies yield the same return they did before, the transitions/rules of the game haven't changed - in fact, there's a pretty strong isomorphism I can show between these two MDPs. All you've done is changed the labels - representation means practically nothing to the mathematical object of the MDP, although many eg DRL algorithms should be able to exploit regularities in the representation to reduce sample complexity.

So what does this mean? If you model the world as a partially observable MDP whose states are single numbers... can you still commit mindcrime via your deliberations? Is the structure of the POMDP in your head somehow sufficient for consciousness to be accounted for (like how the theorems of complexity theory govern computers both of flesh and of silicon)? I'm confused.