Comment by turntrout on Deceptive Alignment · 2019-06-17T20:51:09.424Z · score: 5 (3 votes) · LW · GW

I'm confused what "corrigible alignment" means. Can you expand?

Comment by turntrout on Problems with Counterfactual Oracles · 2019-06-12T15:39:29.420Z · score: -2 (3 votes) · LW · GW

My main problem with these kinds of approaches is they seem to rely on winning a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure. If we acknowledge that by default a full oracle search over consequences basically goes just as wrong as a full sovereign search over consequences, then the optimum of this agent's search is only desirable if we nail the engineering and things work as expected. I have an intuition that this is highly unlikely - the odds just seem too high that we'll forget some corner case (or even be able to see it).

ETA: I see I’ve been strongly downvoted, but I don’t see what’s objectionable.

Comment by turntrout on Does Bayes Beat Goodhart? · 2019-06-03T20:23:22.498Z · score: 4 (2 votes) · LW · GW

If optimizing an arbitrary somewhat-but-not-perfectly-right utility function gives rise to serious Goodhart-related concerns

One thing I’ve been thinking about recently is: why does this happen? Could we have predicted the general phenomenon in advance, without imagining individual scenarios? What aspect of the structure of optimal goal pursuit in an environment reliably produces this result?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-05-10T00:57:41.537Z · score: 2 (1 votes) · LW · GW
"AUP is not about state" - what does it mean for a method to be "about state"?

Here's a potentially helpful analogy. Imagine I program a calculator. Although its computation is determined by the state of the solar system, the computation isn't "about" the state of the solar system.

Comment by turntrout on Not Deceiving the Evaluator · 2019-05-09T16:01:59.228Z · score: 2 (1 votes) · LW · GW

What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.

And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.

Comment by turntrout on Not Deceiving the Evaluator · 2019-05-08T23:54:06.652Z · score: 2 (1 votes) · LW · GW

so I still don't understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I'm suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don't have the right formalism, you're going to get Goodharting on incorrect conceptual contours.

Comment by turntrout on Not Deceiving the Evaluator · 2019-05-08T15:04:59.494Z · score: 11 (4 votes) · LW · GW

Meta: I’d have appreciated a version with less math, because extra formalization can hide the contribution. Or, first explain colloquially why you believe X, and then show the math that shows X.

I don’t see your claim. It looks heavily incentivized to steer state sequences to be desirable to its utility mixture. How do the evaluators even enter the picture?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-05-05T00:29:57.803Z · score: 4 (2 votes) · LW · GW
I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.

(I'm going to take a shot at this now because it's meta, and I think there's a compact explanation I can provide that hopefully makes sense.)

Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say "it's penalizing opportunity cost or instrumental convergence" post hoc because that's why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.

In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can't actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.

Here's an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.

the theory of relative state reachability says choice A is maximally impactful. Why? You can't reach anything like the states you could under inaction. How does this decision track with opportunity cost?

Attainable utility says choice B is the bigger deal. You couldn't do anything with that part of the universe anyways, so it doesn't change much. This is the correct answer.

this scenario is important because it isn't just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It's an illustration of where state reachability diverges from these notions.

a natural reply is, what about things that AUP penalizes that we don't find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)

however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.

ETA: Here's a physically realistic alternative scenario. Again, we're thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.

Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.

Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.

Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.

It's also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don't change much. So it doesn't require our values to do the right thing here, either.

The main point is that the reason it's doing the right thing is based on opportunity cost, while relative reachability's incorrect judgment is not.

I don't agree that AUP is stopping you from "overfitting the environment" (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows).

It isn't the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-05-04T19:09:35.686Z · score: 2 (1 votes) · LW · GW

which do you disagree with?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-05-03T15:46:21.339Z · score: 6 (3 votes) · LW · GW

These are good questions.

As I mentioned, my goal here isn’t to explain the object level, so I’m going to punt on these for now. I think these will be comprehensible after the sequence, which is being optimized to answer this in the clearest way possible.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-23T02:40:08.713Z · score: 4 (2 votes) · LW · GW

I don't read everything that you write, and when I do read things there seems to be some amount of dropout that occurs resulting in me missing certain clauses

Yes, this is fine and understandable. I wasn’t meaning to imply that responsible people should have thought of all these things, but rather pointing to different examples. I’ll edit my phrasing there.

but only the quote

I had a feeling that there was some illusion of transparency, (which is why I said “when I read it”), but I had no idea it was that strong. Good data point, thanks

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-23T02:37:30.219Z · score: 2 (1 votes) · LW · GW

If AUP is not in fact about restricting an agent's impact on the world (or, in other words, on the state of the world)

So the end result is this, but it doesn’t do it by considering impact to be a thing that happens to the state primarily, but rather to agents; impact not in the sense of “how different is the state”, but “how big of a deal is this to me?”. The objective is to limit the agent’s impact on us, which I think is the more important thing. I think this still falls under normal colloquial use of ‘impact’, but I agree that this is different from the approaches so far. I’m going to talk about this distinction quite a bit in the future.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-22T23:55:35.261Z · score: 16 (3 votes) · LW · GW

So there's a thing people do when they talk about AUP which I don't understand. They think it's about state, even though I insist it's fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven't been very good; in the given conversation, they acknowledge that it's different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciously slip back to their old model. out of everyone I've talked to, I can probably count on my hands the number of people who get this – note that agreeing with specific predictions of mine is different.

Now, it's the author's job to communicate their ideas. When I say "as far as I can tell, few others have internalized how AUP actually works", this doesn't connote "gosh, I can't stand you guys, how could you do this", it's more like "somehow I messed up the explanations; I wonder what key ideas are missing still? How can I fix this?".

my goal with this comment isn't to explain, but rather to figure out what's happening. Let's go through some of my past comments about this.

Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.
To scale, relative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric. Relative reachability isn't ontology-agnostic.
In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.
I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power.
Here, we’re directly measuring the agent’s power: its ability to wirehead a trivial utility function.
The plausibility of [this] makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.
By changing our perspective from "what effects on the world are 'impactful'?" to "how can we stop agents from overfitting their environments?", a natural, satisfying definition of impact falls out.
Towards a New Impact Measure

When I read this, it seems like I'm really trying to emphasize that I don't think the direct focus should be on the world state in any way. But it was a long post, and I said a lot of things, so I'm not too surprised.

I tried to nip this confusion in the bud.

"The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it."
I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.
Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history -> world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?
This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance"/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The "state difference weighting" problem disappears. Two concepts of impact are unified.
I’m not saying RR isn’t important - just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.
~ my reply to your initial comment on the AUP post

even more confusing is when I say "there are fundamental concepts here you're missing", people don't seem to become any less confident in their predictions about what AUP does. if people think that AUP is penalizing effects in the world, why don't they notice their confusion when they read a comment like the one above?

a little earlier,

Thinking in terms of "effects" seems like a subtle map/territory confusion. That is, it seems highly unlikely that there exists a robust, value-agnostic means of detecting "effects" that makes sense across representations and environments.
Impact Measure Desiderata

As a more obscure example, some people with a state interpretation might wonder how come I'm not worried about stuff I mentioned in the whitelisting post anymore since I strangely don't think representation/state similarity metric matters for AUP:

due to entropy, you may not be able to return to the exact same universe configuration.
Worrying about the Vase: Whitelisting

(this is actually your "chaotic world" concern)

right now, I'm just chalking this up to "Since the explanations don't make any sense because they're too inferentially distant/it just looks like I built a palace of equations, it probably seems like I'm not on the same page with their concerns, so there's nothing to be curious about." can you give me some of your perspective? (others are welcome to chime in)

to directly answer your question: no, the real world version of AUP which I proposed doesn't reward based on state, and would not have its behavior influenced solely by different possible arrangements of air molecules. (I guess I'm directly responding to this concern, but I don't see any other way to get information on why this phenomenon is happening)

as for the question – I was just curious. I think you'll see why I asked when I send you some drafts of the new sequence. :)

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-20T15:47:36.275Z · score: 2 (1 votes) · LW · GW

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules.

Are you thinking of an action observation formalism, or some kind of reward function over inferred state?

Responsibility

If you had to pose the problem of impact measurement as a question, what would it be?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-19T15:58:15.780Z · score: 4 (2 votes) · LW · GW

Thanks for the detailed list!

AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

Humans get around this by only counting easily predictable effects as impact that they are considered responsible for.

What makes you think that?

Comment by turntrout on Simplified preferences needed; simplified preferences sufficient · 2019-04-19T15:54:21.296Z · score: 2 (1 votes) · LW · GW

people working in these areas don't often disagree with this formal argument; they just think it isn't that important.

I do disagree with this formal argument in that I think it’s incorrectly framed. See the difference between avoiding huge impact to utility and avoiding huge impact to attainable utility, discussed here: https://www.lesswrong.com/posts/c2oM7qytRByv6ZFtz/impact-measure-desiderata#zLnkb5xM4E9ATzCFg.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-17T22:47:27.182Z · score: 2 (1 votes) · LW · GW
this only works if we specified the goal and the cost correctly

Wait, why doesn't it work if you just specify the cost (impact) correctly?

Comment by turntrout on Towards a New Impact Measure · 2019-04-14T01:56:22.896Z · score: 2 (1 votes) · LW · GW

(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:36:26.620Z · score: 2 (1 votes) · LW · GW

I think there's some variance, but not as much as you have in mind. Even if there were a very large value, however, this isn't how N-incrementation works (in the post – if you're thinking of the paper, then yes, the version I presented there doesn't bound lifetime returns and therefore doesn't get the same desirable properties as in the post). If you'll forgive my postponing this discussion, I'd be interested in hearing your thoughts after I post a more in-depth exploration of the phenomenon?

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:32:34.475Z · score: 2 (1 votes) · LW · GW

I don't think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn't supply reward for other reward functions, the agent now has a much more stable existence. If you're saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.

Notice something interesting here where the thing which would be goodharted upon without intent verification isn't the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it's a specific agent with I/O channels, and so on. more on this later.

Comment by turntrout on Towards a New Impact Measure · 2019-04-13T17:24:43.628Z · score: 2 (1 votes) · LW · GW

Where did I purport that it was safe for AGI in the paper, or in the post? I specifically disclaim that I'm not making that point yet, although I'm pretty sure we can get there.

There is a deeper explanation which I didn't have space to fit in the paper, and I didn't have the foresight to focus on when I wrote this post. I agree that it calls out for more investigation, and (this feels like a refrain for me at this point) I'll be answering this call in a more in-depth sequence on what is actually going on at a deep level with AUP, and how fundamental the phenomenon is to agent-environment interaction.

I don't remember how I found the first version, I think it was in a Google search somehow?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-11T23:14:17.202Z · score: 4 (2 votes) · LW · GW

What would you predict AUP does for the chaotic scenarios? Suppose the attainable set just includes the survival utility function, which is 1 if the agent is activated and 0 otherwise.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-11T23:06:08.113Z · score: 9 (3 votes) · LW · GW
I don't see how representation invariance addresses this concern.

I think my post was basically saying "representation selection seems like a problem because people are confused about the type signature of impact, which is actually a thing you can figure out no matter what you think the world is made of". I don't want to go into too much detail here (as I explained below), but part of what this implies is that discrete "effects" are fake/fuzzy mental constructs/not something to think about when designing an impact measure. In turn, this would mean we should ask a different question that isn't about butterfly effects.

Comment by turntrout on Towards a New Impact Measure · 2019-04-11T16:23:30.143Z · score: 2 (1 votes) · LW · GW

1) Why wouldn't gaining trust be useful for other rewards? I think that it wouldn't be motivated to do so, because the notion of gaining power seems to be deeply intertwined with the notion of heavy maximization. It might attempt to Goodhart our particular way of measuring impact; the fact that we are actually measuring goal achievement ability from a particular vantage point and are using a particular counterfactual structure means that there could be cheeky ways of tricking that structure. This is why intent verification is a thing in this longer post. However, I think the attainable utility measure itself is correct.

2) this doesn't appear in the paper, but I do talk about in the post and I think it's great that you raise this point. Attainable utility preservation says that impact is measured along the arc of your actions, taking into account the deviation of the Q functions at each step compared to doing nothing. If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate. In other words, since impact is measured along the arc of your actions, if your differential elements are chunky, you're not going to get a very good approximation. I think there are good reasons to suspect that in the real world, the way we think about actions is granular enough to avoid this dangerous phenomenon.

3) this is true. My stance here is that this is basically a capabilities problem/a safe exploration issue, which is disjoint from impact measurement.

4) this is why we want to slowly increment . This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T23:23:56.111Z · score: 2 (1 votes) · LW · GW

is there a central example you have in mind for this potential failure mode?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T23:21:16.945Z · score: 13 (-1 votes) · LW · GW

I do plan on pushing back on certain concerns, but I think if I did so now, some of my reasons for believing things would seem weird and complicated-enough-to-be-shaky because of inferential distance. The main pedagogical mistake I made with Towards a New Impact Measure wasn't putting too much in one post, but rather spending too much time on conclusions, telling people what I think happens without helping build in them the intuitions and insights which generate those results. Over the last 8 months, I think I've substantially enriched my model of how agents interact with their environments. I'm interested in seeing how many disagreements melt away when these new insights are properly shared and understood, and what people still disagree with me on. That's why I'm planning on waiting until my upcoming sequence to debate these points.

I am comfortable sharing those concerns which I have specific reason to believe don't hold up. However, I'm wary of dismissing them in a way that doesn't Include those specific reasons. That seems unfair. If you're curious which ones I think these are, feel free to ask me over private message.

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T22:51:39.468Z · score: 2 (1 votes) · LW · GW

How does this concern interact with the effective representation invariance claim I made when introducing AUP?

Comment by turntrout on Best reasons for pessimism about impact of impact measures? · 2019-04-10T18:21:38.815Z · score: 3 (2 votes) · LW · GW
Moreover, if in the moment T2 it turns out that the person, who crashed his head at T1, was next Hitler, when again preserving the vase in the T0 becomes a low impact event

why does it become a low impact event in your eyes?

In other words, to calculate the real impact of an event we have to calculate its consequences until the end of eternity, which is impossible.

What is your model of the thing that people are doing in their heads when they say things are impactful or not?

longterm impact can't be calculated without knowing actual human values, so lowering the impact is the same task as AI alignment, and low impact task can't replace AI alignment task or be solved separately.

Can you elaborate on this, and what your definition of impact is? I think we might be thinking of different things, and I'm interested as to what yours is.

Best reasons for pessimism about impact of impact measures?

2019-04-10T17:22:12.832Z · score: 75 (16 votes)
Comment by turntrout on Impact Measure Desiderata · 2019-04-09T01:58:56.799Z · score: 4 (2 votes) · LW · GW

Sounds good. I’m currently working on a long sequence walking through my intuitions and assumptions in detail.

Comment by turntrout on Impact Measure Desiderata · 2019-04-05T17:07:19.328Z · score: 2 (1 votes) · LW · GW
But creating extreme suffering might not actually involve doing much in the physical world (compared to "normal" actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?

Since there are a lot of possible scenarios, each of which affects the optimization differently, I'm hesitant to use a universal quantifier here without more details. However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously. An agent might try to extort us if it expected we would respond, but respond with what? Although impact measures quantify things in the environment, that doesn't mean they're measuring how "similar" two states look to the eye. AUP penalizes distance traveled in the Q function space for its attainable utility functions. We also need to think about the motive for the extortion – if it means the agent gains in power, then that is also penalized.

Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?

Again, it depends on the objective of the extortion. As for the latter, that wouldn't be credible, since we would be able to tell its threat was the last action in its plan. AUP isolates the long-term effects of each action by having the agent stop acting for the rest of the epoch; this gives us a counterfactual opportunity to respond to that action.

I'm not sure whether this belongs in the desiderata, since we're talking about whether temporary object level bad things could happen. I think it's a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure. Even so, it's true that we could explicitly talk about what we want to do with impact measures, adding desiderata like "able to do reasonable things" and "disallows catastrophes from rising to the top of the preference ordering". I'm still thinking about this.

Comment by turntrout on Impact Measure Desiderata · 2019-04-05T01:52:40.691Z · score: 2 (1 votes) · LW · GW

I think so. First, AUP seems to bound "how hard the agent tries" (in the physical world with its actions); the ambitions of such an agent seem rather restrained. Second, AUP provides a strong counterfactual approval incentive. While it doesn't rule out the possibility of physical suffering, the agent is heavily dis-incentivized from actions which would substantially change the likelihood we keep it activated (comparing how likely it is to be turned off if it doesn't do the thing, with the likelihood if it does the thing and then waits for a long time). It would basically have to be extremely sure it could keep it secret, which seems rather unlikely considering the other aspects of the behavior of AUP agents. If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn't keep it secret, so it would be penalized and it wouldn't do it.

I think similar arguments involving counterfactual approval apply for similar things we may want to avoid.

Comment by turntrout on Impact Measure Desiderata · 2019-04-05T00:58:21.543Z · score: 2 (1 votes) · LW · GW

it's true that impact measures, and AUP in particular, don't do anything to mitigate mindcrime. Part of this is because aspects of the agent's reasoning process can't be considered impactful in the non-embedded formalisms we're currently stuck with. Part of this is because it seems like a separate problem. Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.

However, I'm skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking "never have an AI cause an instance of that"? Why would that be part of our terminal preferences? If you mean "never have this happen", we've already lost.

It seems more like we really, really don't want any of that to happen, and the less happens, the better. Like I said, the point isn't that the agent will never do it, but that any bad things can be recovered from. This seems alright to me, as far as impact measures are concerned.

More generally, if we did have a goal of this type, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that universe would be partially ruined for us forever. That just doesn't sound right.

Comment by turntrout on Impact Measure Desiderata · 2019-04-04T22:02:20.199Z · score: 4 (2 votes) · LW · GW

Short answer: yes; if its goal is to break vases, that would be pretty reasonable.

Longer answer: The AUP theory of low impact says that impact is relative to the environment and to the agent's vantage point therein. In Platonic gridworlds like this:



knowing whether a vase is present tells you a lot about the state, and you can't replace the vase here, so breaking it is a big deal (according to AUP). If you could replace the vase, there would still be a lesser impact. AUP would say to avoid breaking unnecessary vases due to the slight penalty, since the goal presumably doesn't require breaking the vase – so why not go around?

On the other hand, in the Go example, winning is the agent's objective. Depending on how the agent models the world (as a real-world agent playing a game on a computer, or whether it thinks it's just Platonically interacting with a Go environment), penalties get applied differently. In the former case, I don't think it would incur much penalty for being good at a game (modulo approval incentives it may or may not predict). In the latter case, you'd probably need to keep giving it more impact allowance until it's playing as well as you'd like. This is because the goal is related to the thing which has a bit of impact.

Comment by turntrout on Impact Measure Desiderata · 2019-04-04T20:38:01.181Z · score: 4 (2 votes) · LW · GW

Reading through this again, I think I have a better response to this part.

We could have valued human superiority at game-playing very highly, in which case AlphaGo would be catastrophic. How can an impact measure allow that without having at least some knowledge about values?

A low impact agent could beat us at games while still preserving our ability to beat it at games (by, for example, shutting it off). Of course, you could say "what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever", but it seems like our goals are simply not of this form. In other words, it seems that actual catastrophes do destroy our ability to achieve different goals, while more benign things don't. If the bad things the agent does can be recovered from, then I think the impact measure has done its job.

Comment by turntrout on What failure looks like · 2019-03-27T15:56:34.332Z · score: 2 (1 votes) · LW · GW

And if the internal goal doesn’t permit that? I’m trying to feel out which levels of meta are problematic in this situation.

Comment by turntrout on What failure looks like · 2019-03-18T23:29:42.746Z · score: 4 (2 votes) · LW · GW

So the concern here is that even if the goal, say, robustly penalizes gaining influence, the agent still has internal selection pressures for seeking influence? And this might not be penalized by the outer criterion if the policy plays nice on-distribution?

Comment by turntrout on Alignment Newsletter #48 · 2019-03-12T23:40:42.044Z · score: 2 (1 votes) · LW · GW

it's also mild on the inside of the algorithm, not just in its effects on the world. this could avert problems with inner optimizers. beyond that, I haven't thought enough about the behavior of the agent. I might reply with another comment.

Comment by turntrout on Alignment Newsletter #48 · 2019-03-12T21:34:03.467Z · score: 4 (2 votes) · LW · GW

Alternatively, construct a distribution over actions such that each action has measure according to some function of its e.g. attainable utility impact penalty (normalized appropriately, of course). Seems like a potential way to get a mild optimizer which is explicitly low-impact and doesn't require complicated models of humans.

Designing agent incentives to avoid side effects

2019-03-11T20:55:10.448Z · score: 31 (6 votes)
Comment by turntrout on Asymptotically Unambitious AGI · 2019-03-07T01:54:26.403Z · score: 2 (1 votes) · LW · GW

Other algorithms... would eventually seek arbitrary power in the world in order to intervene in the provision of its own reward; this follows straightforwardly from its directive to maximize reward

The conclusion seems false; AUP (IJCAI, LW) is a reward maximizer which does not exhibit this behavior. For similar reasons, the recent totalitarian convergence conjecture made here also seems not true.

Comment by turntrout on Test Cases for Impact Regularisation Methods · 2019-02-07T17:25:00.699Z · score: 6 (3 votes) · LW · GW

This post is extremely well done.

my understanding is that every published impact regularisation method fails [supervisor manipulation] in a ‘default’ implementation.

Wouldn’t most measures with a stepwise inaction baseline pass? They would still have incentive to select over future plans so that the humans’ reactions to the agent are low impact (wrt current baseline), but if the stepwise inaction outcome is high impact by the time the agent realizes, that’s the new baseline.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-30T22:37:51.936Z · score: 2 (1 votes) · LW · GW

In my experience so far, we need to include our values, in part, to define "reasonable" utility functions.

It seems that an extremely broad set of input attainable functions suffice to capture the “reasonable“ functions with respect to which we want to be low impact. For example, “remaining on”, “reward linear in how many blue pixels are observed each time step”, etc. All thanks to instrumental convergence and opportunity cost.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-30T21:43:05.499Z · score: 3 (2 votes) · LW · GW

Take a friendly AI that does stuff. Then there is a utility function for which that "does stuff" is the single worst thing the AI could have done.

The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.

As I understand it, the impact version of this argument is flawed in the same way (but less blatantly so): something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-30T18:41:14.242Z · score: 2 (1 votes) · LW · GW

This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible. The argument doesn’t show that that utility has to be important to / considered by the impact measure.

Plus, low impact doesn’t have to be robust to adversarially chosen input attainable utilities - we get to choose them. Just choose the “am I activated” indicator utility and AUP seems to do fine, modulo open questions raised in the post and comments.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-29T23:30:14.484Z · score: 2 (1 votes) · LW · GW

If the AI isn’t just fed all the data by default (ie via a camera already at the opportune location), taking steps to observe is (AUP-)impactful. I think you’re right that agents with small impact allowances can still violate values.

Comment by turntrout on How much can value learning be disentangled? · 2019-01-29T17:06:20.863Z · score: 2 (1 votes) · LW · GW

Incidentally, I feel the same about low-impact approaches. The full generality problem, an AI that is low impact but value-agnostic, I think is impossible.

My (admittedly hazy) recollection of our last conversation is that your concerns were that “value agnostic, low impact, and still does stuff” is impossible. Can you expand on what you mean by value agnostic here, and why you think we can’t even have that and low impact?

Comment by turntrout on "AlphaStar: Mastering the Real-Time Strategy Game StarCraft II", DeepMind [won 10 of 11 games against human pros] · 2019-01-25T04:38:08.092Z · score: 9 (5 votes) · LW · GW

How long do handicaps take to overcome, though? I find it hard to imagine that the difference between eg 500 APM average or 500 APM hard ceiling requires a whole new insight for the agent to be “clever” enough to win anyways - maybe just more training.

Comment by turntrout on Starting to see 2 months later · 2019-01-23T22:22:38.084Z · score: 3 (2 votes) · LW · GW

Congratulations; take some time to be consciously proud of yourself for the progress you’ve made. :)

Comment by turntrout on Announcement: AI alignment prize round 4 winners · 2019-01-23T18:14:38.852Z · score: 3 (2 votes) · LW · GW

I also think surveying applicants might be a good idea, since my experience may not be representative.

Comment by turntrout on And My Axiom! Insights from 'Computability and Logic' · 2019-01-22T15:33:54.781Z · score: 2 (1 votes) · LW · GW

Turing’s thesis applies only to this notion of definability, right?

Comment by turntrout on Announcement: AI alignment prize round 4 winners · 2019-01-22T01:06:59.666Z · score: 16 (6 votes) · LW · GW

Yes, it was the top idea on/off over a few months. I considered it my secret research and thought on my twice daily walks, in the shower, and in class when bored. I developed it for my CHAI application and extended it as my final Bayesian stats project. Probably 5-10 hours a week, plus more top idea time. However, the core idea came within the first hour of thinking about Concrete Problems.

The second piece, Overcoming Clinginess, was provoked by Abram’s comment that clinginess seemed like the most damning failure of whitelisting; at the time, I thought just finding a way to overcome clinginess would be an extremely productive use of my entire summer (lol). On an AMS - PDX flight, I put on some music and spent hours running through different scenarios to dissolve my confusion. I hit the solution after about 5 hours of work, spending 3 hours formalizing it a bit and 5 more making it look nice.

And My Axiom! Insights from 'Computability and Logic'

2019-01-16T19:48:47.388Z · score: 40 (9 votes)

Penalizing Impact via Attainable Utility Preservation

2018-12-28T21:46:00.843Z · score: 26 (10 votes)

Why should I care about rationality?

2018-12-08T03:49:29.451Z · score: 26 (6 votes)

A New Mandate

2018-12-06T05:24:38.351Z · score: 15 (8 votes)

Towards a New Impact Measure

2018-09-18T17:21:34.114Z · score: 104 (36 votes)

Impact Measure Desiderata

2018-09-02T22:21:19.395Z · score: 40 (11 votes)

Turning Up the Heat: Insights from Tao's 'Analysis II'

2018-08-24T17:54:54.344Z · score: 40 (11 votes)

Pretense

2018-07-29T00:35:24.674Z · score: 36 (14 votes)

Making a Difference Tempore: Insights from 'Reinforcement Learning: An Introduction'

2018-07-05T00:34:59.249Z · score: 35 (9 votes)

Overcoming Clinginess in Impact Measures

2018-06-30T22:51:29.065Z · score: 40 (13 votes)

Worrying about the Vase: Whitelisting

2018-06-16T02:17:08.890Z · score: 84 (20 votes)

Swimming Upstream: A Case Study in Instrumental Rationality

2018-06-03T03:16:21.613Z · score: 113 (36 votes)

Into the Kiln: Insights from Tao's 'Analysis I'

2018-06-01T18:16:32.616Z · score: 69 (19 votes)

Confounded No Longer: Insights from 'All of Statistics'

2018-05-03T22:56:27.057Z · score: 56 (13 votes)

Internalizing Internal Double Crux

2018-04-30T18:23:14.653Z · score: 79 (18 votes)

The First Rung: Insights from 'Linear Algebra Done Right'

2018-04-22T05:23:49.024Z · score: 77 (21 votes)

Unyielding Yoda Timers: Taking the Hammertime Final Exam

2018-04-03T02:38:48.327Z · score: 39 (11 votes)

Open-Category Classification

2018-03-28T14:49:23.665Z · score: 36 (8 votes)

The Art of the Artificial: Insights from 'Artificial Intelligence: A Modern Approach'

2018-03-25T06:55:46.204Z · score: 68 (18 votes)

Lightness and Unease

2018-03-21T05:24:26.289Z · score: 53 (15 votes)

How to Dissolve It

2018-03-07T06:19:22.923Z · score: 41 (15 votes)

Ambiguity Detection

2018-03-01T04:23:13.682Z · score: 33 (9 votes)

Set Up for Success: Insights from 'Naïve Set Theory'

2018-02-28T02:01:43.790Z · score: 62 (18 votes)

Walkthrough of 'Formalizing Convergent Instrumental Goals'

2018-02-26T02:20:09.294Z · score: 27 (6 votes)

Interpersonal Approaches for X-Risk Education

2018-01-24T00:47:44.183Z · score: 29 (8 votes)