Best reasons for pessimism about impact of impact measures?

turntrout

Best reasons for pessimism about impact of impact measures?

post by TurnTrout · 2019-04-10T17:22:12.832Z · LW · GW · 9 comments

This is a question post.

  Answers
    29 Vaniver
    25 abramdemski
    16 DanielFilan
    16 Wei_Dai
    14 Vika
    7 ofer
    4 avturchin
None
9 comments

Habryka recently wrote [LW · GW] (emphasis mine):

My inside views on AI Alignment make me think that work on impact measures is very unlikely to result in much concrete progress on what I perceive to be core AI Alignment problems, and I have talked to a variety of other researchers in the field who share that assessment. I think it’s important that this grant not be viewed as an endorsement of the concrete research direction that Alex is pursuing, but only as an endorsement of the higher-level process that he has been using while doing that research.

As such, I think it was a necessary component of this grant that I have talked to other people in AI Alignment whose judgment I trust, who do seem excited about Alex’s work on impact measures. I think I would not have recommended this grant, or at least this large of a grant amount, without their endorsement. I think in that case I would have been worried about a risk of diverting attention from what I think are more promising approaches to AI Alignment, and a potential dilution of the field by introducing a set of (to me) somewhat dubious philosophical assumptions.

I'm interested in learning about the intuitions, experience, and facts which inform this pessimism. As such, I'm not interested in making any arguments to the contrary in this post; any pushback I provide in the comments will be with clarification in mind.

There are two reasons you could believe that "work on impact measures is very unlikely to result in much concrete progress on… core AI Alignment problems". First, you might think that the impact measurement problem is intractable, so work is unlikely to make progress. Second, you might think that even a full solution wouldn't be very useful.

Over the course of 5 minutes by the clock, here are the reasons I generated for pessimism (which I either presently agree with or at least find it reasonable that an intelligent critic would raise the concern on the basis of currently-public reasoning):

Declarative knowledge of a solution to impact measurement probably wouldn't help us do value alignment, figure out embedded agency, etc.
We want to figure out how to transition to a high-value stable future, and it just isn't clear how impact measures help with that.
Competitive and social pressures incentivize people to cut corners on safety measures, especially those which add overhead.

Computational overhead.
Implementation time.
Training time, assuming they start with low aggressiveness and dial it up slowly.

Depending on how "clean" of an impact measure you think we can get, maybe it's way harder to get low-impact agents to do useful things.

Maybe we can get a clean one, but only for powerful agents.
Maybe the impact measure misses impactful actions if you can't predict at near human level.

In a world where we know how to build powerful AI but not how to align it (which is actually probably the scenario in which impact measures do the most work), we play a very unfavorable game while we use low-impact agents to somehow transition to a stable, good future: the first person to set the aggressiveness too high, or to discard the impact measure entirely, ends the game.
In a More realistic tales of doom [LW · GW]-esque scenario, it isn't clear how impact helps prevent "gradually drifting off the rails". $^{1}$

$^{1}$ Paul raised concerns along these lines [LW(p) · GW(p)]:

We'd like to build AI systems that help us resolve the tricky situation that we're in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.

I think the default "terrible" scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.

That may ultimately culminate with a catastrophic bang, but if it does it's not going to be because we wanted the AI to have a small impact and it had a large impact. It's probably going to be because we have a very limited idea what is going on, but we don't feel like we have the breathing room to step back and chill out (at least not for long) because we don't believe that everyone else is going to give us time.

If I'm trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does "low impact" mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?

(And realistically I doubt we'll fail at alignment with a bang---it's more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn't let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)

It seems like "low objective impact" is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that's a coherent situation to think about and plan for, but we shouldn't mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it's the best hope if you were very pessimistic about what I consider "mainline" alignment.)

Answers

answer by Vaniver · 2019-04-11T19:09:00.203Z · LW(p) · GW(p)

When I think about solutions to AI alignment, I often think about 'meaningful reductionism.' That is, if I can factor a problem into two parts, and the parts don't actually rely on each other, now I have two smaller problems to solve. But if the parts are reliant on each other, I haven't really simplified anything yet.

While impact measures feel promising to me as a cognitive strategy (often my internal representation of politeness feels like 'minimizing negative impact', like walking on sidewalks in a way that doesn't startle birds), they don't feel promising to me as reductionism. That is, if I already had a solution to the alignment problem, then impact measures would likely be part of how I implement that solution, but solving it separately from alignment doesn't feel like it gets me any closer to solving alignment.

[The argument here I like most rests on the difference between costs and side effects; we don't want to minimize side effects because that leads to minimizing good side effects also, and it's hard to specify the difference between 'side effects' and 'causally downstream effects,' and so on. But if we just tell the AI "score highly on a goal measure while scoring low on this cost measure," this only works if we specified the goal and the cost correctly.]

But there's a different approach to AI alignment, which is something more like 'correct formalisms.' We talk sometimes about handing a utility function to the robot, or (in old science fiction) providing it with rules to follow, or so on, and by seeing what it actually looks like when we follow that formalism we can figure out how well that formalism fits to what we're interested in. Utility functions on sensory inputs don't seem alignable because of various defects (like wireheading), and so it seems like the right formalism needs to have some other features (it might still be a utility function, but it needs to be an utility function over mental representations of external reality in such a way that the mental representation tracks external reality even when you have freedom to alter your mental representation, in a way that we can't turn into code yet).

So when I ask myself questions like "why am I optimistic about researching impact measures now?" I get answers like "because exploring the possibility space will make clear exactly how the issues link up." For example, looking at things like relative reachability made it clear to me how value-laden the ontology needs to be in order for a statistical measure on states to be meaningful. This provides a different form-factor for 'transferring values to the AI'; instead of trying to ask something like "is scenario A or B better?" and train a utility function, I might instead try to ask something like "how different are scenarios A and B?" or "how are scenarios A and B different?" and train an ontology, with the hopes that this makes other alignment problems easier because the types line up somewhat more closely.

[I think even that last example still performs poorly on the 'meaningful reductionism' angle, since getting more options for types to use in value loading doesn't seem like it addresses the core obstacles of value loading, but provides some evidence of how it could be useful or clarify thinking.]

↑ comment by TurnTrout · 2019-04-17T22:47:27.182Z · LW(p) · GW(p)

this only works if we specified the goal and the cost correctly

Wait, why doesn't it work if you just specify the cost (impact) correctly?

Replies from: Vaniver

↑ comment by Vaniver · 2019-04-20T04:26:08.560Z · LW(p) · GW(p)

Suppose the goal dramatically overvalues some option; then the AI would be willing to pay large (correctly estimated) costs in order to achieve "even larger" (incorrectly estimated) gains.

answer by abramdemski · 2019-04-23T01:05:24.869Z · LW(p) · GW(p)

Giving people a slider with "safety" written on one end and "capability" written on the other, and then trying to get people to set it close enough to the "safety" end, seems like a bad situation. (Very similar to points you raised in your 5-min-timer list.)

An improvement on this situation would be something which looked more like a theoretical solution to Goodhart's law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities ("this is how you get the most of what you want"), allowing ML researchers to develop algorithms orienting toward this.
Even better (but similarly), an approach where capability and alignment go hand in hand would be ideal -- a way to directly optimize for "what I mean, not what I say", such that it is obvious that things are just worse if you depart from this.
However, maybe those things are just pipe dreams -- this should not be the fundamental reason to ignore impact measures, unless promising approaches in the other two categories are pointed out; and even then, impact measures as a backup plan would still seem desirable.

My response to this is roughly that I prefer mild optimization techniques for this back up plan. Like impact measures, they are vulnerable to the objection above; but they seem better in terms of the objection which follows.
Part of my intuition, however, is just that mild optimization is going to be closer to the theoretical heart of anti-Goodhart technology. (Evidence for this is that quantilization seems, to me, theoretically nicer than any low-impact measure.)

In other words, conditioned on having a story more like "this is how you get the most of what you want" rather than a slider reading "safety ------- capability", I more expect to see a mild optimizer as opposed to an impact measure.

Unlike mild-optimization approaches, impact measures still allow potentially large amounts of optimization pressure to be applied to a metric that isn't exactly what we want.

It is apparent that some attempted impact measures run into nearest-unblocked-strategy type problems, where the supposed patch just creates a different problem when a lot of optimization pressure is applied. This gives reason for concern even if you can't spot a concrete problem with a given impact measure: impact measures don't address the basic nearest-unblocked-strategy problem, and so are liable to severe Goodheartian results.
If an impact measure were perfect, then adding it as a penalty on an otherwise (slightly or greatly) misaligned utility function just seems good, and adding it as a penalty to a perfectly aligned utility function would seem an acceptable loss. If impact is slightly misspecified, however, then adding it as a penalty may make a utility function less friendly than it otherwise would be.

(It is a desirable feature of safety measures, that those safety measures do not risk decreasing alignment.)

On the other hand, a mild optimizer seems to get the spirit of what's wanted from low-impact.

This is only somewhat true: a mild optimizer may create a catastrophe through negligence, where a low-impact system would try hard to avoid doing so. However, I view this as a much more acceptable and tractable problem than the nearest-unblocked-strategy type problem.

Both mild optimization and impact measures require separate approaches to "doing what people want".

Arguably this is OK, because they could greatly reduce the bar for alignment of specified utility functions. However, it seems possible to me that we need to understand more about the fundamentally puzzling nature of "do what I want" before we can be confident even in low-impact or mild-optimization approaches, because it is difficult to confidently say that an approach avoids risk of hugely violating your preferences while still being so confused about what human preference even is.

answer by DanielFilan · 2019-04-10T20:45:16.851Z · LW(p) · GW(p)

My concern is similar to Wei Dai's: it seems to me that at a fundamental physical level, any plan involving turning on a computer that does important stuff will make pretty big changes to the world's trajectory in phase space. Heat dissipation will cause atmospheric particles to change their location and momentum, future weather patterns will be different, people will do things at different times (e.g. because they're waiting for a computer program to run, or because the computer is designed to change the flow of traffic through a city), meet different people, and have different children. As a result, it seems hard for me to understand how impact measures could work in the real world without a choice of representation very close to the representation humans use to determine the value of different worlds. I suspect that this will need input from humans similar to what value learning approaches might need, and that once it's done one could just do value learning and dispense with the need for impact measures. That being said, this is more of an impression than a belief - I can't quite convince myself that no good method of impact regularisation exists, and some other competent people seem to disagree with me.

↑ comment by TurnTrout · 2019-04-10T22:51:39.468Z · LW(p) · GW(p)

How does this concern interact with the effective representation invariance claim I made when introducing AUP?

Replies from: Vika, DanielFilan

↑ comment by Vika · 2019-04-11T15:16:37.410Z · LW(p) · GW(p)

I don't see how representation invariance addresses this concern. As far as I understand, the concern is about any actions in the real world causing large butterfly effects. This includes effects that would be captured by any reasonable representation, e.g. different people existing in the action and inaction branches of the world. The state representations used by humans also distinguish between these world branches, but humans have limited models of the future that don't capture butterfly effects (e.g. person X can distinguish between the world state where person Y exists and the world state where person Z exists, but can't predict that choosing a different route to work will cause person Z to exist instead of person Y).

I agree with Daniel that this is a major problem with impact measures. I think that to get around this problem we would either need to figure out how to distinguish butterfly effects from other effects (and then include all the butterfly effects in the inaction branch) or use a weak world model that does not capture butterfly effects (similarly to humans) for measuring impact. Even if we know how to do this, it's not entirely clear whether we should avoid penalizing butterfly effects. Unlike humans, AI systems would be able to cause butterfly effects on purpose, and could channel their impact through butterfly effects if they are not penalized.

Replies from: TurnTrout, DanielFilan

↑ comment by TurnTrout · 2019-04-11T23:06:08.113Z · LW(p) · GW(p)

I don't see how representation invariance addresses this concern.

I think my post was basically saying "representation selection seems like a problem because people are confused about the type signature of impact, which is actually a thing you can figure out no matter what you think the world is made of". I don't want to go into too much detail here (as I explained below), but part of what this implies is that discrete "effects" are fake/fuzzy mental constructs/not something to think about when designing an impact measure. In turn, this would mean we should ask a different question that isn't about butterfly effects.

↑ comment by DanielFilan · 2019-04-11T18:20:34.307Z · LW(p) · GW(p)

Unlike humans, AI systems would be able to cause butterfly effects on purpose, and could channel their impact through butterfly effects if they are not penalized.

Indeed - a point I think is illustrated by the Chaotic Hurricanes test case [AF · GW]. I'm probably most excited about methods that would use transparency techniques to determine when a system is deliberately optimising for a part of the world (e.g. the members of the long-term future population) that we don't want it to care about, but this has a major drawback of perhaps requiring multiple philosophical advances into the meaning of reference in cognition and a greater understanding of what optimisation is.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-11T23:14:17.202Z · LW(p) · GW(p)

What would you predict AUP does for the chaotic scenarios? Suppose the attainable set just includes the survival utility function, which is 1 if the agent is activated and 0 otherwise.

↑ comment by DanielFilan · 2019-04-10T23:07:26.475Z · LW(p) · GW(p)

I think that under the worldview of this concern, the distribution of reward functions effectively defines a representation that, if too different from the one humans care about, will either mean that no realistic impact is possible in the real world or be ineffective at penalising unwanted negative impacts.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-10T23:23:56.111Z · LW(p) · GW(p)

is there a central example you have in mind for this potential failure mode?

Replies from: DanielFilan

↑ comment by DanielFilan · 2019-04-11T18:10:38.046Z · LW(p) · GW(p)

No.

↑ comment by Rohin Shah (rohinmshah) · 2019-04-11T17:48:43.068Z · LW(p) · GW(p)

I can't quite convince myself that no good method of value learning exists, and some other competent people seem to disagre ewith me.

No good method of measuring impact, presumably?

Replies from: DanielFilan

↑ comment by DanielFilan · 2019-04-11T18:10:16.742Z · LW(p) · GW(p)

I prefer the phrase 'impact regularisation', but indeed that was a slip of the mind.

answer by Wei Dai (Wei_Dai) · 2019-04-10T20:04:27.556Z · LW(p) · GW(p)

I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world, because what counts as a negative side effect in the real world seems too complex to easily capture. It seems like AUP tries to get around this by aiming at a lower bar than "avoid negative side effects", namely "avoid catastrophic side effects", and aside from whether it actually succeeds at clearing this lower bar, it would mean that an AI that is only "safe" because of AUP can't be safely used for ordinary goals (e.g., invent a better widget, or make someone personally more successful in life) and instead we have to somehow restrict them to being used just for goals that relate to x-risk reduction, where it's worthwhile to risk incurring less-than-catastrophic negative side effects.

As a side note, it seems generally the case that some approaches to AI safety/alignment aim at the higher bar of "safe for general use" and others aim at "safe enough to use for x-risk reduction", and this isn't always made clear, which can be a source of confusion for both AI safety/alignment researchers and others such as strategists and policy makers.

↑ comment by Matthew Barnett (matthew-barnett) · 2019-04-11T02:34:30.051Z · LW(p) · GW(p)

I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world

Do you think there any experiments that could be performed that would change your view on this point? Or is an impact measure the type of thing that does not generalize well from testing environment to the real world?

↑ comment by TurnTrout · 2020-12-12T19:16:21.002Z · LW(p) · GW(p)

I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world, because what counts as a negative side effect in the real world seems too complex to easily capture.

Although a far cry from "[avoiding side effects] in the real world", see Avoiding Side Effects in Complex Environments [LW · GW] as another piece of evidence to update on.

answer by Vika · 2019-04-19T12:51:08.720Z · LW(p) · GW(p)

Thanks Alex for starting this discussion and thanks everyone for the thought-provoking answers. Here is my current set of concerns about the usefulness of impact measures, sorted in decreasing order of concern:

Irrelevant factors. When applied to the real world, impact measures are likely to be dominated by things humans don't care about (heat dissipation, convection currents, positions of air molecules, etc). This seems likely to happen to value-agnostic impact measures, e.g. AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

This may be mitigated by inability to perceive the irrelevant factors, which results in a more coarse-grained state representation: if the agent can't see air molecules, all the states with different air molecule positions will look the same, as they do to humans. Some human-relevant factors can also be difficult to perceive, e.g. the presence of poisonous gas in the room, so we may not want to limit the agent's perception ability to human level. Automatically filtering out irrelevant factors does seem difficult, and I think this might imply that it is impossible to design an impact measure that is both useful and truly value-agnostic.

However, the value-agnostic criterion does not seem very important in itself. I think the relevant criterion is that designing impact measures should be easier than the general value learning problem. We already have a non-value-agnostic impact measure that plausibly satisfies this criterion: RLSP learns what is effectively an impact measure (the human theta parameter) using zero human input just by examining the starting state. This could also potentially be achieved by choosing an attainable utility set that rewards a broad enough sample of things humans care about, and leaves the rest to generalization. Choosing a good attainable utility set may not be easy [AF · GW] but it seems unlikely to be as hard as the general value learning problem.

Butterfly effects. Every action is likely to have large effects [AF(p) · GW(p)] that are difficult to predict, e.g. taking a different route to work may result in different people being born. Taken literally, this means that there is no such thing as a low-impact action. Humans get around this by only counting easily predictable effects as impact that they are considered responsible for. If we follow a similar strategy of not penalizing butterfly effects, we might incentivize the agent to deliberately cause butterfly effects. The easiest way around this that I can currently see is restricting the agent's capability to model the effects of its actions, though this has obvious usefulness costs as well.

Chaotic world. Every action, including inaction, is irreversible [AF · GW], and each branch contains different states. While preserving reversibility is impossible in this world, preserving optionality (attainable utility, reachability, etc) seems possible. For example, if the attainable set contains a function that rewards the presence of vases, the action of breaking a vase will make this reward function more difficult to satisfy (even if the states with/without vases are different in every branch). If we solve the problem of designing/learning a good utility set that is not dominated by irrelevant factors, I expect chaotic effects will not be an issue.

If any of the above-mentioned concerns are not overcome, impact measures will fail to distinguish between what humans would consider low-impact and high-impact. Thus, penalizing high-impact actions would come with penalizing low-impact actions as well, which would result in a strong safety-capability tradeoff. I think the most informative direction of research to figure out whether these concerns are a deal-breaker is to scale up impact measures to apply beyond gridworlds, e.g. to Atari games.

↑ comment by TurnTrout · 2019-04-19T15:58:15.780Z · LW(p) · GW(p)

Thanks for the detailed list!

AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

Humans get around this by only counting easily predictable effects as impact that they are considered responsible for.

What makes you think that?

Replies from: Vika

↑ comment by Vika · 2019-04-20T13:23:06.578Z · LW(p) · GW(p)

What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules. This does not apply to the attainable set consisting of the survival utility function, since that is not a random utility function.

What makes you think that?

This is an intuitive claim based on a general observation of how people attribute responsibility. For example, if I walk into a busy street and get hit by a car, I will be considered responsible for this because it's easy to predict. On the other hand, if I am walking down the street and a brick falls on my head from the nearby building, then I will not be considered responsible, because this event would be hard to predict. There are probably other reasons that humans don't consider themselves responsible for butterfly effects.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-20T15:47:36.275Z · LW(p) · GW(p)

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules.

Are you thinking of an action observation formalism, or some kind of reward function over inferred state?

Responsibility

If you had to pose the problem of impact measurement as a question, what would it be?

Replies from: Vika

↑ comment by Vika · 2019-04-22T17:36:14.246Z · LW(p) · GW(p)

Are you thinking of an action observation formalism, or some kind of reward function over inferred state?

I don't quite understand what you're asking here, could you clarify?

If you had to pose the problem of impact measurement as a question, what would it be?

Something along the lines of: "How can we measure to what extent the agent is changing the world in ways that we care about?". Why?

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-22T23:55:35.261Z · LW(p) · GW(p)

So there's a thing people do when they talk about AUP which I don't understand. They think it's about state, even though I insist it's fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven't been very good; in the given conversation, they acknowledge that it's different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciously slip back to their old model. out of everyone I've talked to, I can probably count on my hands the number of people who get this – note that agreeing with specific predictions of mine is different.

Now, it's the author's job to communicate their ideas. When I say "as far as I can tell, few others have internalized how AUP actually works", this doesn't connote "gosh, I can't stand you guys, how could you do this", it's more like "somehow I messed up the explanations; I wonder what key ideas are missing still? How can I fix this?".

my goal with this comment isn't to explain, but rather to figure out what's happening. Let's go through some of my past comments about this.

Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.

…

To scale [LW · GW], relative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric. Relative reachability isn't ontology-agnostic.

…

In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.

I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power.

…

Here, we’re directly measuring the agent’s power: its ability to wirehead a trivial utility function.

The plausibility of [this] makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.

…

By changing our perspective from "what effects on the world are 'impactful'?" to "how can we stop agents from overfitting their environments?", a natural, satisfying definition of impact falls out.

Towards a New Impact Measure [LW · GW]

When I read this, it seems like I'm really trying to emphasize that I don't think the direct focus should be on the world state in any way. But it was a long post, and I said a lot of things, so I'm not too surprised.

I tried to nip this confusion in the bud.

"The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it."

I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.

Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history -> world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?

This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance"/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The "state difference weighting" problem disappears. Two concepts of impact are unified.

I’m not saying RR isn’t important - just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.

~ my reply to your initial comment on the AUP post [LW(p) · GW(p)]

even more confusing is when I say "there are fundamental concepts here you're missing", people don't seem to become any less confident in their predictions about what AUP does. if people think that AUP is penalizing effects in the world, why don't they notice their confusion when they read a comment like the one above?

a little earlier,

Thinking in terms of "effects" seems like a subtle map/territory confusion. That is, it seems highly unlikely that there exists a robust, value-agnostic means of detecting "effects" that makes sense across representations and environments.

Impact Measure Desiderata [LW · GW]

As a more obscure example, some people with a state interpretation might wonder how come I'm not worried about stuff I mentioned in the whitelisting post anymore since I strangely don't think representation/state similarity metric matters for AUP:

due to entropy, you may not be able to return to the exact same universe configuration.

Worrying about the Vase: Whitelisting [LW(p) · GW(p)]

(this is actually your "chaotic world" concern)

right now, I'm just chalking this up to "Since the explanations don't make any sense because they're too inferentially distant/it just looks like I built a palace of equations, it probably seems like I'm not on the same page with their concerns, so there's nothing to be curious about." can you give me some of your perspective? (others are welcome to chime in)

to directly answer your question: no, the real world version of AUP which I proposed doesn't reward based on state, and would not have its behavior influenced solely by different possible arrangements of air molecules. (I guess I'm directly responding to this concern, but I don't see any other way to get information on why this phenomenon is happening)

as for the question – I was just curious. I think you'll see why I asked when I send you some drafts of the new sequence. :)

Replies from: abramdemski, Vika, Vika, DanielFilan, DanielFilan

↑ comment by abramdemski · 2019-04-23T07:19:12.937Z · LW(p) · GW(p)

I think in a conversation I had with you last year, I kept going back to 'state' despite protests because I kept thinking "if AUP works, surely it would be because some of the utility functions calculate a sensible state estimate in a humanlike ontology and then define utility from this". It isn't necessarily the right way to critique AUP, but I think I was right to think those thoughts conditional on that assumption -- ie, even if it isn't the argument you're trying to make for AUP, it seems like a not-unreasonable position to consider, and so thinking about how AUP does in terms of state can be a reasonable and important part of a thought-process assessing AUP. I believe I stopped making the assumption outright at some point, but kept bringing out the assumption as a tool for analysis -- for example, supporting a thought experiment with the argument that there would at least be some utility functions which thought about the external world enough to case about such-and-such. I think in our conversation I managed to appropriately flag these sorts of assumptions such that you were OK with the role it was playing in the wider argument (well... not in the sense of necessarily accepting the arguments, but in the sense of not thinking I was just repeatedly making the mistake of thinking it has to be about state, I think).

Other people could be thinking along similar lines without flagging it so clearly.

↑ comment by Vika · 2019-05-03T13:44:31.337Z · LW(p) · GW(p)

There are various parts of your explanation that I find vague and could use a clarification on:

"AUP is not about state" - what does it mean for a method to be "about state"? Same goes for "the direct focus should not be on the state" - what does "direct focus" mean here?
"Overfitting the environment" - I know what it means to overfit a training set, but I don't know what it means to overfit an environment.
"The long arms of opportunity cost and instrumental convergence" - what do "long arms" mean?
"Wirehead a utility function" - is this the same as optimizing a utility function?
"Cut out the middleman" - what are you referring to here?

I think these intuitive phrases may be a useful shorthand for someone who already understands what you are talking about, but since I do not understand, I have not found them illuminating.

I sympathize with your frustration about the difficulty of communicating these complex ideas clearly. I think the difficulty is caused by the vague language rather than missing key ideas, and making the language more precise would go a long way.

Replies from: TheMajor, TurnTrout, TurnTrout

↑ comment by TheMajor · 2019-05-03T16:54:19.675Z · LW(p) · GW(p)

I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can't speak for TurnTrout, and there's a decent chance that I'm confused about some of the things here. But here is how I think about AUP and the points raised in this chain:

"AUP is not about the state" - I'm going to take a step back, and pretend we have an agent working with AUP reasoning. We've specified an arcane set of utility functions (based on air molecule positions, well-defined human happiness, continued existence, whatever fits in the mathematical framework). Next we have an action A available, and would like to compute the impact of that action. To do this our agent would compare how well it would be able to optimize each of those arcane utility functions in the world where A was taken, versus how well it would be able to optimize these utility functions in the world where the rest action was taken instead. This is "not about state" in the sense that the impact is determined by the change in the ability for the agent to optimize these arcane utilities, not by the change in the world state. In the particular case where the utility function is specified all the way down to sensory inputs (as opposed to elements of the world around us, which have to be interpreted by the agent first) this doesn't explicitly refer to the world around us at all (although of course implicitly the actions and sensory inputs of the agent are part of the world)! The thing being measured is the change in ability to optimize future observations, where what is a 'good' observation is defined by our arcane set of utility functions.
"overfitting the environment" - I'm not too sure about this one, but I'll have a crack at it. I think this should be interpreted as follows: if we give a powerful agent a utility function that doesn't agree perfectly with human happiness, then the wrong thing is being optimized. The agent will shape the world around us to what is best according to the utility function, and this is bad. It would be a lot better (but still less than perfect) if we had some way of forcing this agent to obey general rules of simplicity. The idea here is that our bad proxy utility function is at least somewhat correlated with actual human happiness under everyday circumstances, so as long as we don't suddenly introduce a massively powerful agent optimizing something weird (oops) to massively change our lives we should be fine. So if we can give our agent a limited 'budget' - in the case of fitting a curve to a dataset this would be akin to the number of free parameters - then at least things won't go horribly wrong, plus we expect these simpler actions to have less unintended side-effects outside the domain we're interested in. I think this is what is meant, although I don't really like the terminology "overfitting the environment".
"The long arms of opportunity cost and instrumental convergence" - this point is actually very interesting. In the first bullet point I tried to explain a little bit about how AUP doesn't directly depend on the world state (it depends on the agent's observations, but without an ontology that doesn't really tell you much about the world), instead all its gears are part of the agent itself. This is really weird. But it also lets us sidestep the issue of human value learning - if you don't directly involve the world in your impact measure, you don't need to understand the world for it to work. The real question is this one: "how could this impact measure possibly resemble anything like 'impact' as it is intuitively understood, when it doesn't involve the world around us?" The answer: "The long arms of opportunity cost and instrumental convergence". Keep in mind we're defining impact as change in the ability to optimize future observations. So the point is as follows: you can pick any absurd utility function you want, and any absurd possible action, and odds are this is going to result in some amount of attainable utility change compared to taking the null action. In particular, precisely those actions that massively change your ability to make big changes to the real world will have a big impact even on arbitrary utility functions! This sentence is so key I'm just going to repeat it with more emphasis: the actions that massively change your ability to make big changes in the world - i.e. massive decreases of power (like shutting down) but also massive increases in power - have big opportunity costs/benefits compared to the null action for a very wide range of utility functions. So these get assigned very high impact, even if the utility function set we use is utter hokuspokus! Now this is precisely instrumental convergence, i.e. the claim that for many different utility functions the first steps of optimizing them involves "make sure you have sufficient power to enforce your actions to optimize your utility function". So this gives us some hope that TurnTrout's impact measure will correspond to intuitive measures of impact even if the utility functions involved in the definition are not at all like human values (or even like a sensible category in the real world at all)!
"Wirehead a utility function" - this is the same as optimizing a utility function, although there is an important point to be made here. Since our agent doesn't have a world-model (or at least, shouldn't need one for a minimal working example), it is plausible the agent can optimize a utility function by hijacking its own input stream, or something of the sorts. This means that its attainable utility is at least partially determined by the agent's ability to 'wirehead' to a situation where taking the rest action for all future timesteps will produce a sequence of observations that maximizes this specific utility function, which if I'm not mistaken is pretty much spot on the classical definition of wireheading.
"Cut out the middleman" - this is similar to the first bullet point. By defining the impact of an action as our change in the ability to optimize future observations, we don't need to make reference to world-states at all. This means that questions like "how different are two given world-states?" or "how much do we care about the difference between two two world-states?" or even "can we (almost) undo our previous action, or did we lose something valuable along the way?" are orthogonal to the construction of this impact measure. It is only when we add in an ontology and start interpreting the agent's observations as world-states that these questions come back. In this sense this impact measure is completely different from RR: I started to write exactly how this was the case, but I think TurnTrout's explanation is better than anything I can cook up. So just ctrl+F "I tried to nip this confusion in the bud." and read down a bit.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-05-04T17:11:43.029Z · LW(p) · GW(p)

^ This is also how I interpret all of those statements. (Though I don't agree with all of them.)

I also dislike the "overfitting the environment" phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in "normal" cases and not elsewhere.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-05-04T19:09:35.686Z · LW(p) · GW(p)

which do you disagree with?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-05-04T23:00:49.896Z · LW(p) · GW(p)

I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don't think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.

Ignoring my dislike of the phrase, I don't agree that AUP is stopping you from "overfitting the environment" (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I'd guess that your-vision-of AUP wildly overcompensates and causes you to seriously "underfit the environment", or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer "underfits the environment" (alternatively, "allows for interesting plans"), then I expect it allows catastrophic plans.

I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn't say I currently disagree with it.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-05-05T00:29:57.803Z · LW(p) · GW(p)

I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.

(I'm going to take a shot at this now because it's meta, and I think there's a compact explanation I can provide that hopefully makes sense.)

Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say "it's penalizing opportunity cost or instrumental convergence" post hoc because that's why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.

In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can't actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.

Here's an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.

the theory of relative state reachability says choice A is maximally impactful. Why? You can't reach anything like the states you could under inaction. How does this decision track with opportunity cost?

Attainable utility says choice B is the bigger deal. You couldn't do anything with that part of the universe anyways, so it doesn't change much. This is the correct answer.

this scenario is important because it isn't just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It's an illustration of where state reachability diverges from these notions.

a natural reply is, what about things that AUP penalizes that we don't find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)

however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.

ETA: Here's a physically realistic alternative scenario. Again, we're thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.

Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.

Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.

Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.

It's also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don't change much. So it doesn't require our values to do the right thing here, either.

The main point is that the reason it's doing the right thing is based on opportunity cost, while relative reachability's incorrect judgment is not.

I don't agree that AUP is stopping you from "overfitting the environment" (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows).

It isn't the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-05-09T15:01:58.127Z · LW(p) · GW(p)

We talked a bit off-forum, which helped clarify things for me.

Firstly, there's a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory "impact" is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don't know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions.

Many of the claims are about AU theory and not about AUP. There isn't really an analogous "RR theory".

Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for "regular" irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don't care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty.

That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.

↑ comment by TurnTrout · 2019-05-03T15:46:21.339Z · LW(p) · GW(p)

These are good questions.

As I mentioned, my goal here isn’t to explain the object level, so I’m going to punt on these for now. I think these will be comprehensible after the sequence, which is being optimized to answer this in the clearest way possible.

↑ comment by TurnTrout · 2019-05-10T00:57:41.537Z · LW(p) · GW(p)

"AUP is not about state" - what does it mean for a method to be "about state"?

Here's a potentially helpful analogy. Imagine I program a calculator. Although its computation is determined by the state of the solar system, the computation isn't "about" the state of the solar system.

↑ comment by Vika · 2019-05-02T17:01:46.746Z · LW(p) · GW(p)

Thanks for the detailed explanation - I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn't understand the state representation invariance claim in the AUP proposal, though I didn't realize that it is as central to the proposal as you describe here.

I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that does not depend on the state at all. Did you mean that in the real world the agent always receives the same AUP penalty no matter which state it is in? If that is what you meant, then I'm not sure how to reconcile your description of AUP in the real world (where the penalty is not a function of the state) and AUP in an MDP (where it is a function of the state). I would find it helpful to see a definition of AUP in a POMDP as an intermediate case.

I agree with Daniel's comment that if AUP is not penalizing effects on the world, then it is confusing to call it an 'impact measure', and something like 'optimization regularization' would be better.

Since I still have lingering confusions after your latest explanation, I would really appreciate if someone else who understands this could explain it to me.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2019-05-04T17:31:53.185Z · LW(p) · GW(p)

I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that does not depend on the state at all.

It definitely does depend on the state. If the agent moves to a state where it has taken over the world, that's a huge increase in its ability to achieve arbitrary utility functions, and it would get a large penalty.

I think the claim is more that while the penalty does depend on the state, it's not central to think about the state to understand the major effects of AUP. (As an analogy, if you want to predict whether I'm about to leave my house, it's useful to see whether or not I'm wearing shoes, but if you want to understand why I am or am not about to leave my house, whether I'm wearing shoes is not that relevant -- you'd want to know what my current subgoal or plan is.)

Similarly, with AUP, the claim is that while you can predict what the penalty is going to be by looking at particular states and actions, and the penalty certainly does change with different states/actions, the overall effect of AUP can be stated without reference to states and actions. Roughly speaking, this is that it prevents agents from achieving convergent instrumental subgoals like acquiring resources (because that would increase attainable utility across a variety of utility functions -- this is what is meant by "power"), and it also prevents agents from changing the world irreversibly (because that would make a variety of utility functions much harder to attain).

This is somewhat analogous to the concept of empowerment in ML -- while empowerment is defined in terms of states and actions, the hope is that it corresponds to an agent's ability to influence its environment, regardless of the particular form of state or action representation.

Replies from: Vika

↑ comment by Vika · 2019-05-11T03:50:41.229Z · LW(p) · GW(p)

Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.

AU is a function of the world state, but intends to capture some general measure of the agent's influence over the environment that does not depend on the state representation.

Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) -> observations (e.g. pixels) -> state representation / coarse-graining (which defines macrostates as equivalence classes over observations) -> featurization (a coarse-graining that factorizes into features). The impact measure is defined over the macrostates.

Consider the set of all state representations that are consistent with the true reward function (i.e. if two microstates have different true rewards, then their state representation is different). The impact measure is representation-invariant if it has the same values for any state representation in this reward-compatible set. (Note that if representation invariance was defined over the set of all possible state representations, this set would include the most coarse-grained representation with all observations in one macrostate, which would imply that the impact measure is always 0.) Now consider the most coarse-grained representation R that is consistent with the true reward function.

An AU measure defined over R would remain the same for a finer-grained representation. For example, if the attainable set contains a reward function that rewards having a vase in the room, and the representation is refined to distinguish green and blue vases, then macrostates with different-colored vases would receive the same reward. Thus, this measure would be representation-invariant. However, for an AU measure defined over a finer-grained representation (e.g. distinguishing blue and green vases), a random reward function in the attainable set could assign a different reward to macrostates with blue and green vases, and the resulting measure would be different from the measure defined over R.

An RR measure that only uses reachability functions of single macrostates is not representation-invariant, because the observations included in each macrostate depend on the coarse-graining. However, if we allow the RR measure to use reachability functions of sets of macrostates, then it would be representation-invariant if it is defined over R. Then a function that rewards reaching a macrostate with a vase can be defined in a finer-grained representation by rewarding macrostates with green or blue vases. Thus, both AU and this version of RR are representation-invariant iff they are defined over the most coarse-grained representation consistent with the true reward.

↑ comment by DanielFilan · 2019-04-23T02:04:37.688Z · LW(p) · GW(p)

I also might've expected some people to wonder, given their state interpretation, how come I'm not worried about stuff I mentioned in the whitelisting post anymore

I don't read everything that you write, and when I do read things there seems to be some amount of dropout that occurs resulting in me missing certain clauses (not just in long posts by you, even while proofreading the introduction section of a friend's paper draft!) that I don't notice until quizzed in detail -- I suspect this is partially due to me applying lossy compression that preserves my first guess about the gist of a paragraph, and maybe partially due to literal saccades while reading. The solution is repetition and redundancy: for example, I assume that you tried to do that in your quotes after the phrase "Let's go through some of my past comments about this", but only the quote

[R]elative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric

implies to me that we're moving away from a state-based way of thinking, and it doesn't directly say anything about AUP.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-23T02:40:08.713Z · LW(p) · GW(p)

I don't read everything that you write, and when I do read things there seems to be some amount of dropout that occurs resulting in me missing certain clauses

Yes, this is fine and understandable. I wasn’t meaning to imply that responsible people should have thought of all these things, but rather pointing to different examples. I’ll edit my phrasing there.

but only the quote

I had a feeling that there was some illusion of transparency, (which is why I said “when I read it”), but I had no idea it was that strong. Good data point, thanks

↑ comment by DanielFilan · 2019-04-23T02:01:21.010Z · LW(p) · GW(p)

This isn't a full response, but it seems to me that Vika is largely talking about problems she percieves with impact measures in general, as defined by "measures of how much impact things have on the world", and is thinking of AUP as an element of this class (as would I, had I not read this comment). Reasons to think this include:

A perception of your research as primarily being the development of AUP, and of this post as being research for that development and exposition.
The introduction of AUP being in a post titled "Towards a New Impact Measure" [LW · GW].

If AUP is not in fact about restricting an agent's impact on the world (or, in other words, on the state of the world), then I would describe it as something other than an "impact measure", since that term is primarily used by people using the way of thinking you denounce (and I believe was invented that way: it seems to have morphed from 'side effects', which strongly suggests effects on parts of the world, according to my quick looking-over of the relevant section of Concrete Problems in AI Safety). Perhaps "optimisation regularisation technique" would be better, although I don't presume to understand your way of thinking about it.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-23T02:37:30.219Z · LW(p) · GW(p)

If AUP is not in fact about restricting an agent's impact on the world (or, in other words, on the state of the world)

So the end result is this, but it doesn’t do it by considering impact to be a thing that happens to the state primarily, but rather to agents; impact not in the sense of “how different is the state”, but “how big of a deal is this to me?”. The objective is to limit the agent’s impact on us, which I think is the more important thing. I think this still falls under normal colloquial use of ‘impact’, but I agree that this is different from the approaches so far. I’m going to talk about this distinction quite a bit in the future.

answer by Ofer (ofer) · 2019-04-11T07:19:05.500Z · LW(p) · GW(p)

Here's a relevant passage by Rohin (from Alignment Newsletter #49 [LW · GW], March 2019):

On the topic of impact measures, I'll repeat what I've said before: I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on human values), safety (preventing any catastrophic outcomes) and usefulness (the AI system is still able to do useful things). Impact measures are very clearly aiming for the first two criteria, but usually don't have much to say about the third one. My expectation is that there is a strong tradeoff between the first two criteria and the third one, and impact measures have not dealt with this fact yet, but will have to at some point.

↑ comment by Rohin Shah (rohinmshah) · 2019-04-11T17:50:27.674Z · LW(p) · GW(p)

Other relevant writing of mine:

Comment on the AUP [AF(p) · GW(p)] post [AF(p) · GW(p)]

Comment on the [LW(p) · GW(p)] desiderata [LW(p) · GW(p)] post [LW(p) · GW(p)]

But it's true that that quoted passage is the best summary of my current position. Daniel's answer is a good example of an underlying intuition that drives this position.

answer by avturchin · 2019-04-10T18:09:15.497Z · LW(p) · GW(p)

Something which is low impact in the moment T0, could have large impact in the moment T1. For example, if AI is not breaking a vase in the moment T0, it is low impact event, but if a human being later crashes his head on this vase in the next moment T1, when not crashing the vase in T0 becomes very large impact event on the person's wellbeing.

Moreover, if in the moment T2 it turns out that the person, who crashed his head at T1, was next Hitler, when again preserving the vase in the T0 becomes a low impact event, and so on. In other words, to calculate the real impact of an event we have to calculate its consequences until the end of eternity, which is impossible.

Also, longterm impact can't be calculated without knowing actual human values, so lowering the impact is the same task as AI alignment, and low impact task can't replace AI alignment task or be solved separately.

↑ comment by TurnTrout · 2019-04-10T18:21:38.815Z · LW(p) · GW(p)

Moreover, if in the moment T2 it turns out that the person, who crashed his head at T1, was next Hitler, when again preserving the vase in the T0 becomes a low impact event

why does it become a low impact event in your eyes?

In other words, to calculate the real impact of an event we have to calculate its consequences until the end of eternity, which is impossible.

What is your model of the thing that people are doing in their heads when they say things are impactful or not?

longterm impact can't be calculated without knowing actual human values, so lowering the impact is the same task as AI alignment, and low impact task can't replace AI alignment task or be solved separately.

Can you elaborate on this, and what your definition of impact is? I think we might be thinking of different things, and I'm interested as to what yours is.

Replies from: avturchin

↑ comment by avturchin · 2019-04-10T20:15:23.456Z · LW(p) · GW(p)

1. In my eyes, it is a low impact event, as its 3 stages unfold as: The vase is preserved, new Hitler is killed by the vase, human history didn't change in much worse direction.

An alternative is: The vase is broken, new Hitler is saved by the vase, human history change in much worse direction.

2. I think that people think of low impact as of actions which have very small consequences outside given space-time (or other) limits. For example, a robot moves from A to B without breaking the vase.

3. From my point of view, the impact is a long-term effect on human history, measured according to our values. Minimising near-term impact doesn't necessary mean minimising long term impact because of the "butterfly effect", and this was the main idea of my comment.

9 comments

Comments sorted by top scores.

comment by Eli Tyre (elityre) · 2019-04-15T16:27:49.641Z · LW(p) · GW(p)

On the process level: I would offer a bit of unsolicited advice about the method you used to generate reasons for pessimism. You (and others), might try it in future.

First of all, I strongly applaud the step of taking out a physical clock/ timer and making a solid attempt at answering the question for yourself. Virtue points (and karma) awarded!

However, when I read your list, it's blatantly one-sided. You're only trying to generate reasons for pessimism not reasons for optimism. This is not as bad as writing the bottom line, but generating arguments for only one side of a question biases your search.

Given this, one thing that I might do is first, spend 5 minutes generating the best arguments for (or concrete scenarios which inspire) pessimism about impact measures, then shift my mental orientation and spend 5 minutes generating arguments for (or concrete scenarios in which) impact measures seem promising.

But I wouldn't stop there. I would then spend 5 minutes (or as long as I need), looking over the first list and trying to generate counterarguments: reasons why the world probably isn't that way. Once I had done that, I would look over my new list of counter arguments, and try to generate counter-counterarguments, iterating until I either get stuck, or reach a sort of equilibrium where the arguments I've made are as strong as I can see how to make.

Then I would go back to my second original list (the one with reasons for optimism) and do the same back and forth, generating counterarguments and counter-counterarguments, until I get stuck or reach equilibrium on that side.

At that point, I should have two lists of the strongest reasons I can muster, arguments in favor of pessimism and arguments in favor of optimism, both of which have been stress-tested by my own skepticism. I'd then compare both lists, and if any of the arguments invalidates or weakens another, I adjust them accordingly (there might be a few more rounds of back and forth).

At this point, I've really thoroughly considered the issue. Obviously this doesn't mean that I've gotten the right answer, or that I've thought of everything. But it dose mean that for all practical purposes, I've exhausted the low hanging fruit of everything I can think of.

To recap...

Steps:

0. Take a binary question.

1. Make the best case I can for one answer, giving what ever arguments, or ways the world would have to be, that support that outcome.

2. Similarly make the best case I can for the other answer.

3. Take the reasoning for my first answer generate counterarguments. Generate responses to those counterarguments. Continue Iterate until you reach equilibrium.

4. Do the same to the reasoning for your second answer

5. Compare your final arguments on both sides of the question, adjusting as necessary.

(This procedure is inspired by a technique that I originally learned from Leverage Research / Paradigm Academy. By their terminology, this procedure is called (the weak form of) Pyrrhonian skepticism, after the Greek philosopher Pyrrho (who insisted that knowledge was impossible, because there were always arguments on both sides of a question). I've also heard it referred to, more generally, as "alternate stories".)

Of course, this takes more time to do, and that time cost may or may not be worth it to you. Furthermore, there are certainly pieces of your context or thinking process that I'm missing. Maybe you, in fact, did part of this process. But this is an extended method to consider.

Replies from: elityre, abramdemski

↑ comment by Eli Tyre (elityre) · 2019-12-13T00:34:26.889Z · LW(p) · GW(p)

I no longer fully endorse this comment, though I recommend this procedure to anyone who think it sounds interesting.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-12-13T01:29:35.705Z · LW(p) · GW(p)

Is there a short explanation why you no longer fully endorse?

Replies from: elityre

↑ comment by Eli Tyre (elityre) · 2019-12-13T18:21:22.154Z · LW(p) · GW(p)

Not really. Just that when I look that the text I wrote, now, it seems a little hacky / not quite expressing the true spirit of the mental motions that seem useful to me.

It might still be a good procedure for bootstrapping into the right mental motions though? I've haven't done any testing on this one, so I don't know.

↑ comment by abramdemski · 2019-04-23T19:51:39.127Z · LW(p) · GW(p)

this procedure is called (the weak form of) Pyrrhonian skepticism

What's the strong form?

Replies from: elityre

↑ comment by Eli Tyre (elityre) · 2019-04-25T14:53:02.326Z · LW(p) · GW(p)

I believe the strong form is generating a counter argument for any proposition, and then concluding that all propositions are equally likely and therefore that knowledge is impossible.

From wikipedia:

The main principle of Pyrrho's thought is expressed by the word acatalepsia, which connotes the ability to withhold assent from doctrines regarding the truth of things in their own nature; against every statement its contradiction may be advanced with equal justification.

I don't recommend the strong form.

Replies from: abramdemski

↑ comment by abramdemski · 2019-04-25T17:08:22.140Z · LW(p) · GW(p)

Ah, ok. I note that it may have been intended more as a meditative practice, since the goal appears to have been reaching a state of bliss, the epistemic practice being a means to that end. Practicing doubting everything could be an interesting meditation (though it could perhaps be dangerous).

comment by Wei Dai (Wei_Dai) · 2019-04-10T20:03:40.430Z · LW(p) · GW(p)

I’m interested in learning about the intuitions, experience, and facts which inform this pessimism. As such, I’m not interested in making any arguments to the contrary in this post; any pushback I provide in the comments will be with clarification in mind.

I would prefer that you and/or others did push back, as I'm really curious which of the causes/reasons for pessimism actually stand up under such pushback. (See Four Layers of Intellectual Conversation and AI Safety via Debate [LW · GW].) I do appreciate that you prioritize just knowing what the causes/reasons are in the first place and don't want to discourage people from sharing them, so I wonder if there's a way to get both of what we want.

Replies from: TurnTrout

↑ comment by TurnTrout · 2019-04-10T23:21:16.945Z · LW(p) · GW(p)

I do plan on pushing back on certain concerns, but I think if I did so now, some of my reasons for believing things would seem weird and complicated-enough-to-be-shaky because of inferential distance. The main pedagogical mistake I made with Towards a New Impact Measure wasn't putting too much in one post, but rather spending too much time on conclusions, telling people what I think happens without helping build in them the intuitions and insights which generate those results. Over the last 8 months, I think I've substantially enriched my model of how agents interact with their environments. I'm interested in seeing how many disagreements melt away when these new insights are properly shared and understood, and what people still disagree with me on. That's why I'm planning on waiting until my upcoming sequence to debate these points.

I am comfortable sharing those concerns which I have specific reason to believe don't hold up. However, I'm wary of dismissing them in a way that doesn't Include those specific reasons. That seems unfair. If you're curious which ones I think these are, feel free to ask me over private message.

Best reasons for pessimism about impact of impact measures?

Contents

Answers

9 comments