Best reasons for pessimism about impact of impact measures?

post by TurnTrout · 2019-04-10T17:22:12.832Z · score: 73 (15 votes) · LW · GW · 28 comments

This is a question post.

Contents

  Answers
    30 Vaniver
    19 Wei_Dai
    17 DanielFilan
    14 Vika
    9 ofer
    5 avturchin
3 comments

Habryka recently wrote [LW · GW] (emphasis mine):

My inside views on AI Alignment make me think that work on impact measures is very unlikely to result in much concrete progress on what I perceive to be core AI Alignment problems, and I have talked to a variety of other researchers in the field who share that assessment. I think it’s important that this grant not be viewed as an endorsement of the concrete research direction that Alex is pursuing, but only as an endorsement of the higher-level process that he has been using while doing that research.

As such, I think it was a necessary component of this grant that I have talked to other people in AI Alignment whose judgment I trust, who do seem excited about Alex’s work on impact measures. I think I would not have recommended this grant, or at least this large of a grant amount, without their endorsement. I think in that case I would have been worried about a risk of diverting attention from what I think are more promising approaches to AI Alignment, and a potential dilution of the field by introducing a set of (to me) somewhat dubious philosophical assumptions.

I'm interested in learning about the intuitions, experience, and facts which inform this pessimism. As such, I'm not interested in making any arguments to the contrary in this post; any pushback I provide in the comments will be with clarification in mind.

There are two reasons you could believe that "work on impact measures is very unlikely to result in much concrete progress on… core AI Alignment problems". First, you might think that the impact measurement problem is intractable, so work is unlikely to make progress. Second, you might think that even a full solution wouldn't be very useful.

Over the course of 5 minutes by the clock, here are the reasons I generated for pessimism (which I either presently agree with or at least find it reasonable that an intelligent critic would raise the concern on the basis of currently-public reasoning):


Paul raised concerns along these lines [LW · GW]:

We'd like to build AI systems that help us resolve the tricky situation that we're in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.

I think the default "terrible" scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.

That may ultimately culminate with a catastrophic bang, but if it does it's not going to be because we wanted the AI to have a small impact and it had a large impact. It's probably going to be because we have a very limited idea what is going on, but we don't feel like we have the breathing room to step back and chill out (at least not for long) because we don't believe that everyone else is going to give us time.

If I'm trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does "low impact" mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?

(And realistically I doubt we'll fail at alignment with a bang---it's more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn't let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)

It seems like "low objective impact" is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that's a coherent situation to think about and plan for, but we shouldn't mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it's the best hope if you were very pessimistic about what I consider "mainline" alignment.)

Answers

answer by Vaniver · 2019-04-11T19:09:00.203Z · score: 30 (9 votes) · LW · GW

When I think about solutions to AI alignment, I often think about 'meaningful reductionism.' That is, if I can factor a problem into two parts, and the parts don't actually rely on each other, now I have two smaller problems to solve. But if the parts are reliant on each other, I haven't really simplified anything yet.

While impact measures feel promising to me as a cognitive strategy (often my internal representation of politeness feels like 'minimizing negative impact', like walking on sidewalks in a way that doesn't startle birds), they don't feel promising to me as reductionism. That is, if I already had a solution to the alignment problem, then impact measures would likely be part of how I implement that solution, but solving it separately from alignment doesn't feel like it gets me any closer to solving alignment.

[The argument here I like most rests on the difference between costs and side effects; we don't want to minimize side effects because that leads to minimizing good side effects also, and it's hard to specify the difference between 'side effects' and 'causally downstream effects,' and so on. But if we just tell the AI "score highly on a goal measure while scoring low on this cost measure," this only works if we specified the goal and the cost correctly.]

But there's a different approach to AI alignment, which is something more like 'correct formalisms.' We talk sometimes about handing a utility function to the robot, or (in old science fiction) providing it with rules to follow, or so on, and by seeing what it actually looks like when we follow that formalism we can figure out how well that formalism fits to what we're interested in. Utility functions on sensory inputs don't seem alignable because of various defects (like wireheading), and so it seems like the right formalism needs to have some other features (it might still be a utility function, but it needs to be an utility function over mental representations of external reality in such a way that the mental representation tracks external reality even when you have freedom to alter your mental representation, in a way that we can't turn into code yet).

So when I ask myself questions like "why am I optimistic about researching impact measures now?" I get answers like "because exploring the possibility space will make clear exactly how the issues link up." For example, looking at things like relative reachability made it clear to me how value-laden the ontology needs to be in order for a statistical measure on states to be meaningful. This provides a different form-factor for 'transferring values to the AI'; instead of trying to ask something like "is scenario A or B better?" and train a utility function, I might instead try to ask something like "how different are scenarios A and B?" or "how are scenarios A and B different?" and train an ontology, with the hopes that this makes other alignment problems easier because the types line up somewhat more closely.

[I think even that last example still performs poorly on the 'meaningful reductionism' angle, since getting more options for types to use in value loading doesn't seem like it addresses the core obstacles of value loading, but provides some evidence of how it could be useful or clarify thinking.]

comment by TurnTrout · 2019-04-17T22:47:27.182Z · score: 2 (1 votes) · LW · GW
this only works if we specified the goal and the cost correctly

Wait, why doesn't it work if you just specify the cost (impact) correctly?

comment by Vaniver · 2019-04-20T04:26:08.560Z · score: 3 (1 votes) · LW · GW

Suppose the goal dramatically overvalues some option; then the AI would be willing to pay large (corrected estimated) costs in order to achieve "even larger" (incorrectly estimated) gains.

answer by Wei_Dai · 2019-04-10T20:04:27.556Z · score: 19 (8 votes) · LW · GW

I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world, because what counts as a negative side effect in the real world seems too complex to easily capture. It seems like AUP tries to get around this by aiming at a lower bar than "avoid negative side effects", namely "avoid catastrophic side effects", and aside from whether it actually succeeds at clearing this lower bar, it would mean that an AI that is only "safe" because of AUP can't be safely used for ordinary goals (e.g., invent a better widget, or make someone personally more successful in life) and instead we have to somehow restrict them to being used just for goals that relate to x-risk reduction, where it's worthwhile to risk incurring less-than-catastrophic negative side effects.

As a side note, it seems generally the case that some approaches to AI safety/alignment aim at the higher bar of "safe for general use" and others aim at "safe enough to use for x-risk reduction", and this isn't always made clear, which can be a source of confusion for both AI safety/alignment researchers and others such as strategists and policy makers.

comment by Matthew Barnett (matthew-barnett) · 2019-04-11T02:34:30.051Z · score: 6 (4 votes) · LW · GW
I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world

Do you think there any experiments that could be performed that would change your view on this point? Or is an impact measure the type of thing that does not generalize well from testing environment to the real world?

answer by DanielFilan · 2019-04-10T20:45:16.851Z · score: 17 (8 votes) · LW · GW

My concern is similar to Wei Dai's: it seems to me that at a fundamental physical level, any plan involving turning on a computer that does important stuff will make pretty big changes to the world's trajectory in phase space. Heat dissipation will cause atmospheric particles to change their location and momentum, future weather patterns will be different, people will do things at different times (e.g. because they're waiting for a computer program to run, or because the computer is designed to change the flow of traffic through a city), meet different people, and have different children. As a result, it seems hard for me to understand how impact measures could work in the real world without a choice of representation very close to the representation humans use to determine the value of different worlds. I suspect that this will need input from humans similar to what value learning approaches might need, and that once it's done one could just do value learning and dispense with the need for impact measures. That being said, this is more of an impression than a belief - I can't quite convince myself that no good method of impact regularisation exists, and some other competent people seem to disagre ewith me.

comment by rohinmshah · 2019-04-11T17:48:43.068Z · score: 2 (1 votes) · LW · GW
I can't quite convince myself that no good method of value learning exists, and some other competent people seem to disagre ewith me.

No good method of measuring impact, presumably?

comment by DanielFilan · 2019-04-11T18:10:16.742Z · score: 2 (1 votes) · LW · GW

I prefer the phrase 'impact regularisation', but indeed that was a slip of the mind.

comment by TurnTrout · 2019-04-10T22:51:39.468Z · score: 2 (1 votes) · LW · GW

How does this concern interact with the effective representation invariance claim I made when introducing AUP?

comment by Vika · 2019-04-11T15:16:37.410Z · score: 9 (5 votes) · LW · GW

I don't see how representation invariance addresses this concern. As far as I understand, the concern is about any actions in the real world causing large butterfly effects. This includes effects that would be captured by any reasonable representation, e.g. different people existing in the action and inaction branches of the world. The state representations used by humans also distinguish between these world branches, but humans have limited models of the future that don't capture butterfly effects (e.g. person X can distinguish between the world state where person Y exists and the world state where person Z exists, but can't predict that choosing a different route to work will cause person Z to exist instead of person Y).

I agree with Daniel that this is a major problem with impact measures. I think that to get around this problem we would either need to figure out how to distinguish butterfly effects from other effects (and then include all the butterfly effects in the inaction branch) or use a weak world model that does not capture butterfly effects (similarly to humans) for measuring impact. Even if we know how to do this, it's not entirely clear whether we should avoid penalizing butterfly effects. Unlike humans, AI systems would be able to cause butterfly effects on purpose, and could channel their impact through butterfly effects if they are not penalized.

comment by DanielFilan · 2019-04-11T18:20:34.307Z · score: 9 (4 votes) · LW · GW

Unlike humans, AI systems would be able to cause butterfly effects on purpose, and could channel their impact through butterfly effects if they are not penalized.

Indeed - a point I think is illustrated by the Chaotic Hurricanes test case. I'm probably most excited about methods that would use transparency techniques to determine when a system is deliberately optimising for a part of the world (e.g. the members of the long-term future population) that we don't want it to care about, but this has a major drawback of perhaps requiring multiple philosophical advances into the meaning of reference in cognition and a greater understanding of what optimisation is.

comment by TurnTrout · 2019-04-11T23:14:17.202Z · score: 4 (2 votes) · LW · GW

What would you predict AUP does for the chaotic scenarios? Suppose the attainable set just includes the survival utility function, which is 1 if the agent is activated and 0 otherwise.

comment by TurnTrout · 2019-04-11T23:06:08.113Z · score: 9 (3 votes) · LW · GW
I don't see how representation invariance addresses this concern.

I think my post was basically saying "representation selection seems like a problem because people are confused about the type signature of impact, which is actually a thing you can figure out no matter what you think the world is made of". I don't want to go into too much detail here (as I explained below), but part of what this implies is that discrete "effects" are fake/fuzzy mental constructs/not something to think about when designing an impact measure. In turn, this would mean we should ask a different question that isn't about butterfly effects.

comment by DanielFilan · 2019-04-10T23:07:26.475Z · score: 4 (2 votes) · LW · GW

I think that under the worldview of this concern, the distribution of reward functions effectively defines a representation that, if too different from the one humans care about, will either mean that no realistic impact is possible in the real world or be ineffective at penalising unwanted negative impacts.

comment by TurnTrout · 2019-04-10T23:23:56.111Z · score: 2 (1 votes) · LW · GW

is there a central example you have in mind for this potential failure mode?

comment by DanielFilan · 2019-04-11T18:10:38.046Z · score: 2 (1 votes) · LW · GW

No.

answer by Vika · 2019-04-19T12:51:08.720Z · score: 14 (4 votes) · LW · GW

Thanks Alex for starting this discussion and thanks everyone for the thought-provoking answers. Here is my current set of concerns about the usefulness of impact measures, sorted in decreasing order of concern:

Irrelevant factors. When applied to the real world, impact measures are likely to be dominated by things humans don't care about (heat dissipation, convection currents, positions of air molecules, etc). This seems likely to happen to value-agnostic impact measures, e.g. AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

This may be mitigated by inability to perceive the irrelevant factors, which results in a more coarse-grained state representation: if the agent can't see air molecules, all the states with different air molecule positions will look the same, as they do to humans. Some human-relevant factors can also be difficult to perceive, e.g. the presence of poisonous gas in the room, so we may not want to limit the agent's perception ability to human level. Automatically filtering out irrelevant factors does seem difficult, and I think this might imply that it is impossible to design an impact measure that is both useful and truly value-agnostic.

However, the value-agnostic criterion does not seem very important in itself. I think the relevant criterion is that designing impact measures should be easier than the general value learning problem. We already have a non-value-agnostic impact measure that plausibly satisfies this criterion: RLSP learns what is effectively an impact measure (the human theta parameter) using zero human input just by examining the starting state. This could also potentially be achieved by choosing an attainable utility set that rewards a broad enough sample of things humans care about, and leaves the rest to generalization. Choosing a good attainable utility set may not be easy but it seems unlikely to be as hard as the general value learning problem.

Butterfly effects. Every action is likely to have large effects that are difficult to predict, e.g. taking a different route to work may result in different people being born. Taken literally, this means that there is no such thing as a low-impact action. Humans get around this by only counting easily predictable effects as impact that they are considered responsible for. If we follow a similar strategy of not penalizing butterfly effects, we might incentivize the agent to deliberately cause butterfly effects. The easiest way around this that I can currently see is restricting the agent's capability to model the effects of its actions, though this has obvious usefulness costs as well.

Chaotic world. Every action, including inaction, is irreversible, and each branch contains different states. While preserving reversibility is impossible in this world, preserving optionality (attainable utility, reachability, etc) seems possible. For example, if the attainable set contains a function that rewards the presence of vases, the action of breaking a vase will make this reward function more difficult to satisfy (even if the states with/without vases are different in every branch). If we solve the problem of designing/learning a good utility set that is not dominated by irrelevant factors, I expect chaotic effects will not be an issue.

If any of the above-mentioned concerns are not overcome, impact measures will fail to distinguish between what humans would consider low-impact and high-impact. Thus, penalizing high-impact actions would come with penalizing low-impact actions as well, which would result in a strong safety-capability tradeoff. I think the most informative direction of research to figure out whether these concerns are a deal-breaker is to scale up impact measures to apply beyond gridworlds, e.g. to Atari games.

comment by TurnTrout · 2019-04-19T15:58:15.780Z · score: 4 (2 votes) · LW · GW

Thanks for the detailed list!

AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

Humans get around this by only counting easily predictable effects as impact that they are considered responsible for.

What makes you think that?

comment by Vika · 2019-04-20T13:23:06.578Z · score: 2 (1 votes) · LW · GW
What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules. This does not apply to the attainable set consisting of the survival utility function, since that is not a random utility function.

What makes you think that?

This is an intuitive claim based on a general observation of how people attribute responsibility. For example, if I walk into a busy street and get hit by a car, I will be considered responsible for this because it's easy to predict. On the other hand, if I am walking down the street and a brick falls on my head from the nearby building, then I will not be considered responsible, because this event would be hard to predict. There are probably other reasons that humans don't consider themselves responsible for butterfly effects.

comment by TurnTrout · 2019-04-20T15:47:36.275Z · score: 2 (1 votes) · LW · GW

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules.

Are you thinking of an action observation formalism, or some kind of reward function over inferred state?

Responsibility

If you had to pose the problem of impact measurement as a question, what would it be?

answer by ofer · 2019-04-11T07:19:05.500Z · score: 9 (5 votes) · LW · GW

Here's a relevant passage by Rohin (from Alignment Newsletter #49, March 2019):

On the topic of impact measures, I'll repeat what I've said before: I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on human values), safety (preventing any catastrophic outcomes) and usefulness (the AI system is still able to do useful things). Impact measures are very clearly aiming for the first two criteria, but usually don't have much to say about the third one. My expectation is that there is a strong tradeoff between the first two criteria and the third one, and impact measures have not dealt with this fact yet, but will have to at some point.
comment by rohinmshah · 2019-04-11T17:50:27.674Z · score: 7 (4 votes) · LW · GW

Other relevant writing of mine:

Comment on the AUP post

Comment on the [LW · GW] desiderata [LW · GW] post [LW · GW]

But it's true that that quoted passage is the best summary of my current position. Daniel's answer is a good example of an underlying intuition that drives this position.

answer by avturchin · 2019-04-10T18:09:15.497Z · score: 5 (3 votes) · LW · GW

Something which is low impact in the moment T0, could have large impact in the moment T1. For example, if AI is not breaking a vase in the moment T0, it is low impact event, but if a human being later crashes his head on this vase in the next moment T1, when not crashing the vase in T0 becomes very large impact event on the person's wellbeing.

Moreover, if in the moment T2 it turns out that the person, who crashed his head at T1, was next Hitler, when again preserving the vase in the T0 becomes a low impact event, and so on. In other words, to calculate the real impact of an event we have to calculate its consequences until the end of eternity, which is impossible.

Also, longterm impact can't be calculated without knowing actual human values, so lowering the impact is the same task as AI alignment, and low impact task can't replace AI alignment task or be solved separately.

comment by TurnTrout · 2019-04-10T18:21:38.815Z · score: 3 (2 votes) · LW · GW
Moreover, if in the moment T2 it turns out that the person, who crashed his head at T1, was next Hitler, when again preserving the vase in the T0 becomes a low impact event

why does it become a low impact event in your eyes?

In other words, to calculate the real impact of an event we have to calculate its consequences until the end of eternity, which is impossible.

What is your model of the thing that people are doing in their heads when they say things are impactful or not?

longterm impact can't be calculated without knowing actual human values, so lowering the impact is the same task as AI alignment, and low impact task can't replace AI alignment task or be solved separately.

Can you elaborate on this, and what your definition of impact is? I think we might be thinking of different things, and I'm interested as to what yours is.

comment by avturchin · 2019-04-10T20:15:23.456Z · score: 4 (2 votes) · LW · GW

1. In my eyes, it is a low impact event, as its 3 stages unfold as: The vase is preserved, new Hitler is killed by the vase, human history didn't change in much worse direction.

An alternative is: The vase is broken, new Hitler is saved by the vase, human history change in much worse direction.

2. I think that people think of low impact as of actions which have very small consequences outside given space-time (or other) limits. For example, a robot moves from A to B without breaking the vase.

3. From my point of view, the impact is a long-term effect on human history, measured according to our values. Minimising near-term impact doesn't necessary mean minimising long term impact because of the "butterfly effect", and this was the main idea of my comment.

28 comments

Comments sorted by top scores.

comment by elityre · 2019-04-15T16:27:49.641Z · score: 12 (3 votes) · LW · GW

On the process level: I would offer a bit of unsolicited advice about the method you used to generate reasons for pessimism. You (and others), might try it in future.

First of all, I strongly applaud the step of taking out a physical clock/ timer and making a solid attempt at answering the question for yourself. Virtue points (and karma) awarded!

However, when I read your list, it's blatantly one-sided. You're only trying to generate reasons for pessimism not reasons for optimism. This is not as bad as writing the bottom line, but generating arguments for only one side of a question biases your search.

Given this, one thing that I might do is first, spend 5 minutes generating the best arguments for (or concrete scenarios which inspire) pessimism about impact measures, then shift my mental orientation and spend 5 minutes generating arguments for (or concrete scenarios in which) impact measures seem promising.

But I wouldn't stop there. I would then spend 5 minutes (or as long as I need), looking over the first list and trying to generate counterarguments: reasons why the world probably isn't that way. Once I had done that, I would look over my new list of counter arguments, and try to generate counter-counterarguments, iterating until I either get stuck, or reach a sort of equilibrium where the arguments I've made are as strong as I can see how to make.

Then I would go back to my second original list (the one with reasons for optimism) and do the same back and forth, generating counterarguments and counter-counterarguments, until I get stuck or reach equilibrium on that side.

At that point, I should have two lists of the strongest reasons I can muster, arguments in favor of pessimism and arguments in favor of optimism, both of which have been stress-tested by my own skepticism. I'd then compare both lists, and if any of the arguments invalidates or weakens another, I adjust them accordingly (there might be a few more rounds of back and forth).

At this point, I've really thoroughly considered the issue. Obviously this doesn't mean that I've gotten the right answer, or that I've thought of everything. But it dose mean that for all practical purposes, I've exhausted the low hanging fruit of everything I can think of.

To recap...

Steps:

0. Take a binary question.
1. Make the best case I can for one answer, giving what ever arguments, or ways the world would have to be, that support that outcome.
2. Similarly make the best case I can for the other answer.
3. Take the reasoning for my first answer generate counterarguments. Generate responses to those counterarguments. Continue Iterate until you reach equilibrium.
4. Do the same to the reasoning for your second answer
5. Compare your final arguments on both sides of the question, adjusting as necessary.

(This procedure is inspired by a technique that I originally learned from Leverage Research / Paradigm Academy. By their terminology, this procedure is called (the weak form of) Pyrrhonian skepticism, after the Greek philosopher Pyrrho (who insisted that knowledge was impossible, because there were always arguments on both sides of a question). I've also heard it referred to, more generally, as "alternate stories".)

Of course, this takes more time to do, and that time cost may or may not be worth it to you. Furthermore, there are certainly pieces of your context or thinking process that I'm missing. Maybe you, in fact, did part of this process. But this is an extended method to consider.

comment by Wei_Dai · 2019-04-10T20:03:40.430Z · score: 11 (6 votes) · LW · GW

I’m interested in learning about the intuitions, experience, and facts which inform this pessimism. As such, I’m not interested in making any arguments to the contrary in this post; any pushback I provide in the comments will be with clarification in mind.

I would prefer that you and/or others did push back, as I'm really curious which of the causes/reasons for pessimism actually stand up under such pushback. (See Four Layers of Intellectual Conversation and AI Safety via Debate [LW · GW].) I do appreciate that you prioritize just knowing what the causes/reasons are in the first place and don't want to discourage people from sharing them, so I wonder if there's a way to get both of what we want.

comment by TurnTrout · 2019-04-10T23:21:16.945Z · score: 13 (-1 votes) · LW · GW

I do plan on pushing back on certain concerns, but I think if I did so now, some of my reasons for believing things would seem weird and complicated-enough-to-be-shaky because of inferential distance. The main pedagogical mistake I made with Towards a New Impact Measure wasn't putting too much in one post, but rather spending too much time on conclusions, telling people what I think happens without helping build in them the intuitions and insights which generate those results. Over the last 8 months, I think I've substantially enriched my model of how agents interact with their environments. I'm interested in seeing how many disagreements melt away when these new insights are properly shared and understood, and what people still disagree with me on. That's why I'm planning on waiting until my upcoming sequence to debate these points.

I am comfortable sharing those concerns which I have specific reason to believe don't hold up. However, I'm wary of dismissing them in a way that doesn't Include those specific reasons. That seems unfair. If you're curious which ones I think these are, feel free to ask me over private message.