post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by Rohin Shah (rohinmshah) · 2019-12-10T07:05:22.321Z · LW(p) · GW(p)

I don't understand this post. Some confusions:

Given a hard-coded agent that explicitly computed the consequences of its actions, and then took the action which maximized expected value according to its utility function, we would observe precisely the opposite behavior. Mutate a single line of code and the functionality of this agent would almost certainly be broken. However, the agent is still robust in the alignment sense, as its values will never drift.

Isn't this true of every computer program? This sounds like an argument that AI can never be robust in functionality, which seems to prove too much. (If you actually mean this, I think your use of the word "robust" has diverged from the property I care about.)

Logical systems can be robust because their behavior is very predictable off-distribution, but they are not always robust in the sense of being able to adapt to sudden, unpredictable changes.

I would like to see an example of being unable to adapt to sudden, unpredictable changes; that doesn't match my explanation of why logical systems fail: I would say that they make assumptions that turn out to be false, with a particularly common case being the failure of the designer to consider some particular edge case, and there are enough edge cases that these failures become too common.

they must look something like a mesa optimizer. Thus, if the current learning paradigm is to create general intelligence, we will necessarily encounter problems endemic to the "old" type of AI.

In the previous paragraph, I thought you argued that logical / "old" AI has robust specification but not robust functionality. But the worry with mesa optimizers is the exact opposite; that they will have robust functionality but not robust specification.

We might expect it to perform extremely well and be quite robust to changes in its environment, but be brittle in the sense of having no explicit principles which underlie its pursuit of goals.

What does "brittle" mean here? You can say the same of humans; are humans "brittle"?

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2019-12-10T08:32:21.503Z · LW(p) · GW(p)
Isn't this true of every computer program? This sounds like an argument that AI can never be robust in functionality, which seems to prove too much. (If you actually mean this, I think your use of the word "robust" has diverged from the property I care about.)

When we describe the behavior of a system, we typically operate at varying levels of abstraction. I'm not making an argument about the fragility of the substrate that the system is on, but rather the fragility of the parts that we typically use to describe the system at an appropriate level of abstraction.

When we describe the functionality of an artificial neural network, we tend to speak about model weights and computational graphs, which do tolerate slight modifications. On the other hand, when we describe the functionality of A* search, we tend to speak about single lines of code that do stuff, which generally don't tolerate slight modifications.

I would like to see an example of being unable to adapt to sudden, unpredictable changes; that doesn't match my explanation of why logical systems fail: I would say that they make assumptions that turn out to be false, with a particularly common case being the failure of the designer to consider some particular edge case, and there are enough edge cases that these failures become too common.

I'm not sure I understand the difference between a logical agent encountering a sudden, unpredictable change to its environment and a logical agent entering a regime where its operating assumptions turned out to be false. The reason why anyone would be in an unpredictable situation is because they assumed they would be in a different environment.

In any case, I'll re-write that part to be more clear.

In the previous paragraph, I thought you argued that logical / "old" AI has robust specification but not robust functionality. But the worry with mesa optimizers is the exact opposite; that they will have robust functionality but not robust specification.

Consider the particular mode of failure in logical systems that you highlighted above: the system makes a false assumption about the world. Since my operating assumption was that a mesa optimizer will use some form of logical reasoning, then they are therefore liable to make a false assumption about their enviroment which causes them to make an incorrect decision.

Note that I'm not saying that the primary issue with mesa optimizers is that they'll make some logical mistake, only that they could. In truth, the primary thesis of my post was that efforts to solve one type of robustness may not automatically carry over, because they don't respect the decomposition.

What does "brittle" mean here? You can say the same of humans; are humans "brittle"?

Consider the scalar definition of robustness: how well did you do during your performance off some training distribution? In this case, many humans are brittle, since they are not doing well according to inclusive fitness. Even within their own lives, humans don't pursue the goals they set for themselves 10 years ago. There's a lot of ways in which humans are brittle in this sense.

Overall, I think your confusions are warranted, and I wish I shared the post with others before sending out, and I may re-write some sections to make my main point more clear for any future readers.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2019-12-10T09:25:49.376Z · LW(p) · GW(p)

Btw, as a meta-point, my understanding of your key claim is:

Sometimes, getting more of one necessarily means getting less of the other. Hence, the "paradox."

My impression after reading your comment is that you're actually saying that if you optimize for one, the other one might go down, which I certainly agree with, but for a much simpler reason: in general, if you make a change that isn't targeted at a variable Y, then that change is roughly equally likely to increase or decrease Y.

When we describe the behavior of a system, we typically operate at varying levels of abstraction. I'm not making an argument about the fragility of the substrate that the system is on, but rather the fragility of the parts that we typically use to describe the system at an appropriate level of abstraction.
When we describe the functionality of an artificial neural network, we tend to speak about model weights and computational graphs, which do tolerate slight modifications. On the other hand, when we describe the functionality of A* search, we tend to speak about single lines of code that do stuff, which generally don't tolerate slight modifications.

This seems to be a fact about which programming language you choose, as opposed to what algorithm you're using. I could in theory implement A* in neural net weights by hand, and then it would be robust. Similarly, I could write out a learned neural net in Python (one line of Python for every flop in the model), and it would no longer be robust.

(I think more broadly the robustness you're identifying involves making "small" changes where "small" is defined in terms of a "distance" defined by the programming language; I want a "distance" defined on algorithms [LW · GW], because that seems more relevant to talking about properties of AI. That distance should not depend on what programming language you use to implement the algorithm.)

Consider the scalar definition of robustness: how well did you do during your performance off some training distribution? In this case, many humans are brittle, since they are not doing well according to inclusive fitness. Even within their own lives, humans don't pursue the goals they set for themselves 10 years ago. There's a lot of ways in which humans are brittle in this sense.

I claim the neural net of your example wouldn't be brittle in this way, since you postulated that it was trained on the actual distribution of environments it would encounter.

I'm not sure I understand the difference between a logical agent encountering a sudden, unpredictable change to its environment and a logical agent entering a regime where its operating assumptions turned out to be false.

A logical agent for solving mazes could be an agent that follows a wall. If you deploy such an agent in a maze with a loop, then it can circle forever. It seems like a type error to call this a sudden, unpredictable change -- I wouldn't really ascribe beliefs to this agent at all.

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2019-12-10T10:25:35.865Z · LW(p) · GW(p)

After thinking about your reply, I've come to the conclusion that my thoughts are currently too confused to continue explaining. I've edited the main post to add that detail.

comment by Donald Hobson (donald-hobson) · 2019-12-10T10:41:13.471Z · LW(p) · GW(p)

Consider a self driving car. Call the human utility function . Call the space of all possible worlds . In the normal operation of a self driving car, the car has only makes decisions over the restricted space . Say In practice will contain a whole bunch of things the car could do. Suppose that the programmers only know the restriction of to . This is enough to make a self driving car that behaves correctly in the crash or don't crash dilemma.

However, suppose that a self driving car is faced with an off distribution situation from . Three things it could do include:

1) Recognise the problem and shut down.

2) Fail to coherently optimise at all

3) Coherently optimise some extrapolation of

The behavior we want is to optimise , but the info about what is just isn't there.

Options (1) and (2) makes the system brittle, tending to fail the moment anything goes slightly differently.

Option (3) leads to reasoning like, "I know not to crash into x, y and z, so maybe I shouldn't crash into anything", In other words, the extrapolation is often quite good when slightly off distribution. However when far off distribution, you can get traffic light maximizer behavior.

In short, the paradox of robustness exists because, when you don't know what to optimize for, you can fail to optimize, or you can guess at something and optimize that.

comment by TAG · 2019-12-10T11:09:24.663Z · LW(p) · GW(p)

Given a hard-coded agent that explicitly computed the consequences of its actions, and then took the action which maximized expected value according to its utility function, we would observe precisely the opposite behavior. Mutate a single line of code and the functionality of this agent would almost certainly be broken. However, the agent is still robust in the alignment sense, as its values will never drift

Why are its values immune? If its utility function is not made of code, what is it made of?