Coherence arguments do not imply goal-directed behavior

post by rohinmshah · 2018-12-03T03:26:03.563Z · score: 64 (21 votes) · LW · GW · 26 comments


  All behavior can be rationalized as EU maximization
  There are no coherence arguments that say you must have goal-directed behavior
  There are no coherence arguments that say you must have preferences
  Convergent instrumental subgoals are about goal-directed behavior
  Goodhart’s Law is about goal-directed behavior
  Wireheading is about explicit reward maximization

One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.

We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the agent doesn’t fix the failures because it would not be worth it -- in this case, the argument says that we will not be able to notice any exploitable failures.)

Taken together, these arguments suggest that we should model an agent much smarter than us as an expected utility (EU) maximizer. And many people agree that EU maximizers are dangerous. So does this mean we’re doomed? I don’t think so: it seems to me that the problems about EU maximizers that we’ve identified are actually about goal-directed behavior or explicit reward maximizers. The coherence theorems say nothing about whether an AI system must look like one of these categories. This suggests that we could try building an AI system that can be modeled as an EU maximizer, yet doesn’t fall into one of these two categories, and so doesn’t have all of the problems that we worry about.

Note that there are two different flavors of arguments that the AI systems we build will be goal-directed agents (which are dangerous if the goal is even slightly wrong):

I will only be arguing against the first claim in this post, and will talk about the second claim in the next post.

All behavior can be rationalized as EU maximization

Suppose we have access to the entire policy of an agent, that is, given any universe-history, we know what action the agent will take. Can we tell whether the agent is an EU maximizer?

Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/observations and actions. The actual policy gets 1 utility at every timestep; any other policy gets less than this, so the given policy perfectly maximizes this utility function. This construction has been given before, eg. at the bottom of page 6 of this paper. (I think I’ve seen it before too, but I can’t remember where.)

But wouldn’t this suggest that the VNM theorem has no content? Well, we assumed that we were looking at the policy of the agent, which led to a universe-history deterministically. We didn’t have access to any probabilities. Given a particular action, we knew exactly what the next state would be. Most of the axioms of the VNM theorem make reference to lotteries and probabilities -- if the world is deterministic, then the axioms simply say that the agent must have transitive preferences over outcomes. Given that we can only observe the agent choose one history over another, we can trivially construct a transitive preference ordering by saying that the chosen history is higher in the preference ordering than the one that was not chosen. This is essentially the construction we gave above.

What then is the purpose of the VNM theorem? It tells you how to behave if you have probabilistic beliefs about the world, as well as a complete and consistent preference ordering over outcomes. This turns out to be not very interesting when “outcomes” refers to “universe-histories”. It can be more interesting when “outcomes” refers to world states instead (that is, snapshots of what the world looks like at a particular time), but utility functions over states/snapshots can’t capture everything we’re interested in, and there’s no reason to take as an assumption that an AI system will have a utility function over states/snapshots.

There are no coherence arguments that say you must have goal-directed behavior

Not all behavior can be thought of as goal-directed (primarily because I allowed the category to be defined by fuzzy intuitions rather than something more formal). Consider the following examples:

These are not goal-directed by my “definition”. However, they can all be modeled as expected utility maximizers, and there isn’t any particular way that you can exploit any of these agents. Indeed, it seems hard to model the twitching robot or the policy-following agent as having any preferences at all, so the notion of “exploiting” them doesn’t make much sense.

You could argue that neither of these agents are intelligent, and we’re only concerned with superintelligent AI systems. I don’t see why these agents could not in principle be intelligent: perhaps the agent knows how the world would evolve, and how to intervene on the world to achieve different outcomes, but it does not act on these beliefs. Perhaps if we peered into the inner workings of the agent, we could find some part of it that allows us to predict the future very accurately, but it turns out that these inner workings did not affect the chosen action at all. Such an agent is in principle possible, and it seems like it is intelligent.

(If not, it seems as though you are defining intelligence to also be goal-driven, in which case I would frame my next post as arguing that we may not want to build superintelligent AI, because there are other things we could build that are as useful without the corresponding risks.)

You could argue that while this is possible in principle, no one would ever build such an agent. I wholeheartedly agree, but note that this is now an argument based on particular empirical facts about humans (or perhaps agent-building processes more generally). I’ll talk about those in the next post; here I am simply arguing that merely knowing that an agent is intelligent, with no additional empirical facts about the world, does not let you infer that it has goals.

As a corollary, since all behavior can be modeled as maximizing expected utility, but not all behavior is goal-directed, it is not possible to conclude that an agent is goal-driven if you only know that it can be modeled as maximizing some expected utility. However, if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.

There are no coherence arguments that say you must have preferences

This section is another way to view the argument in the previous section, with “goal-directed behavior” now being operationalized as “preferences”; it is not saying anything new.

Above, I said that the VNM theorem assumes both that you use probabilities and that you have a preference ordering over outcomes. There are lots of good reasons to assume that a good reasoner will use probability theory. However, there’s not much reason to assume that there is a preference ordering over outcomes. The twitching robot, “A”-following agent, and random policy agent from the last section all seem like they don’t have preferences (in the English sense, not the math sense).

Perhaps you could define a preference ordering by saying “if I gave the agent lots of time to think, how would it choose between these two histories?” However, you could apply this definition to anything, including eg. a thermostat, or a rock. You might argue that a thermostat or rock can’t “choose” between two histories; but then it’s unclear how to define how an AI “chooses” between two histories without that definition also applying to thermostats and rocks.

Of course, you could always define a preference ordering based on the AI’s observed behavior, but then you’re back in the setting of the first section, where all observed behavior can be modeled as maximizing an expected utility function and so saying “the AI is an expected utility maximizer” is vacuous.

Convergent instrumental subgoals are about goal-directed behavior

One of the classic reasons to worry about expected utility maximizers is the presence of convergent instrumental subgoals, detailed in Omohundro’s paper The Basic AI Drives. The paper itself is clearly talking about goal-directed AI systems:

To say that a system of any design is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world.

It then argues (among other things) that such AI systems will want to “be rational” and so will distill their goals into utility functions, which they then maximize. And once they have utility functions, they will protect them from modification.

Note that this starts from the assumption of goal-directed behavior and derives that the AI will be an EU maximizer along with the other convergent instrumental subgoals. The coherence arguments all imply that AIs will be EU maximizers for some (possibly degenerate) utility function; they don’t imply that the AI must be goal-directed.

Goodhart’s Law is about goal-directed behavior

A common argument for worrying about AI risk is that we know that a superintelligent AI system will look to us like an EU maximizer, and if it maximizes a utility function that is even slightly wrong we could get catastrophic outcomes.

By now you probably know my first response: that any behavior can be modeled as an EU maximizer, and so this argument proves too much, suggesting that any behavior causes catastrophic outcomes. But let’s set that aside for now.

The second part of the claim comes from arguments like Value is Fragile [LW · GW] and Goodhart’s Law. However, if we consider utility functions that assign value 1 to some histories and 0 to others, then if you accidentally assign a history where I needlessly stub my toe a 1 instead of a 0, that’s a slightly wrong utility function, but it isn’t going to lead to catastrophic outcomes.

The worry about utility functions that are slightly wrong holds water when the utility functions are wrong about some high-level concept, like whether humans care about their experiences reflecting reality. This is a very rarefied, particular distribution of utility functions, that are all going to lead to goal-directed or agentic behavior. As a result, I think that the argument is better stated as “if you have a slightly incorrect goal, you can get catastrophic outcomes”. And there aren’t any coherence arguments that say that agents must have goals.

Wireheading is about explicit reward maximization

There are a few papers that talk about the problems that arise with a very powerful system with a reward function or utility function, most notably wireheading. The argument that AIXI will seize control of its reward channel falls into this category. In these cases, typically the AI system is considering making a change to the system by which it evaluates goodness of actions, and the goodness of the change is evaluated by the system after the change. Daniel Dewey argues in Learning What to Value that if the change is evaluated by the system before the change, then these problems go away.

I think of these as problems with reward maximization, because typically when you phrase the problem as maximizing reward, you are maximizing the sum of rewards obtained in all timesteps, no matter how those rewards are obtained (i.e. even if you self-modify to make the reward maximal). It doesn’t seem like AI systems have to be built this way (though admittedly I do not know how to build AI systems that reliably avoid these problems).


In this post I’ve argued that many of the problems we typically associate with expected utility maximizers are actually problems with goal-directed agents or with explicit reward maximization. Coherence arguments only imply that a superintelligent AI system will look like an expected utility maximizer, but this is actually a vacuous constraint, and there are many potential utility functions for which the resulting AI system is neither goal-directed nor explicit-reward-maximizing. This suggests that we could try to build AI systems of this type, in order to sidestep the problems that we have identified so far.


Comments sorted by top scores.

comment by Stuart_Armstrong · 2018-12-03T20:43:06.508Z · score: 12 (6 votes) · LW · GW

Note that this starts from the assumption of goal-directed behavior and derives that the AI will be an EU maximizer along with the other convergent instrumental subgoals.

The result is actually stronger than that, I think: if the AI is goal-directed at least in part, then that part will (tend to) purge the non-goal directed behaviours and then follow the EU path.

I wonder if we could get theorems as to what kinds of minimal goal directed behaviour will result in the agent becoming a completely goal-directed agent.

comment by John_Maxwell_IV · 2018-12-04T04:27:07.755Z · score: 4 (2 votes) · LW · GW

Seems like it comes down to the definition of goal-directed. Omohundro uses a chess-playing AI as a motivating example, and intuitively, a chess-playing AI seems "fully goal-directed". But even as chess and go-playing AIs have become superhuman, and found creative plans humans can't find, we haven't seen any examples of them trying to e.g. kill other processes on your computer so they can have more computational resources and play a better game. A theory which can't explain these observations doesn't sound very useful.

Maybe this discussion is happening on the wrong level of abstaction. All abstractions are leaky, and abstractions like "intelligent", "goal-oriented", "creative plans", etc. are much leakier than typical computer science abstractions. An hour of looking at the source code is going to be worth five hours of philosophizing. The most valuable thing the AI safety community can do might be to produce a checklist for someone creating the software architecture or reading the source code for the first AGI, so they know what failure modes to look for.

comment by Stuart_Armstrong · 2018-12-04T14:54:49.848Z · score: 4 (4 votes) · LW · GW

A chess tree search algorithm would never hit upon killing other processes. An evolutionary chess-playing algorithm might learn to do that. It's not clear whether goal-directed is relevant to that distinction.

comment by gwern · 2018-12-04T19:58:01.516Z · score: 43 (14 votes) · LW · GW

That's not very imaginative. Here's how a chess tree search algorithm - let's take AlphaZero for concreteness - could learn to kill other processes, even if it has no explicit action which corresponds to interaction with other processes and is apparently sandboxed (aside from the usual sidechannels like resource use). It's a variant of the evolutionary algorithm which learned to create a board so large that its competing GAs crashed/were killed while trying to deal with it (the Tic-tac-toe memory bomb). In this case, position evaluations can indirectly reveal that an exploration strategy caused enough memory use to trigger the OOM, killing rival processes, and freeing up resources for the tree search to get a higher win rate by more exploration:

  1. one of the main limits to tree evaluation is memory consumption, due to the exponential growth of breadth-first memory requirements (this is true regardless of whether an explicit tree or implicit hash-based representation is used); to avoid this, memory consumption is often limited to a fixed amount of memory or a mix of depth/breadth-first strategies are used to tame memory growth, even though this may not be optimal, as it may force premature stopping to expansion of the game tree (resorting to light/heavy playouts) or force too much exploitation depthwise along a few promising lines of play and too little exploration etc. (One of the criticisms of AlphaZero, incidentally, was that too little RAM was given to the standard chess engines to permit them to reach their best performance.)

  2. when a computer OS detects running out of memory, it'll usually invoke an 'OOM killer', which may or may not kill the program which makes the request which uses up the last of free memory

  3. so, it is possible that if a tree search algorithm exhausts memory (because the programmer didn't remember to include a hard limit, the hard limit turns out to be incorrect for the machine being trained on, the limit is defined wrong like in terms of max depth instead of total nodes, etc), it may not crash or be killed but other programs, using unknown & potentially large percentages of memory, may be killed instead to free up memory. (I've observed this on Linux, to my frustration, where the programs I don't want killed get killed by the OOM reaper instead of the haywire program.)

  4. once other programs are killed to free up memory, all that memory is now available for the tree search algorithm to use; using this memory will increase performance by allowing more of the game tree to be explicitly evaluated, either wider or deeper.

  5. in AlphaZero, the choice of widening or deepening is inherently controlled by the NN, which is trained to predict the result of the final values of each position and increase win probabilities.

  6. reaching a position (which can be recognized by its additional complexity, indicating it lies at a certain additional depth in the tree and thus indirectly reveals how much memory is being used by the NN's cumulative exploration) which triggers an OOM killing other programs will result in more accurate position evaluations, leading to higher values/higher win probability; so it will reinforce a strategy where it learns to aggressively widen early in the game to exhaust memory, waits for an OOM to happen, and then in the rest of the game proceeds to explore more aggressively (rather than depth-first exploit) given the new memory.

    (Depending on the exact details of how the tree expansion & backups are done, it's possible that the AlphaZero NN couldn't observe the benefits of wide-then-deep - it might just look like noise in value estimates - but there are expert iteration variants where the NN directly controls the tree expansion rather than merely providing value estimates for the MCTS algorithm to explore using, and those should be able to observe indirect benefits of exploration strategies over a game.)

At no point does it interact directly with other processes, or even know that they exist; it just implicitly learns that expanding a decision tree in a particular wide-then-deep fashion leads to better evaluations more consistent with the true value and/or end-game result (because of side-effects leading to increased resource consumption leading to better performance). And that's how a tree-search algorithm can hit upon killing other processes.

comment by John_Maxwell_IV · 2018-12-05T04:30:12.522Z · score: 8 (5 votes) · LW · GW

This story seems to reinforce my "leaky abstraction" point. The story hinges on nitty gritty details of how the AI is implemented and how the operating system manages resources. There's no obvious usefulness in proving theorems and trying to make grand statements about utility maximizers, optimizers, goal-oriented systems, etc. I expect that by default, a programmer who tried to apply a theorem of Stuart's to your chess system would not think to consider these details related to memory management (formally verifying a program's source code says nothing about memory management if that happens lower in the stack). But if they did think to consider these details of memory management, having no access to Stuart's theorem, they'd still have a good shot at preventing the problem (by changing the way the NN controls tree expansion or simply capping the program's memory use).

Leaky abstractions are a common cause of computer security problems also. I think this is a big reason why crypto proofs fail so often. A proof is a tower on top of your existing set of abstractions; it's fairly useless if your existing abstractions are faulty.

comment by G Gordon Worley III (gworley) · 2019-04-04T20:01:38.347Z · score: 2 (1 votes) · LW · GW

What I like about this thread, and why I'm worried about people reading this post and updating away from thinking that sufficiently powerful processes that don't look like what we think are dangerous is safe, is that it helps make clear that Rohin seems to be making an argument that hinges on leaky or even confused abstractions. I'm not sure any of the rest of us have much better abstractions to offer that aren't leaky, and I want to encourage what Rohin does in this post of thinking through the implications of the abstractions he's using to draw conclusions that are specific enough to be critiqued, because through a process like this we can get a clearer idea of where we have shared confusion and then work to resolve it.

comment by rohinmshah · 2018-12-04T21:19:31.266Z · score: 3 (2 votes) · LW · GW

This seems right (though I have some apprehension around talking about "parts" of an AI). From the perspective of proving a theorem, it seems like you need some sort of assumption on what the rest of the AI looks like, so that you can say something like "the goal-directed part will outcompete the other parts". Though perhaps you could try defining goal-directed behavior as the sort of behavior that tends to grow and outcompete things -- this could be a useful definition? I'm not sure.

comment by Chris_Leong · 2018-12-03T14:36:56.101Z · score: 6 (3 votes) · LW · GW

An agent that constantly twitches could still be a threat if it were trying to maximise the probability that it would actually twitch in the future. For example, if it were to break down, it wouldn't be able to twitch, so it might want to gain control of resources.

I don't suppose you could clarify exactly how this agent that is twitching is defined. In particular, how does it accumulate over time? Do you get 1 utility for each point in time where you twitch and is your total utility the undiscounted sum of these utilities.

comment by rohinmshah · 2018-12-03T23:43:48.311Z · score: 13 (8 votes) · LW · GW
I don't suppose you could clarify exactly how this agent that is twitching is defined. In particular, how does it accumulate over time? Do you get 1 utility for each point in time where you twitch and is your total utility the undiscounted sum of these utilities.

I am not defining this agent using a utility function. It turns out that because of coherence arguments and the particular construction I gave, I can view the agent as maximizing some expected utility.

I like Gurkenglas's suggestion of a random number generator hooked up to motor controls, let's go with that.

An agent that constantly twitches could still be a threat if it were trying to maximise the probability that it would actually twitch in the future. For example, if it were to break down, it wouldn't be able to twitch, so it might want to gain control of resources.

Yeah, but it's not trying to maximize that probability. I agree that a superintelligent agent that is trying to maximize the amount of twitching it does would be a threat, possibly by acquiring resources. But motor controls hooked up to random numbers certainly won't do that.

If your robot powered by random numbers breaks down, it indeed will not twitch in the future. That's fine, clearly it must have been maximizing a utility function that assigned utility 1 to it breaking at that exact moment in time. Jessica's construction below would also work, but it's specific to the case where you take the same action across all histories.

comment by Gurkenglas · 2018-12-03T18:01:51.576Z · score: 10 (6 votes) · LW · GW

Presumably, it is a random number generator hooked up to motor controls. There is no explicit calculation of utilities that tells it to twitch.

comment by jessicata (jessica.liu.taylor) · 2018-12-03T21:31:06.814Z · score: 6 (3 votes) · LW · GW

It can maximize the utility function: if I take the twitch action in time step otherwise. In a standard POMDP setting this always takes the twitch action.

comment by Chris_Leong · 2018-12-04T14:03:13.804Z · score: 2 (1 votes) · LW · GW

Oh that's interesting, so you've chosen a discount rate such that twitching now is always more important than twitching for the rest of time. And presumably it can't both twitch AND take other actions in the world in the same time-step, as that'd make it an immediate threat.

Such a utility maximiser might become dangerous if it were broken in such a way that it wasn't allowed to take the twitch action for a long period of time including the current time step, in which case it would take whatever actions would allow itself to twitch again as soon as possible. I wonder how dangerous such a robot would be?

On one hand, the goal of resuming twitching as soon as possible would seem to only require a limited amount of power to be accumulated, on the other hand, any resources accumulated in this process would then be deployed to maximising its utility. For example, it might have managed to gain control of a repair drone and this could now operate independently even if the original could now only twitch and nothing else. Even then, it'd likely be less of a threat as if the repair drone tried to leave to do anything, there would be a chance that the original robot would break down and the repair would be delayed. On the other hand, perhaps the repair drone can hack other systems without moving. This might result in resource accumulation.

comment by jessicata (jessica.liu.taylor) · 2018-12-04T21:47:26.808Z · score: 2 (1 votes) · LW · GW

In a POMDP there is no such thing as not being able to take a particular action at a particular time. You might have some other formalization of agents in mind; my guess is that, if this formalization is made explicit, there will be an obvious utility function that rationalizes the "always twitch" behavior.

comment by Chris_Leong · 2018-12-05T12:47:21.830Z · score: 2 (1 votes) · LW · GW

POMDP is an abstraction. Real agents can be interfered with.

comment by jessicata (jessica.liu.taylor) · 2018-12-05T18:34:38.817Z · score: 6 (3 votes) · LW · GW

AI agents are designed using an agency abstraction. The notion of an AI "having a utility function" itself only has meaning relative to an agency abstraction. There is no such thing as a "real agent" independent of some concept of agency.

All the agency abstractions I know of permit taking one of some specified set of actions at each time step, which can easily be defined to include the "twitch" action. If you disagree with my claim, you can try formalizing a natural one that doesn't have this property. (There are trivial ways to restrict the set of actions, but then you could use a utility function to rationalize "twitch if you can, take the lexicographically first action you can otherwise")

comment by rohinmshah · 2018-12-07T16:58:23.291Z · score: 2 (1 votes) · LW · GW

How do you imagine the real agent working? Can you describe the process by which it chooses actions?

comment by Chris_Leong · 2018-12-08T10:32:13.354Z · score: 2 (1 votes) · LW · GW

Presumably twitching requires sending a signal to a motor control and the connection here can be broken

comment by rohinmshah · 2018-12-08T20:51:58.532Z · score: 6 (3 votes) · LW · GW

Sorry, I wasn't clear enough. What is the process which both:

  • Sends the signal to the motor control to twitch, and
  • Infers that it could break or be interfered with, and sends signals to the motor controls that cause it to be in a universe-state where it is less likely to break or be interfered with?

I claim that for any such reasonable process, if there is a notion of a "goal" in this process, I can create a goal that rationalizes the "always-twitch" policy. If I put in the goal that I construct into the program that you suggest, the policy always twitches, even if it infers that it could break or be interfered with.

The "reasonable" constraint is to avoid processes like "Maximize expected utility, except in the case where you would always twitch, in that case do something else".

comment by Richard_Kennaway · 2018-12-03T09:45:29.260Z · score: 5 (4 votes) · LW · GW
(I think I’ve seen it before too, but I can’t remember where.)

Possibly on LessWrong (v1.0), where on a couple of occasions I called it the Texas Sharpshooter Utility Function (to imply that it is a useless concept).

comment by Davidmanheim · 2019-05-06T07:09:47.454Z · score: 3 (2 votes) · LW · GW

I love that framing - do you have a source you can link so I can cite it?

comment by Richard_Kennaway · 2019-05-06T07:23:25.365Z · score: 5 (3 votes) · LW · GW
I love that framing - do you have a source you can link so I can cite it?

Not anywhere outside of LessWrong. I coined the phrase: the first occurrence I can find is here [LW · GW].

comment by Davidmanheim · 2019-05-06T07:09:04.997Z · score: 3 (2 votes) · LW · GW
Actually, no matter what the policy is, we can view the agent as an EU maximizer.

There is an even broader argument to be made. For an agent that is represented by a program, no matter what the preferences are, even if inconsistent, we can view it as an EU maximizer that always chooses the output it is programmed to take. (If it is randomized, its preferences are weighted between those options.)

I suspect there are other constructions that are at least slightly less trivial, because this trivial construction has utilities over only the "outcomes" of which action it takes, which is a deontological goal, rather than the external world, which would allow more typically consequentialist goals. Still, it is consistent with definitions of EU maximization.

comment by Charlie Steiner · 2018-12-03T17:45:28.340Z · score: 3 (2 votes) · LW · GW

I'm not sure that the agent that constantly twitches is going to be motivated by coherence theorems anyways. Is the class of agents that care about coherence identical to the class of potentially dangerous goal-directed/explicit-utility-maximizing/insert-euphemism-here agents?

comment by rohinmshah · 2018-12-03T23:38:31.314Z · score: 4 (2 votes) · LW · GW

In the setting where your outcomes are universe-histories, coherence is vacuous, so nobody cares/doesn't care about that kind of coherence.

In the setting where you have some sort of contradictory preferences, because your preferences are over more high-level concepts than particular universe-histories, then you probably care about coherence theorems. Seems possible that this is the same as the class of goal-directed behaviors, but even if so I'm not sure what implications that has? Eg. I don't think it changes anything about the arguments I'm making in this post.

comment by Charlie Steiner · 2019-01-03T23:35:07.330Z · score: 3 (2 votes) · LW · GW

Sorry, this was a good response to my confused take - I promised myself I'd write a response but only ended up doing it now :)

I think the root of my disagreeing-feeling is that when I talk about things like "it cares" or "it values," I'm in a context where the intentional stance is actually doing useful work - thinking of some system as an agent with wants, plans, goals, etc. is in some cases a useful simplification that helps me better predict the world. This is especially true when I'm just using the words informally - I can talk about the constantly-twitching agent wanting to constantly twitch, when using the words deliberately, but I wouldn't use this language intuitively, because it doesn't help me predict anything the physical stance wouldn't. It might even mislead me, or dilute the usefulness of intentional stance language. This conflict with intuition is a lot of what's driving my reaction this this argument.

The other half of the issue is that I'm used to thinking of intentional-stance features as having cognitive functions. For example, if I "believe" something, this means that I have some actual physical pattern inside me that performs the function of a world-model, and something like plans, actions, or observations that I check against that world-model. The physical system that constantly twitches can indeed be modeled by an agent with a utility function over world-histories, but that agent is in some sense an incorporeal soul - the physical system itself doesn't have the cognitive functions associated with intentional-stance attributes (like "caring about coherence").

comment by rohinmshah · 2019-01-04T10:27:09.807Z · score: 3 (2 votes) · LW · GW

Yeah, I agree that the concepts of "goals", "values", "wanting", etc. are useful concepts to have, and point to something real. For those concepts, it is true that the constant-twitching agent does not "want" to constantly twitch, nor does it have it as a "goal". On the other hand, you can say that humans "want" to not suffer.

I'm not arguing that we should drop these concepts altogether. Separately from this post, I want to make the claim that we can try to build an AI system that does not have "goals". A common counterargument is that due to coherence theorems any sufficiently advanced AI system will have "goals". I'm rebutting that counterargument with this post.

(The next couple of posts in the sequence should address this a bit more, I think the sequence is going to resume very soon.)