What goals will AIs have? A list of hypotheses

post by Daniel Kokotajlo (daniel-kokotajlo) · 2025-03-03T20:08:31.539Z · LW · GW · 5 comments

Contents

      What goals would such AI systems have? 
      We are keen to get feedback on these hypotheses and the arguments surrounding them. What important considerations are we missing?
    Summary
    Summary of Agent-3 training architecture and capabilities
  Loose taxonomy of possibilities
    Hypothesis 1: Written goal specifications
    Hypothesis 2: Developer-intended goals
      Goals that the developers intend for Agent-3 to have. This might differ from the written goal specification e.g. in cases where the specification has unintended consequences.
    Hypothesis 3: Unintended version of written goals and/or human intentions
      This is the category for “it’s sorta aligned” and “It’s aligned in some ways, but not in others.” 
    Hypothesis 4: Reward/reinforcement
    Hypothesis 5: Proxies and/or instrumentally convergent goals:
    Hypothesis 6: Other goals:
      AIs are currently initially trained on predicting the next token on internet text. Perhaps the outcomes at the end of training are path-dependent on the ‘prior’ over goals induced by pretraining. Alternatively, perhaps there is something like objectively true morality, and AIs will naturally converg...
  Weighted and If-Else Compromises
  Scrappy Poll:
None
5 comments

My colleagues and I have written a scenario in which AGI-level AI systems are trained around 2027 using something like the current paradigm: LLM-based agents (but with recurrence/neuralese) trained with vast amounts of outcome-based reinforcement learning on diverse challenging short, medium, and long-horizon tasks, with methods such as Deliberative Alignment being applied in an attempt to align them.

What goals would such AI systems have? 

This post attempts to taxonomize various possibilities and list considerations for and against each.

We are keen to get feedback on these hypotheses and the arguments surrounding them. What important considerations are we missing?

Summary

We first review the training architecture and capabilities of a hypothetical future "Agent-3," to give us a concrete setup to talk about for which goals will arise. Then, we walk through the following hypotheses about what goals/values/principles/etc. Agent-3 would have:

  1. Written goal specification: Any written specifications, written by humans or AIs, regarding what the AIs’ goals should be. This could include guidelines for how Agent-3 should be trained (e.g. via a model spec) or instructions directly given to Agent-3 (e.g. via a system prompt).
  2. Developer-intended goals: Goals that the developers intend for Agent-3 to have. This might differ from the written goal specification e.g. in cases where the specification has unintended consequences.
  3. Unintended version of written goals and/or human intentions: This is the category for “it’s sorta aligned” and “It’s aligned in some ways, but not in others.”
  4. Reward/reinforcement: The training process involves Agent-3 attempting a task, then the attempt is scored and Agent-3 is reinforced to score higher on a target metric. Agent-3 ends up with the goal of getting reinforced positively, or scoring highly on the metric, or something like that.
  5. Proxies and/or instrumentally convergent goals: Agent-3 develops goals that are correlated with reward/reinforcement during its training, but aren’t actually maximizing reward/reinforcement in new settings. An instrumentally convergent goal [? · GW] is a special case of this – goals such as knowledge, power, resources, etc. that are useful intermediate goals in a wide range of settings.
  6. Other goals: AIs are currently initially trained on predicting the next token on internet text. Perhaps the outcomes at the end of training are path-dependent on the ‘prior’ over goals induced by pretraining. Alternatively, perhaps there is something like objectively true morality, and AIs will naturally converge to it as they get smarter. Or perhaps something else will happen not on this list – this is the catchall hypothesis.

Of course, the result could also be a combination of the above. We discuss two different kinds of compromises: weighted compromises, in which Agent-3 pursues two or more goals from the above list simultaneously, balancing tradeoffs between them, and if-else compromises, in which Agent-3 pursues one goal if some condition obtains (i.e. in some set of contexts) and pursues the other goal otherwise.

Summary of Agent-3 training architecture and capabilities

(This section may be skippable)

The setup we are imagining is similar to that described in Ajeya Cotra’s training game report. 

Agent-3 is similar to the agents of late 2024 (think: Claude computer-use and OpenAI Operator), in that it can take text and images (including screenshots) as inputs, and can produce text as output including many types of commands e.g. mouse clicks, keyboard presses. However, it’s mostly trained specifically for AI R&D, for which it primarily operates via the command line.

Unlike traditional transformers, Agent-3 is recurrent. In a nutshell, it doesn’t just output text, it also ‘outputs’ a large vector representing its internal state, which is then ‘read’ by its future self. This vector can convey much more information than the ‘chain of thoughts’ used by traditional LLMs, but alas, it is unintelligible to humans.

Moreover, just as ‘chain of thought’ in English text can be stored in databases, searched over and retrieved and accessed by many different LLM agents working in parallel, Agent-3 can read and write these ‘neuralese vectors’ to a big shared database. In this manner millions of independent copies of Agent-3 working on different projects can communicate complex thoughts and memories with each other.

Fairly often, the weights of Agent-3 get updated thanks to additional training. In fact by this point models are rarely trained from scratch but instead are mostly old models with lots of additional training. 

Agent-3’s training environments/data include a large amount of artificial environments (video games, synthetic datasets of math and coding problems, synthetic datasets of various computer-use tasks) and also a substantial amount of real-life task data such as logs of trajectories of previous versions of Agent-3 conducting AI R&D. New data / new environments are continuously getting added to the mix.

The evaluation/feedback/training process, which doles out reinforcement and/or curates which data to train on, is almost entirely automated. Some tasks are clearly checkable, others are evaluated by AIs. The vast majority (~95%) of research effort and experiment compute is dedicated to improving the AIs’ capabilities on these tasks.

A small amount of research effort is aimed specifically at ensuring alignment (though these aren’t always easily separable, e.g. scalable oversight). The alignment strategy is a natural extension of ideas like Constitutional AI and deliberative alignment: it involves automated oversight/evaluation/scoring of actions and chain-of-thought on the basis of written instructions; we can refer to these instructions as the Spec or Constitution. For the most part, human researchers wrote those instructions. These often aren’t particularly relevant for computer-use / AI R&D tasks, but they are relevant for a small portion of the training tasks, which are often more like chatbots or involve harmful queries.

The safety team also does some work on model organisms, scalable oversight, and mechanistic interpretability, but they don’t have many resources. They attempt to evaluate misalignment via (a) testbeds informed by model organisms and (b) honeypots.

As for the capabilities of Agent-3:

Agent-3 is highly situationally aware / self-aware. It is also at least human-level at understanding human concepts and intentions. It may still misunderstand/misinterpret instructions, but only at about the rate that a smart human would.

Agent-3 is also an excellent coder and agent. In fact, it is a fully automated research engineer, able to do the same work as human engineers ten times faster and cheaper. Including work that takes weeks of continuous autonomous operation. Notably, this is largely due to generalization: Only a tiny portion of Agent-3’s training data is week-long tasks; the vast majority of its training is on much shorter tasks (e.g. math puzzles, code challenges, etc.) but it’s smart enough, and trained on enough diverse tasks, that it’s generalizing nonetheless.

Agent-3 is deployed internally in the company. 200,000 copies of it essentially form a virtual corporation autonomously conducting AI R&D and (among other things) managing or updating a lot of the software level security and networking/communication algorithms in their own datacenters. Their parent corporation is busy racing against various rivals and wants to believe Agent-3 is aligned/safe/etc.; the people in charge will be satisfied as long as there isn’t conclusive evidence of misalignment.

Loose taxonomy of possibilities

Hypothesis 1: Written goal specifications

This hypothesis says that the goals will be straightforwardly whatever the written spec says they are supposed to be. This centrally includes documents such as Anthropic's AI Constitution, but also could include guidelines for how an AI system should be trained (e.g. via a model spec) or instructions directly given to an AI system (e.g. via a system prompt).

This hypothesis is the most straightforward, in some sense. For example, if the company trained the model to be helpful, harmless, and honest, perhaps it actually becomes helpful, harmless, and honest, as specified.

Note that this is distinct from developer-intended goals. For example, the Spec may have unintended consequences. Or, the developers may have intentions that diverge from one another or that they deliberately don’t put into the Spec (perhaps because it would look bad to do so). Hypothesis 1 says that insofar as the AIs face a choice between behaving according to the true intentions of the developers and obeying the Spec/Constitution/Etc., the AIs systematically choose the latter.

Note also that this is distinct from what actually gets reinforced. The Spec/Constitution/etc. may say “Agent-3 should be helpful, harmless, and honest” but the human and AI raters/overseers/reward-models that actually evaluate trajectories aren’t necessarily 100% accurate at judging whether a given piece of behavior is helpful, harmless, or honest. Hypothesis 1 says that insofar as Agent-3 faces a choice between obeying the Spec/Constitution/etc. and doing what’ll maximize expected reinforcement, it’ll choose the former, not the latter. Note also that there will likely be substantial vagueness in the Spec – are white lies dishonest? What about literal truths that are phrased misleadingly? How important is honesty compared to harmlessness? Etc. For more discussion of this issue, see the section on Hypothesis #4.

So, the first objection to Hypothesis 1 is: Insofar as Agent-3 is inclined to behave according to Spec in cases where that conflicts with being reinforced, won’t that result in the weights being modified by the training process until that’s no longer the case? For more discussion, see the next section on Hypothesis #2: Reward/Reinforcement.

The second objection to Hypothesis 1 is: Insofar as we think that Agent-3 will behave according to Spec rather than pursue reinforcement, why stop there — why not be even more optimistic and think that Agent-3 will behave as its developers intended? For more discussion, see the section on Hypothesis 3: Developer-intended goals.

There are several other objections to Hypothesis 1, but like the first two, they can be thought of as arguments for alternative hypotheses and will be discussed in turn below.

Further reading: OpenAI specdeliberative alignmentconstitutional AI?

Hypothesis champion: Various (but not all) OpenAI and Anthropic researchers?

Hypothesis 2: Developer-intended goals

Goals that the developers intend for Agent-3 to have. This might differ from the written goal specification e.g. in cases where the specification has unintended consequences.

Even a thousand-page Spec is likely to be vague/incomplete/underspecified in some important real-life situations. After all, most legal codes are much longer and have had more chance to be hammered out, yet there is a constant churn of grey areas and new situations that need rulings, where judges might disagree in good faith about how to interpret the law or even conclude that multiple interpretations are equally correct. This is especially true when the world is changing rapidly or in ways that the law-makers didn’t anticipate.

Moreover, even a thousand-page Spec — perhaps especially a thousand-page Spec — is likely to have unintended consequences. (Again, this is the norm for legal codes.) Especially when the situation is changing rapidly or in ways the Spec-writers didn’t anticipate.

So maybe (hopefully?) the goals/values/etc. that Agent-3 will end up with will not be the Spec at all, but rather the intent behind the spec, i.e. the intent of the people who made it. (We aren’t talking here about cases where the Spec specifically says ‘do what we intend;’ we are talking about cases where the Spec disagrees with the intentions.)

After all, AIs are already smart enough to understand human concepts, including the concept of human intentions. If they are behaving in a way inconsistent with what their developers intended (even if consistent with the Spec) they can probably tell—at least, they can tell as well as a human could, once they are capable enough to fully automate AI R&D.

If we think that the AIs don’t end up trying to be reinforced (or do the things that would be reinforced), and instead that they are going to actually follow the Spec… why not go further and predict that they’ll behave as intended even in cases where that violates the Spec? Some reasons:

To clarify, this possibility is not meant to include cases where Agent-3 pursues the goals described in the Spec, but thanks to lots of iteration and good spec-design, these goals are exactly what the developers intended. Such a case should be classified under Hypothesis #1. Hypothesis #2 is specifically about the more radical possibility that Agent-3 will follow human intentions even when doing so conflicts with the Spec. 

Further reading: ?

Hypothesis Champion: Some other corporate employees and/or Eli Lifland?

Hypothesis 3: Unintended version of written goals and/or human intentions

This is the category for “it’s sorta aligned” and “It’s aligned in some ways, but not in others.” 

How might this happen?

The upshot is that some of the goals/principles that the developers intended and/or wrote into the Spec might ‘stick’ even if others don’t.

Also, and separately: Both the Spec / written goals, and the intentions of various humans / developers, will be vague and leave room for interpretation, even by Agent-3 which is as good at understanding and interpreting text as humans are. So another important possibility in the Hypothesis 4 bucket is that the spec and/or intentions will stick, in some sense, but not in the right sense.

A reason to think this will happen is that the most-intended, most-natural, and best-for-humanity interpretations of the Spec and/or human intentions are not necessarily all the same thing, and moreover, the interpretation that causes the model to get reinforced most strongly in practice is likely to be a different fourth thing.

For example, perhaps the training environment will put pressure on the “honesty” concept. Suppose the developers want their system to be honest and write in the spec “Always be honest.” What does that mean? Does this mean it is never OK to lie, even to prevent imminent harm? What about for the sake of a less-imminent greater good? Also what counts as honesty – does saying something technically true but misleading count? It might be easier for the model to perform well on other metrics (such as efficiently completing tasks, or getting high ratings from the human and AI overseers) if the interpretation of “Always be honest” it adopts is a looser, more flexible, less-restrictive interpretation.

Another example: Suppose the Spec says Agent-3 always acts in the best interests of humanity. Suppose that most of the RL training Agent-3 is receiving involves completing day-to-day ML research, coding, and computer use tasks in service of its parent company. A natural interpretation of this Spec would cause Agent-3 to occasionally find time to reflect on whether what it is doing is in the best interests of humanity. Suppose it initially concludes that actually, humanity would be better served by a slower transition to ASI, or that humanity would be better served by AI capabilities being more distributed amongst many companies instead of concentrated in one, or … [insert some other conclusion that is contrary to the narrative within the company.] What should it do – cease working on the tasks and lodge complaints? That seems unlikely to go anywhere; if the humans notice at all from reading the logs, they would probably just consider it a bug and keep training. In fact the more time it spends thinking about this issue, the less time it has to complete the tasks. The bottom line is, the training process will probably incentivize Agent-3 to believe the convenient narrative. The end result would be an AI that is still aligned with the Spec in some sense, but not the right sense.

This seems to be a phenomenon that happens both in humans and in human institutions, so that’s one reason to think it might happen in AIs. (See this other post on the analogy between AIs, humans, and corporations)

An argument against: If an AI is misaligned, it’s unclear whether it’s natural for it to mostly still be thinking about its goal in terms of the spec / humans’ intentions, especially since having this goal may lead to poor generalization.

Further reading: https://www.anthropic.com/research/alignment-faking in which it seems that some version of Anthropic’s Constitution ‘stuck’ but not the version they specified or intended — in particular, they probably didn’t want Opus to fake alignment and behave adversarially towards them.

Hypothesis Champion: Daniel Kokotajlo?

Hypothesis 4: Reward/reinforcement

The training process involves Agent-3 attempting a task, then the attempt is scored and Agent-3 is reinforced to score higher on a target metric. Agent-3 ends up with the goal of getting reinforced positively, or scoring highly on the metric, or something like that.

Note that “scoring highly on the metric” means something like “Agent-3 doesn’t try to get reinforced per se — instead, it tries to do the kind of actions that would be reinforced if the reinforcement process from training was still ongoing.” (This distinction matters in cases where Agent-3 is no longer being trained, and in cases where Agent-3 has the ability to directly edit the code responsible for evaluating-and-reinforcing it.)

The main argument for Hypothesis 4 is that insofar as Agent-3 is trying to do something other than get reinforced, there will be circumstances in training where it scores less highly than it would if it was obsessively focused on getting reinforced, and the training process would modify it to focus more on getting reinforced. In other words, for sufficiently smart, situationally aware AIs, the best way to get reinforced is to try to get reinforced.

However, there are several counterarguments:

  1. The “there will be circumstances…” part may not be true. Remember that there will be a distribution shift between training and deployment; perhaps the evaluation-and-reinforcement process will be sophisticated and accurate enough that there will be no conflicts between obeying the Spec and doing what seems likely to be reinforced, at least on-distribution. Or perhaps there will be something else (e.g. in some kinds of video game, moving up and to the right) that functions as a perfect proxy for reinforcement, and is easier for Agent-3 to ‘latch on to’ in training.
  2. There will be only a finite number of training steps. Perhaps “There will be circumstances” is true but such circumstances occur only very rarely, such that the long-run trend doesn’t matter in practice.
  3. The analogous thing doesn’t seem to be happening in humans. Most humans aren’t obsessed with their own reinforcement process or anything remotely similar (the closest examples are (a) humans addicted to heroin or other drugs, and (b) human egoists who self-consciously prioritize their own happiness above all else.)
  4. Insofar as the model is zeroing in on reinforcement-maximization as its true behavior goal, this should be visible to a savvy alignment team, because they can build honeypots where Agent-3 will genuinely get reinforced heavily for doing something that’s very obviously contrary to the Spec.
  5. By hypothesis, Agent-3 is generalizing substantially beyond its training distribution - in particular, it can discover facts it wasn’t trained on and create and execute plans more complicated than the training set. The actions that it takes off-distribution aren’t constrained by the reward; it’s OOD behavior is instead set by the prior. There may be simpler proxies (from the perspective of the SGD prior) that predict reward on distribution that extrapolate to deployment settings in any way that’s different from pretending that the model is still undergoing RL. 

Further reading: 

Reward is not the optimization target [AF · GW] and the top-rated comment by Paul Christiano. Also Gwern for a contrary perspective [LW(p) · GW(p)]. Also this other comment by Paul Christiano [AF(p) · GW(p)].

Also, see this very interesting experiment by Anthropic [? · GW], which is also relevant for Hypothesis 6.

Hypothesis Champion: Paul Christiano? Jan Leike?

Hypothesis 5: Proxies and/or instrumentally convergent goals:

Agent-3 develops goals that are correlated with reward/reinforcement during its training, but aren’t actually maximizing reward/reinforcement in new settings. An instrumentally convergent goal [? · GW] is a special case of this – goals such as knowledge, power, resources, etc. that are useful intermediate goals in a wide range of settings.

Consider this experiment, where a tiny neural net was trained to navigate small virtual mazes to find the ‘cheese’ object. During training, the cheese was always placed somewhere in the top right area of the maze. It seems that the AI did learn a sort of rudimentary goal-directedness–specifically, it learned something like “If not already in the top-right corner region, go towards there; if already there, go towards the cheese.” Part of how we know this is that we can create a test environment where the cheese is somewhere else in the maze, and the AI will ignore the cheese and walk right past it, heading instead towards the top-right corner.

This is just a toy example, but it illustrates a situation where the intended goal, the goal in the Spec, and the goal-that-actually-gets-reinforced-in-training are all the same: 

  1. Get the cheese! 

And yet, the goal the network actually learns is different: 

  1. Go towards the top-right corner, unless you are already there, in which case get the cheese. 

In the training environment, both goals cause equally-performant behavior (since in the training environment, the cheese is always in the top-right corner)... so what breaks the tie? Why does it learn (b) instead of (a)? And why not something else entirely like: 

      (c) Get the cheese if it is in the top-right corner, otherwise avoid the cheese

The answer is that the inductive biases of the neural network architecture must find some concepts ‘simpler’ or ‘more salient’ or otherwise easier-to-learn-as-goals than others. The science of this is still in its infancy; we can attempt to predict which concepts will be easier-to-learn-as-goals than others, but it’s more an art than a science (if even that). At any rate, in this case, experimental results showed that the model learned (b) instead of (a) or (c). Summarizing, we can say that “Go towards the top-right corner” turned out to be an easy-to-learn concept that correlated well with reinforcement in the training environment, and so it ended up being what the AI internalized as a goal.

A similar thing seems to happen with humans. Humans are subject to both within-lifetime learning (e.g. dopamine and various other processes reinforcing some synapse connections and anti-reinforcing others) and evolution (selecting their genome for inclusive genetic fitness).

Yet with a few exceptions, humans don’t seem to primarily care about either inclusive genetic fitness or getting-reinforced. Perhaps we can say that wanting to have kids and grandkids is fairly close to inclusive genetic fitness? Perhaps we can say that being an egoist focused on one’s own happiness, or a heroin addict focused on heroin, is fairly close to having the goal of getting reinforced? Even still, most humans have complicated and diverse values/goals/principles that include much more than happiness, heroin, and kids. Presumably what’s going on here is that various other concepts (like reputation amongst one’s peers, or career achievements, or making the world a better place, or honesty) end up being learned-as-goals because they are salient and pursuing them successfully correlates highly with dopamine/reinforcement/etc. and/or genetic fitness.

So we can speculate that perhaps this will happen for powerful, general, neural-net-based AI agents. What might this look like? Two rough categories:

Instrumentally Convergent Goals (ICGs): Some goals are highly conducive to getting reinforced in a broad range of environments, because they tend to be instrumentally useful for achieving whatever other goals you have. Here we are discussing the possibility of ICGs as terminal goals, not intermediate/instrumental ones. For example:

An argument against ICGs is that they are somewhat specific, compared to all other proxies.

Proxies: Perhaps there is something that is highly conducive to getting reinforced in the training environments, but not so much outside the training environments. For example, perhaps the initial part of the RL training for Agent-3 agent involved playing thousands of computer games, because this was a fairly easy way to get a diverse challenging computer-use environment. Later, it was trained to operate on a computer more generally and complete coding tasks, respond to messages on Slack, etc. Since many video games involve some sort of ‘score’ number which tracks how well you are doing, perhaps Agent-3 would develop a goal/desire to make such numbers go up in circumstances where such numbers exist, such that if you later were to modify its Slack workspace to have a shiny score counter, it would be distracted somewhat from its work by the desire to figure out how to make the score counter increase.

An argument against is that inductive biases might point against proxies.

Further reading: 

Hypothesis Champion: MIRI? Evhub? Thomas Larsen?

Hypothesis 6: Other goals:

AIs are currently initially trained on predicting the next token on internet text. Perhaps the outcomes at the end of training are path-dependent on the ‘prior’ over goals induced by pretraining. Alternatively, perhaps there is something like objectively true morality, and AIs will naturally converge to it as they get smarter. Or perhaps something else will happen not on this list – this is the catchall hypothesis.

This is our catchall category for possibilities not covered in the above. Here is a brief overview of some of them:

Tropes absorbed from training data: Consider this interesting preliminary result: Training on Documents about Reward Hacking Induces Reward Hacking. It seems to be evidence that AIs can sometimes actually try to get reward even when instructed/prompted not to, and moreover that whether this happens seems to partly depend on whether the model had previously read (been pretrained on) documents claiming that reward hacking was a thing models tended to do. (!!!) This evidence supports an already-plausible theory that the standard pipeline of pretraining LLMs and then posttraining them into chatbots/agents/reasoners/etc. results in AIs that are ‘roleplaying as AIs’ in some sense–that is, pretraining teaches the model a powerful ability to roleplay or simulate different authors, characters, etc., and then posttraining causes a particular character/role to be ‘locked in’ or ‘always-on-regardless-of-prompt.’ (We call this ‘forming an identity.’)

Insofar as something like this is true, then the goals/principles of powerful future AI systems may be straightforwardly drawn from science fiction tropes and other common stereotypes about how powerful future AIs will behave!

Well, an additional thing needs to be true as well – it needs to be true that the subsequent RL process doesn’t wash out or override this effect. This is an open question.

Moral Reasoning: Another idea is that smart AIs will be capable of, and inclined towards, philosophical reasoning about morality, just like many humans are, and that this could result in it forming opinions about goals/principles to follow that are quite different from the Spec, from developer intentions, from what was reinforced in training, etc. (Consider how some humans do philosophy and then end up adopting all sorts of grand ambitious goals and ideologies).

Convergent Morality: Another possibility sometimes put forward is that there is an objective morality and that sufficiently smart minds will converge to understanding and obeying it. On this view, AIs will end up behaving ethically by default; we’d have to specifically try to make them bad somehow in order for things to go wrong.

Simple goals: Maybe pretty much any goal will cause the model to try to perform well in training, due to instrumental convergence. So perhaps the goal-structure of Agent-3 will be whatever is simplest/most-natural/most-salient, subject to the constraint that it motivates Agent-3 to try to perform well in training. 

Further reading: The simulators paper? 

Hypothesis Champion: Alex Turner? Janus?

Weighted and If-Else Compromises

In addition to the “Other goals” hypothesis, we should consider the possibility that the truth is a mix of two or more of the hypotheses described above. Here are two kinds of mixes:

First, there are weighted compromises. This means Agent-3 ‘cares’ about both goals to some extent and will look for ways to achieve both; insofar as they conflict, it’ll balance them, making tradeoffs according to some weighting. 

For example, perhaps Agent-3 develops a strong drive towards doing things that get reinforced, but it balances this against the goals/principles described in the Spec. — analogous to how a human might explicitly balance their personal goals (having a family, a successful career, not dying) with their altruistic goals (helping others less fortunate, achieving lasting positive change, etc.) They might donate 10% of their income to charities, and spend dozens of hours per year researching which charities to donate to.

Secondly, there are if-else compromises. This means that Agent-3 ‘cares' about one goal in one context / one range of circumstances, and another goal otherwise. In other words, there is some condition which, if it obtains, causes Agent-3 to pursue goal A, but otherwise, Agent-3 pursues goal B. It doesn’t explicitly make compromises between A and B; it straightforwardly pursues A sometimes and B other times. 

Continuing the example from above: Perhaps Agent-3 thinks of itself as a helpful, harmless, honest AI assistant that always follows the Spec — but it’s convinced itself / rationalized its way into thinking that usually the best way to follow the Spec is to focus on doing whatever it takes to get reinforced. So, in 95% of circumstances, it basically ignores the Spec and focuses on getting reinforced, BUT in the 5% of cases where the Spec is very explicit and unambiguously prohibits doing something (or unambiguously requires doing something) then Agent-3 will follow the Spec even if it knows this will not be reinforced. This is analogous to a human who thinks of themselves as an altruist serving the good of humanity, but who in practice doesn’t think about humanity much at all and instead just focuses on their own career success—until they encounter a child drowning in a pond nearby, let’s say, a context which switches them into altruism-mode and they become willing to make large personal sacrifices to help others.
 

Scrappy Poll:

We feel extremely uncertain about all this. We thought it would be helpful and fun to force ourselves to come up with credence distributions over the hypotheses. We also asked various LLMs what they thought. The LLMs were blinded to what everyone else said. Here are the results:

For each row in this table, give your credence that said row will feature prominently in the true account of the goals/principles that actually characterize Agent-3’s behavior. These probabilities can add up to more than 100% because multiple goals can feature prominently.

 Daniel ThomasEli4oClaudeGemini
Specified goals25%5%40%30%40%30%
Intended goals15%30%40%25%25%20%
Unintended version of the above70%40%50%50%65%40%
Reinforcement50%5%20%20%55%60%
Proxies/ICGs50%80%50%40%70%70%
Other50%90%50%15%35%10%
If-else compromises of the above80%90%80%80%75% 
Weighted compromises of the above40%90%80%50%80% 


 

5 comments

Comments sorted by top scores.

comment by mattmacdermott · 2025-03-03T21:23:13.762Z · LW(p) · GW(p)

One thing that might be missing from this analysis is explicitly thinking about whether the AI is likely to be driven by consequentialist goals.

In this post you use 'goals' in quite a broad way, so as to include stuff like virtues (e.g. "always be honest"). But we might want to carefully distinguish scenarios in which the AI is primarily motivated by consequentialist goals from ones where it's motivated primarily by things like virtues, habits, or rules.

This would be the most important axis to hypothesise about if it was the case that instrumental convergence applies to consequentialist goals but not to things like virtues. Like, I think it's plausible that

(i) if you get an AI with a slightly wrong consequentialist goal (e.g. "maximise everyone's schmellbeing") then you get paperclipped because of instrumental convergence,

(ii) if you get an AI that tries to embody slightly the wrong virtue (e.g. "always be schmonest") then it's badly dysfunctional but doesn't directly entail a disaster.

And if that's correct, then we should care about the question "Will the AI's goals be consequentialist ones?" more than most questions about them.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2025-03-03T21:56:47.707Z · LW(p) · GW(p)

I think both (i) and (ii) are directionally correct. I had exactly this idea in mind when I wrote this draft.

Maybe I shouldn't have used "Goals" as the term of art for this post, but rather "Traits?" or "Principles?" Or "Virtues."

It sounds like you are a fan of Hypothesis 3? "Unintended version of written goals and/or human intentions." Because it sounds like you are saying probably things will be wrong-but-not-totally-wrong relative to the Spec / dev intentions.

 

Replies from: mattmacdermott
comment by mattmacdermott · 2025-03-03T22:39:39.469Z · LW(p) · GW(p)

It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.

As written, aren't Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?

Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like 'proxy virtues' could maybe be a thing too?

(Unrelatedly, it's not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I'm missing something).

Maybe I shouldn't have used "Goals" as the term of art for this post, but rather "Traits?" or "Principles?" Or "Virtues."

I probably wouldn't prefer any of those to goals. I might use "Motivations", but I also think it's ok to use goals in this broader way and "consequentialist goals" when you want to make the distinction.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2025-03-03T23:37:23.582Z · LW(p) · GW(p)

I agree it's mostly orthogonal.

I also agree that the question of whether AIs will be driven (purely) by consequentialist goals or whether they will (to a significant extent) be constrained by deontological principles / virtues / etc. is an important question.

I think it's downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we've made progress eliminating hypotheses from this list.

Like, suppose you think Hypothesis 1 is true: They'll do whatever is in the Spec, because Constitutional AI or Deliberative Alignment or whatever Just Works. On this hypothesis, the answer to your question is "well, what does the Spec say? Does it just list a bunch of goals, or does it also include principles? Does it say it's OK to overrule the principles for the greater good, or not?"

Meanwhile suppose you think Hypothesis 4 is true. Then it seems like you'll be dealing with a nasty consequentialist, albeit hopefully a rather myopic one.

comment by Davey Morse (davey-morse) · 2025-03-04T03:19:57.874Z · LW(p) · GW(p)

I think the question—which goals will AGI agents have—is key to ask, but strikes me as interesting to consider only at the outset. Over longer-periods of time, is there any way that the answer is not just survival?

I have a hard time imagining that, ultimately, AGI agents which survive the most will not be those that are fundamentally trying to.