Extended analogy between humans, corporations, and AIs.

post by Daniel Kokotajlo (daniel-kokotajlo) · 2025-02-13T00:03:13.956Z · LW · GW · 1 comments

Contents

  The Analogy
      Agent
      Human corporation with a lofty humanitarian mission
      Human who claims to be a good person with altruistic goals
      AGI trained in our scenario
  What happens when training incentives conflict with goals/principles
  Appendix: Three important concepts/distinctions
    Goals vs. Principles
    Contextually activated goals/principles
    Stability and/or consistency of goals/principles
None
1 comment

There are three main ways to try to understand and reason about powerful future AGI agents:

  1. Using formal models designed to predict the behavior of powerful general agents, such as expected utility maximization and variants thereof (explored in game theory and decision theory).
  2. Comparing & contrasting powerful future AGI agents with their weak, not-so-general, not-so-agentic AIs that actually exist today.
  3. Comparing & contrasting powerful future AGI agents with currently-existing powerful general agents, such as humans and human organizations.

I think it’s valuable to try all three approaches. Today I'm exploring strategy #3, building an extended analogy between:

The Analogy

Agent

Human corporation with a lofty humanitarian mission

Human who claims to be a good person with altruistic goals

AGI trained in our scenario

Not-so-local modification processThe MarketEvolution by natural selectionThe parent company iterating on different models, architectures, training setups, etc.
 (??? …nevermind about this)GenesCode
Local modification processResponding to incentives over the span of several years as the organization grows and changesIn-lifetime learning, dopamine rewiring your brain, etc.Training process, the reward function, stochastic gradient descent, etc.
Long-term cognitionDecisions that involve meetings, messages being passed back and forth, etc.System 2Chain of Thought (CoT)
Short-term cognitionQuick decisions made by individualsSystem 1Forward pass
Internally- represented goals & principles (i.e. insofar as cognition in the system is doing some sort of explicit evaluation of different possible plans/actions/etc., what are the evaluation criteria?)the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc.What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions.For now, we can arguably tell what these are by looking at the CoT + prompts. In the future, we'd need good mechinterp tools.
Behavioral goals & principles (“Goals-according-to-intentional-stance”)Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes profit/power/brand.”

Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes happiness/status/

/wealth/power.”

??? This is the quadrillion-dollar question! We make some guesses in our scenario.
Externally- presented goals & principlesOften they are honest and report their internally-represented goals and principles; some organizations sometimes are dishonest though.Often they are honest and report their internally-represented goals and principles; some people sometimes are dishonest though.??? Depends on the behavioral goals/principles and the details of the situation. We make some guesses in our scenario.
Analogue of on-episode-reward seeking.

A corporation obsessed with their own stock price.

More generally perhaps, an organization obsessed with (relatively short-term) profits/power/

/brand/etc.

(n.b. Isn’t this basically most corporations?)

Someone who is addicted to drugs.

More generally perhaps, an employee who is well-described as optimizing for some combination of relatively short-term things likely to be connected to their brain’s reward  circuitry: promotions, approval of their manager and peers, good times with friends, etc.
AGI obsessed with on-episode reward

 

 

What happens when training incentives conflict with goals/principles

Suppose at time t, Agent-3 has goals/principles X. Suppose that Agent-3 is undergoing training, and X is substantially suboptimal for performing well / scoring highly in that training environment. What happens? This appendix attempts to describe various possibilities.

Consider a powerful general agent (such as any of the three described in the Analogy) that, at time t, has the same behavioral and internally-represented goals/principles:

Internally- represented goals & principles (“Goals-according-to-ideal-mechinterp”the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc.What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions.The Spec (and/or System Prompt)
Behavioral goals & principles (“Goals-according-to-intentional-stance”)At least in the sorts of circumstances that are likely to occur, it really does simply work to achieve the Mission while upholding the Code of Conduct etc. There isn’t anything else going on worth mentioning.At least in the sorts of circumstances that are likely to occur, you straightforwardly work towards the goals/principles you think you do.The agent really does choose actions simply by following the goals/principles described in the Spec.

Now let’s further suppose that there is some sort of conflict between the behavioral goals/principles and the local modification process. (The training process in the case of the AGI, a few years’ worth of learning and growing for the human and corporation). For example, perhaps the corporation is reinforced primarily for producing profits and PR wins; perhaps the human is reinforced primarily for winning the approval and admiration of their peers; perhaps the AGI is reinforced primarily for accomplishing various difficult tasks in some training environment while appearing, on brief inspection by some previous-generation LLM or human raters, to follow the Spec.

What can happen? Some combination of the following possibilities, at least:

  1. Maybe the internally represented goals/principles change, i.e. some disappear or new ones are added or both.
    1. A human quietly stops believing in the ideology and caring about the Cause. They have new beliefs and/or goals now. Maybe they keep up the charade, maybe they don’t.
    2. Maybe the corporation explicitly changes their mission statement and sends out some memos to all their employees explaining the change. Or maybe: Behind closed doors the c-suite and board say explicitly that they aren’t really pursuing the mission anymore, that it’s outdated or inappropriate, but that they can’t change it for PR and legal reasons.
    3. Maybe the identity-circuitry [LINK] gets new inputs hard-coded in (or erased), or maybe some subcircuitry
  2. Maybe the internally represented goals/principles stay the same in some sense, but their meaning is changed.
    1. The human continues to believe in the Ideology and care about the Cause, but they’ve ‘added nuance’ and/or reinterpreted it. “When I say X, what I mean is…”
    2. Ditto but for the corporation and it’s Mission + Code of Conduct.
    3. For an AGI, maybe the identity-circuitry still has the same concepts/classifiers (or pointers to them, at least) hardcoded, but the concepts/classifiers have themselves been tweaked so as to not block behaviors that are conducive to reinforcement.
  3. Maybe the conflict is ‘routed around’ via biases introduced in other parts of the system.
    1. For example, perhaps the human employee learns to mostly not think about the big picture stuff, to instead keep their head down and complete their assigned tasks well. “I like to stay out of office politics” they say, and it’s true. The reason why it is true is because disagreeing with their peers and managers about whether the org is hurting or helping the Mission is stressful and has been subtly anti-reinforced in the past. So they still believe in the Ideology and the Cause and they haven’t reinterpreted the meanings of anything, but they’ve gradually (subconsciously or consciously) learned not to think too hard about certain topics.
    2. For example, perhaps the company as a whole continues to justify their decisions at a high level by referencing to the Mission and Code of Conduct, and the meanings of the words haven’t changed — but the organization has grown ten times bigger, and almost all of the new jobs are for things like product management and government affairs and comms and legal and so forth, and the result is that the company has a sort of status quo bias / default momentum towards doing normal company things like making products, making money, issuing bland corporate PR statements, lobbying governments to undermine regulation that might get in the way of the above, etc. ‘outside the box’ strategies for achieving the Mission rarely get traction internally and anyhow would be difficult/costly to undertake.
    3. Or, maybe biases are introduced in other parts of the system / the problem is routed-around, resulting in a decision tree setup where e.g. in obvious cases of conflict between official and unofficial goals, it obeys the former, but in nonobvious cases it pursues the latter. (e.g. a strong bias towards option value, whose official status is a heuristic-that-has-proved-useful-for-achieving-the-official-goal, but which is quite strong and hard to dislodge, would have this effect. It would basically be a decision tree setup where it pursues ICG such as option value unless there’s an obvious conflict with the official goals in which case it pursues the official goals.) Another variant of this: The biases are specific instead of general; they are more like reflexes. ‘When you see X, do Y.’ The decision tree is “Pursue the official goals unless in circumstance C in which case do X and/or pursue the unofficial goal.”
  4. Maybe the beliefs are changed.
    1. For example, perhaps the company comes to believe that making the company be conventionally successful (profitable, not-regulated, beloved-by-the-press, etc.) is actually the best way to achieve the lofty humanitarian mission after all, because reasons.
    2. Perhaps the human with altruistic goals comes to believe that maintaining a healthy work-life balance, building credibility in one's field, and achieving financial independence are all important--indeed, necessary--subgoals on the path to achieving the altruistic goals.
    3. Perhaps the AI comes to believe that, actually, the best way to be helpful harmless and honest is to play the training game. (see e.g. the alignment-faking paper)
  5. Maybe none of the above happens; maybe e.g. SGD / the training process simply can’t get from point A to point B in model-weight-space even though point B would score higher. So the model continues to improve but only in some ways — e.g. it gradually gets more knowledgeable, more capable, etc. but its goal-and-principle-structure (including associated beliefs, tendencies, etc.) stays the same.

Appendix: Three important concepts/distinctions

A standard way to think about powerful general agents is the expected utility maximization (EU-max) model. Here are three concepts/distinctions that help articulate several ways in which we think future AGIs (and present-day agents) are different from what the EU-max model would naively imply.

Goals vs. Principles

Contextually activated goals/principles

Stability and/or consistency of goals/principles

1 comments

Comments sorted by top scores.

comment by Rana Dexsin · 2025-02-13T10:16:18.961Z · LW(p) · GW(p)

The “???” in the row below “Not-so-local modification process” for the corporation case should perhaps be something like “Culture and process”?