Extended analogy between humans, corporations, and AIs.
post by Daniel Kokotajlo (daniel-kokotajlo) · 2025-02-13T00:03:13.956Z · LW · GW · 1 commentsContents
The Analogy Agent Human corporation with a lofty humanitarian mission Human who claims to be a good person with altruistic goals AGI trained in our scenario What happens when training incentives conflict with goals/principles Appendix: Three important concepts/distinctions Goals vs. Principles Contextually activated goals/principles Stability and/or consistency of goals/principles None 1 comment
There are three main ways to try to understand and reason about powerful future AGI agents:
- Using formal models designed to predict the behavior of powerful general agents, such as expected utility maximization and variants thereof (explored in game theory and decision theory).
- Comparing & contrasting powerful future AGI agents with their weak, not-so-general, not-so-agentic AIs that actually exist today.
- Comparing & contrasting powerful future AGI agents with currently-existing powerful general agents, such as humans and human organizations.
I think it’s valuable to try all three approaches. Today I'm exploring strategy #3, building an extended analogy between:
- A prototypical human corporation that has a lofty humanitarian mission but also faces market pressures and incentives.
- A prototypical human working there, who thinks of themselves as a good person and independent thinker with lofty altruistic goals, but also faces the usual peer pressures and incentives.
- AGI agents being trained in our scenario — trained by a training process that mostly rewards strong performance on a wide range of difficult and challenging tasks, but also attempts to train in various goals and principles (those described in the Spec). (For context, we at the AI Futures Project are working on a scenario forecast in which "Agent-3," an autonomous AI researcher, is trained in 2027)
The Analogy
Agent | Human corporation with a lofty humanitarian mission | Human who claims to be a good person with altruistic goals | AGI trained in our scenario |
Not-so-local modification process | The Market | Evolution by natural selection | The parent company iterating on different models, architectures, training setups, etc. |
(??? …nevermind about this) | Genes | Code | |
Local modification process | Responding to incentives over the span of several years as the organization grows and changes | In-lifetime learning, dopamine rewiring your brain, etc. | Training process, the reward function, stochastic gradient descent, etc. |
Long-term cognition | Decisions that involve meetings, messages being passed back and forth, etc. | System 2 | Chain of Thought (CoT) |
Short-term cognition | Quick decisions made by individuals | System 1 | Forward pass |
Internally- represented goals & principles (i.e. insofar as cognition in the system is doing some sort of explicit evaluation of different possible plans/actions/etc., what are the evaluation criteria?) | the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc. | What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions. | For now, we can arguably tell what these are by looking at the CoT + prompts. In the future, we'd need good mechinterp tools. |
Behavioral goals & principles (“Goals-according-to-intentional-stance”) | Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes profit/power/brand.” | Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes happiness/status/ /wealth/power.” | ??? This is the quadrillion-dollar question! We make some guesses in our scenario. |
Externally- presented goals & principles | Often they are honest and report their internally-represented goals and principles; some organizations sometimes are dishonest though. | Often they are honest and report their internally-represented goals and principles; some people sometimes are dishonest though. | ??? Depends on the behavioral goals/principles and the details of the situation. We make some guesses in our scenario. |
Analogue of on-episode-reward seeking. | A corporation obsessed with their own stock price. /brand/etc. | Someone who is addicted to drugs. More generally perhaps, an employee who is well-described as optimizing for some combination of relatively short-term things likely to be connected to their brain’s reward circuitry: promotions, approval of their manager and peers, good times with friends, etc. | AGI obsessed with on-episode reward |
What happens when training incentives conflict with goals/principles
Suppose at time t, Agent-3 has goals/principles X. Suppose that Agent-3 is undergoing training, and X is substantially suboptimal for performing well / scoring highly in that training environment. What happens? This appendix attempts to describe various possibilities.
Consider a powerful general agent (such as any of the three described in the Analogy) that, at time t, has the same behavioral and internally-represented goals/principles:
Internally- represented goals & principles (“Goals-according-to-ideal-mechinterp” | the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc. | What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions. | The Spec (and/or System Prompt) |
Behavioral goals & principles (“Goals-according-to-intentional-stance”) | At least in the sorts of circumstances that are likely to occur, it really does simply work to achieve the Mission while upholding the Code of Conduct etc. There isn’t anything else going on worth mentioning. | At least in the sorts of circumstances that are likely to occur, you straightforwardly work towards the goals/principles you think you do. | The agent really does choose actions simply by following the goals/principles described in the Spec. |
Now let’s further suppose that there is some sort of conflict between the behavioral goals/principles and the local modification process. (The training process in the case of the AGI, a few years’ worth of learning and growing for the human and corporation). For example, perhaps the corporation is reinforced primarily for producing profits and PR wins; perhaps the human is reinforced primarily for winning the approval and admiration of their peers; perhaps the AGI is reinforced primarily for accomplishing various difficult tasks in some training environment while appearing, on brief inspection by some previous-generation LLM or human raters, to follow the Spec.
What can happen? Some combination of the following possibilities, at least:
- Maybe the internally represented goals/principles change, i.e. some disappear or new ones are added or both.
- A human quietly stops believing in the ideology and caring about the Cause. They have new beliefs and/or goals now. Maybe they keep up the charade, maybe they don’t.
- Maybe the corporation explicitly changes their mission statement and sends out some memos to all their employees explaining the change. Or maybe: Behind closed doors the c-suite and board say explicitly that they aren’t really pursuing the mission anymore, that it’s outdated or inappropriate, but that they can’t change it for PR and legal reasons.
- Maybe the identity-circuitry [LINK] gets new inputs hard-coded in (or erased), or maybe some subcircuitry
- Maybe the internally represented goals/principles stay the same in some sense, but their meaning is changed.
- The human continues to believe in the Ideology and care about the Cause, but they’ve ‘added nuance’ and/or reinterpreted it. “When I say X, what I mean is…”
- Ditto but for the corporation and it’s Mission + Code of Conduct.
- For an AGI, maybe the identity-circuitry still has the same concepts/classifiers (or pointers to them, at least) hardcoded, but the concepts/classifiers have themselves been tweaked so as to not block behaviors that are conducive to reinforcement.
- Maybe the conflict is ‘routed around’ via biases introduced in other parts of the system.
- For example, perhaps the human employee learns to mostly not think about the big picture stuff, to instead keep their head down and complete their assigned tasks well. “I like to stay out of office politics” they say, and it’s true. The reason why it is true is because disagreeing with their peers and managers about whether the org is hurting or helping the Mission is stressful and has been subtly anti-reinforced in the past. So they still believe in the Ideology and the Cause and they haven’t reinterpreted the meanings of anything, but they’ve gradually (subconsciously or consciously) learned not to think too hard about certain topics.
- For example, perhaps the company as a whole continues to justify their decisions at a high level by referencing to the Mission and Code of Conduct, and the meanings of the words haven’t changed — but the organization has grown ten times bigger, and almost all of the new jobs are for things like product management and government affairs and comms and legal and so forth, and the result is that the company has a sort of status quo bias / default momentum towards doing normal company things like making products, making money, issuing bland corporate PR statements, lobbying governments to undermine regulation that might get in the way of the above, etc. ‘outside the box’ strategies for achieving the Mission rarely get traction internally and anyhow would be difficult/costly to undertake.
- Or, maybe biases are introduced in other parts of the system / the problem is routed-around, resulting in a decision tree setup where e.g. in obvious cases of conflict between official and unofficial goals, it obeys the former, but in nonobvious cases it pursues the latter. (e.g. a strong bias towards option value, whose official status is a heuristic-that-has-proved-useful-for-achieving-the-official-goal, but which is quite strong and hard to dislodge, would have this effect. It would basically be a decision tree setup where it pursues ICG such as option value unless there’s an obvious conflict with the official goals in which case it pursues the official goals.) Another variant of this: The biases are specific instead of general; they are more like reflexes. ‘When you see X, do Y.’ The decision tree is “Pursue the official goals unless in circumstance C in which case do X and/or pursue the unofficial goal.”
- Maybe the beliefs are changed.
- For example, perhaps the company comes to believe that making the company be conventionally successful (profitable, not-regulated, beloved-by-the-press, etc.) is actually the best way to achieve the lofty humanitarian mission after all, because reasons.
- Perhaps the human with altruistic goals comes to believe that maintaining a healthy work-life balance, building credibility in one's field, and achieving financial independence are all important--indeed, necessary--subgoals on the path to achieving the altruistic goals.
- Perhaps the AI comes to believe that, actually, the best way to be helpful harmless and honest is to play the training game. (see e.g. the alignment-faking paper)
- Maybe none of the above happens; maybe e.g. SGD / the training process simply can’t get from point A to point B in model-weight-space even though point B would score higher. So the model continues to improve but only in some ways — e.g. it gradually gets more knowledgeable, more capable, etc. but its goal-and-principle-structure (including associated beliefs, tendencies, etc.) stays the same.
Appendix: Three important concepts/distinctions
A standard way to think about powerful general agents is the expected utility maximization (EU-max) model. Here are three concepts/distinctions that help articulate several ways in which we think future AGIs (and present-day agents) are different from what the EU-max model would naively imply.
Goals vs. Principles
- As we use the terms, there isn’t a sharp distinction between goals and principles. Goals are more about consequences, especially long-term consequences (e.g. happiness over time, or creating a Mars colony, or achieving world peace, or maximizing money or paperclips) whereas principles are about other properties of actions. Examples:
- Honesty (Whether the action involves knowingly misleading someone)
- Promise-keeping (Whether the action is consistent with past promises)
- Hypothetical approval (Whether the action is what someone else would have approved of, if they were brought up to speed on the situation)
- The EU-max model is flexible enough to represent pretty much any behavior. But in practice it implicitly encourages people to think of agents as pursuing long-term goals unconstrained by any principles.
Contextually activated goals/principles
- The EU-max model encourages people to think of agents with multiple goals as pursuing a weighted combination of, or compromise between, those goals. But there’s another way an agent can have multiple goals: they can have one goal in one context, and another goal in another.
- For example, consider a corporation that basically maximizes profit — except that the people in the corporation think of themselves as good people and would pivot the corporation to providing disaster relief in the right context, i.e. if there was an obvious visible disaster around them and people were begging them for aid. This corporation is *not* well-described as optimizing for a compromise between disaster relief and profit, because if it was, it would donate a portion of its profits to disaster relief around the world and/or set up a disaster preparedness plan just in case a disaster happened nearby. Instead it’s well-described as “optimize for profits UNLESS in the following circumstances in which case pivot to optimizing for disaster relief…” Probably there are many other UNLESS clauses in there too which never get triggered in practice.
- Another example: Consider someone who is prone to groupthink and social pressure and so forth, who has two distinct social circles: the activist group they volunteer with, and their coworkers at their job. This person might, but also might not, be well-modelled as pursuing a compromise of the goals & principles of each social group. They may instead be well-modelled as fully pursuing the goals & principles of each group when in that context. The compartmentalize, and e.g. don't think about the political implications of their employer's activities. This causes them to pass up win-win opportunities--e.g. maybe they don't notice a relatively cheap opportunity the company has to benefit the activist political agenda (at only minor cost to profits), and maybe they don't notice a relatively cheap opportunity they have to benefit the corporate agenda when in the activist group (e.g. by asking their activist friends for feedback on an advertising campaign design they are working on at work).
- Another example: A corporation with a lofty humanitarian mission, that officially justifies all high-level decisions by referencing the Mission. However, in practice the corporation usually chooses the options that maximize the organization’s profits, stock price, brand, etc. But it won’t do things blatantly at odds with the Mission; instead it does things which can be rationalized. This corporation is well-described not as a compromise between the mission and profits, but rather as “if it’s blatantly obvious that the Mission demands you do X and/or not do Y, then do X and/or not Y. But if it’s debatable what’s best for the Mission, then choose the action that maximizes profits etc.”
Stability and/or consistency of goals/principles
- Consider how two humans who are quite demographically similar, who are friends, and who profess basically the same beliefs and values, nevertheless can find themselves strongly disagreeing with each other when they take a philosophy 101 class and are presented with strange thought experiments about trolleys or utopias or infinite ethics.
- Moreover, consider how even the same student probably could end up with different opinions about these thought experiments if their circumstances were slightly changed — e.g. perhaps if the framing of the experiment was different, or if they had seen a different sci-fi movie the day before, or if their crush had made a different face when first hearing the experiment.
- Moreover, consider how the students’ opinions on the philosophy seminar probably only loosely correlate with what they would actually do if a real-world situation materialized that was analogous to the hypothetical.
- The point is, it’s possible for powerful general agents to be well-described as pursuing certain goals/principles in some range of circumstances, but not outside that range — outside that range, in crazy new circumstances, they’ll probably try to extend their goals/principles to the new situation, but the way that shakes out in practice might be random/path-dependent and hard to predict in advance—and they might just abandon their goals/principles entirely in the new situation. In other words, the agent’s goals/principles are only stable & coherent in some normal range.
- We think this might be true of some AGI systems as well. It’s arguably true of most humans and LLM agents today. The most likely case this might come up is that the AGI systems will act according to consistent goals/principles in situations like those they have been trained on, but act unpredictably (yet still highly competently) in cases very unlike any that they have seen in training thus far.
1 comments
Comments sorted by top scores.
comment by Rana Dexsin · 2025-02-13T10:16:18.961Z · LW(p) · GW(p)
The “???” in the row below “Not-so-local modification process” for the corporation case should perhaps be something like “Culture and process”?