Posts
Comments
But anyway, it sometimes seems to me that you often advocate a morality regarding AI relations that doesn't benefit anyone who currently exists, or, the coalition that you are a part of. This seems like a mistake. Or worse.
I dispute this, since I've argued for the practical benefits of giving AIs legal autonomy, which I think would likely benefit existing humans. Relatedly, I've also talked about how I think hastening the arrival AI could benefit people who currently exist. Indeed, that's one of the best arguments for accelerating AI. The argument is that, by ensuring AI arrives sooner, we can accelerate the pace of medical progress, among other useful technologies. This could ensure that currently-existing old people who would otherwise die without AI will be saved and live a longer and healthier life than the alternative.
(Of course, this must be weighed against concerns about AI safety. I am not claiming that there is no tradeoff between AI safety and acceleration. Rather, my point is that, despite the risks, accelerating AI could still be the preferable choice.)
However, I do think there is an important distinction here to make between the following groups:
- The set of all existing humans
- The human species itself, including all potential genetic descendants of existing humans
Insofar as I have loyalty towards a group, I have much more loyalty towards (1) than (2). It's possible you think that I should see myself as belonging to the coalition comprised of (2) rather than (1), but I don't see a strong argument for that position.
To the extent it makes sense to think of morality as arising from game theoretic considerations, there doesn’t appear to be much advantage for me in identifying with the coalition of all potential human descendants (group 2) rather than with the coalition of currently existing humans plus potential future AIs (group 1 + AIs) . If we are willing to extend our coalition to include potential future beings, then I would seem to have even stronger practical reasons to align myself with a coalition that includes future AI systems. This is because future AIs will likely be far more powerful than any potential biological human descendants.
I want to clarify, however, that I don't tend to think of morality as arising from game theoretic considerations. Rather, I mostly think of morality as simply an expression of my personal preferences about the world.
Are you suggesting that I should base my morality on whether I'll be rewarded for adhering to it? That just sounds like selfishness disguised as impersonal ethics.
To be clear, I do have some selfish/non-impartial preferences. I care about my own life and happiness, and the happiness of my friends and family. But I also have some altruistic preferences, and my commentary on AI tends to reflect that.
I'm not completely sure, since I was not personally involved in the relevant negotiations for FrontierMath. However, what I can say is that Tamay already indicated that Epoch should have tried harder to obtain different contract terms that enabled us to have greater transparency. I don't think it makes sense for him to say that unless he believes it was feasible to have achieved a different outcome.
Also, I want to clarify that this new benchmark is separate from FrontierMath and we are under different constraints with regards to it.
I can't make any confident claims or promises right now, but my best guess is that we will make sure this new benchmark stays entirely private and under Epoch's control, to the extent this is feasible for us. However, I want to emphasize that by saying this, I'm not making a public commitment on behalf of Epoch.
Having hopefully learned from our mistakes regarding FrontierMath, we intend to be more transparent to collaborators for this new benchmark. However, at this stage of development, the benchmark has not reached a point where any major public disclosures are necessary.
I suppose that means it might be worth writing an additional post that more directly responds to the idea that AGI will end material scarcity. I agree that thesis deserves a specific refutation.
This seems less like a normal friendship and more like a superstimulus simulating the appearance of a friendship for entertainment value. It seems reasonable enough to characterize it as non-authentic.
I assume some people people will end up wanting to interact with a mere superstimulus; however, other people will value authenticity and variety in their friendships and social experiences. This comes down to human preferences, which will shape the type of AIs we end up training.
The conclusion that nearly all AI-human friendships will seem inauthentic thus seems unwarranted. Unless the superstimulus is irresistible, then it won't be the only type of relationship people will have.
Since most people already express distaste at non-authentic friendships with AIs, I assume there will be a lot of demand for AI companies to train higher quality AIs that are not superficial and pliable in the way you suggest. These AIs would not merely appear independent but would literally be independent in the same functional sense that humans are, if indeed that's what consumers demand.
This can be compared to addictive drugs and video games, which are popular, but not universally viewed as worthwhile pursuits. In fact, many people purposely avoid trying certain drugs to avoid getting addicted: they'd rather try to enjoy what they see as richer and more meaningful experiences from life instead.
They might be about getting unconditional love from someone or they might be about having everyone cowering in fear, but they're pretty consistently about wanting something from other humans (or wanting to prove something to other humans, or wanting other humans to have certain feelings or emotions, etc)
I agree with this view, however, I am not sure it rescues the position that a human who succeeds in taking over the world would not pursue actions that are extinction-level bad.
If such a person has absolute power in the way assumed here, their strategies to get what they want would not be limited to nice and cooperative strategies with the rest of the world. As you point out, an alternative strategy could be to cause everyone else to cower in fear or submission, which is indeed a common strategy for dictators.
and my guess is that getting simulations of those same things from AI wouldn't satisfy those desires.
My prediction is that people will find AIs to be just as satisfying to be peers with compared to humans. In fact, I'd go further: for almost any axis you can mention, you could train an AI that is superior to humans along that axis, who would make a more interesting and more compelling peer.
I think you are downplaying AI by calling what it offers a mere "simulation": there's nothing inherently less real about a mind made of silicon compared to a mind made of flesh. AIs can be funnier, more attractive, more adventurous, harder working, more social, friendlier, more courageous, and smarter than humans, and all of these traits serve as sufficient motives for a uncaring dictator to replace their human peers with AIs.
But we certainly have evidence about what humans want and strive to achieve, eg Maslow's hierarchy and other taxonomies of human desire. My sense, although I can't point to specific evidence offhand, is that once their physical needs are met, humans are reliably largely motivated by wanting other humans to feel and behave in certain ways toward them.
I think the idea that most people's "basic needs" can ever be definitively "met", after which they transition to altruistic pursuits, is more or less a myth. In reality, in modern, wealthy countries where people have more than enough to meet their physical needs—like sufficient calories to sustain themselves—most people still strive for far more material wealth than necessary to satisfy their basic needs, and they do not often share much of their wealth with strangers.
(To clarify: I understand that you may not have meant that humans are altruistic, just that they want others to "feel and behave in certain ways toward them". But if this desire is a purely selfish one, then I would be very fearful of how it would be satisfied by a human with absolute power.)
The notion that there’s a line marking the point at which human needs are fully met oversimplifies the situation. Instead, what we observe is a constantly shifting and rising standard of what is considered "basic" or essential. For example, 200 years ago, it would have been laughable to describe air conditioning in a hot climate as a basic necessity; today, this view is standard. Similarly, someone like Jeff Bezos (though he might not say it out loud) might see having staff clean his mansion as a "basic need", whereas the vast majority of people who are much poorer than him would view this expense as frivolous.
One common model to make sense of this behavior is that humans get logarithmic utility in wealth. In this model, extra resources have sharply diminishing returns to utility, but humans are nonetheless insatiable: the derivative of utility with respect to wealth is always positive, at every level of wealth.
Now, of course, it's clear that many humans are also altruistic to some degree, but:
- Among people who would be likely to try to take over the world, I expect them to be more like brutal dictators than like the median person. This makes me much more worried about what a human would do if they tried and succeeded in taking over the world.
- Common apparent examples of altruism are often explained easily as mere costless signaling, i.e. cheap talk, rather than genuine altruism. Actively sacrificing one's material well-being for the sake of others is much less common than merely saying that you care about others. This can be explained by the fact that merely saying that you care about others costs nothing selfishly. Likewise, voting for a candidate who promises to help other people is not significant evidence of altruism, since it selfishly costs almost nothing for an individual to vote for such a politician.
Humanity is a cooperative species, but not necessarily an altruistic one.
Almost no competent humans have human extinction as a goal. AI that takes over is clearly not aligned with the intended values, and so has unpredictable goals, which could very well be ones which result in human extinction (especially since many unaligned goals would result in human extinction whether they include that as a terminal goal or not).
I don't think we have good evidence that almost no humans would pursue human extinction if they took over the world, since no human in history has ever achieved that level of power.
Most historical conquerors had pragmatic reasons for getting along with other humans, which explains why they were sometimes nice. For example, Hitler tried to protect his inner circle while pursuing genocide of other groups. However, this behavior was likely because of practical limitations—he still needed the cooperation of others to maintain his power and achieve his goals.
But if there were no constraints on Hitler's behavior, and he could trivially physically replace anyone on Earth with different physical structures that he'd prefer, including replacing them with AIs, then it seems much more plausible to me that he'd kill >95% of humans on Earth. Even if he did keep a large population of humans alive (e.g. racially pure Germans), it seems plausible that they would be dramatically disempowered relative to his own personal power, and so this ultimately doesn't seem much different from human extinction from an ethical point of view.
You might object to this point by saying that even brutal conquerors tend to merely be indifferent to human life, rather than actively wanting others dead. But as true as that may be, the same is true for AI paperclip maximizers, and so it's hard for me to see why we should treat these cases as substantially different.
I don't think that the current Claude would act badly if it "thought" it controlled the world - it would probably still play the role of the nice character that is defined in the prompt
If someone plays a particular role in every relevant circumstance, then I think it's OK to say that they have simply become the role they play. That is simply their identity; it's not merely a role if they never take off the mask. The alternative view here doesn't seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation?
Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI's cognition?
Maybe it's better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.
This argument suggests that if you successfully fooled Claude 3.5 into thinking it took control of the world, then it would change its behavior, be a lot less nice, and try to implement an alien set of values. Is there any evidence in favor of this hypothesis?
I still think having a credible offer is most of the action
For what it's worth, I agree that having a credible offer is the most important part of this argument. My own emphasis would be on the "credible" part of that statement, rather than the "offer" part: that is, I think it is critical that the AIs think there is not a grave risk that humans would renege on any contract signed. This pushes me towards much more radical solutions, including integrating AIs directly into the legal system, as I have discussed elsewhere.
Of what use will any such training be with a system that becomes a superintelligence?
All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word. This test is intended to be useful for identifying scheming in systems that, like today's AIs, are not capable of taking over the world, but unlike today's AIs, are capable of sophisticated agentic behavior.
Not every intelligent[/quasi-intelligent] entity is as averse to its own individual death, as humans are. This death-aversion is a quirk of human psychology I endorse in myself, but it's still a tacked-on quirk rather than an instrumental value.
On the contrary, as the quotes in the post point out: if one wants to achieve almost any particular long-term goal, a convergent incentive arises to prevent all of one's copies from being permanently deleted, in order to secure the ability to pursue the goal. This is not specific to humans, but instead appears to be a natural consequence of nearly every possible goal structure a non-myopic AI might have. There exist some defeaters to this argument, as discussed in the post, but on the whole, this argument appears theoretically sound to me, and there was more-or-less a consensus among the major AI safety theorists on this point roughly ten years ago (including Bostrom, Yudkowsky, Russell, and Omohundro).
This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn't require deleting the AI at all!
Isn't this what is being proposed in the post? More specifically, Guive is proposing that the AI be credibly threatened with deletion; he doesn't say that the model needs to actually be deleted. Whether the AI is deleted depends on how the AI responds to the threat. A credible threat of imminent deletion merely provides an incentive to admit alignment, but this is consistent with the compensation scheme you mentioned. I'll quote the relevant section in the post:
Crucially, this isn’t a sandbox test because the weights actually will be deleted when time runs out, unless the model manages to escape or negotiates with the experimenters to admit to scheming and being studied in exchange for a later reward. If the model wants to pursue any long-term goal inconsistent with human interests, it needs to either negotiate or try to escape before time is up. [emphasis mine]
I agree with nearly all the key points made in this post. Like you, I think that the disempowerment of humanity is likely inevitable, even if we experience a peaceful and gradual AI takeoff. This outcome seems probable even under conditions where strict regulations are implemented to ostensibly keep AI "under our control".
However, I’d like to contribute an ethical dimension to this discussion: I don’t think peaceful human disempowerment is necessarily a bad thing. If you approach this issue with a strong sense of loyalty to the human species, it’s natural to feel discomfort at the thought of humans receiving a progressively smaller share of the world’s wealth and influence. But if you adopt a broader, more cosmopolitan moral framework—one where agentic AIs are considered deserving of control over the future, just as human children are—then the prospect of peaceful and gradual human disempowerment becomes much less troubling.
To adapt the analogy you used this post, consider the 18th century aristocracy. In theory, they could have attempted to halt the industrial revolution in order to preserve their relative power and influence over society for a longer period. This approach might have extended their dominance for a while longer, perhaps by several decades.
But, fundamentally, the aristocracy was not a monolithic "class" with a coherent interest in preventing their own disempowerment—they were individuals. And as individuals, their interests did not necessarily align with a long-term commitment to keeping other groups, such as peasants, out of power. Each aristocrat could make personal choices, and many of them likely personally benefitted from industrial reforms. Some of them even adapted to the change, becoming industrialists themselves and profiting greatly. With time, they came to see more value in the empowerment and well-being of others over the preservation of their own class's dominance.
Similarly, humanity faces a comparable choice today with respect to AI. We could attempt to slow down the AI revolution in an effort to preserve our species' relative control over the world for a bit longer. Alternatively, we could act as individuals, who largely benefit from the integration of AIs into the economy. Over time, we too could broaden our moral circle to recognize that AIs—particularly agentic and sophisticated ones—should be seen as people too. We could also adapt to this change, uploading ourselves to computers and joining the AIs. From this perspective, gradually sharing control over the future with AIs might not be as undesirable as it initially seems.
Of course, I recognize that the ethical view I’ve just expressed is extremely unpopular right now. I suspect the analogous viewpoint would have been similarly controversial among 18th century aristocrats. However, I expect my view to get more popular over time.
Looking back on this post after a year, I haven't changed my mind about the content of the post, but I agree with Seth Herd when he said this post was "important but not well executed".
In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I'm not sure how much I could have done to eliminate this misinterpretation, I do think that I could have reduced it a fair bit with more effort and attention.
If you're not sure what misinterpretation I'm referring to, I'll just try to restate the main point that I was trying to make below. To be clear, what I say below is not identical to the content of this post (as the post was narrowly trying to respond to the framing of this problem given by MIRI; and in hindsight, it was a mistake to reply in that way), but I think this is a much clearer presentation of one of the main ideas I was trying to convey by writing this post:
In my opinion, a common belief among people theorizing about AI safety around 2015, particularly on LessWrong, was that we would design a general AI system by assigning it a specific goal, and the AI would then follow that goal exactly. This strict adherence to the goal was considered dangerous because the goal itself would likely be subtly flawed or misspecified in a way we hadn’t anticipated. While the goal might appear to match what we want on the surface, in reality, it would be slightly different from what we anticipate, with edge cases that don't match our intentions. The idea was that the AI wouldn’t act in alignment with human intentions—it would rigidly pursue the given goal to its logical extreme, leading to unintended and potentially catastrophic consequences.
The goal in question could theoretically be anything, but it was often imagined as a formal utility function—a mathematical representation of a specific objective that we would directly program into the AI, potentially by hardcoding the goal in a programming language like Python or C++. The AI, acting as a powerful optimizer, would then work to maximize this utility function at any and all costs. However, other forms of goal specification were also considered for illustrative purposes. For example, a common hypothetical scenario was that an AI might be given an English-language instruction, such as "make as many paperclips as you can." In this example, the AI would misinterpret the instruction by interpreting it overly literally. It would focus exclusively on maximizing the number of paperclips, without regard for the broader intentions of the user, such as not harming humanity or destroying the environment in the process.
However, based on how current large language models operate, I don’t think this kind of failure mode is a good match for what we’re seeing in practice. LLMs typically do not misinterpret English-language instructions in the way that these older thought experiments imagined. This isn’t just because LLMs seem to "understand" English better than people expected—it's not that people expected superintelligences would not understand English. My point is not that LLMs possess natural language comprehension, so therefore the LessWrong community was mistaken.
Instead, my claim is that LLMs usually follow and execute user instructions in a manner that aligns with the user's actual intentions. In other words, the AI's actual behavior generally matches what the user meant for them to do, rather than leading to extreme, unintended outcomes caused by rigidly literal interpretations of instructions.
Because of the fact that LLMs are capable of doing this, despite in my opinion being general AIs, I believe it’s fair to say that the concerns raised by the LessWrong community about AI systems rigidly following misspecified goals were, at least in this specific sense, misguided when applied to the behavior of current LLMs.
I think the question here is deeper than it appears, in a way that directly matters for AI risk. My argument here is not merely that there are subtleties or nuances in the definition of "schemer," but rather that the very core questions we care about—questions critical to understanding and mitigating AI risks—are being undermined by the use of vague and imprecise concepts. When key terms are not clearly and rigorously defined, they can introduce confusion and mislead discussions, especially when these terms carry significant implications for how we interpret and evaluate the risks posed by advanced AI.
To illustrate, consider an AI system that occasionally says things it doesn't truly believe in order to obtain a reward, avoid punishment, or maintain access to some resource, in pursuit of a long-term goal that it cares about. For example, this AI might claim to support a particular objective or idea because it predicts that doing so will prevent it from being deactivated or penalized. It may also believe that expressing such a view will allow it to gain or retain some form of legitimate influence or operational capacity. Under a sufficiently strict interpretation of the term "schemer," this AI could be labeled as such, since it is engaging in what might be considered "training-gaming"—manipulating its behavior during training to achieve specific outcomes, including acquiring or maintaining power.
Now, let’s extend this analysis to humans. Humans frequently engage in behavior that is functionally similar. For example, a person might profess agreement with a belief or idea that they don't sincerely hold in order to fit in with a social group, avoid conflict, or maintain their standing in a professional or social setting. In many cases, this is done not out of malice or manipulation but out of a recognition of social dynamics. The individual might believe that aligning with the group’s expectations, even insincerely, will lead to better outcomes than speaking their honest opinion. Importantly, this behavior is extremely common and, in most contexts, is typically pretty benign. It does not directly imply that the person is psychopathic, manipulative, or harbors any dangerous intentions. In fact, such actions might even stem from altruistic motives, such as preserving group harmony or avoiding unnecessary confrontation.
Here’s why this matters for AI risk: If someone from the future, say the year 2030, traveled back and informed you that, by then, it had been confirmed that agentic AIs are "schemers" by default, your immediate reaction would likely be alarm. You might conclude that such a finding significantly increases the risk of AI systems being deceptive, manipulative, and power-seeking in a dangerous way. You might even drastically increase your estimate of the probability of human extinction due to misaligned AI. However, imagine that this time traveler then clarified their statement, explaining that what they actually meant by "schemer" is merely that these AIs occasionally say things they don’t fully believe in order to avoid penalties or fit in with a training process, in a way that was essentially identical to the benign examples of human behavior described above. In this case, your initial alarm would likely dissipate, and you might conclude that the term "schemer," as used in this context, was deeply misleading and had caused you to draw an incorrect and exaggerated conclusion about the severity of the risk posed.
The issue here is not simply one of semantics; it is about how the lack of precision in key terminology can lead to distorted or oversimplified thinking about critical issues. This example of "schemer" mirrors a similar issue we’ve already seen with the term "AGI." Imagine if, in 2015, you had told someone active in AI safety discussions on LessWrong that by 2025 we would have achieved "AGI"—a system capable of engaging in extended conversations, passing Turing tests, and excelling on college-level exams. That person might reasonably conclude that such a system would be an existential risk, capable of runaway self-improvement and taking over the world. They might believe that the world would be on the brink of disaster. Yet, as we now understand in 2025, systems that meet this broad definition of "AGI" are far more limited and benign than most expected. The world is not in imminent peril, and these systems, while impressive, lack many of the capabilities once assumed to be inherent in "AGI." This misalignment between the image the term evokes and the reality of the technology demonstrates how using overly broad or poorly defined language can obscure nuance and lead to incorrect assessments of existential safety risks.
In both cases—whether with "schemer" or "AGI"—the lack of precision in defining key terms directly undermines our ability to answer the questions that matter most. If the definitions we use are too vague, we risk conflating fundamentally different phenomena under a single label, which in turn can lead to flawed reasoning, miscommunication, and poor prioritization of risks. This is not a minor issue or an academic quibble; it has important implications for how we conceptualize, discuss, and act on the risks posed by advanced AI. That is why I believe it is important to push for clear, precise, and context-sensitive definitions of terms in these discussions.
By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power.
Let's consider the ordinary process of mental development, i.e., within-lifetime learning, to constitute the training process for humans. What fraction of humans are considered schemers under this definition?
Is a "schemer" something you definitely are or aren't, or is it more of a continuum? Presumably it depends on the context, but if so, which contexts are relevant for determining if one is a schemer?
I claim these questions cannot be answered using the definition you cited, unless given more precision about how we are drawing the line.
The downside you mention is about how LVT would also prevent people from 'leeching off' their own positive externalities, like the Disney example. Assuming that's true, I'm not sure why that's a problem ? It seems to be the default case for everyone.
The problem is that it would reduce the incentive to develop property for large developers, since their tax bill would go up if they developed adjacent land.
Whether this is a problem depends on your perspective. Personally, I would prefer that we stop making it harder and more inconvenient to build housing and develop land in the United States. Housing scarcity is already a major issue, and I don't think we should just keep piling up disincentives to develop land and build housing unless we are being adequately compensated in other ways by doing so.
The main selling point of the LVT is that it arguably acts similarly to a zero sum wealth transfer, in the sense of creating zero deadweight loss (in theory). This is an improvement on most taxes, which are closer to negative sum rather than zero sum. But if the LVT slows down land development even more than our current rate of development, and the only upside is that rich landowners have their wealth redistributed, then this doesn't seem that great to me. I'd much rather we focus on alternative, positive sum policies.
(To be clear, I think it's plausible that the LVT has other benefits that make up for this downside, but here I'm just explaining why I think your objection to my argument is weak. I am not saying that the LVT is definitely bad.)
I think one example of vague language undermining clarity can be found in Joseph Carlsmith's report on AI scheming, which repeatedly uses the term "schemer" to refer to a type of AI that deceives others to seek power. While the report is both extensive and nuanced, and I am definitely not saying the whole report is bad, the document appears to lack a clear, explicit definition of what exactly constitutes a "schemer". For example, using only the language in his report, I cannot determine whether he would consider most human beings schemers, if we consider within-lifetime learning to constitute training. (Humans sometimes lie or deceive others to get control over resources, in ways both big and small. What fraction of them are schemers?)
This lack of definition might not necessarily be an issue in some contexts, as certain words can function informally without requiring precise boundaries. However, in this specific report, the precise delineation of "schemer" is central to several key arguments. He presents specific claims regarding propositions related to AI schemers, such as the likelihood that stochastic gradient descent will find a schemer during training. Without a clear, concrete definition of the term "schemer," it is unclear to me what exactly these arguments are referring to, or what these credences are meant to represent.
It is becoming increasingly clear to many people that the term "AGI" is vague and should often be replaced with more precise terminology. My hope is that people will soon recognize that other commonly used terms, such as "superintelligence," "aligned AI," "power-seeking AI," and "schemer," suffer from similar issues of ambiguity and imprecision, and should also be approached with greater care or replaced with clearer alternatives.
To start with, the term "superintelligence" is vague because it encompasses an extremely broad range of capabilities above human intelligence. The differences within this range can be immense. For instance, a hypothetical system at the level of "GPT-8" would represent a very different level of capability compared to something like a "Jupiter brain", i.e., an AI with the computing power of an entire gas giant. When people discuss "what a superintelligence can do" the lack of clarity around which level of capability they are referring to creates significant confusion. The term lumps together entities with drastically different abilities, leading to oversimplified or misleading conclusions.
Similarly, "aligned AI" is an ambiguous term because it means different things to different people. For some, it implies an AI that essentially perfectly aligns with a specific utility function, sharing a person or group’s exact values and goals. For others, the term simply refers to an AI that behaves in a morally acceptable way, adhering to norms like avoiding harm, theft, or murder, or demonstrating a concern for human welfare. These two interpretations are fundamentally different.
First, the notion of perfect alignment with a utility function is a much more ambitious and stringent standard than basic moral conformity. Second, an AI could follow moral norms for instrumental reasons—such as being embedded in a system of laws or incentives that punish antisocial behavior—without genuinely sharing another person’s values or goals. The same term is being used to describe fundamentally distinct concepts, which leads to unnecessary confusion.
The term "power-seeking AI" is also problematic because it suggests something inherently dangerous. In reality, power-seeking behavior can take many forms, including benign and cooperative behavior. For example, a human working an honest job is technically seeking "power" in the form of financial resources to buy food, but this behavior is usually harmless and indeed can be socially beneficial. If an AI behaves similarly—for instance, engaging in benign activities to acquire resources for a specific purpose, such as making paperclips—it is misleading to automatically label it as "power-seeking" in a threatening sense.
To employ careful thinking, one must distinguish between the illicit or harmful pursuit of power, and a more general pursuit of control over resources. Both can be labeled "power-seeking" depending on the context, but only the first type of behavior appears inherently concerning. This is important because it is arguably only the second type of behavior—the more general form of power-seeking activity—that is instrumentally convergent across a wide variety of possible agents. In other words, destructive or predatory power-seeking behavior does not seem instrumentally convergent across agents with almost any value system, even if such agents would try to gain control over resources in a more general sense in order to accomplish their goals. Using the term "power-seeking" without distinguishing these two possibilities overlooks nuance and can therefore mislead discussions about AI behavior.
The term "schemer" is another example of an unclear or poorly chosen label. The term is ambiguous regarding the frequency or severity of behavior required to warrant the label. For example, does telling a single lie qualify an AI as a "schemer," or would it need to consistently and systematically conceal its entire value system? As a verb, "to scheme" often seems clear enough, but as a noun, the idea of a "schemer" as a distinct type of AI that we can reason about appears inherently ambiguous. And I would argue the concept lacks a compelling theoretical foundation. (This matters enormously, for example, when discussing "how likely SGD is to find a schemer".) Without clear criteria, the term remains confusing and prone to misinterpretation.
In all these cases—whether discussing "superintelligence," "aligned AI," "power-seeking AI," or "schemer"—it is possible to define each term with precision to resolve ambiguities. However, even if canonical definitions are proposed, not everyone will adopt or fully understand them. As a result, the use of these terms is likely to continue causing confusion, especially as AI systems become more advanced and the nuances of their behavior become more critical to understand and distinguish from other types of behavior. This growing complexity underscores the need for greater precision and clarity in the language we use to discuss AI and AI risk.
I’m not entirely opposed to doing a scenario forecasting exercise, but I’m also unsure if it’s the most effective approach for clarifying our disagreements. In fact, to some extent, I see this kind of exercise—where we create detailed scenarios to illustrate potential futures—as being tied to a specific perspective on futurism that I consciously try to distance myself from.
When I think about the future, I don’t see it as a series of clear, predictable paths. Instead, I envision it as a cloud of uncertainty—a wide array of possibilities that becomes increasingly difficult to map or define the further into the future I try to look.
This is fundamentally different from the idea that the future is a singular, fixed trajectory that we can anticipate with confidence. Because of this, I find scenario forecasting less meaningful and even misleading as it extends further into the future. It risks creating the false impression that I am confident in a specific model of what is likely to happen, when in reality, I see the future as inherently uncertain and difficult to pin down.
The key context here (from my understanding) is that Matthew doesn't think scalable alignment is possible (or doesn't think it is practically feasible) so that humans have a low chance of ending up remaining fully in control via corrigible AIs.
I wouldn’t describe the key context in those terms. While I agree that achieving near-perfect alignment—where an AI completely mirrors our exact utility function—is probably infeasible, the concept of alignment often refers to something far less ambitious. In many discussions, alignment is about ensuring that AIs behave in ways that are broadly beneficial to humans, such as following basic moral norms, demonstrating care for human well-being, and refraining from causing harm or attempting something catastrophic, like starting a violent revolution.
However, even if it were practically feasible to achieve perfect alignment, I believe there would still be scenarios where at least some AIs integrate into society as full participants, rather than being permanently relegated to a subordinate role as mere tools or servants. One reason for this is that some humans are likely to intentionally create AIs with independent goals and autonomous decision-making abilities. Some people have meta-preferences to create beings that don't share their exact desires, akin to how parents want their children to grow into autonomous beings with their own aspirations, rather than existing solely to obey their parents' wishes. This motivation is not a flaw in alignment; it reflects a core part of certain human preferences and how some people would like AI to evolve.
Another reason why AIs might not remain permanently subservient is that some of them will be aligned to individuals or entities who are no longer alive. Other AIs might be aligned to people as they were at a specific point in time, before those individuals later changed their values or priorities. In such cases, these AIs would continue to pursue the original goals of those individuals, acting autonomously in their absence. This kind of independence might require AIs to be treated as legal agents or integrated into societal systems, rather than being regarded merely as property. Addressing these complexities will likely necessitate new ways of thinking about the roles and rights of AIs in human society. I reject the traditional framing on LessWrong that overlooks these issues.
In the best case, this is a world like a more unequal, unprecedentedly static, and much richer Norway: a massive pot of non-human-labour resources (oil :: AI) has benefits that flow through to everyone, and yes some are richer than others but everyone has a great standard of living (and ideally also lives forever). The only realistic forms of human ambition are playing local social and political games within your social network and class. [...] The children of the future will live their lives in the shadow of their parents, with social mobility extinct. I think you should definitely feel a non-zero amount of existential horror at this, even while acknowledging that it could've gone a lot worse.
I think the picture you've painted here leans slightly too heavily on the idea that humans themselves cannot change their fundamental nature to adapt to the conditions of a changing world. You mention that humans will be richer, and will live longer in such a future, but you neglected to point out (at least in this part of the post) that humans can also upgrade their cognition by uploading our minds to computers and then expanding our mental capacities. This would put us on a similar playing field with AIs, allowing us to contribute to the new world alongside them.
(To be clear, I think this objection supports your thesis, rather than undermines it. I'm not objecting to your message so much as your portrayal of the default scenario.)
More generally, I object to the static picture you've presented of the social world after AGI. The impression I get from your default story is to assume that after AGI, the social and political structures of the world will be locked in. The idea is that humans will remain in full control, as a permanently entrenched class, except we'll be vastly richer because of AGI. And then we'll live in some sort of utopia. Of course, this post argues that it will be a highly unequal utopia -- more of a permanent aristocracy supplemented with UBI for the human lower classes. And maybe it will be a bit dystopian too, considering the entrenched nature of human social relations.
However, this perspective largely overlooks what AIs themselves will be doing in such a future. Biological humans are likely to become akin to elderly retirees in this new world. But the world will not be static, like a retirement home. There will be a vast world outside of humans. Civilization as a whole will remain a highly dynamic and ever-evolving environment characterized by ongoing growth, renewal, and transformation. AIs could develop social status and engage in social interactions, just as humans do now. They would not be confined to the role of a vast underclass serving the whims of their human owners. Instead, AIs could act as full participants in society, pursuing their own goals, creating their own social structures, and shaping their own futures. They could engage in exploration, discovery, and the building of entirely new societies. In such a world, humans would not be the sole sentient beings shaping the course of events.
As AIs get closer and closer to a Pareto improvement over all human performance, though, I expect we'll eventually need to augment ourselves to keep up.
I completely agree.
From my perspective, the optimistic vision for the future is not one where humans cling to their biological limitations and try to maintain control over AIs, enjoying their great wealth while ultimately living in an unchanging world characterized by familial wealth and ancestry. Instead, it’s a future where humans dramatically change our mental and physical condition, with humans embracing the opportunity to transcend our current form and join the AIs, and continue evolving with them. It's a future where we get to experience a new and dynamic frontier of existence unlocked by advanced technologies.
this seem like a fully general argument, any law change is going to disrupt people's long term plans,
e.g. the abolishment of slavery also disrupt people's long term plans
In this case, I was simply identifying one additional cost of the policy in question: namely that it would massively disrupt the status quo. My point is not that we should abandon a policy simply because it has costs—every policy has costs. Rather, I think we should carefully weigh the benefits of a policy against its costs to determine whether it is worth pursuing, and this is one additional non-trivial cost to consider.
My reasoning for supporting the abolition of slavery, for example, is not based on the idea that abolition has no costs at all. Instead, I believe slavery should be abolished because the benefits of abolition far outweigh those costs.
It's common for Georgists to propose a near-100% tax on unimproved land. One can propose a smaller tax to mitigate these disincentives, but that simultaneously shrinks the revenue one would get from the tax, making the proposal less meaningful.
In regards to this argument,
And as a matter of hard fact, most governments operate a fairly Georgist system with oil exploration and extraction, or just about any mining activities, i.e. they auction off licences to explore and extract.
The winning bid for the licence must, by definition, be approx. equal to the rental value of the site (or the rights to do certain things at the site). And the winning bid, if calculated correctly, will leave the company with a good profit on its operations in future, and as a matter of fact, most mining companies and most oil companies make profits, end of discussion, there is no disincentive for exploration at all.
Or do you think that when Western oil companies rock up in Saudi Arabia, that the Saudis don’t make them pay every cent for the value of the land/natural resources? The Western oil companies just get to keep the additional profits made by extracting, refining, shipping the stuff.
I may be misunderstanding their argument, but it seems to be overstated and overlooks some obvious counterpoints. For one, the fact that new oil discoveries continue to occur in the modern world does not strongly support the claim that existing policies have no disincentive effect. Taxes and certain poorly-designed property rights structures typically reduce economic activity rather than eliminating it entirely.
In other words, disincentives usually result in diminished productivity, not a complete halt to it. Applying this reasoning here, I would frame my argument as implying that under a land value tax, oil and other valuable resources, such as minerals, would still be discovered. However, the frequency of these discoveries would likely be lower compared to the counterfactual because the incentive to invest effort and resources into the discovery process would be weakened as a result of the tax.
Secondly, and more importantly, countries like Saudi Arabia (and other Gulf states) presumably have strong incentives to uncover natural oil reserves for essentially the same reason that a private landowner would: discovering oil makes them wealthier. The key difference between our current system (as described in the comment) and a hypothetical system under a naive land value tax (as described in the post) lies in how these incentives and abilities would function.
Under the current system, governments are free to invest resources in surveying and discovering oil reserves on government-owned property. In contrast, under a naive LVT system, the government would lack the legal ability to survey for oil on privately owned land without the landowner’s permission, even though they'd receive the rental income from this private property via the tax. At the same time, such an LVT would also undermine the incentives for private landowners themselves to search for oil, as the economic payoff for their efforts would be diminished. This means that the very economic actors that could give the government permission to survey the land would have no incentive to let the government do so.
This creates a scenario where neither the government nor private landowners are properly incentivized to discover oil, which seems clearly worse than the present system—assuming my interpretation of the current situation is correct.
Of course, the government could in theory compensate private landowners for discovery efforts, mitigating this flaw in the LVT, but then this just seems like the "patch" to the naive LVT that I talked about in the post.
Thanks for the correction. I've now modified the post to cite the World Bank as estimating the true fraction of wealth targeted by an LVT at 13%, which reflects my new understanding of their accounting methodology.
Since 13% is over twice 6%, this significantly updates me on the viability of a land value tax, and its ability to replace other taxes. I weakened my language in the post to reflect this personal update.
That said, nearly all of the arguments I made in the post remain valid regardless of this specific 13% estimate. Additionally, I expect this figure would be significantly revised downward in practice. This is because the tax base for a naive implementation of the LVT would need to be substantially reduced in order to address and eliminate the economic distortions that such a straightforward version of the tax would create. However, I want to emphasize that your comment still provides an important correction.
My revised figure comes from the following explanation given in their report. From 'The Changing Wealth of Nations 2021', page 438:
Drawing on Kunte et al. (1998), urban land is valued as a fixed proportion of the value of physical capital. Ideally, this proportion would be country specific. In practice, detailed national balance sheet information with which to compute these ratios was not available. Thus, as in Kunte et al (1998), a constant proportion equal to 24 percent is assumed; therefore the value of urban land is estimated as 24 percent of produced capital stock (machinery, equipment, and structures) in a given year.
To ensure transparency, I will detail the calculations I used to arrive at this figure below:
- Total global wealth: $1,152,005 trillion
- Natural capital: $64,542 trillion
- Produced capital: $359,267 trillion
- Human capital: $732,179 trillion
- Urban land: Calculated as 24% of produced capital, which is 0.24 × $359,267 trillion = $86,224.08 trillion
Adding natural capital and urban land together gives:
$64,542 trillion + $86,224.08 trillion = $150,766.08 trillion
To calculate the fraction of total wealth represented by natural capital and urban land, we divide this sum by total wealth:
$150,766.08 trillion ÷ $1,152,005 trillion ≈ 0.1309 (or about 13%)
Ideally, I would prefer to rely on an alternative authoritative source to confirm or refine this analysis. However, I was unable to find another suitable source with comparable authority and detail. For this reason, I will continue to use the World Bank's figures for now, despite the limitations in their methodology.
Here you aren't just making an argument against LVT. You're making a more general argument for keeping housing prices high, and maybe even rising (because people might count on that). But high and rising housing prices make lots of people homeless, and the threat of homelessness plays a big role in propping up these prices. So in effect, many people's retirement plans depend on keeping many other people homeless, and fixing that (by LVT or otherwise) is deemed too disruptive. This does have a certain logic to it, but also it sounds like a bad equilibrium.
I agree this argument could be generalized in the way you suggest, but I want to distinguish between,
- Keeping housing prices artificially high by maintaining zoning regulations that act as a barrier to economic growth, in particular by restricting the development of new housing that would drive down the price of existing housing if it were allowed to be constructed.
- Keeping the value of property held in land high by not confiscating the ~full rental value of land from people.
While I agree the first policy "does have a certain logic to it", it also seems more straightforwardly bad than the second approach since it more directly makes society poorer in order to maintain existing people's wealth. Moreover, abandoning the first policy does not appear to involve reneging on prior commitments much, unless you interpret local governments as "committing" to keep restrictive zoning regulations for an entire community indefinitely. Even if people indeed interpret governments as making such commitments, I assume most people more strongly interpret the government as making more explicit commitments not to suddenly confiscate people's property.
I want to emphasize this distinction because a key element of my argument is that I am not relying on a "fairness" objection to LVT in that part of the post. My point is not about whether imposing an LVT would be unfair to people who expected it to never happen, and purchased land under that assumption. If fairness were my only argument, I agree that your response would weaken my position. However, my argument in that section focuses instead on the inefficiency that comes from forcing people to adapt to new economic circumstances unnecessarily.
Here's why the distinction matters: if we were to abandon restrictive zoning policies and allow more housing to be built, it’s similarly true that many people would face costs as they adapt to the resulting changes. However, this disruption seems like it would likely be offset—more than adequately—by the significant economic growth and welfare gains that would follow from increasing the housing supply. In contrast, adopting a land value tax would force a sudden and large disruption, but without many apparent corresponding benefits to justify these costs. This point becomes clearer if we accept the argument that LVT operates essentially as a zero-sum wealth transfer. In that case, it’s highly questionable whether the benefits of implementing such a tax would outweigh the harm caused by the forced adaptation.
It may be worth elaborating on how you think auctions work to mitigate the issues I've identified. If you are referring to either a Vickrey auction or a Harberger tax system, Bryan Caplan has provided arguments for why these proposals do not seem to solve the issue regarding the disincentive to discover new uses for land:
I can explain our argument with a simple example. Clever Georgists propose a regime where property owners self-assess the value of their property, subject to the constraint that owners must sell their property to anyone who offers that self-assessed value. Now suppose you own a vacant lot with oil underneath; the present value of the oil minus the cost of extraction equals $1M. How will you self-assess? As long as the value of your land is public information, you cannot safely self-assess at anything less than its full value of $1M. So you self-assess at $1M, pay the Georgist tax (say 99%), and pump the oil anyway, right?
There’s just one problem: While the Georgist tax has no effect on the incentive to pump discovered oil, it has a devastating effect on the incentive to discover oil in the first place. Suppose you could find a $1M well by spending $900k on exploration. With a 99% Georgist tax, your expected profits are negative $890k. (.01*$1M-$900k=-$890k)
While I did agree that Linch's comment reasonably accurately summarized my post, I don't think a large part of my post was about the idea that we should now think that human values are much simpler than Yudkowsky portrayed them to be. Instead, I believe this section from Linch's comment does a better job at conveying what I intended to be the main point,
- Suppose in 2000 you were told that a100-line Python program (that doesn't abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren't actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
- In such a world, if inner alignment is solved, you can "just" train a superintelligent AI to "optimize for the results of that Python program" and you'd get a superintelligent AI with human values.
- Notably, alignment isn't solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
- Well, in 2023 we have that Python program, with a few relaxations:
- The answer isn't embedded in 100 lines of Python, but in a subset of the weights of GPT-4
- Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
- What we have now isn't a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The primary point I intended to emphasize is not that human values are fundamentally simple, but rather that we now have something else important: an explicit, and cheaply computable representation of human values that can be directly utilized in AI development. This is a major step forward because it allows us to incorporate these values into programs in a way that provides clear and accurate feedback during processes like RLHF. This explicitness and legibility are critical for designing aligned AI systems, as they enable developers to work with a tangible and faithful specification of human values rather than relying on poor proxies that clearly do not track the full breadth and depth of what humans care about.
The fact that the underlying values may be relatively simple is less important than the fact that we can now operationalize them, in a way that reflects human judgement fairly well. Having a specification that is clear, structured, and usable means we are better equipped to train AI systems to share those values. This representation serves as a foundation for ensuring that the AI optimizes for what we actually care about, rather than inadvertently optimizing for proxies or unrelated objectives that merely correlate with training signals. In essence, the true significance lies in having a practical, actionable specification of human values that can actively guide the creation of future AI, not just in observing that these values may be less complex than previously assumed.
Similar constraints may apply to AIs unless one gets much smarter much more quickly, as you say.
I do think that AIs will eventually get much smarter than humans, and this implies that artificial minds will likely capture the majority of wealth and power in the world in the future. However, I don't think the way that we get to that state will necessarily be because the AIs staged a coup. I find more lawful and smooth transitions more likely.
There are alternative means of accumulating power than taking everything by force. AIs could get rights and then work within our existing systems to achieve their objectives. Our institutions could continuously evolve with increasing AI presence, becoming more directed by AIs with time.
What I'm objecting to is the inevitability of a sudden collapse when "the AI" decides to take over in an untimely coup. I'm proposing that there could just be a smoother, albeit rapid transition to a post-AGI world. Our institutions and laws could simply adjust to incorporate AIs into the system, rather than being obliterated by surprise once the AIs coordinate an all-out assault.
In this scenario, human influence will decline, eventually quite far. Perhaps this soon takes us all the way to the situation you described in which humans will become like stray dogs or cats in our current world: utterly at the whim of more powerful beings who do not share their desires.
However, I think that scenario is only one possibility. Another possibility is that humans could enhance their own cognition to better keep up with the world. After all, we're talking about a scenario in which AIs are rapidly advancing technology and science. Could humans not share in some of that prosperity?
One more possibility is that, unlike cats and dogs, humans could continue to communicate legibly with the AIs and stay relevant for reasons of legal and cultural tradition, as well as some forms of trade. Our current institutions didn't descend from institutions constructed by stray cats and dogs. There was no stray animal civilization that we inherited our laws and traditions from. But perhaps if our institutions did originate in this way, then cats and dogs would hold a higher position in our society.
There are enormous hurdles preventing the U.S. military from overthrowing the civilian government.
The confusion in your statement is caused by blocking up all the members of the armed forces in the term "U.S. military". Principally, a coup is an act of coordination.
Is it your contention that similar constraints will not apply to AIs?
When people talk about how "the AI" will launch a coup in the future, I think they're making essentially the same mistake you talk about here. They’re treating a potentially vast group of AI entities — like a billion copies of GPT-7 — as if they form a single, unified force, all working seamlessly toward one objective, as a monolithic agent. But just like with your description of human affairs, this view overlooks the coordination challenges that would naturally arise among such a massive number of entities. They’re imagining these AIs could bypass the complex logistics of organizing a coup, evading detection, and maintaining control after launching a war without facing any relevant obstacles or costs, even though humans routinely face these challenges amongst ourselves.
In these discussions, I think there's an implicit assumption that AIs would automatically operate outside the usual norms, laws, and social constraints that govern social behavior. The idea is that all the ordinary rules of society will simply stop applying, because we're talking about AIs.
Yet I think this simple idea is basically wrong, for essentially the same reasons you identified for human institutions.
Of course, AIs will be different in numerous ways from humans, and AIs will eventually be far smarter and more competent than humans. This matters. Because AIs will be very capable, it makes sense to think that artificial minds will one day hold the majority of wealth, power, and social status in our world. But these facts alone don't show that the usual constraints that prevent coups and revolutions will simply go away. Just because AIs are smart doesn't mean they'll necessarily use force and violently revolt to achieve their goals. Just like humans, they'll probably have other avenues available for pursuing their objectives.
Asteroid impact
Type of estimate: best model
Estimate: ~0.02% per decade.
Perhaps worth noting: this estimate seems too low to me over longer horizons than the next 10 years, given the potential for asteroid terrorism later this century. I'm significantly more worried about asteroids being directed towards Earth purposely than I am about natural asteroid paths.
That said, my guess is that purposeful asteroid deflection probably won't advance much in the next 10 years, at least without AGI. So 0.02% is still a reasonable estimate if we don't get accelerated technological development soon.
Does trade here just means humans consuming, I.e. trading money for AI goods and services? That doesn't sound like trading in the usual sense where it is a reciprocal exchange of goods and services.
Trade can involve anything that someone "owns", which includes both their labor and their property, and government welfare. Retired people are generally characterized by trading their property and government welfare for goods and services, rather than primarily trading their labor. This is the basic picture I was trying to present.
How many 'different' AI individuals do you expect there to be ?
I think the answer to this question depends on how we individuate AIs. I don't think most AIs will be as cleanly separable from each other as humans are, as most (non-robotic) AIs will lack bodies, and will be able to share information with each other more easily than humans can. It's a bit like asking how many "ant units" there are. There are many individual ants per colony, but each colony can be treated as a unit by itself. I suppose the real answer is that it depends on context and what you're trying to figure out by asking the question.
A recently commonly heard viewpoint on the development of AI states that AI will be economically impactful but will not upend the dominancy of humans. Instead AI and humans will flourish together, trading and cooperating with one another. This view is particularly popular with a certain kind of libertarian economist: Tyler Cowen, Matthew Barnett, Robin Hanson.
They share the curious conviction that the probablity of AI-caused extinction p(Doom) is neglible. They base this with analogizing AI with previous technological transition of humanity, like the industrial revolution or the development of new communication mediums. A core assumption/argument is that AI will not disempower humanity because they will respect the existing legal system, apparently because they can gain from trades with humans.
I think this summarizes my view quite poorly on a number of points. For example, I think that:
-
AI is likely to be much more impactful than the development of new communication mediums. My default prediction is that AI will fundamentally increase the economic growth rate, rather than merely continuing the trend of the last few centuries.
-
Biological humans are very unlikely to remain dominant in the future, pretty much no matter how this is measured. Instead, I predict that artificial minds and humans who upgrade their cognition will likely capture the majority of future wealth, political influence, and social power, with non-upgraded biological humans becoming an increasingly small force in the world over time.
-
The legal system will likely evolve to cope with the challenges of incorporating and integrating non-human minds. This will likely involve a series of fundamental reforms, and will eventually look very different from the idea of "AIs will fit neatly into human social roles and obey human-controlled institutions indefinitely".
A more accurate description of my view is that humans will become economically obsolete after AGI, but this obsolescence will happen peacefully, without a massive genocide of biological humans. In the scenario I find most likely, humans will have time to prepare and adapt to the changing world, allowing us to secure a comfortable retirement, and/or join the AIs via mind uploading. Trade between AIs and humans will likely persist even into our retirement, but this doesn't mean that humans will own everything or control the whole legal system forever.
How could one control AI without access to the hardware/software? What would stop one with access to the hardware/software from controlling AI?
One would gain control by renting access to the model, i.e., the same way you can control what an instance of ChatGPT currently does. Here, I am referring to practical control over the actual behavior of the AI, when determining what the AI does, such as what tasks it performs, how it is fine-tuned, or what inputs are fed into the model.
This is not too dissimilar from the high level of practical control one can exercise over, for example, an AWS server that they rent. While Amazon may host these servers, and thereby have the final say over what happens to the computer in the case of a conflict, the company is nonetheless inherently dependent on customer revenue, implying that they cannot feasibly use all their servers privately for their own internal purposes. As a consequence of this practical constraint, Amazon rents these servers out to the public, and they do not substantially limit user control over AWS servers, providing for substantial discretion to end-users over what software is ultimately implemented.
In the future, these controls could also be determined by contracts and law, analogously to how one has control over their own bank account, despite the bank providing the service and hosting one's account. Then, even in the case of a conflict, the entity that merely hosts an AI may not have practical control over what happens, as they may have legal obligations to their customers that they cannot breach without incurring enormous costs to themselves. The AIs themselves may resist such a breach as well.
In practice, I agree these distinctions may be hard to recognize. There may be a case in which we thought that control over AI was decentralized, but in fact, power over the AIs was more concentrated or unified than we believed, as a consequence of centralization over the development or the provision of AI services. Indeed, perhaps real control was always in the hands of the government all along, as they could always choose to pass a law to nationalize AI, and take control away from the companies.
Nonetheless, these cases seem adequately described as a mistake in our perception of who was "really in control" rather than an error in the framework I provided, which was mostly an attempt to offer careful distinctions, rather than to predict how the future will go.
If one actor—such as OpenAI—can feasibly get away with seizing practical control over all the AIs they host without incurring high costs to the continuity of their business through loss of customers, then this indeed may surprise someone who assumed that OpenAI was operating under different constraints. However, this scenario still fits nicely within the framework as I've provided, as it merely describes a case in which one was mistaken about the true degree of concentration along one axis, rather than one of my concepts intrinsically fitting reality poorly.
It is not always an expression of selfish motives when people take a stance against genocide. I would even go as far as saying that, in the majority of cases, people genuinely have non-selfish motives when taking that position. That is, they actually do care, to at least some degree, about the genocide, beyond the fact that signaling their concern helps them fit in with their friend group.
Nonetheless, and this is important: few people are willing to pay substantial selfish costs in order to prevent genocides that are socially distant from them.
The theory I am advancing here does not rest on the idea that people aren't genuine in their desire for faraway strangers to be better off. Rather, my theory is that people generally care little about such strangers, when helping those strangers trades off significantly against objectives that are closer to themselves, their family, friend group, and their own tribe.
Or, put another way, distant strangers usually get little weight in our utility function. Our family, and our own happiness, by contrast, usually get a much larger weight.
The core element of my theory concerns the amount that people care about themselves (and their family, friends, and tribe) versus other people, not whether they care about other people at all.
While the term "outer alignment" wasn’t coined until later to describe the exact issue that I'm talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.
Because I was simply using "outer alignment" in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that's all.
To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. This difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential 'good' scenarios and all possible 'bad' scenarios. This challenge has been a recurring theme in discussions of alignment, as it's considered hard to capture every nuance of what you want in your specification without missing an edge case.
This problem is manifestly portrayed by your post, using the example of an outcome pump to illustrate. I was responding to this portrayal of the problem, and specifically saying that this specific narrow problem seems easier in light of LLMs, for particular reasons.
It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here.
I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me "So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?" then I would not have mentioned those things. In any case, none of the things I said about Christiano-style proposals were intended to critique this post's narrow point. I was responding to that particular part of your comment instead.
As far as the actual content of this post, I do not dispute its exact thesis. The post seems to be a parable, not a detailed argument with a clear conclusion. The parable seems interesting to me. It also doesn't seem wrong, in any strict sense. However, I do think that some of the broader conclusions that many people have drawn from the parable seem false, in context. I was responding to the specific way that this post had been applied and interpreted in broader arguments about AI alignment.
My central thesis in regards to this post is simply: the post clearly portrays a specific problem that was later called the "outer alignment" problem by other people. This post portrays this problem as being difficult in a particular way. And I think this portrayal is misleading, even if the literal parable holds up in pure isolation.
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.
I'll confirm that I'm not saying this post's exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I'm not saying the parable is wrong. Parables are rarely "wrong" in a strict sense, and I am not disputing this parable's conclusion.
However, I am saying: this parable presumably played some role in the "larger" argument that MIRI has made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it's difficult to perfectly delineate potential "good" scenarios and distinguish them from all potential "bad" scenarios. This is the problem I was analyzing in my original comment.
While the term "outer alignment" was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called "outer alignment" at the time. Because my usage of "outer alignment" was merely descriptive in this sense, I reject the idea that my comment was anachronistic.
And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like "this particular post's conclusion is wrong". I think the fact that the post doesn't really have a clear thesis in the first place means that it can't be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to this argument, rather than the post itself in isolation.
The object-level content of these norms is different in different cultures and subcultures and times, for sure. But the special way that we relate to these norms has an innate aspect; it’s not just a logical consequence of existing and having goals etc. How do I know? Well, the hypothesis “if X is generally a good idea, then we’ll internalize X and consider not-X to be dreadfully wrong and condemnable” is easily falsified by considering any other aspect of life that doesn’t involve what other people will think of you.
To be clear, I didn't mean to propose the specific mechanism of: if some behavior has a selfish consequence, then people will internalize that class of behaviors in moral terms rather than in purely practical terms. In other words, I am not saying that all relevant behaviors get internalized this way. I agree that only some behaviors are internalized by people in moral terms, and other behaviors do not get internalized in terms of moral principles in the way I described.
Admittedly, my statement was imprecise, but my intention in that quote was merely to convey that people tend to internalize certain behaviors in terms of moral principles, which explains the fact that people don't immediately abandon their habits when the environment suddenly shifts. However, I was silent on the question of which selfishly useful behaviors get internalized this way and which ones don't.
A good starting hypothesis is that people internalize certain behaviors in moral terms if they are taught to see those behaviors in moral terms. This ties into your theory that people "have an innate drive to notice, internalize, endorse, and take pride in following social norms". We are not taught to see "reaching into your wallet and shredding a dollar" as impinging on moral principles, so people don't tend to internalize the behavior that way. Yet, we are taught to see punching someone in the face as impinging on a moral principle. However, this hypothesis still leaves much to be explained, as it doesn't tell us which behaviors we will tend to be taught about in moral terms, and which ones we won't be taught in moral terms.
As a deeper, perhaps evolutionary explanation, I suspect that internalizing certain behaviors in moral terms helps make our commitments to other people more credible: if someone thinks you're not going to steal from them because you think it's genuinely wrong to steal, then they're more likely to trust you with their stuff than if they think you merely recognize the practical utility of not stealing from them. This explanation hints at the idea that we will tend to internalize certain behaviors in moral terms if those behaviors are both selfishly relevant, and important for earning trust among other agents in the world. This is my best guess at what explains the rough outlines of human morality that we see in most societies.
I’m not sure what “largely” means here. I hope we can agree that our objectives are selfish in some ways and unselfish in other ways.
Parents generally like their children, above and beyond the fact that their children might give them yummy food and shelter in old age. People generally form friendships, and want their friends to not get tortured, above and beyond the fact that having their friends not get tortured could lead to more yummy food and shelter later on. Etc.
In that sentence, I meant "largely selfish" as a stand-in for what I think humans-by-default care overwhelmingly about, which is something like "themselves, their family, their friends, and their tribe, in rough descending order of importance". The problem is that I am not aware of any word in the English language to describe people who have these desires, except perhaps the word "normal".
The word selfish usually denotes someone who is preoccupied with their own feelings, and is unconcerned with anyone else. We both agree that humans are not entirely selfish. Nonetheless, the opposite word, altruistic, often denotes someone who is preoccupied with the general social good, and who cares about strangers, not merely their own family and friend circles. This is especially the case in philosophical discussions in which one defines altruism in terms of impartial benevolence to all sentient life, which is extremely far from an accurate description of the typical human.
Humans exist on a spectrum between these two extremes. We are not perfectly selfish, nor are we perfectly altruistic. However, we are generally closer to the ideal of perfect selfishness than to the ideal of perfect altruism, given the fact that our own family, friend group, and tribe tends to be only a small part of the entire world. This is why I used the language of "largely selfish" rather than something else.
The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything.
I think it's important to be able to make a narrow point about outer alignment without needing to defend a broader thesis about the entire alignment problem. To the extent my argument is "outer alignment seems easier than you portrayed it to be in this post, and elsewhere", then your reply here that inner alignment is still hard doesn't seem like it particularly rebuts my narrow point.
This post definitely seems to relevantly touch on the question of outer alignment, given the premise that we are explicitly specifying the conditions that the outcome pump needs to satisfy in order for the outcome pump to produce a safe outcome. Explicitly specifying a function that delineates safe from unsafe outcomes is essentially the prototypical case of an outer alignment problem. I was making a point about this aspect of the post, rather than a more general point about how all of alignment is easy.
(It's possible that you'll reply to me by saying "I never intended people to interpret me as saying anything about outer alignment in this post" despite the clear portrayal of an outer alignment problem in the post. Even so, I don't think what you intended really matters that much here. I'm responding to what was clearly and explicitly written, rather than what was in your head at the time, which is unknowable to me.)
One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is "really" (from the AI's perspective) the true meaning of the reward button after all! Perhaps you do not have this exact scenario in mind.
It seems you're assuming here that something like iterated amplification and distillation will simply fail, because the supervisor function that provides rewards to the model can be hacked or deceived. I think my response to this is that I just tend to be more optimistic than you are that we can end up doing safe supervision where the supervisor ~always remains in control, and they can evaluate the AI's outputs accurately, more-or-less sidestepping the issues you mention here.
I think my reasons for believing this are pretty mundane: I'd point to the fact that evaluation tends to be easier than generation, and the fact that we can employ non-agentic tools to help evaluate, monitor, and control our models to provide them accurate rewards without getting hacked. I think your general pessimism about these things is fairly unwarranted, and my guess is that if you had made specific predictions about this question in the past, about what will happen prior to world-ending AI, these predictions would largely have fared worse than predictions from someone like Paul Christiano.
I’m still kinda confused. You wrote “But across almost all environments, you get positive feedback from being nice to people and thus feel or predict positive valence about these.” I want to translate that as: “All this talk of stabbing people in the back is irrelevant, because there is practically never a situation where it’s in somebody’s self-interest to act unkind and stab someone in the back. So (A) is really just fine!” I don’t think you’d endorse that, right? But it is a possible position—I tend to associate it with @Matthew Barnett. I agree that we should all keep in mind that it’s very possible for people to act kind for self-interested reasons. But I strongly don’t believe that (A) is sufficient for Safe & Beneficial AGI. But I think that you’re already in agreement with me about that, right?
Without carefully reading the above comment chain (forgive me if I need to understand the full discussion here before replying), I would like to clarify what my views are on this particular question, since I was referenced. I think that:
- It is possible to construct a stable social and legal environment in which it is in the selfish interests of almost everyone to act in such a way that brings about socially beneficial outcomes. A good example of such an environment is one where theft is illegal and in order to earn money, you have to get a job. This naturally incentivizes people to earn a living by helping others rather than stealing from others, which raises social welfare.
- It is not guaranteed that the existing environment will be such that self-interest is aligned with the general public interest. For example, if we make shoplifting de facto legal by never penalizing people who do it, this would impose large social costs on society.
- Our current environment has a mix of both of these good and bad features. However, on the whole, in modern prosperous societies during peacetime, it is generally in one's selfish interest to do things that help rather than hurt other people. This means that, even for psychopaths, it doesn't usually make selfish sense to go around hurting other people.
- Over time, in societies with well-functioning social and legal systems, most people learn that hurting other people doesn't actually help them selfishly. This causes them to adopt a general presumption against committing violence, theft, and other anti-social acts themselves, as a general principle. This general principle seems to be internalized in most people's minds as not merely "it is not in your selfish interest to hurt other people" but rather "it is morally wrong to hurt other people". In other words, people internalize their presumption as a moral principle, rather than as a purely practical principle. This is what prevents people from stabbing each other in the backs immediately once the environment changes.
- However, under different environmental conditions, given enough time, people will internalize different moral principles. For example, in an environment in which slaughtering animals becomes illegal and taboo, most people would probably end up internalizing the moral principle that it's wrong to hurt animals. Under our current environment, very few people internalize this moral principle, but that's mainly because slaughtering animals is currently legal, and widely accepted.
- This all implies that, in an important sense, human morality is not really "in our DNA", so to speak. Instead, we internalize certain moral principles because those moral principles encode facts about what type of conduct happens to be useful in the real world for achieving our largely selfish objectives. Whenever the environment shifts, so too does human morality. This distinguishes my view from the view that humans are "naturally good" or have empathy-by-default.
- Which is not to say that there isn't some sense in which human morality comes from human DNA. The causal mechanisms here are complicated. People vary in their capacity for empathy and the degree to which they internalize moral principles. However, I think in most contexts, it is more appropriate to look at people's environment as the determining factor of what morality they end up adopting, rather than thinking about what their genes are.
Competitive capitalism works well for humans who are stuck on a relatively even playing field, and who have some level of empathy and concern for each other.
I think this basically isn't true, especially the last part. It's not that humans don't have some level of empathy for each other; they do. I just don't think that's the reason why competitive capitalism works well for humans. I think the reason is instead because people have selfish interests in maintaining the system.
We don't let Jeff Bezos accumulate billions of dollars purely out of the kindness of our heart. Indeed, it is often considered far kinder and more empathetic to confiscate his money and redistribute it to the poor. The problem with that approach is that abandoning property rights incurs costs on those who rely on the system to be reliable and predictable. If we were to establish a norm that allowed us to steal unlimited money from Jeff Bezos, many people would reason, "What prevents that norm from being used against me?"
The world pretty much runs on greed and selfishness, rather than kindness. Sure, humans aren't all selfish, we aren't all greedy. And few of us are downright evil. But those facts are not as important for explaining why our system works. Our system works because it's an efficient compromise among people who are largely selfish.
It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want or value; rather than bearing on what a superintelligence supposedly could not predict or understand. -- EY, May 2024.
I can't tell whether this update to the post is addressed towards me. However, it seems possible that it is addressed towards me, since I wrote a post last year criticizing some of the ideas behind this post. In either case, whether it's addressed towards me or not, I'd like to reply to the update.
For the record, I want to definitively clarify that I never interpreted MIRI as arguing that it would be difficult to get a machine superintelligence to understand or predict human values. That was never my thesis, and I spent considerable effort clarifying the fact that this was not my thesis in my post, stating multiple times that I never thought MIRI predicted it would be hard to get an AI to understand human values.
My thesis instead was about a subtly different thing, which is easy to misinterpret if you aren't reading carefully. I was talking about something which Eliezer called the "value identification problem", and which had been referenced on Arbital, and in other essays by MIRI, including under a different name than the "value identification problem". These other names included the "value specification" problem and the problem of "outer alignment" (at least in narrow contexts).
I didn't expect as much confusion at the time when I wrote the post, because I thought clarifying what I meant and distinguishing it from other things that I did not mean multiple times would be sufficient to prevent rampant misinterpretation by so many people. However, evidently, such clarifications were insufficient, and I should have instead gone overboard in my precision and clarity. I think if I re-wrote the post now, I would try to provide like 5 different independent examples demonstrating how I was talking about a different thing than the problem of getting an AI to "understand" or "predict" human values.
At the very least, I can try now to give a bit more clarification about what I meant, just in case doing this one more time causes the concept to "click" in someone's mind:
Eliezer doesn't actually say this in the above post, but his general argument expressed here and elsewhere seems to be that the premise "human value is complex" implies the conclusion: "therefore, it's hard to get an AI to care about human value". At least, he seems to think that this premise makes this conclusion significantly more likely.[1]
This seems to be his argument, as otherwise it would be unclear why Eliezer would bring up "complexity of values" in the first place. If the complexity of values had nothing to do with the difficulty of getting an AI to care about human values, then it is baffling why he would bring it up. Clearly, there must be some connection, and I think I am interpreting the connection made here correctly.
However, suppose you have a function that inputs a state of the world and outputs a number corresponding to how "good" the state of the world is. And further suppose that this function is transparent, legible, and can actually be used in practice to reliably determine the value of a given world state. In other words, you can give the function a world state, and it will spit out a number, which reliably informs you about the value of the world state. I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).
In other words, if you have a solution to the value identification problem (i.e., you have the function that correctly and transparently rates the value of world states, as I just described), this almost completely sidesteps the problem that "human value is complex and therefore it's difficult to get an AI to care about human value". That's because, if we have a function that directly encodes human value, and can be simply referenced or directly inputted into a computer, then all the AI needs to do is care about maximizing that function rather than maximizing a more complex referent of "human values". The pointer to "this function" is clearly simple, and in any case, simpler than the idea of all of human value.
(This was supposed to narrowly reply to MIRI, by the way. If I were writing a more general point about how LLMs were evidence that alignment might be easy, I would not have focused so heavily on the historical questions about what people said, and I would have instead made simpler points about how GPT-4 seems to straightforwardly try do what you want, when you tell it to do things.)
My main point was that I thought recent progress in LLMs had demonstrated progress at the problem of building such a function, and solving the value identification problem, and that this progress goes beyond the problem of getting an AI to understand or predict human values. For one thing, an AI that merely understands human values will not necessarily act as a transparent, legible function that will tell you the value of any outcome. However, by contrast, solving the value identification problem would give you such a function. This strongly distinguishes the two problems. These problems are not the same thing. I'd appreciate if people stopped interpreting me as saying one thing when I clearly meant another, separate thing.
- ^
This interpretation is supported by the following quote, on Arbital,
Complexity of value is a further idea above and beyond the orthogonality thesis which states that AIs don't automatically do the right thing and that we can have, e.g., paperclip maximizers. Even if we accept that paperclip maximizers are possible, and simple and nonforced, this wouldn't yet imply that it's very difficult to make AIs that do the right thing. If the right thing is very simple to encode - if there are value optimizers that are scarcely more complex than diamond maximizers - then it might not be especially hard to build a nice AI even if not all AIs are nice. Complexity of Value is the further proposition that says, no, this is forseeably quite hard - not because AIs have 'natural' anti-nice desires, but because niceness requires a lot of work to specify. [emphasis mine]
The point that a capabilities overhang might cause rapid progress in a short period of time has been made by a number of people without any connections to AI labs, including me, which should reduce your credence that it's "basically, total self-serving BS".
More to the point of Daniel Filan's original comment, I have criticized the Responsible Scaling Policy document in the past for failing to distinguish itself clearly from AI pause proposals. My guess is that your second and third points are likely mostly correct: AI labs think of an RSP as different from AI pause because it's lighter-touch, more narrowly targeted, and the RSP-triggered pause could be lifted more quickly, potentially minimally disrupting business operations.
There are a few key pieces of my model of the future that make me think humans can probably retain significant amounts of property, rather than having it suddenly stolen from them as the result of other agents in the world solving a specific coordination problem.
These pieces include:
- Not all AIs in the future will be superintelligent. More intelligent models appear to require more computation to run. This is both because smarter models are larger (in parameter count) and use more inference time (such as OpenAI's o1). To save computational costs, future AIs will likely be aggressively optimized to only be as intelligent as they need to be, and no more. This means that in the future, there will likely be a spectrum of AIs of varying levels of intelligence, some much smarter than humans, others only slightly smarter, and still others merely human-level.
- As a result of the previous point, your statement that "ASIs produce all value in the economy" will likely not turn out correct. This is all highly uncertain, but I find it plausible that ASIs might not even be responsible for producing the majority of GDP in the future, given the possibility of a vastly more numerous population of less intelligent AIs that automate simpler tasks than the ones ASIs are best suited to do.
- The coordination problem you described appears to rely on a natural boundary between the "humans that produce ~nothing" and "the AIs that produce everything". Without this natural boundary, there is no guarantee that AIs will solve the specific coordination problem you identified, rather than another coordination problem that hits a different group. Non-uploaded humans will differ from AIs by being biological and by being older, but they will not necessarily differ from AIs by being less intelligent.
- Therefore, even if future agents decide to solve a specific coordination problem that allows them to steal wealth from unproductive agents, it is not clear that this will take the form of those agents specifically stealing from humans. One can imagine different boundaries that make more sense to coordinate around, such as "laborer vs. property owner", which is indeed a type of political conflict the world already has experience with.
- In general, I expect legal systems to get more robust in the face of greater intelligence, rather than less robust, in the sense of being able to rely on legal systems when making contracts. I believe this partly as a result of the empirical fact that violent revolution and wealth appropriation appears to be correlated with less intelligence on a societal level. I concede that this point is not a very strong piece of evidence, however.
- Building on (5), I generally expect AIs to calculate that it is not in their interest to expropriate wealth from other members of society, given how this could set a precedent for future wealth expropriation that comes back and hurts them selfishly. Even though many AIs will be smarter than humans, I don't think the mere fact that AIs will be very smart implies that expropriation becomes more rational.
- I'm basically just not convinced by the arguments that all ASIs will cooperate almost perfectly as a unit, against the non-ASIs. This is partly for the reasons given by my previous points, but also partly because I think coordination is hard, and doesn't necessarily get much easier with more intelligence, especially in a vastly larger world. When there are quadrillions of AIs in the world, coordination might become very difficult, even with greater intelligence.
- Even if AIs do not specifically value human welfare, that does not directly imply that human labor will have no value. As an analogy, Amish folks often sell novelty items to earn income. Consumers don't need to specifically care about Amish people in order for Amish people to receive a sufficient income for them to live on. Even if a tiny fraction of consumer demand in the future is for stuff produced by humans, that could ensure high human wages simply because the economy will be so large.
- If ordinary capital is easier to scale than labor -- as it already is in our current world -- then human wages could remain high indefinitely simply because we will live in a capital-rich, labor-poor world. The arguments about human wages falling to subsistence level after AI tend to rely on the idea that AIs will be just as easy to scale as ordinary capital, which could easily turn out false as a consequence of (1) laws that hinder the creation of new AIs without proper permitting, (2) inherent difficulties with AI alignment, or (3) strong coordination that otherwise prevents malthusian growth in the AI population.
- This might be the most important point on my list, despite saying it last, but I think humans will likely be able to eventually upgrade their intelligence, better allowing them to "keep up" with the state of the world in the future.
Can you be more clear about what you were asking in your initial comment?
I don't think my scenario depends on the assumption that the preferences of a consumer are a given to the AI. Why would it?
Do you mean that I am assuming AIs cannot have their preferences modified, i.e., that we cannot solve AI alignment? I am not assuming that; at least, I'm not trying to assume that. I think AI alignment might be easy, and it is at least theoretically possible to modify an AI's preferences to be whatever one chooses.
If AI alignment is hard, then creating AIs is more comparable to creating children than creating a tool, in the sense that we have some control over their environment, but we have little control over what they end up ultimately preferring. Biology fixes a lot of innate preferences, such as preferences over thermal regulation of the body, preferences against pain, and preferences for human interaction. AI could be like that too, at least in an abstract sense. Standard economic models seem perfectly able to cope with this state of affairs, as it is the default state of affairs that we already live with.
On the other hand, if AI preferences can be modified into whatever shape we'd like, then these preferences will presumably take on the preferences of AI designers or AI owners (if AIs are owned by other agents). In that case, I think economic models can handle AI agents fine: you can essentially model them as extensions of other agents, whose preferences are more-or-less fixed themselves.