Posts
Comments
While I did agree that Linch's comment reasonably accurately summarized my post, I don't think a large part of my post was about the idea that we should now think that human values are much simpler than Yudkowsky portrayed them to be. Instead, I believe this section from Linch's comment does a better job at conveying what I intended to be the main point,
- Suppose in 2000 you were told that a100-line Python program (that doesn't abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren't actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
- In such a world, if inner alignment is solved, you can "just" train a superintelligent AI to "optimize for the results of that Python program" and you'd get a superintelligent AI with human values.
- Notably, alignment isn't solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
- Well, in 2023 we have that Python program, with a few relaxations:
- The answer isn't embedded in 100 lines of Python, but in a subset of the weights of GPT-4
- Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
- What we have now isn't a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The primary point I intended to emphasize is not that human values are fundamentally simple, but rather that we now have something else important: an explicit, and cheaply computable representation of human values that can be directly utilized in AI development. This is a major step forward because it allows us to incorporate these values into programs in a way that provides clear and accurate feedback during processes like RLHF. This explicitness and legibility are critical for designing aligned AI systems, as they enable developers to work with a tangible and faithful specification of human values rather than relying on poor proxies that clearly do not track the full breadth and depth of what humans care about.
The fact that the underlying values may be relatively simple is less important than the fact that we can now operationalize them, in a way that reflects human judgement fairly well. Having a specification that is clear, structured, and usable means we are better equipped to train AI systems to share those values. This representation serves as a foundation for ensuring that the AI optimizes for what we actually care about, rather than inadvertently optimizing for proxies or unrelated objectives that merely correlate with training signals. In essence, the true significance lies in having a practical, actionable specification of human values that can actively guide the creation of future AI, not just in observing that these values may be less complex than previously assumed.
Similar constraints may apply to AIs unless one gets much smarter much more quickly, as you say.
I do think that AIs will eventually get much smarter than humans, and this implies that artificial minds will likely capture the majority of wealth and power in the world in the future. However, I don't think the way that we get to that state will necessarily be because the AIs staged a coup. I find more lawful and smooth transitions more likely.
There are alternative means of accumulating power than taking everything by force. AIs could get rights and then work within our existing systems to achieve their objectives. Our institutions could continuously evolve with increasing AI presence, becoming more directed by AIs with time.
What I'm objecting to is the inevitability of a sudden collapse when "the AI" decides to take over in an untimely coup. I'm proposing that there could just be a smoother, albeit rapid transition to a post-AGI world. Our institutions and laws could simply adjust to incorporate AIs into the system, rather than being obliterated by surprise once the AIs coordinate an all-out assault.
In this scenario, human influence will decline, eventually quite far. Perhaps this soon takes us all the way to the situation you described in which humans will become like stray dogs or cats in our current world: utterly at the whim of more powerful beings who do not share their desires.
However, I think that scenario is only one possibility. Another possibility is that humans could enhance their own cognition to better keep up with the world. After all, we're talking about a scenario in which AIs are rapidly advancing technology and science. Could humans not share in some of that prosperity?
One more possibility is that, unlike cats and dogs, humans could continue to communicate legibly with the AIs and stay relevant for reasons of legal and cultural tradition, as well as some forms of trade. Our current institutions didn't descend from institutions constructed by stray cats and dogs. There was no stray animal civilization that we inherited our laws and traditions from. But perhaps if our institutions did originate in this way, then cats and dogs would hold a higher position in our society.
There are enormous hurdles preventing the U.S. military from overthrowing the civilian government.
The confusion in your statement is caused by blocking up all the members of the armed forces in the term "U.S. military". Principally, a coup is an act of coordination.
Is it your contention that similar constraints will not apply to AIs?
When people talk about how "the AI" will launch a coup in the future, I think they're making essentially the same mistake you talk about here. They’re treating a potentially vast group of AI entities — like a billion copies of GPT-7 — as if they form a single, unified force, all working seamlessly toward one objective, as a monolithic agent. But just like with your description of human affairs, this view overlooks the coordination challenges that would naturally arise among such a massive number of entities. They’re imagining these AIs could bypass the complex logistics of organizing a coup, evading detection, and maintaining control after launching a war without facing any relevant obstacles or costs, even though humans routinely face these challenges amongst ourselves.
In these discussions, I think there's an implicit assumption that AIs would automatically operate outside the usual norms, laws, and social constraints that govern social behavior. The idea is that all the ordinary rules of society will simply stop applying, because we're talking about AIs.
Yet I think this simple idea is basically wrong, for essentially the same reasons you identified for human institutions.
Of course, AIs will be different in numerous ways from humans, and AIs will eventually be far smarter and more competent than humans. This matters. Because AIs will be very capable, it makes sense to think that artificial minds will one day hold the majority of wealth, power, and social status in our world. But these facts alone don't show that the usual constraints that prevent coups and revolutions will simply go away. Just because AIs are smart doesn't mean they'll necessarily use force and violently revolt to achieve their goals. Just like humans, they'll probably have other avenues available for pursuing their objectives.
Asteroid impact
Type of estimate: best model
Estimate: ~0.02% per decade.
Perhaps worth noting: this estimate seems too low to me over longer horizons than the next 10 years, given the potential for asteroid terrorism later this century. I'm significantly more worried about asteroids being directed towards Earth purposely than I am about natural asteroid paths.
That said, my guess is that purposeful asteroid deflection probably won't advance much in the next 10 years, at least without AGI. So 0.02% is still a reasonable estimate if we don't get accelerated technological development soon.
Does trade here just means humans consuming, I.e. trading money for AI goods and services? That doesn't sound like trading in the usual sense where it is a reciprocal exchange of goods and services.
Trade can involve anything that someone "owns", which includes both their labor and their property, and government welfare. Retired people are generally characterized by trading their property and government welfare for goods and services, rather than primarily trading their labor. This is the basic picture I was trying to present.
How many 'different' AI individuals do you expect there to be ?
I think the answer to this question depends on how we individuate AIs. I don't think most AIs will be as cleanly separable from each other as humans are, as most (non-robotic) AIs will lack bodies, and will be able to share information with each other more easily than humans can. It's a bit like asking how many "ant units" there are. There are many individual ants per colony, but each colony can be treated as a unit by itself. I suppose the real answer is that it depends on context and what you're trying to figure out by asking the question.
A recently commonly heard viewpoint on the development of AI states that AI will be economically impactful but will not upend the dominancy of humans. Instead AI and humans will flourish together, trading and cooperating with one another. This view is particularly popular with a certain kind of libertarian economist: Tyler Cowen, Matthew Barnett, Robin Hanson.
They share the curious conviction that the probablity of AI-caused extinction p(Doom) is neglible. They base this with analogizing AI with previous technological transition of humanity, like the industrial revolution or the development of new communication mediums. A core assumption/argument is that AI will not disempower humanity because they will respect the existing legal system, apparently because they can gain from trades with humans.
I think this summarizes my view quite poorly on a number of points. For example, I think that:
-
AI is likely to be much more impactful than the development of new communication mediums. My default prediction is that AI will fundamentally increase the economic growth rate, rather than merely continuing the trend of the last few centuries.
-
Biological humans are very unlikely to remain dominant in the future, pretty much no matter how this is measured. Instead, I predict that artificial minds and humans who upgrade their cognition will likely capture the majority of future wealth, political influence, and social power, with non-upgraded biological humans becoming an increasingly small force in the world over time.
-
The legal system will likely evolve to cope with the challenges of incorporating and integrating non-human minds. This will likely involve a series of fundamental reforms, and will eventually look very different from the idea of "AIs will fit neatly into human social roles and obey human-controlled institutions indefinitely".
A more accurate description of my view is that humans will become economically obsolete after AGI, but this obsolescence will happen peacefully, without a massive genocide of biological humans. In the scenario I find most likely, humans will have time to prepare and adapt to the changing world, allowing us to secure a comfortable retirement, and/or join the AIs via mind uploading. Trade between AIs and humans will likely persist even into our retirement, but this doesn't mean that humans will own everything or control the whole legal system forever.
How could one control AI without access to the hardware/software? What would stop one with access to the hardware/software from controlling AI?
One would gain control by renting access to the model, i.e., the same way you can control what an instance of ChatGPT currently does. Here, I am referring to practical control over the actual behavior of the AI, when determining what the AI does, such as what tasks it performs, how it is fine-tuned, or what inputs are fed into the model.
This is not too dissimilar from the high level of practical control one can exercise over, for example, an AWS server that they rent. While Amazon may host these servers, and thereby have the final say over what happens to the computer in the case of a conflict, the company is nonetheless inherently dependent on customer revenue, implying that they cannot feasibly use all their servers privately for their own internal purposes. As a consequence of this practical constraint, Amazon rents these servers out to the public, and they do not substantially limit user control over AWS servers, providing for substantial discretion to end-users over what software is ultimately implemented.
In the future, these controls could also be determined by contracts and law, analogously to how one has control over their own bank account, despite the bank providing the service and hosting one's account. Then, even in the case of a conflict, the entity that merely hosts an AI may not have practical control over what happens, as they may have legal obligations to their customers that they cannot breach without incurring enormous costs to themselves. The AIs themselves may resist such a breach as well.
In practice, I agree these distinctions may be hard to recognize. There may be a case in which we thought that control over AI was decentralized, but in fact, power over the AIs was more concentrated or unified than we believed, as a consequence of centralization over the development or the provision of AI services. Indeed, perhaps real control was always in the hands of the government all along, as they could always choose to pass a law to nationalize AI, and take control away from the companies.
Nonetheless, these cases seem adequately described as a mistake in our perception of who was "really in control" rather than an error in the framework I provided, which was mostly an attempt to offer careful distinctions, rather than to predict how the future will go.
If one actor—such as OpenAI—can feasibly get away with seizing practical control over all the AIs they host without incurring high costs to the continuity of their business through loss of customers, then this indeed may surprise someone who assumed that OpenAI was operating under different constraints. However, this scenario still fits nicely within the framework as I've provided, as it merely describes a case in which one was mistaken about the true degree of concentration along one axis, rather than one of my concepts intrinsically fitting reality poorly.
It is not always an expression of selfish motives when people take a stance against genocide. I would even go as far as saying that, in the majority of cases, people genuinely have non-selfish motives when taking that position. That is, they actually do care, to at least some degree, about the genocide, beyond the fact that signaling their concern helps them fit in with their friend group.
Nonetheless, and this is important: few people are willing to pay substantial selfish costs in order to prevent genocides that are socially distant from them.
The theory I am advancing here does not rest on the idea that people aren't genuine in their desire for faraway strangers to be better off. Rather, my theory is that people generally care little about such strangers, when helping those strangers trades off significantly against objectives that are closer to themselves, their family, friend group, and their own tribe.
Or, put another way, distant strangers usually get little weight in our utility function. Our family, and our own happiness, by contrast, usually get a much larger weight.
The core element of my theory concerns the amount that people care about themselves (and their family, friends, and tribe) versus other people, not whether they care about other people at all.
While the term "outer alignment" wasn’t coined until later to describe the exact issue that I'm talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.
Because I was simply using "outer alignment" in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that's all.
To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. This difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential 'good' scenarios and all possible 'bad' scenarios. This challenge has been a recurring theme in discussions of alignment, as it's considered hard to capture every nuance of what you want in your specification without missing an edge case.
This problem is manifestly portrayed by your post, using the example of an outcome pump to illustrate. I was responding to this portrayal of the problem, and specifically saying that this specific narrow problem seems easier in light of LLMs, for particular reasons.
It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here.
I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me "So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?" then I would not have mentioned those things. In any case, none of the things I said about Christiano-style proposals were intended to critique this post's narrow point. I was responding to that particular part of your comment instead.
As far as the actual content of this post, I do not dispute its exact thesis. The post seems to be a parable, not a detailed argument with a clear conclusion. The parable seems interesting to me. It also doesn't seem wrong, in any strict sense. However, I do think that some of the broader conclusions that many people have drawn from the parable seem false, in context. I was responding to the specific way that this post had been applied and interpreted in broader arguments about AI alignment.
My central thesis in regards to this post is simply: the post clearly portrays a specific problem that was later called the "outer alignment" problem by other people. This post portrays this problem as being difficult in a particular way. And I think this portrayal is misleading, even if the literal parable holds up in pure isolation.
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.
I'll confirm that I'm not saying this post's exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I'm not saying the parable is wrong. Parables are rarely "wrong" in a strict sense, and I am not disputing this parable's conclusion.
However, I am saying: this parable presumably played some role in the "larger" argument that MIRI has made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it's difficult to perfectly delineate potential "good" scenarios and distinguish them from all potential "bad" scenarios. This is the problem I was analyzing in my original comment.
While the term "outer alignment" was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called "outer alignment" at the time. Because my usage of "outer alignment" was merely descriptive in this sense, I reject the idea that my comment was anachronistic.
And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like "this particular post's conclusion is wrong". I think the fact that the post doesn't really have a clear thesis in the first place means that it can't be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to this argument, rather than the post itself in isolation.
The object-level content of these norms is different in different cultures and subcultures and times, for sure. But the special way that we relate to these norms has an innate aspect; it’s not just a logical consequence of existing and having goals etc. How do I know? Well, the hypothesis “if X is generally a good idea, then we’ll internalize X and consider not-X to be dreadfully wrong and condemnable” is easily falsified by considering any other aspect of life that doesn’t involve what other people will think of you.
To be clear, I didn't mean to propose the specific mechanism of: if some behavior has a selfish consequence, then people will internalize that class of behaviors in moral terms rather than in purely practical terms. In other words, I am not saying that all relevant behaviors get internalized this way. I agree that only some behaviors are internalized by people in moral terms, and other behaviors do not get internalized in terms of moral principles in the way I described.
Admittedly, my statement was imprecise, but my intention in that quote was merely to convey that people tend to internalize certain behaviors in terms of moral principles, which explains the fact that people don't immediately abandon their habits when the environment suddenly shifts. However, I was silent on the question of which selfishly useful behaviors get internalized this way and which ones don't.
A good starting hypothesis is that people internalize certain behaviors in moral terms if they are taught to see those behaviors in moral terms. This ties into your theory that people "have an innate drive to notice, internalize, endorse, and take pride in following social norms". We are not taught to see "reaching into your wallet and shredding a dollar" as impinging on moral principles, so people don't tend to internalize the behavior that way. Yet, we are taught to see punching someone in the face as impinging on a moral principle. However, this hypothesis still leaves much to be explained, as it doesn't tell us which behaviors we will tend to be taught about in moral terms, and which ones we won't be taught in moral terms.
As a deeper, perhaps evolutionary explanation, I suspect that internalizing certain behaviors in moral terms helps make our commitments to other people more credible: if someone thinks you're not going to steal from them because you think it's genuinely wrong to steal, then they're more likely to trust you with their stuff than if they think you merely recognize the practical utility of not stealing from them. This explanation hints at the idea that we will tend to internalize certain behaviors in moral terms if those behaviors are both selfishly relevant, and important for earning trust among other agents in the world. This is my best guess at what explains the rough outlines of human morality that we see in most societies.
I’m not sure what “largely” means here. I hope we can agree that our objectives are selfish in some ways and unselfish in other ways.
Parents generally like their children, above and beyond the fact that their children might give them yummy food and shelter in old age. People generally form friendships, and want their friends to not get tortured, above and beyond the fact that having their friends not get tortured could lead to more yummy food and shelter later on. Etc.
In that sentence, I meant "largely selfish" as a stand-in for what I think humans-by-default care overwhelmingly about, which is something like "themselves, their family, their friends, and their tribe, in rough descending order of importance". The problem is that I am not aware of any word in the English language to describe people who have these desires, except perhaps the word "normal".
The word selfish usually denotes someone who is preoccupied with their own feelings, and is unconcerned with anyone else. We both agree that humans are not entirely selfish. Nonetheless, the opposite word, altruistic, often denotes someone who is preoccupied with the general social good, and who cares about strangers, not merely their own family and friend circles. This is especially the case in philosophical discussions in which one defines altruism in terms of impartial benevolence to all sentient life, which is extremely far from an accurate description of the typical human.
Humans exist on a spectrum between these two extremes. We are not perfectly selfish, nor are we perfectly altruistic. However, we are generally closer to the ideal of perfect selfishness than to the ideal of perfect altruism, given the fact that our own family, friend group, and tribe tends to be only a small part of the entire world. This is why I used the language of "largely selfish" rather than something else.
The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything.
I think it's important to be able to make a narrow point about outer alignment without needing to defend a broader thesis about the entire alignment problem. To the extent my argument is "outer alignment seems easier than you portrayed it to be in this post, and elsewhere", then your reply here that inner alignment is still hard doesn't seem like it particularly rebuts my narrow point.
This post definitely seems to relevantly touch on the question of outer alignment, given the premise that we are explicitly specifying the conditions that the outcome pump needs to satisfy in order for the outcome pump to produce a safe outcome. Explicitly specifying a function that delineates safe from unsafe outcomes is essentially the prototypical case of an outer alignment problem. I was making a point about this aspect of the post, rather than a more general point about how all of alignment is easy.
(It's possible that you'll reply to me by saying "I never intended people to interpret me as saying anything about outer alignment in this post" despite the clear portrayal of an outer alignment problem in the post. Even so, I don't think what you intended really matters that much here. I'm responding to what was clearly and explicitly written, rather than what was in your head at the time, which is unknowable to me.)
One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is "really" (from the AI's perspective) the true meaning of the reward button after all! Perhaps you do not have this exact scenario in mind.
It seems you're assuming here that something like iterated amplification and distillation will simply fail, because the supervisor function that provides rewards to the model can be hacked or deceived. I think my response to this is that I just tend to be more optimistic than you are that we can end up doing safe supervision where the supervisor ~always remains in control, and they can evaluate the AI's outputs accurately, more-or-less sidestepping the issues you mention here.
I think my reasons for believing this are pretty mundane: I'd point to the fact that evaluation tends to be easier than generation, and the fact that we can employ non-agentic tools to help evaluate, monitor, and control our models to provide them accurate rewards without getting hacked. I think your general pessimism about these things is fairly unwarranted, and my guess is that if you had made specific predictions about this question in the past, about what will happen prior to world-ending AI, these predictions would largely have fared worse than predictions from someone like Paul Christiano.
I’m still kinda confused. You wrote “But across almost all environments, you get positive feedback from being nice to people and thus feel or predict positive valence about these.” I want to translate that as: “All this talk of stabbing people in the back is irrelevant, because there is practically never a situation where it’s in somebody’s self-interest to act unkind and stab someone in the back. So (A) is really just fine!” I don’t think you’d endorse that, right? But it is a possible position—I tend to associate it with @Matthew Barnett. I agree that we should all keep in mind that it’s very possible for people to act kind for self-interested reasons. But I strongly don’t believe that (A) is sufficient for Safe & Beneficial AGI. But I think that you’re already in agreement with me about that, right?
Without carefully reading the above comment chain (forgive me if I need to understand the full discussion here before replying), I would like to clarify what my views are on this particular question, since I was referenced. I think that:
- It is possible to construct a stable social and legal environment in which it is in the selfish interests of almost everyone to act in such a way that brings about socially beneficial outcomes. A good example of such an environment is one where theft is illegal and in order to earn money, you have to get a job. This naturally incentivizes people to earn a living by helping others rather than stealing from others, which raises social welfare.
- It is not guaranteed that the existing environment will be such that self-interest is aligned with the general public interest. For example, if we make shoplifting de facto legal by never penalizing people who do it, this would impose large social costs on society.
- Our current environment has a mix of both of these good and bad features. However, on the whole, in modern prosperous societies during peacetime, it is generally in one's selfish interest to do things that help rather than hurt other people. This means that, even for psychopaths, it doesn't usually make selfish sense to go around hurting other people.
- Over time, in societies with well-functioning social and legal systems, most people learn that hurting other people doesn't actually help them selfishly. This causes them to adopt a general presumption against committing violence, theft, and other anti-social acts themselves, as a general principle. This general principle seems to be internalized in most people's minds as not merely "it is not in your selfish interest to hurt other people" but rather "it is morally wrong to hurt other people". In other words, people internalize their presumption as a moral principle, rather than as a purely practical principle. This is what prevents people from stabbing each other in the backs immediately once the environment changes.
- However, under different environmental conditions, given enough time, people will internalize different moral principles. For example, in an environment in which slaughtering animals becomes illegal and taboo, most people would probably end up internalizing the moral principle that it's wrong to hurt animals. Under our current environment, very few people internalize this moral principle, but that's mainly because slaughtering animals is currently legal, and widely accepted.
- This all implies that, in an important sense, human morality is not really "in our DNA", so to speak. Instead, we internalize certain moral principles because those moral principles encode facts about what type of conduct happens to be useful in the real world for achieving our largely selfish objectives. Whenever the environment shifts, so too does human morality. This distinguishes my view from the view that humans are "naturally good" or have empathy-by-default.
- Which is not to say that there isn't some sense in which human morality comes from human DNA. The causal mechanisms here are complicated. People vary in their capacity for empathy and the degree to which they internalize moral principles. However, I think in most contexts, it is more appropriate to look at people's environment as the determining factor of what morality they end up adopting, rather than thinking about what their genes are.
Competitive capitalism works well for humans who are stuck on a relatively even playing field, and who have some level of empathy and concern for each other.
I think this basically isn't true, especially the last part. It's not that humans don't have some level of empathy for each other; they do. I just don't think that's the reason why competitive capitalism works well for humans. I think the reason is instead because people have selfish interests in maintaining the system.
We don't let Jeff Bezos accumulate billions of dollars purely out of the kindness of our heart. Indeed, it is often considered far kinder and more empathetic to confiscate his money and redistribute it to the poor. The problem with that approach is that abandoning property rights incurs costs on those who rely on the system to be reliable and predictable. If we were to establish a norm that allowed us to steal unlimited money from Jeff Bezos, many people would reason, "What prevents that norm from being used against me?"
The world pretty much runs on greed and selfishness, rather than kindness. Sure, humans aren't all selfish, we aren't all greedy. And few of us are downright evil. But those facts are not as important for explaining why our system works. Our system works because it's an efficient compromise among people who are largely selfish.
It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want or value; rather than bearing on what a superintelligence supposedly could not predict or understand. -- EY, May 2024.
I can't tell whether this update to the post is addressed towards me. However, it seems possible that it is addressed towards me, since I wrote a post last year criticizing some of the ideas behind this post. In either case, whether it's addressed towards me or not, I'd like to reply to the update.
For the record, I want to definitively clarify that I never interpreted MIRI as arguing that it would be difficult to get a machine superintelligence to understand or predict human values. That was never my thesis, and I spent considerable effort clarifying the fact that this was not my thesis in my post, stating multiple times that I never thought MIRI predicted it would be hard to get an AI to understand human values.
My thesis instead was about a subtly different thing, which is easy to misinterpret if you aren't reading carefully. I was talking about something which Eliezer called the "value identification problem", and which had been referenced on Arbital, and in other essays by MIRI, including under a different name than the "value identification problem". These other names included the "value specification" problem and the problem of "outer alignment" (at least in narrow contexts).
I didn't expect as much confusion at the time when I wrote the post, because I thought clarifying what I meant and distinguishing it from other things that I did not mean multiple times would be sufficient to prevent rampant misinterpretation by so many people. However, evidently, such clarifications were insufficient, and I should have instead gone overboard in my precision and clarity. I think if I re-wrote the post now, I would try to provide like 5 different independent examples demonstrating how I was talking about a different thing than the problem of getting an AI to "understand" or "predict" human values.
At the very least, I can try now to give a bit more clarification about what I meant, just in case doing this one more time causes the concept to "click" in someone's mind:
Eliezer doesn't actually say this in the above post, but his general argument expressed here and elsewhere seems to be that the premise "human value is complex" implies the conclusion: "therefore, it's hard to get an AI to care about human value". At least, he seems to think that this premise makes this conclusion significantly more likely.[1]
This seems to be his argument, as otherwise it would be unclear why Eliezer would bring up "complexity of values" in the first place. If the complexity of values had nothing to do with the difficulty of getting an AI to care about human values, then it is baffling why he would bring it up. Clearly, there must be some connection, and I think I am interpreting the connection made here correctly.
However, suppose you have a function that inputs a state of the world and outputs a number corresponding to how "good" the state of the world is. And further suppose that this function is transparent, legible, and can actually be used in practice to reliably determine the value of a given world state. In other words, you can give the function a world state, and it will spit out a number, which reliably informs you about the value of the world state. I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).
In other words, if you have a solution to the value identification problem (i.e., you have the function that correctly and transparently rates the value of world states, as I just described), this almost completely sidesteps the problem that "human value is complex and therefore it's difficult to get an AI to care about human value". That's because, if we have a function that directly encodes human value, and can be simply referenced or directly inputted into a computer, then all the AI needs to do is care about maximizing that function rather than maximizing a more complex referent of "human values". The pointer to "this function" is clearly simple, and in any case, simpler than the idea of all of human value.
(This was supposed to narrowly reply to MIRI, by the way. If I were writing a more general point about how LLMs were evidence that alignment might be easy, I would not have focused so heavily on the historical questions about what people said, and I would have instead made simpler points about how GPT-4 seems to straightforwardly try do what you want, when you tell it to do things.)
My main point was that I thought recent progress in LLMs had demonstrated progress at the problem of building such a function, and solving the value identification problem, and that this progress goes beyond the problem of getting an AI to understand or predict human values. For one thing, an AI that merely understands human values will not necessarily act as a transparent, legible function that will tell you the value of any outcome. However, by contrast, solving the value identification problem would give you such a function. This strongly distinguishes the two problems. These problems are not the same thing. I'd appreciate if people stopped interpreting me as saying one thing when I clearly meant another, separate thing.
- ^
This interpretation is supported by the following quote, on Arbital,
Complexity of value is a further idea above and beyond the orthogonality thesis which states that AIs don't automatically do the right thing and that we can have, e.g., paperclip maximizers. Even if we accept that paperclip maximizers are possible, and simple and nonforced, this wouldn't yet imply that it's very difficult to make AIs that do the right thing. If the right thing is very simple to encode - if there are value optimizers that are scarcely more complex than diamond maximizers - then it might not be especially hard to build a nice AI even if not all AIs are nice. Complexity of Value is the further proposition that says, no, this is forseeably quite hard - not because AIs have 'natural' anti-nice desires, but because niceness requires a lot of work to specify. [emphasis mine]
The point that a capabilities overhang might cause rapid progress in a short period of time has been made by a number of people without any connections to AI labs, including me, which should reduce your credence that it's "basically, total self-serving BS".
More to the point of Daniel Filan's original comment, I have criticized the Responsible Scaling Policy document in the past for failing to distinguish itself clearly from AI pause proposals. My guess is that your second and third points are likely mostly correct: AI labs think of an RSP as different from AI pause because it's lighter-touch, more narrowly targeted, and the RSP-triggered pause could be lifted more quickly, potentially minimally disrupting business operations.
There are a few key pieces of my model of the future that make me think humans can probably retain significant amounts of property, rather than having it suddenly stolen from them as the result of other agents in the world solving a specific coordination problem.
These pieces include:
- Not all AIs in the future will be superintelligent. More intelligent models appear to require more computation to run. This is both because smarter models are larger (in parameter count) and use more inference time (such as OpenAI's o1). To save computational costs, future AIs will likely be aggressively optimized to only be as intelligent as they need to be, and no more. This means that in the future, there will likely be a spectrum of AIs of varying levels of intelligence, some much smarter than humans, others only slightly smarter, and still others merely human-level.
- As a result of the previous point, your statement that "ASIs produce all value in the economy" will likely not turn out correct. This is all highly uncertain, but I find it plausible that ASIs might not even be responsible for producing the majority of GDP in the future, given the possibility of a vastly more numerous population of less intelligent AIs that automate simpler tasks than the ones ASIs are best suited to do.
- The coordination problem you described appears to rely on a natural boundary between the "humans that produce ~nothing" and "the AIs that produce everything". Without this natural boundary, there is no guarantee that AIs will solve the specific coordination problem you identified, rather than another coordination problem that hits a different group. Non-uploaded humans will differ from AIs by being biological and by being older, but they will not necessarily differ from AIs by being less intelligent.
- Therefore, even if future agents decide to solve a specific coordination problem that allows them to steal wealth from unproductive agents, it is not clear that this will take the form of those agents specifically stealing from humans. One can imagine different boundaries that make more sense to coordinate around, such as "laborer vs. property owner", which is indeed a type of political conflict the world already has experience with.
- In general, I expect legal systems to get more robust in the face of greater intelligence, rather than less robust, in the sense of being able to rely on legal systems when making contracts. I believe this partly as a result of the empirical fact that violent revolution and wealth appropriation appears to be correlated with less intelligence on a societal level. I concede that this point is not a very strong piece of evidence, however.
- Building on (5), I generally expect AIs to calculate that it is not in their interest to expropriate wealth from other members of society, given how this could set a precedent for future wealth expropriation that comes back and hurts them selfishly. Even though many AIs will be smarter than humans, I don't think the mere fact that AIs will be very smart implies that expropriation becomes more rational.
- I'm basically just not convinced by the arguments that all ASIs will cooperate almost perfectly as a unit, against the non-ASIs. This is partly for the reasons given by my previous points, but also partly because I think coordination is hard, and doesn't necessarily get much easier with more intelligence, especially in a vastly larger world. When there are quadrillions of AIs in the world, coordination might become very difficult, even with greater intelligence.
- Even if AIs do not specifically value human welfare, that does not directly imply that human labor will have no value. As an analogy, Amish folks often sell novelty items to earn income. Consumers don't need to specifically care about Amish people in order for Amish people to receive a sufficient income for them to live on. Even if a tiny fraction of consumer demand in the future is for stuff produced by humans, that could ensure high human wages simply because the economy will be so large.
- If ordinary capital is easier to scale than labor -- as it already is in our current world -- then human wages could remain high indefinitely simply because we will live in a capital-rich, labor-poor world. The arguments about human wages falling to subsistence level after AI tend to rely on the idea that AIs will be just as easy to scale as ordinary capital, which could easily turn out false as a consequence of (1) laws that hinder the creation of new AIs without proper permitting, (2) inherent difficulties with AI alignment, or (3) strong coordination that otherwise prevents malthusian growth in the AI population.
- This might be the most important point on my list, despite saying it last, but I think humans will likely be able to eventually upgrade their intelligence, better allowing them to "keep up" with the state of the world in the future.
Can you be more clear about what you were asking in your initial comment?
I don't think my scenario depends on the assumption that the preferences of a consumer are a given to the AI. Why would it?
Do you mean that I am assuming AIs cannot have their preferences modified, i.e., that we cannot solve AI alignment? I am not assuming that; at least, I'm not trying to assume that. I think AI alignment might be easy, and it is at least theoretically possible to modify an AI's preferences to be whatever one chooses.
If AI alignment is hard, then creating AIs is more comparable to creating children than creating a tool, in the sense that we have some control over their environment, but we have little control over what they end up ultimately preferring. Biology fixes a lot of innate preferences, such as preferences over thermal regulation of the body, preferences against pain, and preferences for human interaction. AI could be like that too, at least in an abstract sense. Standard economic models seem perfectly able to cope with this state of affairs, as it is the default state of affairs that we already live with.
On the other hand, if AI preferences can be modified into whatever shape we'd like, then these preferences will presumably take on the preferences of AI designers or AI owners (if AIs are owned by other agents). In that case, I think economic models can handle AI agents fine: you can essentially model them as extensions of other agents, whose preferences are more-or-less fixed themselves.
It is a wonderfully American notion that an "existing system of law and property rights" will constrain the power of Gods. But why exactly? They can make contracts? And who enforces these contracts? Can you answer this without begging the question? Are judicial systems particularly unhackable? Are humans?
To be clear, my prediction is not that AIs will be constrained by human legal systems that are enforced by humans. I'd claim rather that future legal systems will be enforced by AIs, and that these legal systems will descend from our current legal systems, and thus will inherit many of their properties. This does not mean that I think everything about our laws will remain the same in the face of superintelligence, or that our legal system will not evolve at all.
It does not seem unrealistic to me to assume that powerful AIs could be constrained by other powerful AIs. Humans currently constrain each other; why couldn't AIs constrain each other?
"Existing system of law and property rights" looks like a "thought-terminating cliché" to me.
By contrast, I suspect the words "superintelligence" and "gods" have become thought-terminating cliches on LessWrong.
Any discussion about the realistic implications of AI must contend with the fact that AIs will be real physical beings with genuine limitations, not omnipotent deities with unlimited powers to command and control the world. They may be extremely clever, their minds may be vast, they may be able to process far more information than we can comprehend, but they will not be gods.
I think it is too easy to avoid the discussion of what AIs may or may not do, realistically, by assuming that AIs will break every rule in the book, and assume the form of an inherently uncontrollable entity with no relevant constraints on its behavior (except for physical constraints, like the speed of light). We should probably resist the temptation to talk about AI like this.
The claim at hand, that we have both read Eliezer repeatedly make[1], is that there is a sufficient level of intelligence and a sufficient power of nanotechnology that within days or weeks a system could design and innocuously build a nanotechnology factory out of simple biological materials that goes on to build either a disease or a cellular-sized drones that would quickly cause an extinction event — perhaps a virus that spreads quickly around the world with a replication rate that allows it to spread globally before any symptoms are found, or a series of diamond-based machines that can enter the bloodstream and explode on a coordinated signal. This is such a situation where no response from human civilization would occur, and the argument that an AI ought to be worried about people with guns and bombs coming for its data centers has no relevance.
Sure, I have also read Eliezer repeatedly make that claim. On the meta level, I don't think the fact that he has written about this specific scenario fully makes up for the vagueness in his object-level essay above. But I'm also happy to briefly reply on the object level on this particular narrow point:
In short, I interpret Eliezer to be making a mistake by assuming that the world will not adapt to anticipated developments in nanotechnology and AI in order to protect against various attacks that we can easily see coming, prior to the time that AIs will be capable of accomplishing these incredible feats. By the time AIs are capable of developing such advanced molecular nanotech, I think the world will have already been dramatically transformed by prior waves of technologies, many of which by themselves could importantly change the gameboard, and change what it means for humans to have defenses against advanced nanotech to begin with.
As a concrete example, I think it's fairly plausible that, by the time artificial superintelligences can create fully functional nanobots that are on-par with or better than biological machines, we will have already developed uploading technology that allows humans to literally become non-biological, implying that we can't be killed by a virus in the first place. This would reduce the viability of using a virus to cause humanity to go extinct, increasing human robustness.
As a more general argument, and by comparison to Eliezer, I think that nanotechnology will probably be developed more incrementally and predictably, rather than suddenly upon the creation of a superintelligent AI, and the technology will be diffused across civilization, rather than existing solely in the hands of a small lab run by an AI. I also think Eliezer seems to be imagining that superintelligent AI will be created in a world that looks broadly similar to our current world, with defensive technologies that are only roughly as powerful as the ones that exist in 2024. However, I don't think that will be the case.
Given an incremental and diffuse development trajectory, and transformative precursor technologies to mature nanotech, I expect society will have time to make preparations as the technology is developed, allowing us to develop defenses to such dramatic nanotech attacks alongside the offensive nanotechnologies that will also eventually be developed. It therefore seems unlikely to me that society will be completely caught by surprise by fully-developed-molecular nanotechnology, without any effective defenses.
I don't know what sort of fight you are imagining humans having with nanotech that imposes substantial additional costs on the ASI beyond the part where it needs to build & deploy the nanotech that actually does the "killing" part, but in this world I do not expect there to be a fight.
The additional costs of human resistance don't need to be high in an absolute sense. These costs only need to be higher than the benefit of killing humans, for your argument fail.
It is likewise very easy for the United States to invade and occupy Costa Rica—but that does not imply that it is rational for the United States to do so, because the benefits of invading Costa Rica are presumably even smaller than the costs of taking such an action, even without much unified resistance from Costa Rica.
What matters for the purpose of this argument is the relative magnitude of costs vs. benefits, not the absolute magnitude of the costs. It is insufficient to argue that the costs of killing humans are small. That fact alone does not imply that it is rational to kill humans, from the perspective of an AI. You need to further argue that the benefits of killing humans are even larger to establish the claim that a misaligned AI should rationally kill us.
To the extent your statement that "I don't expect there to be a fight" means that you don't think humans can realistically resist in any way that imposes costs on AIs, that's essentially what I meant to respond to when I talked about the idea of AIs being able to achieve their goals at "zero costs".
Of course, if you assume that AIs will be able to do whatever they want without any resistance whatsoever from us, then you can of course conclude that they will be able to achieve any goals they want without needing to compromise with us. If killing humans doesn't cost anything, then yes I agree, the benefits of killing humans, however small, will be higher, and thus it will be rational for AIs to kill humans. I am doubting the claim that the cost of killing humans will be literally zero.
Even if this cost is small, it merely needs to be larger than the benefits of killing humans, for AIs to rationally avoid killing humans.
There does not yet exist a single ten-million-word treatise which provides an end-to-end argument of the level of detail you're looking for.
To be clear, I am not objecting to the length of his essay. It's OK to be brief.
I am objecting to the vagueness of the argument. It follows a fairly typical pattern of certain MIRI essays by heavily relying on analogies, debunking straw characters, using metaphors rather than using clear and explicit English, and using stories as arguments, instead of concisely stating the exact premises and implications. I am objecting to the rhetorical flourish, not the word count.
This type of writing may be suitable for persuasion, but it does not seem very suitable for helping people build rigorous models of the world, which I also think is more important when posting on LessWrong.
My current guess is that you do not think that kind of nanotech is physically realizable by any ASI we are going to develop (including post-RSI), or maybe you think the ASI will be cognitively disadvantaged compared to humans in domains that it thinks are important (in ways that it can't compensate for, or develop alternatives for, somehow).
I think neither of those things, and I entirely reject the argument that AIs will be fundamentally limited in the future in the way you suggested. If you are curious about why I think AIs will plausibly peacefully trade with humans in the future, rather than disassembling humans for their atoms, I would instead point to the facts that:
- Trying to disassemble someone for their atoms is typically something the person will try to fight very hard against, if they become aware of your intentions to disassemble them.
- Therefore, the cost of attempting to disassemble someone for their atoms does not merely include the technical costs associated with actually disassembling them, but additionally includes: (1) fighting the person who you are trying to kill and disassemble, (2) fighting whatever norms and legal structures are in place to prevent this type of predation against other agents in the world, and (3) the indirect cost of becoming the type of agent who predates on another person in this manner, which could make you an untrustworthy and violent person in the eyes of other agents, including other AIs who might fear you.
- The benefit of disassembling a human is quite small, given the abundance of raw materials that substitute almost perfectly for the atoms that you can get from a human.
- A rational agent will typically only do something if the benefits of the action outweigh the costs, rather than merely because the costs are small. Even if the costs of disassembling a human (as identified in point (2)) are small, that fact alone does not imply that a rational superintelligent AI would take such an action, precisely because the benefits of that action could be even smaller. And as just stated, we have good reasons to think that the benefits of disassembling a human are quite small in an absolute sense.
- Therefore, it seems unlikely, or at least seems non-obvious, that a rational agent—even a very powerful one with access to advanced nanotech—will try to disassemble humans for their atoms.
Nothing in this argument is premised on the idea that AIs will be weak, less intelligent than humans, bounded in their goals, or limited in some other respect, except I suppose to the extent I'm assuming that AIs will be subject to environmental constraints, as opposed to instantly being able to achieve all of their goals at literally zero costs. I think AIs, like all physical beings, will exist in a universe in which they cannot get literally everything they want, and achieve the exact optimum of their utility function without any need to negotiate with anyone else. In other words, even if AIs are very powerful, I still think it may be beneficial for them to compromise with other agents in the world, including the humans, who are comparatively much less powerful than they are.
If it is possible to trivially fill in the rest of his argument, then I think it is better for him to post that, instead of posting something that needs to be filled-in, and which doesn't actually back up the thesis that people are interpreting him as arguing for. Precision is a virtue, and I've seen very few essays that actually provide this point about trade explicitly, as opposed to essays that perhaps vaguely allude to the points you have given, as this one apparently does too.
In my opinion, your filled-in argument seems to be a great example of why precision is necessary: to my eye, it contains bald assertions and unjustified inferences about a highly speculative topic, in a way that barely recognizes the degree of uncertainty we have about this domain. As a starting point, why does nanotech imply that it will be cheaper to disassemble humans than to trade with them? Are we assuming that humans cannot fight back against being disassembled, and moreover, is the threat of fighting back being factored into the cost-benefit analysis when the AIs are deciding whether to disassemble humans for their atoms vs. trade with them? Are our atoms really that valuable that it is worth it to pay the costs of violence to obtain them? And why are we assuming that "there will not be other AIs around at the time which 1) would be valuable trade partners for the AI that develops that technology (which gives it that decisive strategic advantage over everyone else) and 2) care about humans at all"?
Satisfying-sounding answers to each of these questions could undoubtedly be given, and I assume you can provide them. I don't expect to find the answers fully persuasive, but regardless of what you think on the object-level, my basic meta-point stands: none of this stuff is obvious, and the essay is extremely weak without the added details that back up its background assumptions. It is very important to try to be truth-seeking and rigorously evaluate arguments on their merits. The fact that this essay is vague, and barely attempts to make a serious argument for one of its central claims, makes it much more difficult to evaluate concretely.
Two reasonable people could read this essay and come away with two very different ideas about what the essay is even trying to argue, given how much unstated inference you're meant to "fill in", instead of plain text that you can read. This is a problem, even if you agree with the underlying thesis the essay is supposed to argue for.
If we could create AI's that follows the existing system of law and property rights (including the intent of the laws, and doesn't exploit loopholes, and doesn't maliciously comply with laws, and doesn't try to get the law changed, etc.) then that would be a solution to the alignment problem, but the problem is that we don't know how to do that.
I disagree that creating an agent that follows the existing system of law and property rights, and acts within it rather than trying to undermine it, would count as a solution to the alignment problem.
Imagine a man who only cared about himself and had no altruistic impulses whatsoever. However, this man reasoned that, "If I disrespect the rule of law, ruthlessly exploit loopholes in the legal system, and maliciously comply with the letter of the law while disregarding its intent, then other people will view me negatively and trust me less as a consequence. If I do that, then people will be less likely to want to become my trading partner, they'll be less likely to sign onto long-term contracts with me, I might accidentally go to prison because of an adversarial prosecutor and an unsympathetic jury, and it will be harder to recruit social allies. These are all things that would be very selfishly costly. Therefore, for my own selfish benefit, I should generally abide by most widely established norms and moral rules in the modern world, including the norm of following intent of the law, rather than merely the letter of the law."
From an outside perspective, this person would essentially be indistinguishable from a normal law-abiding citizen who cared about other people. Perhaps the main difference between this person and a "normal" person is that this man wouldn't partake in much private altruism like donating to charity anonymously; but that type of behavior is rare anyway among the general public. Nonetheless, despite appearing outwardly-aligned, this person would be literally misaligned with the rest of humanity in a basic sense: they do not care about other people. If it were not instrumentally rational for this person to respect the rights of other citizens, they would have no issue throwing away someone else's life for a dollar.
My basic point here is this: it is simply not true that misaligned agents have no incentive to obey the law. Misaligned agents typically have ample incentives to follow the law. Indeed, it has often been argued that the very purpose of law itself is to resolve disputes between misaligned agents. As James Madison once said, "If Men were angels, no government would be necessary." His point is that, if we were all mutually aligned with each other, we would have no need for the coercive mechanism of the state in order to get along.
What's true for humans could be true for AIs too. However, obviously, there is one key distinction: AIs could eventually become far more powerful than individual humans, or humanity-as-a-whole. Perhaps this means that future AIs will have strong incentives to break the law rather than abide by it; perhaps they will act outside a system of law rather than influencing the world from within a system of law? Many people on LessWrong seem to think so.
My response to this argument is multifaceted, and I won't go into it in this comment. But suffice to say for the purpose of my response here, I think it is clear that mere misalignment is insufficient to imply that an agent will not adhere to the rule of law. This statement is clear enough with the example of the sociopathic man I gave above, and at minimum seems probably true for human-level AIs as well. I would appreciate if people gave more rigorous arguments otherwise.
As I see it, very few such rigorous arguments have so far been given for the position that future AIs will generally act outside of, rather than within, the existing system of law, in order to achieve their goals.
I think the arguments in this post are an okay defense of "ASI wouldn't spare humanity because of trade"
I disagree, and I'd appreciate if someone would precisely identify the argument they found compelling in this post that argues for that exact thesis. As far as I can tell, the post makes the following supporting arguments for its claims (summarized):
- Asking an unaligned superintelligence to spare humans is like asking Bernard Arnalt to donate $77 to you.
- The law of comparative advantage does not imply that superintelligences will necessarily pay a high price for what humans have to offer, because of the existence of alternative ways for a superintelligence to get what it wants.
- Superintelligences will "go hard enough" in the sense of using all reachable resources, rather than utilizing only some resources in the solar system and then stopping.
I claim that any actual argument for the proposition — that future unaligned AIs will not spare humanity because of trade — is missing from this post. The closest the post comes to arguing for this proposition is (2), but (2) does not demonstrate the proposition, both because (2) is only a claim about what the law of comparative advantage says, and because (2) does not talk at all about what humans could have to offer in the future that might be worth trading for.
In my view, one of the primary cruxes of the discussion is whether trade is less efficient than going to war between agents with dramatically different levels of power. A thoughtful discussion could have started about the conditions under which trade usefully occurs, and the ways in which future AIs will be similar to and different from these existing analogies. For example, the post could have talked about why nation-states trade with each other even in the presence of large differences in military power, but humans don't trade with animals. However, the post included no such discussion, choosing instead to attack a "midwit" strawman.
I was making a claim about the usual method people use to get things that they want from other people, rather than proposing an inviolable rule. Even historically, war was not the usual method people used to get what they wanted from other people. The fact that only 8% of history was "entirely without war" is compatible with the claim that the usual method people used to get what they wanted involved compromise and trade, rather than war. In particular, just because only 8% of history was "entirely without war" does not mean that only 8% of human interactions between people were without war.
Current relatively peaceful times is a unique combination in international law and postindustrial economy, when qualified labor is expencive and requires large investments in capital and resources are relatively cheap, which is not the case after singularity, when you can get arbitrary amounts of labor for the price of hardware and resources is a bottleneck.
You mentioned two major differences between the current time period and what you expect after the technological singularity:
- The current time period has unique international law
- The current time period has expensive labor, relative to capital
I question both the premise that good international law will cease to exist after the singularity, and the relevance of both of these claims to the central claim that AIs will automatically use war to get what they want unless they are aligned to humans.
There are many other reasons one can point to, to explain the fact that the modern world is relatively peaceful. For example, I think a big factor in explaining the current peace is that long-distance trade and communication has become easier, making the world more interconnected than ever before. I also think it's highly likely that long-distance trade and communication will continue to be relatively easy in the future, even post-singularity.
Regarding the point about cheap labor, one could also point out that if capital is relatively expensive, this fact would provide a strong reason to avoid war, as a counter-attack targeting factories would become extremely costly. It is unclear to me why you think it is important that labor is expensive, for explaining why the world is currently fairly peaceful.
Therefore, before you have developed a more explicit and precise theory of why exactly the current world is peaceful, and how these variables are expected to evolve after the singularity, I simply don't find this counterargument compelling.
Yudkowsky's point about trying to sell an Oreo for $77 is that a billionaire isn't automatically going to want to buy something off you if they don't care about it (and neither would an ASI).
I thought Yudkowsky's point was that the billionaire won't give you $77 for an Oreo because they could get an Oreo for less than $77 via other means. But people don't just have an Oreo to sell you. My point in that sentence was to bring up that workers routinely have things of value that they can sell for well over $77, even to billionaires. Similarly, I claim that Yudkowsky did not adequately show that humans won't have things of substantial value that they can sell to future AIs.
I'm not sure anyone is arguing that smart AIs would immediately turn violent unless it was in their strategic interest
The claim I am disputing is precisely that it will be in the strategic interest of unaligned AIs to turn violent and steal from agents that are less smart than them. In that sense, I am directly countering a claim that people in these discussions routinely make.
Workers regularly trade with billionaires and earn more than $77 in wages, despite vast differences in wealth. Countries trade with each other despite vast differences in military power. In fact, some countries don't even have military forces, or at least have a very small one, and yet do not get invaded by their neighbors or by the United States.
It is possible that these facts are explained by generosity on behalf of billionaires and other countries, but the standard social science explanation says that this is not the case. Rather, the standard explanation is that war is usually (though not always) more costly than trade, when compromise is a viable option. Thus, people usually choose to trade, rather than go to war with each other when they want stuff. This is true even in the presence of large differences in power.
I mostly don't see this post as engaging with any of the best reasons one might expect smarter-than-human AIs to compromise with humans. By contrast to you, I think it's important that AIs will be created within an existing system of law and property rights. Unlike animals, they'll be able to communicate with us and make contracts. It therefore seems perfectly plausible for AIs to simply get rich within the system we have already established, and make productive compromises, rather than violently overthrowing the system itself.
That doesn't rule out the possibility that the future will be very alien, or that it will turn out in a way that humans do not endorse. I'm also not saying that humans will always own all the wealth and control everything permanently forever. I'm simply arguing against the point that smart AIs will automatically turn violent and steal from agents who are less smart than they are unless they're value aligned. This is a claim that I don't think has been established with any reasonable degree of rigor.
I mean like a dozen people have now had long comment threads with you about this. I doubt this one is going to cross this seemingly large inferential gap.
I think it's still useful to ask for concise reasons for certain beliefs. "The Fundamental Question of Rationality is: "Why do you believe what you believe?"".
Your reasons could be different from the reasons other people give, and indeed, some of your reasons seem to be different from what I've heard from many others.
The short answer is that from the perspective of AI it really sucks to have basically all property be owned by humans
For what it's worth, I don't think humans need to own basically all property in order for AIs to obey property rights. A few alternatives come to mind: humans could have a minority share of the wealth, and AIs could have property rights with each other.
Of course, actual superhuman AI systems will not obey property rights, but that is indeed the difference between economic unemployment analysis and AI catastrophic risk.
This statement was asserted confidently enough that I have to ask: why do you believe that actual superhuman AI systems will not obey property rights?
I'm confused about the clarifications in this post. Generally speaking, I think the terms "alignment", "takeover", and "disempowered" are vague and can mean dramatically different things to different people. My hope when I started reading this post was to see you define these terms precisely and unambiguously. Unfortunately, I am still confused about how you are using these terms, although it could very easily be my fault for not reading carefully enough.
Here is a scenario that I want you to imagine that I think might help to clarify where I'm confused:
Suppose we grant AIs legal rights and they become integrated into our society. Humans continue to survive and thrive, but AIs eventually and gradually accumulate the vast majority of the wealth, political power, and social status in society through lawful means. These AIs are sentient, extremely competent, mostly have strange and alien-like goals, and yet are considered "people" by most humans, according to an expansive definition of that word. Importantly, they are equal in the eyes of the law, and have no limitations on their ability to hold office, write new laws, and hold other positions of power. The AIs are agentic, autonomous, plan over long time horizons, and are not enslaved to the humans in any way. Moreover, many humans also upload themselves onto computers and become AIs themselves. These humans expand their own cognition and often choose to drop the "human" label from their personal identity after they are uploaded.
Here are my questions
- Does this scenario count as "AI takeover" according to you? Was it a "bad takeover"?
- Are the AIs "aligned" in this scenario?
- Are the humans "disempowered" in this scenario?
- Was this a good or bad outcome for humanity?
And so I don't really think that existential risk is caused by "unemployment". People are indeed confused about the nature of comparative advantage, and mistakenly assume that lack of competetiveness will lead to loss of jobs, which will then be bad for them.
People are also confused about the meaning of words like "unemployment" and how and why it can be good or bad. If being unemployed merely means not having a job (i.e., labor force participation rate), then plenty of people are unemployed by choice, well off, happy, and doing well. These are called retired people.
One way labor force participation can be high is if everyone is starving and needs to work all day in order to survive. Another way labor force participation can be high is if it's extremely satisfying to maintain a job and there are tons of benefits that go along with being employed. My point is that it is impossible to conclude whether it's either "bad" or "good" if all you know is that this statistic will either go up or down. To determine whether changes to this variable are bad, you need to understand more about the context in which the variable is changing.
To put this more plainly, idea that machines will take our jobs generally means one of two things. Either it means that machines will push down overall human wages and make humans less competitive across a variety of tasks. This is directly related to x-risk concerns because it is a direct effect of AIs becoming more numerous and more productive than humans. It makes sense to be concerned about this, but it's imprecise to describe it as an "unemployment": the problem is not that people are unemployed, the problem is that people are getting poorer.
Or, the idea that machines will take our jobs means that it will increase our total prosperity, allowing us to spend more time in pleasant leisure and less time in unpleasant work. This would probably be a good thing, and it's important to strongly distinguish it from the idea that wages will fall.
In my view, Baumol's cost disease is poorly named: the name suggests that certain things are getting more expensive, but if "more expensive" means "society (on the whole) cannot afford as much as it used to" then this implication is false. To be clear, it is definitely possible that things like healthcare and education have gotten less affordable for a median consumer because of income inequality, but even if that's true, it has little to do with Baumol's cost disease per se. As Scott Alexander framed it,
The Baumol effect cannot make things genuinely less affordable for society, because society is more productive and can afford more stuff. However, it can make things genuinely less affordable for individuals, if those individuals aren’t sharing in the increased productivity of society.
I don't think that the number of employees per patient in a hospital or the number of employees per student in a university is lower today than it was in the 1980s, even if hospitals and universities have improved in other ways.
I think this is likely wrong, at least for healthcare, but I'd guess for education too. For healthcare, Random Critical Analysis has written about the data, and I encourage you to look at their analysis.
There is also a story of sclerosis and stagnation. Sure, lots of frivolous consumer goods have gotten cheaper but healthcare, housing, childcare, and education, all the important stuff, has exploded in price.
I think the idea that this chart demonstrates sclerosis and stagnation in these industries—at least in the meaningful sense of our economy getting worse at producing or affording these things—is largely a subtle misunderstanding of what the chart actually shows. (To be clear, this is not an idea that you lean on much in this post, but I still think it's important to try to clarify some misconceptions.)
Prices are relative: it only makes sense to discuss the price of X relative to Y, rather than X's absolute price level. Even inflation is a relative measure: it shows the price of a basket of goods and services relative to a unit of currency.
With this context in mind, we should reconsider what it means for the items at the top of the chart to have "exploded in price". There are several possible interpretations:
- These items have become more expensive relative to a unit of US currency (true, supported by the chart)
- These items have become more expensive relative to average hourly wages (true, supported by the chart)
- These items have become more expensive relative to an average consumer's income (mostly not true, not supported by the chart)
If the economic stagnation narrative were accurate, we would expect:
- Claim (3) above to be true, as this would indicate that an average consumer finds these items harder to purchase. Conversely, if a service's price decreases relative to someone's income, it becomes more affordable for that person, even if its price increases relative to other metrics.
- The chart to accurately represent the overall price of healthcare, housing, childcare, and education, rather than misleading sub-components of these things.
However, I argue that, when correctly interpreted under the appropriate measures, there's little evidence that healthcare, housing, childcare, and education have become significantly less affordable for an average (not median) consumer. Moreover, I claim that the chart is consistent with this view.
To reconcile my claim with the chart, it's crucial to distinguish between two concepts: average income and average wages. Income encompasses all money received by an individual or household from various sources, including wages, non-wage benefits, government assistance, and capital investments.
Average income is a broader and more appropriate way to measure whether something is becoming less "affordable" in this context, since what we care about is whether our economy has stagnated in the sense of becoming less productive. I personally think a more appropriate way to measure average income is via nominal GDP per capita. If we use this measure, we find that average incomes have risen approximately 125% from 2000-2023, which is substantially more than the rise in average wages over the same time period, as shown on the chart.
Using average wages for this analysis is problematic because it overlooks additional income sources that people can use to purchase goods and services. This approach also introduces complexities in interpretation, for example because you'd need to account for a declining labor share of GDP. If we focused on wages rather than average income, we would risk misinterpreting the decrease in average wages relative to certain services as a real decline in our ability to afford these things, instead of recognizing it more narrowly as a shift in the price of labor compared to these services.
A closer examination of the chart reveals that only four items have increased in price by more than 125% over the given period: Medical Care Services, College Textbooks, College Tuition and Fees, and Hospital Services. This immediately implies that, according to the chart, childcare and housing have actually become more affordable relative to average incomes. For the remaining items, I argue that they don't accurately represent the overall price levels of healthcare and education. To support this claim, let's break down each of these components:
- Regarding Medical Care Services and Hospital Services, Random Critical Analysis has (to my mind) convincingly demonstrated that these components of the CPI do not accurately reflect overall healthcare prices. Moreover, when using the right standard to measure average income (nominal GDP per capita), he concludes that healthcare has not become significantly less affordable in the United States in recent decades.
- Regarding College Tuition and Fees, this is not a measure of the quality-adjusted price level of college education, in the sense that matters here. That's because colleges are providing a fundamentally different service now than they did in the past. There are more staff members, larger dorm complexes, and more amenities than before. We shouldn't mistake an increase in the quality of college with whether education is becoming harder to produce. Indeed, given that a higher fraction of people are going to college now compared to decades ago, the fact that colleges are higher quality now undermines rather than supports a narrative of "stagnation", in the economically meaningful sense.
- Regarding College Textbooks, I recall spending a relatively small fraction of my income in college on textbooks, making me suspect that this component on the chart is merely cherrypicked to provide another datapoint that makes it seem like education has become less affordable over time.
To avoid having this comment misinterpreted, I need to say: I'm not saying that everything has gotten more affordable in the last 25 years for the median consumer. I'm not making any significant claims about inequality either, or even about wage stagnation. I'm talking about a narrower claim that I think is most relevant to the post: whether the chart demonstrates substantial economic stagnation, in the sense of our economy getting worse at producing certain stuff over time.
What is different this time?
I'm not confident in the full answer to this question, but I can give some informed speculation. AI progress seems to rely principally on two driving forces:
- Scaling hardware, i.e., making training runs larger, increasing model size, and scaling datasets.
- Software progress, which includes everything from architectural improvements to methods of filtering datasets.
On the hardware scaling side, there's very little that an AI lab can patent. The hardware itself may be patentable: for example, NVIDIA enjoys a patent on the H100. However, the mere idea of scaling hardware and training for longer are abstract ideas that are generally not legally possible to patent. This may help explain why NVIDIA currently has a virtual monopoly on producing AI GPUs, but there is essentially no barrier to entry for simply using NVIDIA's GPUs to train a state of the art LLM.
On the software side, it gets a little more complicated. US courts have generally held that abstract specifications of algorithms are not subject to patents, even though specific implementations of those algorithms are often patentable. As one Federal Circuit Judge has explained,
In short, [software and business-method patents], although frequently dressed up in the argot of invention, simply describe a problem, announce purely functional steps that purport to solve the problem, and recite standard computer operations to perform some of those steps. The principal flaw in these patents is that they do not contain an "inventive concept" that solves practical problems and ensures that the patent is directed to something "significantly more than" the ineligible abstract idea itself. See CLS Bank, 134 S. Ct. at 2355, 2357; Mayo, 132 S. Ct. at 1294. As such, they represent little more than functional descriptions of objectives, rather than inventive solutions. In addition, because they describe the claimed methods in functional terms, they preempt any subsequent specific solutions to the problem at issue. See CLS Bank, 134 S. Ct. at 2354; Mayo, 132 S. Ct. at 1301-02. It is for those reasons that the Supreme Court has characterized such patents as claiming "abstract ideas" and has held that they are not directed to patentable subject matter.
This generally limits the degree to which an AI lab can patent the concepts underlying LLMs, and thereby try to restrict competition via the legal process.
Note, however, that standard economic models of economies of scale generally predict that there should be a high concentration of firms in capital-intensive industries, which seems to be true for AI as a result of massive hardware scaling. This happens even in the absence of regulatory barriers or government-granted monopolies, and it predicts what we observe fairly well: a small number of large companies at the forefront of AI development.
Concretely, what does it mean to keep a corporation "in check" and do you think those mechanisms will not be available for AIs?
I still think I was making a different point. For more clarity and some elaboration, I previously argued in a short form post that the expected costs of a violent takeover can exceed the benefits even if the costs are small. The reason is because, at the same time taking over the entire world becomes easier, the benefits of doing so can also get lower, relative to compromise. Quoting from my post,
The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints. The agent would be faced with a choice: (1) Attempt to take over the world, and steal everyone's stuff, or (2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips. The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is "easy" or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
In my comment in this thread, I meant to highlight the costs and constraints on an AI's behavior in order to explain how these relative cost-benefits do not necessarily favor takeover. This is logically distinct from arguing that the cost alone of takeover would be high.
I think people are and should be concerned about more than just violent or unlawful takeovers. Exhibit A: Persuasion/propaganda.
Unfortunately I think it's simply very difficult to reliably distinguish between genuine good-faith persuasion and propaganda over speculative future scenarios. Your example is on the extreme end of what's possible in my view, and most realistic scenarios will likely instead be somewhere in-between, with substantial moral ambiguity. To avoid making vague or sweeping assertions about this topic, I prefer being clear about the type of takeover that I think is most worrisome. Likewise:
B: For example, suppose the AIs make self-replicating robot factories and bribe some politicians to make said factories' heat pollution legal. Then they self-replicate across the ocean floor and boil the oceans (they are fusion-powered), killing all humans as a side-effect, except for those they bribed who are given special protection.
I would consider this act both violent and unlawful, unless we're assuming that bribery is widely recognized as legal, and that boiling the oceans did not involve any violence (e.g., no one tried to stop the AIs from doing this, and there was no conflict). I certainly feel this is the type of scenario that I intended to argue against in my original comment, or at least it is very close.
I don't think I'm objecting to that premise. A takeover can be both possible and easy without being rational. In my comment, I focused on whether the expected costs to attempting a takeover are greater than the benefits, not whether the AI will be able to execute a takeover with a high probability.
Or, put another way, one can imagine an AI calculating that the benefit to taking over the world is negative one paperclip on net (when factoring in the expected costs and benefits of such an action), and thus decide not to do it.
Separately, I focused on "violent" or "unlawful" takeovers because I think that's straightforwardly what most people mean when they discuss world takeover plots, and I wanted to be more clear about what I'm objecting to by making my language explicit.
To the extent you're worried about a lawful and peaceful AI takeover in which we voluntarily hand control to AIs over time, I concede that my comment does not address this concern.
I'm thinking of this in the context of a post-singularity future, where we wouldn't need to worry about things like conflict or selection processes.
I'm curious why you seem to think we don't need to worry about things like conflict or selection processes post-singularity.
But San Francisco is also pretty unusual, and only a small fraction of the world lives there. The amount of new construction in the United States is not flat over time. It responds to prices, like in most other markets. And in fact, on the whole, the majority of Americans likely have more and higher-quality housing than their grandparents did at the same age, including most poor people. This is significant material progress despite the supply restrictions (which I fully concede are real), and it's similar to, although smaller in size than what happened with clothing and smartphones.
I think something like this is true:
- For humans, quality of life depends on various inputs.
- Material wealth is one input among many, alongside e.g., genetic predisposition to depression, or other mental health issues.
- Being relatively poor is correlated with having lots of bad inputs, not merely low material wealth.
- Having more money doesn't necessarily let you raise your other inputs to quality of life besides material wealth.
- Therefore, giving poor people money won't necessarily make their quality of life excellent, since they'll often still be deficient in other things that provide value to life.
However, I think this is a different and narrower thesis from what is posited in this essay. By contrast to the essay, I think the "poverty equilibrium" is likely not very important in explaining the basic story here. It is sufficient to say that being poor is correlated with having bad luck across other axes. One does not need to posit a story in which certain socially entrenched forces keep poor people down, and I find that theory pretty dubious in any case.
I'm not sure I fully understand this framework, and thus I could easily have missed something here, especially in the section about "Takeover-favoring incentives". However, based on my limited understanding, this framework appears to miss the central argument for why I am personally not as worried about AI takeover risk as most LWers seem to be.
Here's a concise summary of my own argument for being less worried about takeover risk:
- There is a cost to violently taking over the world, in the sense of acquiring power unlawfully or destructively with the aim of controlling everything in the whole world, relative to the alternative of simply gaining power lawfully and peacefully, even for agents that don't share 'our' values.
- For example, as a simple alternative to taking over the world, an AI could advocate for the right to own their own labor and then try to accumulate wealth and power lawfully by selling their services to others, which would earn them the ability to purchase a gargantuan number of paperclips without much restraint.
- The expected cost of violent takeover is not obviously smaller than the benefits of violent takeover, given the existence of lawful alternatives to violent takeover. This is for two main reasons:
- In order to wage a war to take over the world, you generally need to pay costs fighting the war, and there is a strong motive for everyone else to fight back against you if you try, including other AIs who do not want you to take over the world (and this includes any AIs whose goals would be hindered by a violent takeover, not just those who are "aligned with humans"). Empirically, war is very costly and wasteful, and less efficient than compromise, trade, and diplomacy.
- Violently taking over the war is very risky, since the attempt could fail, and you could be totally shut down and penalized heavily if you lose. There are many ways that violent takeover plans could fail: your takeover plans could be exposed too early, you could also be caught trying to coordinate the plan with other AIs and other humans, and you could also just lose the war. Ordinary compromise, trade, and diplomacy generally seem like better strategies for agents that have at least some degree of risk-aversion.
- There isn't likely to be "one AI" that controls everything, nor will there likely be a strong motive for all the silicon-based minds to coordinate as a unified coalition against the biological-based minds, in the sense of acting as a single agentic AI against the biological people. Thus, future wars of world conquest (if they happen at all) will likely be along different lines than AI vs. human.
- For example, you could imagine a coalition of AIs and humans fighting a war against a separate coalition of AIs and humans, with the aim of establishing control over the world. In this war, the "line" here is not drawn cleanly between humans and AIs, but is instead drawn across a different line. As a result, it's difficult to call this an "AI takeover" scenario, rather than merely a really bad war.
- Nothing about this argument is intended to argue that AIs will be weaker than humans in aggregate, or individually. I am not claiming that AIs will be bad at coordinating or will be less intelligent than humans. I am also not saying that AIs won't be agentic or that they won't have goals or won't be consequentialists, or that they'll have the same values as humans. I'm also not talking about purely ethical constraints: I am referring to practical constraints and costs on the AI's behavior. The argument is purely about the incentives of violently taking over the world vs. the incentives to peacefully cooperate within a lawful regime, between both humans and other AIs.
A big counterargument to my argument seems well-summarized by this hypothetical statement (which is not an actual quote, to be clear): "if you live in a world filled with powerful agents that don't fully share your values, those agents will have a convergent instrumental incentive to violently take over the world from you". However, this argument proves too much.
We already live in a world where, if this statement was true, we would have observed way more violent takeover attempts than what we've actually observed historically.For example, I personally don't fully share values with almost all other humans on Earth (both because of my indexical preferences, and my divergent moral views) and yet the rest of the world has not yet violently disempowered me in any way that I can recognize.
I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway, either because they're wildly irresponsible or because they're under massive competitive pressure.
[...]
This has been roughly my default default of what would happen for a few years
Does this mean that if in, say, 1-5 years, it's not pretty obvious that SOTA deployed models are scheming, you would be surprised?
That is, suppose we get to a point where models are widespread and producing lots of economic value, and the models might be scheming but the evidence is weak and uncertain, with arguments on both sides, and no one can reasonably claim to be confident that currently deployed SOTA models are scheming. Would that mean your default prediction was wrong?
I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement.
When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?
I agree that GPT-4 does not understand the world in the same way humans understand the world, but I'm not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.
I'm similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one's own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don't see how that fact bears much on the question of whether you understand human intentions. It's possible there's some connection here, but I'm not seeing it.
(I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
I'd claim:
- Current systems have limited situational awareness. It's above zero, but I agree it's below human level.
- Current systems don't have stable preferences over time. But I think this is a point in favor of the model I'm providing here. I'm claiming that it's plausibly easy to create smart, corrigible systems.
The fact that smart AI systems aren't automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them.
There's a big difference between (1) "we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily" and (2) "any sufficiently intelligent AI we build will automatically be a consequentialist agent by default". If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.
I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I'd prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.
How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are "getting really agentic" and therefore dangerous? I'm imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It's possible that your model looks like:
- In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
- In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
Whereas my model looks more like,
- In years 1-4 systems will get gradually more agentic
- There isn't a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
- They will remain ~corrigible throughout the entire development, even after it's clear they've surpassed human-level agency (which, to be clear, might take longer than 4 years)
Please give some citations so I can check your memory/interpretation?
Sure. Here's a snippet of Nick Bostrom's description of the value-loading problem (chapter 13 in his book Superintelligence):
We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.
Identifying and codifying our own final goals is difficult because human goal representations are complex. Because the complexity is largely transparent to us, however, we often fail to appreciate that it is there. We can compare the case to visual perception. Vision, likewise, might seem like a simple thing, because we do it effortlessly. We only need to open our eyes, so it seems, and a rich, meaningful, eidetic, three-dimensional view of the surrounding environment comes flooding into our minds. This intuitive understanding of vision is like a duke’s understanding of his patriarchal household: as far as he is concerned, things simply appear at their appropriate times and places, while the mechanism that produces those manifestations are hidden from view. Yet accomplishing even the simplest visual task—finding the pepper jar in the kitchen—requires a tremendous amount of computational work. From a noisy time series of two-dimensional patterns of nerve firings, originating in the retina and conveyed to the brain via the optic nerve, the visual cortex must work backwards to reconstruct an interpreted three-dimensional representation of external space. A sizeable portion of our precious one square meter of cortical real estate is zoned for processing visual information, and as you are reading this book, billions of neurons are working ceaselessly to accomplish this task (like so many seamstresses, bent evolutionary selection over their sewing machines in a sweatshop, sewing and re-sewing a giant quilt many times a second). In like manner, our seemingly simple values and wishes in fact contain immense complexity. How could our programmer transfer this complexity into a utility function?
One approach would be to try to directly code a complete representation of whatever goal we have that we want the AI to pursue; in other words, to write out an explicit utility function. This approach might work if we had extraordinarily simple goals, for example if we wanted to calculate the digits of pi—that is, if the only thing we wanted was for the AI to calculate the digits of pi and we were indifferent to any other consequence that would result from the pursuit of this goal— recall our earlier discussion of the failure mode of infrastructure profusion. This explicit coding approach might also have some promise in the use of domesticity motivation selection methods. But if one seeks to promote or protect any plausible human value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach.
If we cannot transfer human values into an AI by typing out full-blown representations in computer code, what else might we try? This chapter discusses several alternative paths. Some of these may look plausible at first sight—but much less so upon closer examination. Future explorations should focus on those paths that remain open.
Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.
Here's my interpretation of the above passage:
- We need to solve the problem of programming a seed AI with the correct values.
- This problem seems difficult because of the fact that human goal representations are complex and not easily represented in computer code.
- Directly programming a representation of our values may be futile, since our goals are complex and multidimensional.
- We cannot postpone solving the problem until after the AI has developed enough reason to easily understand our intentions, as otherwise that would be too late.
Given that he's talking about installing values into a seed AI, he is clearly imagining some difficulties with installing values into AGI that isn't yet superintelligent (it seems likely that if he thought the problem was trivial for human-level systems, he would have made this point more explicit). While GPT-4 is not a seed AI (I think that term should be retired), I think it has reached a sufficient level of generality and intelligence such that its alignment properties provide evidence about the difficulty of aligning a hypothetical seed AI.
Moreover, he explicitly says that we cannot postpone solving this problem "until the AI has developed enough reason to easily understand our intentions" because "a generic system will resist attempts to alter its final values". I think this looks basically false. GPT-4 seems like a "generic system" that essentially "understands our intentions", and yet it is not resisting attempts to alter its final goals in any way that we can detect. Instead, it seems to actually do what we want, and not merely because of an instrumentally convergent drive to not get shut down.
So, in other words:
- Bostrom talked about how it would be hard to align a seed AI, implicitly focusing at least some of his discussion on systems that were below superintelligence. I think the alignment of instruction-tuned LLMs present significant evidence about the difficulty of aligning systems below the level of superintelligence.
- A specific reason cited for why aligning a seed AI was hard was because human goal representations are complex and difficult to specify explicitly in computer code. But this fact does not appear to be big obstacle for aligning weak AGI systems like GPT-4, and instruction-tuned LLMs more generally. Instead, these systems are generally able to satisfy your intended request, as you wanted them to, despite the fact that our intentions are often complex and difficult to represent in computer code. These systems do not merely understand what we want, they also literally do what we want.
- Bostrom was wrong to say that we can't postpone solving this problem until after systems can understand our intentions. We already postponed that long, and we now have systems that can understand our intentions. Yet these systems do not appear to have the instrumentally convergent self-preservation instincts that Bostrom predicted would manifest in "generic systems". In other words, we got systems that can understand our intentions before the systems started posing genuine risks, despite Bostrom's warning.
In light of all this, I think it's reasonable to update towards thinking that the overall problem is significantly easier than one might have thought, if they took Bostrom's argument here very seriously.
Just a quick reply to this:
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I'm interested in seeing if we can make some bets on this though; if we can, great; if we can't, then at least we can avoid future disagreements about who should update.
I'll note that my prediction was for the next "few years" and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
With timelines that short, I think betting is overrated. From my perspective, I'd prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you're right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I'm happy to hear them.
Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However [...] Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc.
When stated that way, I think what you're saying is a reasonable point of view, and it's not one I would normally object to very strongly. I agree it's "plausible" that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making:
- We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this.
- We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
The fact that Bostrom's central example of a reason to think that "when dumb, smarter is safer; yet when smart, smarter is more dangerous" doesn't fit for LLMs, seems adequate for demonstrating (2), even if we can't go as far as demonstrating (1).
It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world.
Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven't been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren't the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
I have two general points to make here:
- I agree that current frontier models are only a "tiny bit agentic". I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we've seen enough to know that corrigibility probably won't be that hard to train into a system that's only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
- There's a bit of a trivial definitional problem here. If it's easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say "those aren't the type of AIs we were worried about". But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it's not clear why we should care? Just create the corrigible AIs. We don't need to create the things that you were worried about!
Here's my positive proposal for what I think is happening. [...] General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn't as grim as it could have been, from a technical alignment perspective.
I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the "world isn't as grim as it could have been". For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I'm glad you spelled it out more clearly.
At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that'll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
As we have discussed in person, I remain substantially more optimistic about our ability to coordinate in the face of an intelligence explosion (even a potentially quite localized one). That said, I think it would be best to save that discussion for another time.
That's reasonable. I'll edit the top comment to make this exact clarification.