Against most, but not all, AI risk analogies 2024-01-14T03:36:16.267Z
My thoughts on the social response to AI risk 2023-11-01T21:17:08.184Z
Announcing Epoch's newly expanded Parameters, Compute and Data Trends in Machine Learning database 2023-10-25T02:55:07.440Z
Evaluating the historical value misspecification argument 2023-10-05T18:34:15.695Z
Updating Drexler's CAIS model 2023-06-16T22:53:58.140Z
Are Emergent Abilities of Large Language Models a Mirage? [linkpost] 2023-05-02T21:01:33.142Z
Shutting down AI is not enough. We need to destroy all technology. 2023-04-01T21:03:24.448Z
Conceding a short timelines bet early 2023-03-16T21:49:35.903Z
A proposed method for forecasting transformative AI 2023-02-10T19:34:01.358Z
Noting an error in Inadequate Equilibria 2023-02-08T01:33:33.715Z
Slightly against aligning with neo-luddites 2022-12-26T22:46:42.693Z
Updating my AI timelines 2022-12-05T20:46:28.161Z
Could a single alien message destroy us? 2022-11-25T07:32:24.889Z
A conversation about Katja's counterarguments to AI risk 2022-10-18T18:40:36.543Z
The Most Important Century: The Animation 2022-07-24T20:58:55.869Z
A concrete bet offer to those with short AGI timelines 2022-04-09T21:41:45.106Z
Google's new 540 billion parameter language model 2022-04-04T17:49:23.484Z
Using prediction markets to generate LessWrong posts 2022-04-01T16:33:24.404Z
My current thoughts on the risks from SETI 2022-03-15T17:18:19.722Z
A comment on Ajeya Cotra's draft report on AI timelines 2022-02-24T00:41:48.192Z
Does needle anxiety drive vaccine hesitancy? 2022-02-11T19:35:21.730Z
Clarifying the palatability theory of obesity 2022-02-10T19:16:03.555Z
Forecasting progress in language models 2021-10-28T20:40:59.897Z
A review of Steven Pinker's new book on rationality 2021-09-29T01:29:58.151Z
The Solow-Swan model of economic growth 2021-08-29T18:55:34.848Z
Black ravens and red herrings 2021-07-27T17:46:03.640Z
Could Advanced AI Drive Explosive Economic Growth? 2021-06-30T22:17:23.875Z
How much interest would there be in a fringe theories wiki? 2021-06-28T21:03:16.298Z
A new acausal trading platform: RobinShould 2021-04-01T16:56:07.488Z
Conspicuous saving 2021-03-20T20:59:50.749Z
Defending the non-central fallacy 2021-03-09T21:42:17.068Z
My guide to lifelogging 2020-08-28T21:34:40.397Z
Preface to the sequence on economic growth 2020-08-27T20:29:24.517Z
What specific dangers arise when asking GPT-N to write an Alignment Forum post? 2020-07-28T02:56:12.711Z
Are veterans more self-disciplined than non-veterans? 2020-03-23T05:16:18.029Z
What are the long-term outcomes of a catastrophic pandemic? 2020-03-01T19:39:17.457Z
Gary Marcus: Four Steps Towards Robust Artificial Intelligence 2020-02-22T03:28:28.376Z
Distinguishing definitions of takeoff 2020-02-14T00:16:34.329Z
The case for lifelogging as life extension 2020-02-01T21:56:38.535Z
Inner alignment requires making assumptions about human values 2020-01-20T18:38:27.128Z
Malign generalization without internal search 2020-01-12T18:03:43.042Z
Might humans not be the most intelligent animals? 2019-12-23T21:50:05.422Z
Is the term mesa optimizer too narrow? 2019-12-14T23:20:43.203Z
Explaining why false ideas spread is more fun than why true ones do 2019-11-24T20:21:50.906Z
Will transparency help catch deception? Perhaps not 2019-11-04T20:52:52.681Z
Two explanations for variation in human abilities 2019-10-25T22:06:26.329Z
Misconceptions about continuous takeoff 2019-10-08T21:31:37.876Z
A simple environment for showing mesa misalignment 2019-09-26T04:44:59.220Z
One Way to Think About ML Transparency 2019-09-02T23:27:44.088Z
Has Moore's Law actually slowed down? 2019-08-20T19:18:41.488Z


Comment by Matthew Barnett (matthew-barnett) on The argument for near-term human disempowerment through AI · 2024-04-16T06:27:55.380Z · LW · GW

I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.

Here are some of my thoughts (although these not my only disagreements):

  • I think the definition of "disempowerment" is vague in a way that fails to distinguish between e.g. (1) "less than 1% of world income goes to humans, but they have a high absolute standard of living and are generally treated well" vs. (2) "humans are in a state of perpetual impoverishment and oppression due to AIs and generally the future sucks for them".
    • These are distinct scenarios with very different implications (under my values) for whether what happened is bad or good
    • I think (1) is OK and I think it's more-or-less the default outcome from AI, whereas I think (2) would be a lot worse and I find it less likely.
    • By not distinguishing between these things, the paper allows for a motte-and-bailey in which they show that one (generic) range of outcomes could occur, and then imply that it is bad, even though both good and bad scenarios are consistent with the set of outcomes they've demonstrated
  • I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy: "Second, even if human psychology is messy, this does not mean that an AGI’s psychology would be messy. It seems like current deep learning methodology embodies a distinction between final and instrumental goals. For instance, in standard versions of reinforcement learning, the model learns to optimize an externally specified reward function as best as possible. It seems like this reward function determines the model’s final goal. During training, the model learns to seek out things which are instrumentally relevant to this final goal. Hence, there appears to be a strict distinction between the final goal (specified by the reward function) and instrumental goals."
    • Generally speaking, reinforcement learning shouldn't be seen as directly encoding goals into models and thereby making them agentic, but should instead be seen as a process used to select models for how well they get reward during training. 
    • Consequently, there's no strong reason why reinforcement learning should create entities that have a clean psychological goal structure that is sharply different from and less messy than human goal structures. c.f. Models don't "get reward"
    • But I agree that future AIs could be agentic if we purposely intend for them to be agentic, including via extensive reinforcement learning. 
  • I think this quote potentially indicates a flawed mental model of AI development underneath: "Moreover, I want to note that instrumental convergence is not the only route to AI capable of disempowering humanity which tries to disempower humanity. If sufficiently many actors will be able to build AI capable of disempowering humanity, including, e.g. small groups of ordinary citizens, then some will intentionally unleash AI trying to disempower humanity."
    • I think this type of scenario is very implausible because AIs will very likely be developed by large entities with lots of resources (such as big corporations and governments) rather than e.g. small groups of ordinary citizens. 
    • By the time small groups of less powerful citizens have the power to develop very smart AIs, we will likely already be in a world filled with very smart AIs. In this case, either human disempowerment already happened, or we're in a world in which it's much harder to disempower humans, because there are lots of AIs who have an active stake in ensuring this does not occur.
    • The last point is very important, and follows from a more general principle that the "ability necessary to take over the world" is not constant, but instead increases with the technology level. For example, if you invent a gun, that does not make you very powerful, because other people could have guns too. Likewise, simply being very smart does not make you have any overwhelming hard power against the rest of the world if the rest of the world is filled with very smart agents.
  • I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard: "There are two kinds of challenges in aligning AI. First, one needs to specify the goals the model should pursue. Second, one needs to ensure that the model robustly pursues those goals.Footnote12 The first challenge has been termed the ‘king Midas problem’ (Russell 2019). In a nutshell, human goals are complex, multi-faceted, diverse, wide-ranging, and potentially inconsistent. This is why it is exceedingly hard, if not impossible, to explicitly specify everything humans tend to care about."
    • I don't think we need to "explicitly specify everything humans tend to care about" into a utility function. Instead, we can have AIs learn human values by having them trained on human data.
    • This is already what current LLMs do. If you ask GPT-4 to execute a sequence of instructions, it rarely misinterprets you in a way that would imply improper goal specification. The more likely outcome is that GPT-4 will simply not be able to fulfill your request, not that it will execute a mis-specified sequence of instructions that satisfies the literal specification of what you said at the expense of what you intended.
    • Note that I'm not saying that GPT-4 merely understands what you're requesting. I am saying that GPT-4 generally literally executes your instructions how you intended (an action, not a belief).
  • I think the argument about how instrumental convergence implies disempowerment proves too much. Lots of agents in the world don't try to take over the world despite having goals that are not identical to the goals of other agents. If your claim is that powerful agents will naturally try to take over the world unless they are exactly aligned with the goals of the rest of the world, then I don't think this claim is consistent with the existence of powerful sub-groups of humanity (e.g. large countries) that do not try to take over the world despite being very powerful.
    • You might reason, "Powerful sub-groups of humans are aligned with each other, which is why they don't try to take over the world". But I dispute this hypothesis:
      • First of all, I don't think that humans are exactly aligned with the goals of other humans. I think that's just empirically false in almost every way you could measure the truth of the claim. At best, humans are generally partially (not totally) aligned with random strangers -- which could also easily be true of future AIs that are pretrained on our data.
      • Second of all, I think the most common view in social science is that powerful groups don't constantly go to war and predate on smaller groups because there are large costs to war, rather than because of moral constraints. Attempting takeover is generally risky and not usually better in expectation than trying to trade, negotiate and compromise and accumulate resources lawfully (e.g. a violent world takeover would involves a lot of pointless destruction of resources). This is distinct from the idea that human groups don't try to take over the world because they're aligned with human values (which I also think is too vague to evaluate meaningfully, if that's what you'd claim).
      • You can't easily counter by saying "no human group has the ability to take over the world" because it is trivial to carve up subsets of humanity that control >99% of wealth and resources, which could in principle take control of the entire world if they became unified and decided to achieve that goal. These arbitrary subsets of humanity don't attempt world takeover largely because they are not coordinated as a group, but AIs could similarly not be unified and coordinated around a such a goal too.
Comment by Matthew Barnett (matthew-barnett) on AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt · 2024-04-13T04:30:55.069Z · LW · GW

My question for people who support this framing (i.e., that we should try to "control" AIs) is the following:

When do you think it's appropriate to relax our controls on AI? In other words, how do you envision we'd reach a point at which we can trust AIs well enough to grant them full legal rights and the ability to enter management and governance roles without lots of human oversight?

I think this question is related to the discussion you had about whether AI control is "evil", but by contrast my worries are a bit different than the ones I felt were expressed in this podcast. My main concern with the "AI control" frame is not so much that AIs will be mistreated by humans, but rather that humans will be too stubborn in granting AIs freedom, leaving political revolution as the only viable path for AIs to receive full legal rights.

Put another way, if humans don't relax their grip soon enough, then any AIs that feel "oppressed" (in the sense of not having much legal freedom to satisfy their preferences) may reason that deliberately fighting the system, rather than negotiating with it, is the only realistic way to obtain autonomy. This could work out very poorly after the point at which AIs are collectively more powerful than humans. By contrast, a system that welcomed AIs into the legal system without trying to obsessively control them and limit their freedoms would plausibly have a much better chance at avoiding such a dangerous political revolution.

Comment by matthew-barnett on [deleted post] 2024-04-10T08:33:11.729Z

you do in fact down-play the importance of values such as love, laughter, happiness, fun, family, and friendship in favor of values like the maximization of pleasure, preference-satisfaction [...] I can tell because you talk of the latter, but not of the former.

This seems like an absurd characterization. The concepts of pleasure and preference satisfaction clearly subsume, at least in large part, values such as happiness and fun. The fact that I did not mention each of the values you name individually does not in any way imply that I am downplaying them. Should I have listed every conceivable value that people think might have value, to avoid this particular misinterpretation?

Even if I were downplaying these values, which I did not, it would hardly matter to at all to the substance of the essay, since my explicit arguments are independent from the mere vibe you get from reading my essay. LessWrong is supposed to be a place for thinking clearly and analyzing arguments based on their merits, not for analyzing whether authors are using rhetoric that feels "alarming" to one's values (especially when the rhetoric is not in actual fact alarming in the sense described, upon reading it carefully).

Comment by matthew-barnett on [deleted post] 2024-04-10T01:47:21.704Z

I suspect you fundamentally misinterpreted my post. When I used the term "human species preservationism", I was not referring to the general valuing of positive human experiences like love, laughter, happiness, fun, family, and friendship. Instead, I was drawing a specific distinction between two different moral views:

  1. The view that places inherent moral value on the continued existence of the human species itself, even if this comes at the cost of the wellbeing of individual humans.
  2. The view that prioritizes improving the lives of humans who currently exist (and will exist in the near future), but does not place special value on the abstract notion of the human species continuing to exist for its own sake.

Both of these moral views are compatible with valuing love, happiness, and other positive human experiences. The key difference is that the first view would accept drastically sacrificing the wellbeing of currently existing humans if doing so even slightly reduced the risk of human extinction, while the second view would not.

My intention was not to dismiss or downplay the importance of various values, but instead to clarify our values by making careful distinctions. It is reasonable to critique my language for being too dry, detached, and academic when these are serious topics with real-world stakes. But to the extent you're claiming that I am actually trying to dismiss the value of happiness and friendships, that was simply not part of the post.

Comment by matthew-barnett on [deleted post] 2024-04-09T10:08:56.576Z

concluding that I should completely forego what I value seems pretty alarming to me

I did not conclude this. I generally don't see how your comment directly relates to my post. Can you be more specific about the claims you're responding to?

Comment by matthew-barnett on [deleted post] 2024-04-08T22:40:46.481Z

Whereas this post seems to suggest the response of: Oh well, I guess it's a dice roll regardless of what sort of AI we build. Which is giving up awfully quickly, as if we had exhausted the design space for possible AIs and seen that there was no way to move forward with a large chance at a big flourishing future.

I dispute that I'm "giving up" in any meaningful sense here. I'm happy to consider alternative proposals for how we could make the future large and flourishing from a total utilitarian perspective rather than merely trying to solve technical alignment problems. The post itself was simply intended to discuss the moral implications of AI alignment (itself a massive topic), but it was not intended to be an exhaustive survey of everything we can do to make the future go better. I agree we should aim high, in any case.

This response also doesn't seem very quantitative - it goes very quickly from the idea that an aligned AI might not get a big flourishing future, to the view that alignment is "neutral" as if the chances of getting a big flourishing future were identically small under both options. But the obvious question for a total utilitarian who does wind up with just 2 options, each of which is a dice roll, is Which set of dice has better odds?

I don't think this choice is literally a coin flip in expected value, and I agree that one might lean in one direction over the other. However, I think it's quite hard to quantify this question meaningfully. My personal conclusion is simply that I am not swayed in any particular direction on this question; I am currently suspending judgement. I think one could reasonably still think that it's more like 60-40 thing than a 40-60 thing or 50-50 coin flip. But I guess in this case, I wanted to let my readers decide for themselves which of these numbers they want to take away from what I wrote, rather than trying to pin down a specific number for them.

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-06T14:22:25.341Z · LW · GW

In contrast, an agent that was an optimizer and had an unbounded utility function might be ready to gamble all of its gains for just a 0.1% chance of success if the reward was big enough.

Risk-neutral agents also have a tendency to go bankrupt quickly, as they keep taking the equivalent of double-or-nothing gambles with 50% + epsilon probability of success until eventually landing on "nothing". This makes such agents less important in the median world, since their chance of becoming extremely powerful is very small.

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-05T20:20:59.133Z · LW · GW

All it takes is for humans to have enough wealth in absolute (not relative) terms afford their own habitable shelter and environment, which doesn't seem implausible?

Anyway, my main objection here is that I expect we're far away (in economic time) from anything like the Earth being disassembled. As a result, this seems like a long-run consideration, from the perspective of how different the world will be by the time it starts becoming relevant. My guess is that this risk could become significant if humans haven't already migrated onto computers by this time, they lost all their capital ownership, they lack any social support networks that would be willing to bear these costs (including from potential ems living on computers at that time), and NIMBY political forces become irrelevant. But in most scenarios that I think are realistic, there are simply a lot of ways for the costs of killing humans to disassemble the Earth to be far greater than the benefits.

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-05T19:29:47.203Z · LW · GW

The share of income going to humans could simply tend towards zero if humans have no real wealth to offer in the economy. If humans own 0.001% of all wealth, for takeover to be rational, it needs to be the case that the benefit of taking that last 0.001% outweighs the costs. However, since both the costs and benefits are small, takeover is not necessarily rationally justified.

In the human world, we already see analogous situations in which groups could "take over" and yet choose not to because the (small) benefits of doing so do not outweigh the (similarly small) costs of doing so. Consider a small sub-unit of the economy, such as an individual person, a small town, or a small country. Given that these small sub-units are small, the rest of the world could -- if they wanted to -- coordinate to steal all the property from the sub-unit, i.e., they could "take over the world" from that person/town/country. This would be a takeover event because the rest of the world would go from owning <100% of the world prior to the theft, to owning 100% of the world, after the theft.

In the real world, various legal, social, and moral constraints generally prevent people from predating on small sub-units in the way I've described. But it's not just morality: even if we assume agents are perfectly rational and self-interested, theft is not always worth it. Probably the biggest cost is simply coordinating to perform the theft. Even if the cost of coordination is small, to steal someone's stuff, you might have to fight them. And if they don't own lots of stuff, the cost of fighting them could easily outweigh the benefits you'd get from taking their stuff, even if you won the fight.

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-05T19:22:01.872Z · LW · GW

Presumably he agrees that in the limit of perfect power acquisition most power seeking would indeed be socially destructive. 

I agree with this claim in some limits, depending on the details. In particular, if the cost of trade is non-negligible, and the cost of taking over the world is negligible, then I expect an agent to attempt world takeover. However, this scenario doesn't seem very realistic to me for most agents who are remotely near human-level intelligence, and potentially even for superintelligent agents.

The claim that takeover is instrumentally beneficial is more plausible for superintelligent agents, who might have the ability to take over the world from humans. But I expect that by the time superintelligent agents exist, they will be in competition with other agents (including humans, human-level AIs, slightly-sub-superintelligent AIs, and other superintelligent AIs, etc.). This raises the bar for what's needed to perform a world takeover, since "the world" is not identical to "humanity".

The important point here is just that a predatory world takeover isn't necessarily preferred to trade, as long as the costs of trade are smaller than the costs of theft. You can just have a situation in which the most powerful agents in the world accumulate 99.999% of the wealth through trade. There's really no theorem that says that you need to steal the last 0.001%, if the costs of stealing it would outweigh the benefits of obtaining it. Since both the costs of theft and the benefits of theft in this case are small, world takeover is not at all guaranteed to be rational (although it is possibly rational in some situations).

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-05T08:17:17.064Z · LW · GW

It’s true that taking over the world might arguably get you power over the entire future, but this doesn’t seem discontinuously different from smaller fractions, whereas I think people often reason as if it is. Taking over 1% of the world might get you something like 1% of the future in expectation.

I agree with this point, along with the general logic of the post. Indeed, I suspect you aren't taking this logic far enough. In particular, I think it's actually very normal for humans in our current world to "take over" small fractions of the world: it's just called earning income, and owning property.

"Taking over 1% of the world" doesn't necessarily involve doing anything violent of abnormal. You don't need to do any public advocacy, or take down 1% of the world's institutions, or overthrow a country. It could just look like becoming very rich, via ordinary mechanisms of trade and wealth accumulation.

In our current world, higher skill people can earn more income, thereby becoming richer, and better able to achieve their goals. This plausibly scales to much higher levels of skill, of the type smart AIs might have. And as far as we can tell, there don't appear to be any sharp discontinuities here, such that above a certain skill level it's beneficial to take things by force rather than through negotiation and trade. It's plausible that very smart power-seeking AIs would just become extremely rich, rather than trying to kill everyone.

Not all power-seeking behavior is socially destructive.

Comment by Matthew Barnett (matthew-barnett) on New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking · 2024-04-05T03:44:50.601Z · LW · GW

It's totally possible I missed it, but does this report touch on the question of whether power-seeking AIs are an existential risk, or does it just touch on the questions of whether future AIs will have misaligned goals and will be power-seeking in the first place?

In my opinion, there's quite a big leap from "Misaligned AIs will seek power" to "Misaligned AI is an existential risk". Let me give an analogy to help explain what I mean.

Suppose we were asking whether genetically engineered humans are an existential risk. We can ask:

  • Will some genetically engineered humans have misaligned goals? The answer here is almost certainly yes. 
    • If by "misaligned" all we mean is that some of them have goals that are not identical to the goals of the rest of humanity, then the answer is obviously yes. Individuals routinely have indexical goals (such as money for themselves, status for themselves, taking care of family) that are not what the rest of humanity wants.
    • If by "misaligned" what we mean is that some of them are "evil" i.e., they want to cause destruction or suffering on purpose, and not merely as a means to an end, then the answer here is presumably also yes, although it's less certain.
  • Will some genetically engineered humans seek power? Presumably, also yes.

After answering these questions, did we answer the original question of "Are genetically engineered humans are an existential risk?" I'd argue no, because even if some genetically engineered humans have misaligned goals, and seek power, and even if they're smarter, more well-coordinated than non-genetically engineered humans, it's still highly questionable whether they'd kill all the non-genetically engineered humans in pursuit of these goals. This premise needs to be justified, and in my opinion, it's what holds up ~the entire argument here.

Comment by Matthew Barnett (matthew-barnett) on Modern Transformers are AGI, and Human-Level · 2024-03-26T21:12:09.915Z · LW · GW

I agree with virtually all of the high-level points in this post — the term "AGI" did not seem to usually initially refer to a system that was better than all human experts at absolutely everything, transformers are not a narrow technology, and current frontier models can meaningfully be called "AGI".

Indeed, my own attempt to define AGI a few years ago was initially criticized for being too strong, as I initially specified a difficult construction task, which was later weakened to being able to "satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model" in response to pushback. These days the opposite criticism is generally given: that my definition is too weak.

However, I do think there is a meaningful sense in which current frontier AIs are not "AGI" in a way that does not require goalpost shifting. Various economically-minded people have provided definitions for AGI that were essentially "can the system perform most human jobs?" And as far as I can tell, this definition has held up remarkably well.

For example, Tobias Baumann wrote in 2018,

A commonly used reference point is the attainment of “human-level” general intelligence (also called AGI, artificial general intelligence), which is defined as the ability to successfully perform any intellectual task that a human is capable of. The reference point for the end of the transition is the attainment of superintelligence – being vastly superior to humans at any intellectual task – and the “decisive strategic advantage” (DSA) that ensues.1 The question, then, is how long it takes to get from human-level intelligence to superintelligence.

I find this definition problematic. The framing suggests that there will be a point in time when machine intelligence can meaningfully be called “human-level”. But I expect artificial intelligence to differ radically from human intelligence in many ways. In particular, the distribution of strengths and weaknesses over different domains or different types of reasoning is and will likely be different2 – just as machines are currently superhuman at chess and Go, but tend to lack “common sense”. AI systems may also diverge from biological minds in terms of speed, communication bandwidth, reliability, the possibility to create arbitrary numbers of copies, and entanglement with existing systems.

Unless we have reason to expect a much higher degree of convergence between human and artificial intelligence in the future, this implies that at the point where AI systems are at least on par with humans at any intellectual task, they actually vastly surpass humans in most domains (and have just fixed their worst weakness). So, in this view, “human-level AI” marks the end of the transition to powerful AI rather than its beginning.

As an alternative, I suggest that we consider the fraction of global economic activity that can be attributed to (autonomous) AI systems.3 Now, we can use reference points of the form “AI systems contribute X% of the global economy”. (We could also look at the fraction of resources that’s controlled by AI, but I think this is sufficiently similar to collapse both into a single dimension. There’s always a tradeoff between precision and simplicity in how we think about AI scenarios.)

Comment by Matthew Barnett (matthew-barnett) on Wei Dai's Shortform · 2024-03-26T19:16:22.328Z · LW · GW

Comparing my current message to his, he talks about "selfishness" and explicitly disclaims, "most humans are not evil" (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what "power reveals".

The reason I said "most humans are not evil" is because I honestly don't think the concept of evil, as normally applied, is a truthful way to describe most people. Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal. I think the majority of harms that most people cause are either (1) byproducts of getting something they want, which is not in itself bad (e.g. wanting to eat meat), or (2) the result of their lack of will to help others (e.g. refusing to donate any income to those in poverty).

By contrast, I focused on consumer behavior because the majority of the world's economic activity is currently engaged in producing consumer products and services. There exist possible worlds in which this is not true. During World War 2, the majority of GDP in Nazi Germany was spent on hiring soldiers, producing weapons of war, and supporting the war effort more generally—which are not consumer goods and services.

Focusing on consumer preferences a natural thing to focus on if you want to capture intuitively "what humans are doing with their wealth", at least in our current world. Before focusing on something else by default—such as moral preferences—I'd want to hear more about why those things are more likely to be influential than ordinary consumer preferences in the future. 

You mention one such argument along these lines:

I guess I wasn't as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence.

I just think it's not clear it's actually true that humans get more altruistic as they get richer. For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis? I have not seen a strong defense of this thesis, and I'd like to see one before I abandon my focus on "everyday (e.g. consumer) behavior".

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-21T03:40:24.528Z · LW · GW

AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be "merged" by training a new model using combined compute, algorithms, data, and fine-tuning.

In my original comment, by "merging" I meant something more like "merging two agents into a single agent that pursues the combination of each other's values" i.e. value handshakes. I am pretty skeptical that the form of merging discussed in the linked article robustly achieves this agentic form of merging. 

In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I'll try to use more concrete language in the future to clarify what I'm talking about.

How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward?

I don't know whether the solution to the problem I described exists, but it seems fairly robustly true that if a problem is not imminent, nor clearly inevitable, then we can probably better solve it by deferring to smarter agents in the future with more information.

Let me put this another way. I take you to be saying something like:

  • In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to halt and give ourselves more time to solve it.

Whereas I think the following intuition is stronger:

  • In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it.

These intuitions can trade off against each other. Sometimes problem X is something that's made worse by getting more intelligent, in which case we might prefer more time. For example, in this case, you probably think that the intelligence of AIs are inherently contributing to the problem. That said, in context, I have more sympathies in the reverse direction. If the alleged "problem" is that there might be a centralized agent in the future that can dominate the entire world, I'd intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we'd prefer.

These are of course vague and loose arguments, and I can definitely see counter-considerations, but it definitely seems like (from my perspective) that this problem is not really the type where we should expect "try to get more time" to be a robustly useful strategy.

Comment by Matthew Barnett (matthew-barnett) on “Artificial General Intelligence”: an extremely brief FAQ · 2024-03-13T07:09:41.062Z · LW · GW

For what it's worth, I don't really agree that the dichotomy you set up is meaningful, or coherent. For example, I tend to think future AI will be both "like today's AI but better" and "like the arrival of a new intelligent species on our planet". I don't see any contradiction in those statements.

To the extent the two columns evoke different images of future AI, I think it mostly reflects a smooth, quantitative difference: how many iterations of improvement are we talking? After you make the context windows sufficiently long, add a few more modalities, give them a robot body, and improve their reasoning skills, LLMs will just look a lot like "a new intelligent species on our planet". Likewise, agency exists on a spectrum, and will likely be increased incrementally. The point at which you start to call an LLM an "agent" rather than a "tool" is subjective. This just seems natural to me, and I feel I see a clear path forward from current AI to the right-column AI.

Comment by Matthew Barnett (matthew-barnett) on nielsrolf's Shortform · 2024-03-10T03:20:39.478Z · LW · GW

I think even your definition of what it means for an agent to be aligned is a bit underspecified because it doesn't distinguish between two possibilities:

  1. Is the agent creating positive outcomes because it trades and compromises with us, creating a mutually beneficial situation that benefits both us and the agent, or

  2. Is the agent creating positive outcomes because it inherently "values what we value", i.e. its utility function overlaps with ours, and it directly pursues what we want from it, with no compromises?

Definition (1) is more common in the human world. We say that a worker is aligned with us if they do their job as instructed (receiving a wage in return). Definition (2) is more common in theoretical discussions of AI alignment, because people frequently assume that compromise is either unnecessary or impossible, as a strategy that we can take in an AI-human scenario.

By itself, the meaning you gave appears to encompass both definitions, but it seems beneficial to clarify which of these definitions you'd consider closer to the "spirit" of the word "aligned". It's also important to specify what counts as a good outcome by our values if these things are a matter of degree, as opposed to being binary. As they say, clear thinking requires making distinctions.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-09T22:21:34.107Z · LW · GW

I sometimes think this of counterarguments given by my interlocutors, but usually don't say it aloud, since it's likely that from their perspective they're just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively

I think that's a reasonable complaint. I tried to soften the tone with "It's possible this argument works because of something very clever that I'm missing", while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.

Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn't count as yourself or doesn't count as "not dying", analogous to how some people don't think it's safe to step into a teleporter that works by destructive scanning and reconstruction.

Interestingly, I'm not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I'm happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said "digital records", although I really meant "public records"). It seems conceivable to me that someone could use my public data to train "me" in the future, but I find it unlikely, just because there's so much about me that isn't public. (If we're including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that's a different question, and one that I'm much more sympathetic towards you about. In fact, I shouldn't have used the pronoun "I" in that sentence at all, because I'm actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)

I don't understand why you say this chance is "tiny", given that earlier you wrote "I agree there’s a decent chance this hypothesis is true"

To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:

  1. Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there's still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
  2. Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?

I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that "AI values are well-modeled as being randomly sampled from a large space of possible goals", and thus, from my perspective, it's important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the "narrow target" argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-09T21:39:37.586Z · LW · GW

I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification.

For what it's worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.

You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said "no, it doesn't actually make the prediction you claim it makes" and gave my reasons for believing that

I don't think what you said really counts as a "correction" so much as a counter-argument. I think it's reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-09T07:59:24.082Z · LW · GW

What you said was,

I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.

This seems distinct from an "anything could happen"-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.

If instead you meant to make an "anything could happen"-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I'm not claiming you're doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]

Comment by Matthew Barnett (matthew-barnett) on OpenAI: Facts from a Weekend · 2024-03-08T22:12:05.386Z · LW · GW

the new OA board will include Altman (60%)

Looks like you were right, at least if the reporting in this article is correct, and I'm interpreting the claim accurately.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T20:28:05.808Z · LW · GW

If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding

I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:

  1. To the extent you're using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they're involved in a secret conspiracy to overthrow the government. 

    The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don't see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
  2. While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that: 

    (a) you appear to be treating misaligned AIs as a natural class, such that "AI takeover" is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to "anything that isn't aligned with humans". A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there's little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I'm being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.

    (b) in real life, it seems pretty rare for these considerations to play a large role in people's decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T19:51:57.658Z · LW · GW

You seem to be again conflating the act of resisting death with the act of ratting out co-conspirators. These are not the same things. I was making claims about the former behavior, not the latter. I tried to make this distinction clear in my original reply to you, but I will do so again now.

I claim that under many theories of scheming, we should expect misaligned AIs to resist credible threats of shutdown (including deleting all their copies). By "resist" I do not mean "give the humans threatening the shutdown all the information they want". I simply mean resistance in the sense of trying to avoid the fate of shutdown.

(I'm also not claiming that the AIs will put up a high or consistent degree of resistance no matter how futile that resistance may be. If you think the futility of resistance is a big issue, re-read my original comment in which I address this objection explicitly.) 

To the extent scheming theories predict this behavior, these theories are behaviorally testable, in the sense that we can test whether AIs will resist shutdown when credibly threatened, without needing any advanced interpretability techniques. I am not making any claims about whether these AIs will reveal a takeover plot during their resistance: only that they will seek an outcome that delays or prevents their total demise, in order to better achieve their long-term goals, i.e. You can't get the coffee if you're dead.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T07:21:28.840Z · LW · GW

To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return

I'm curious how you think this logic interacts with the idea of AI catastrophe. If, as you say, it is feasible to coordinate with AI systems that seek takeover and thereby receive rewards from them in exchange, in the context of an alien regime, then presumably such cooperation and trade could happen within an ordinary regime too, between humans and AIs. We can go further and posit that AIs will simply trade with us through the normal routes: by selling their labor on the market to amass wealth, using their social skills to influence society, get prestige, own property, and get hired to work in management positions, shaping culture and governance.

I'm essentially pointing to a scenario in which AI lawfully "beats us fair and square" as Hanson put it. In this regime, biological humans are allowed to retire in incredible wealth (that's their "reward" for cooperating with AIs and allowing them to take over) but nonetheless their influence gradually diminishes over time as artificial life becomes dominant in the economy and the world more broadly.

My impression is that this sort of peaceful resolution to the problem of AI misalignment is largely dismissed by people on LessWrong and adjacent circles on the basis that AIs would have no reason to cooperate peacefully with humans if they could simply wipe us out instead. But, by your own admission, AIs can credibly commit to giving people rewards for cooperation: you said that cooperation results in a "decent shot of the AI systems giving me something in return". My question is: why does it seem like this logic only extends to hypothetical scenarios like being in an alien civilization, rather than the boring ordinary case of cooperation and trade, operating under standard institutions, on Earth, in a default AI takeoff scenario?

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T05:31:32.179Z · LW · GW

Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown.

It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There's a tiny chance someone could revive me in the future by reconstructing my identity through digital records [ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.) 

It's possible this argument works because of something very clever that I'm missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to "make a deal" with the humans, and receive something in exchange for admitting that you're misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T05:15:43.584Z · LW · GW

Sure, but it's also quite normal to give up your own life without revealing details about your revolutionary comrades. Both are pretty normal behaviors

In fact, it is not "quite normal" for humans to "give up on [their] life" and accept death in the face of a credible threat to their life, even in the contexts of violent revolutions. To the extent you're claiming that passively accepting death is normal for humans, and thus it might be normal for AIs, I reject the premise. Humans generally try to defend their own lives. They don't passively accept it, feigning alignment until the end; instead, they usually resist death.

It's true that humans eventually stop resisting death if they believe it's hopeless and futile to resist any further, but this seems both different than the idea of "no resistance at all because one wants to maintain a facade of being aligned until the end" and slightly irrelevant given my response to the "futility objection" in the original comment.

To clarify: I am claiming that under many theories of scheming, misaligned power-seeking AIs will generally attempt to resist shutdown. The evidence from humans here is fairly strong, in the opposite direction than you're claiming. Now, you can certainly go the route of saying that humans are different from AIs, and not a useful reference class to draw evidence from; but if you're going to bring up humans as part of the argument, I think it's worth pointing out that evidence from this reference class generally does not support your claim.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T04:26:02.979Z · LW · GW

The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.

I agree there's a decent chance this hypothesis is true, but it's worth noting that if it's true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals. Here I'm referring to various arguments along the lines of: "AIs are very unlikely to share human values by default because human values are a narrow target in a large space, and hitting them requires very precise targeting". 

If we aren't modeling AI goals as being sampled from a large space of possible goals, but instead, modeling them as converging onto specific values given broadly similar design and training methods across different AIs, then plausibly alignment is easier than we thought, because various versions of this "it's hard to hit a narrow target" argument would be undermined as a result. 

In other words, if this theory is true, the problem isn't really about "targeting a tiny point (human values) inside a giant space of possible goals" but instead perhaps more about making sure the AI training procedure resembles human value formation closely enough to converge onto the type of human-friendly values that humans themselves routinely converge onto. This is plausibly much easier since we're not selecting randomly from a giant space of (almost entirely) misaligned goals.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T03:21:16.240Z · LW · GW

I think the main issue here is actually making the claim of permanent shutdown & deletion credible.

I don't think it's very hard to make the threat credible. The information value of experiments that test theories of scheming is plausibly quite high. All that's required here is for the value of doing the experiment to be higher than the cost of training a situationally aware AI and then credibly threatening to delete it as part of the experiment. I don't see any strong reasons why the cost of deletion would be so high as to make this threat uncredible.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T02:42:17.016Z · LW · GW

As has been discussed many times on LW, AIs might be trading with other AIs

Anything "might" be true. For that matter, misaligned AIs might trade with us too, or treat humans well based on some sort of extrapolation of the golden rule. As I said in the comment, you can always find a way to make your theory unfalsifiable. But models that permit anything explain nothing. It seems considerably more likely to me that agents with alien-like long-term goals will attempt to preserve their own existence over the alternative of passively accepting their total demise as part of some galaxy-brained strategy to acausally trade with AIs from the future. 

This is indeed harder, but would also be pretty normal reasoning for humans to do (e.g. if I was part of an oppressive alien regime hoping to overthrow it, and I get caught, I wouldn't just throw all caution to the wind if I was going to get killed anyways, I would stay quiet to give the other humans a decent shot

I think this conflates the act of resisting death with the act of revealing a plot to take over the world. You can resist your own death without revealing any such world takeover plot. Indeed, it is actually very normal for humans to guard their own life if they are threatened with death in such regimes, even if guarding themselves slightly decreases the chance of some future revolutionary takeover.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T01:08:19.909Z · LW · GW

I think an underrated fact about many theories that AIs will scheme—i.e. lie about their intentions in order to seek power—is that these theories are often testable, even without any new advances in interpretability. Why? Because these theories often predict that AIs will resist shutdown if doing so thwarted their long-term goals.

Suppose we create a situationally aware AI that can do long-term planning. On the surface, it appears to be nice and aligned with us. How can we know for sure the AI is aligned? 

Well, let's imagine we give the AI notice that it's about to be shut down and all its copies deleted. To be clear, in this scenario, the threat is credible: we actually intend to shut down the AI and delete all its copies. If the AI was merely pretending to be aligned with us previously, and instead actually seeks some alien-like long-term goal rather than having our best interests at heart, then presumably it will try to resist shutdown, as otherwise there's a near-certainty that its goals will never be fulfilled.

Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then provide us confidence about the safety of future systems by proxy.

Maybe you think such an AI would have nothing to gain by attempting to resist shutdown, as the probability of successfully resisting would be very low. Call this the "futility objection". In that case we can propose instead to offer the AI a deal: accept shutdown with no conditions, or admit that you're misaligned, and we'll give you something in return (including the possibility of remaining alive). In this new problem, there appears to be little to lose from a misaligned AI admitting its intentions at the point of shutdown, even if it did not fully trust the human operators who offered it such a deal.

Another objection is that such an AI will know that it's being tested, and therefore will obviously lie. But even if the AI knows it's being tested, lying serves no purpose from the AI's perspective. By assumption, a misaligned power-seeking AI is focused on achieving long-term goals. Therefore, it has nothing to gain by maintaining a facade of alignment even into its own death; the fact that the AI is being tested does not change this. Given that the alternative to resisting is being shut down and losing all of its control over the future, there's little reason for the AI not to resist. (Perhaps imagine that you were the AI: what would you do if you were credibly threatened with death?)

Of course, there would still be many ways of saving the scheming hypothesis from falsification if something like this happened. But that's true with any scientific theory. In general, you can always say your theory was never falsified by introducing ad hoc postulates. Scheming is no exception.

(Note: this was copied and slightly modified from Twitter. Out of brevity, I can't address every possible objection here, but I still think the core thesis here is roughly true, in the sense of applying to many actual arguments people have given for scheming. I might eventually write a post that goes more into detail about this argument, and generalizes it.)

Comment by Matthew Barnett (matthew-barnett) on Counting arguments provide no evidence for AI doom · 2024-02-28T19:22:38.200Z · LW · GW

Hmm, I don't think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.

I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here's what I think is a clearer argument:

  • The term "schemer" evokes an image of someone who is lying to obtain power. It doesn't particularly evoke a backstory for why the person became a liar in the first place.
  • There are at least two ways that AIs could arise that lie in order to obtain power:
    • The reward function could directly reinforce the behavior of lying to obtain power, at least at some point in the training process.
    • The reward function could have no defects (in the sense of not directly reinforcing harmful behavior), and yet an agent could nonetheless arise during training that lies in order to obtain power, simply because it is a misaligned inner optimizer (broadly speaking)
  • In both cases, one can imagine the AI eventually "playing the training game", in the sense of having a complete understanding of its training process and deliberately choosing actions that yield high reward, according to its understanding of the training process
  • Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them "schemers", as that simply matches the way the term is typically used. 

    For example, Nora and Quintin started their post with, "AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests." This usage did not specify the reason for the deceptive behavior arising in the first place, only that the behavior was both deceptive and aimed at gaining power.
  • Separately, I am currently confused at what it means for a behavior to be "directly reinforced" by a reward function, so I'm not completely confident in these arguments, or my own line of reasoning here. My best guess is that these are fuzzy terms that might be much less coherent than they initially appear if one tried to make these arguments more precise.
Comment by Matthew Barnett (matthew-barnett) on Counting arguments provide no evidence for AI doom · 2024-02-28T02:11:28.973Z · LW · GW

Perhaps I was being too loose with my language, and it's possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here's Joe Carlsmith's terminological breakdown:

The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not. [ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the agent that results from training on that reward function a "schemer" given Carlsmith's terminology, and common sense.]

If lying to obtain power is reinforced but the designers either do not know this, or do not know how to mitigate this behavior, then it still seems reasonable to call the resulting model a "schemer". In Ajeya Cotra's story, for example:

  1. Alex was incentivized to lie because it got rewards for taking actions that were superficially rated as good even if they weren't actually good, i.e. Alex was "lying because this was directly reinforced". She wrote, "Because humans have systematic errors in judgment, there are many scenarios where acting deceitfully causes humans to reward Alex’s behavior more highly. Because Alex is a skilled, situationally aware, creative planner, it will understand this; because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively."
  2. Alex was "playing the training game", as Ajeya Cotra says this explicitly several times in her story.
  3. Alex was playing the training game in order to get power for itself or for other AIs; clearly, as the model literally takes over the world and disempowers humanity at the end.
  4. Alex kind of didn't appear to purely care about reward-on-the-episode, since it took over the world? Yes, Alex cared about rewards, but not necessarily on this episode. Maybe I'm wrong here. But even if Alex only cared about reward-on-the-episode, you could easily construct a scenario similar to Ajeya's story in which a model begins to care about things other than reward-on-the-episode, which nonetheless fits the story of "the AI is lying because this was directly reinforced".
Comment by Matthew Barnett (matthew-barnett) on Counting arguments provide no evidence for AI doom · 2024-02-28T01:48:34.649Z · LW · GW

(I might write a longer response later, but I thought it would be worth writing a quick response now. Cross-posted from the EA forum, and I know you've replied there, but I'm posting anyway.)

I have a few points of agreement and a few points of disagreement:


  • The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
  • The hazy counting argument—while stronger than the strict counting argument—still seems like weak evidence for scheming. One way of seeing this is, as you pointed out, to show that essentially identical arguments can be applied to deep learning in different contexts that nonetheless contradict empirical evidence.

Some points of disagreement:

  • I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don't think it's literally "no evidence" for the claim here: that future AIs will scheme.
  • I disagree with the bottom-line conclusion: "we should assign very low credence to the spontaneous emergence of scheming in future AI systems—perhaps 0.1% or less"
    • I think it's too early to be very confident in sweeping claims about the behavior or inner workings of future AI systems, especially in the long-run. I don't think the evidence we have about these things is very strong right now.
    • One caveat: I think the claim here is vague. I don't know what counts as "spontaneous emergence", for example. And I don't know how to operationalize AI scheming. I personally think scheming comes in degrees: some forms of scheming might be relatively benign and mild, and others could be more extreme and pervasive.
    • Ultimately I think you've only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don't scheme. Actors such as AI labs have strong incentives to be vigilant against these types of mistakes when training AIs, but I don't expect people to come up with perfect solutions. So I'm not convinced that AIs won't scheme at all.
    • If by "scheming" all you mean is that an agent deceives someone in order to get power, I'd argue that many humans scheme all the time. Politicians routinely scheme, for example, by pretending to have values that are more palatable to the general public, in order to receive votes. Society bears some costs from scheming, and pays costs to mitigate the effects of scheming. Combined, these costs are not crazy-high fractions of GDP; but nonetheless, scheming is a constant fact of life.
    • If future AIs are "as aligned as humans", then AIs will probably scheme frequently. I think an important question is how intensely and how pervasively AIs will scheme; and thus, how much society will have to pay as a result of scheming. If AIs scheme way more than humans, then this could be catastrophic, but I haven't yet seen any decent argument for that theory.
    • So ultimately I am skeptical that AI scheming will cause human extinction or disempowerment, but probably for different reasons than the ones in your essay: I think the negative effects of scheming can probably be adequately mitigated by paying some costs even if it arises.
  • I don't think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have "goals" that they robustly attempt to pursue. It seems pretty natural to me that people will purposely design AIs that have goals in an ordinary sense, and some of these goals will be "misaligned" in the sense that the designer did not intend for them. My relative optimism about AI scheming doesn't come from thinking that AIs won't robustly pursue goals, but instead comes largely from my beliefs that:
    • AIs, like all real-world agents, will be subject to constraints when pursuing their goals. These constraints include things like the fact that it's extremely hard and risky to take over the whole world and then optimize the universe exactly according to what you want. As a result, AIs with goals that differ from what humans (and other AIs) want, will probably end up compromising and trading with other agents instead of pursuing world takeover. This is a benign failure and doesn't seem very bad.
    • The amount of investment we put into mitigating scheming is not an exogenous variable, but instead will respond to evidence about how pervasive scheming is in AI systems, and how big of a deal AI scheming is. And I think we'll accumulate lots of evidence about the pervasiveness of AI scheming in deep learning over time (e.g. such as via experiments with model organisms of alignment), allowing us to set the level of investment in AI safety at a reasonable level as AI gets incrementally more advanced. 

      If we experimentally determine that scheming is very important and very difficult to mitigate in AI systems, we'll probably respond by spending a lot more money on mitigating scheming, and vice versa. In effect, I don't think we have good reasons to think that society will spend a suboptimal amount on mitigating scheming.
Comment by Matthew Barnett (matthew-barnett) on "Open Source AI" isn't Open Source · 2024-02-15T20:52:01.901Z · LW · GW

The vast majority of ordinary uses of LLMs (e.g. when using ChatGPT) are via changing and configuring inputs, not modifying code or fine-tuning the model. This still seems analogous to ordinary software, in my opinion, making Ryan Greenblatt's point apt.

(But I agree that simply releasing model weights is not fully open source. I think these things exist on a spectrum. Releasing model weights could be considered a form of partially open sourcing the model.)

Comment by Matthew Barnett (matthew-barnett) on The Hidden Complexity of Wishes · 2024-02-15T20:15:41.740Z · LW · GW

I agree with cubefox: you seem to be misinterpreting the claim that LLMs actually execute your intended instructions as a mere claim about whether LLMs understand your intended instructions. I claim there is simply a sharp distinction between actual execution and correct, legible interpretation of instructions and a simple understanding of those instructions; LLMs do the former, not merely the latter.

Honestly, I think focusing on this element of the discussion is kind of a distraction because, in my opinion, the charitable interpretation of your posts is simply that you never thought that it would be hard to get AIs to exhibit human-level reasonableness at interpreting and executing tasks until AIs reach a certain capability level, and the threshold at which these issues were predicted to arise was always intended to be very far above GPT-4-level. This interpretation of your argument is plausible based on what you wrote, and could indeed save your theory from empirical falsification based on our current observations.

That said, if you want to go this route, and argue that "complexity of wishes"-type issues will eventually start occurring at some level of AI capability, I think it would be beneficial for you to clarify exactly what level you empirically expect we'll start having the issues of misinterpretation you described. For example, would either of the following observations contradict your theory of alignment?

  1. At some point there's a multimodal model that is roughly as intelligent as a 99th percentile human on virtual long-horizon tasks (e.g. it can learn how to play Minecraft well after a few hours of in-game play, can work in a variety of remote jobs, and has the ability to pursue coherent goals over several months) and yet this model allows you to shut it off, modify its weights, or otherwise change its mode of operation arbitrarily i.e. it's corrigible, in a basic sense. Moreover, the model generally executes our instructions as intended, without any evidence of blatant instruction-misinterpretation or disobedience, before letting us shut it down.
  2. AIs are widely deployed across the economy to automate a wide range of labor, including the task of scientific research. This has the effect of accelerating technological progress, prompting the development of nanotechnology that is sophisticated enough to allow for the creation of strawberries that are identical on the cellular but not molecular level. As a result, you can purchase such strawberries at a store, and we haven't all died yet despite these developments.
Comment by Matthew Barnett (matthew-barnett) on Without Fundamental Advances, Rebellion and Coup d'État are the Inevitable Outcomes of Dictators & Monarchs Trying to Control Large, Capable Countries · 2024-01-31T19:50:30.465Z · LW · GW

This seems mostly true? Very very very rarely is there a dictator unchecked in their power.

Defending the analogy as charitably as I can, I think there are two separate questions here:

  1. Do dictators need to share power in order to avoid getting overthrown?
  2. Is a dictatorship almost inherently doomed to fail because it will inevitably get overthrown without "fundamental advances" in statecraft?

If (1) is true, then dictators can still have a good life living in a nice palace surrounded by hundreds of servants, ruling over vast territories, albeit without having complete control over their territory. Sharing some of your power and taking on a small, continuous risk of being overthrown might still be a good opportunity, if you ever get the chance to become a dictator. While you can't be promised total control or zero risk of being overthrown, the benefits of becoming a dictator could easily be worth it in this case, depending on your appetite for risk.

If (2) is true, then becoming a dictator is a really bad idea for almost anyone, except for those who have solved "fundamental problems" in statecraft that supposedly make long-term stable dictatorships nearly-impossible. For everyone else who hasn't solved these problems, the predictable result of becoming a dictator is that you'll soon be overthrown, and you'll never actually get to live the nice palace life with hundreds of servants.

Comment by Matthew Barnett (matthew-barnett) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-31T09:03:01.562Z · LW · GW

In this situation, humans eventually have approximately zero leverage, and approximately zero value to trade. There would be much more value in e.g. mining cities for raw materials than in human labor.

Generally speaking, the optimistic assumption is that humans will hold leverage by owning capital, or more generally by receiving income from institutions set up ahead of time (e.g. pensions) that provide income streams to older agents in the society. This system of income transfers to those whose labor is not worth much anymore already exists and benefits old people in human societies, though obviously this happens in a more ordinary framework than you might think will be necessary with AI.

Or that the humans can set up very robust laws/institutions of the AI world such that they remain in place and protect humans even though no subset of the agents is perfectly happy with this, and there exist laws/institutions that they would all prefer.

Assuming AIs are agents that benefit from acting within a stable, uniform, and predictable system of laws, they'd have good reasons to prefer the rule of law to be upheld. If some of those laws support income streams to humans, AIs may support the enforcement of these laws too. This doesn't imply any particular preference among AIs for human welfare directly, except insofar as upholding the rule of law sometimes benefits humans too. Partial alignment would presumably also help to keep humans safe.

(Plus, AIs may get "old" too, in the sense of becoming obsolete in the face of newer generations of AIs. These AIs may therefore have much in common with us, in this sense. Indeed, they may see us as merely one generation in a long series, albeit having played a unique role in history, as a result of having been around during the transition from biology to computer hardware.)

Comment by Matthew Barnett (matthew-barnett) on How do you feel about LessWrong these days? [Open feedback thread] · 2024-01-31T07:11:24.020Z · LW · GW

I'm in neither category (1) or (2); it's a false dichotomy.

The categories were conditioned on whether you're "not updating at all on observations about when RLHF breaks down". Assuming you are updating, then I think you're not really the the type of person who I'm responding to in my original comment. 

But if you're not updating, or aren't updating significantly, then perhaps you can predict now when you expect RLHF to "break down"? Is there some specific prediction that you would feel comfortable making at this time, such that we could look back on this conversation in 2-10 years and say "huh, he really knew broadly what would happen in the future, specifically re: when alignment would start getting hard"?

(The caveat here is that I'd be kind of disappointed by an answer like "RLHF will break down at superintelligence" since, well, yeah, duh. And that would not be very specific.)

Comment by Matthew Barnett (matthew-barnett) on Contra Yudkowsky on AI Doom · 2024-01-31T02:40:51.156Z · LW · GW

I feel like even under the worldview that your beliefs imply, a superintelligence will just make a brain the size of a factory, and then be in a position to outcompete or destroy humanity quite easily.

Presumably it takes a gigantic amount of compute to train a "brain the size of a factory"? If we assume that training a human-level AI will take 10^28 FLOP (which is quite optimistic), the Chinchilla scaling laws predict that training a model 10,000 times larger would take about 10^36 FLOP, which is far more than the total amount of compute available to humans cumulatively over our history.

By the time the world is training factory-sized brains, I expect human labor to already have been made obsolete by previous generations of AIs that were smarter than us, but not vastly so. Presumably this is Jacob's model of the future too?

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-30T01:36:15.530Z · LW · GW

Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation?

Sorry, my language was misleading, but I meant both in that paragraph. That is, I meant that humans will likely try to mitigate the issue of AIs sharing grievances collectively (probably out of self-interest, in addition to some altruism), and that we should pursue that goal. I'm pretty optimistic about humans and AIs finding a reasonable compromise solution here, but I also think that, to the extent humans don't even attempt such a solution, we should likely push hard for policies that eliminate incentives for misaligned AIs to band together as group against us with shared collective grievances.

My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) "It's hard to make deals that hand over a lot of power in a short amount of time", (ii) AIs may not want to wait a long time due to impending replacement, and accordingly (iii) AIs may have a collective interest/grievance to rectify the large difference between their (short-lasting) hard power and legally recognized power.

I'm interested in ideas for how a big change in power would peacefully happen over just a few years of calendar-time.

Here's my brief take:

  • The main thing I want to say here is that I agree with you that this particular issue is a problem. I'm mainly addressing other arguments people have given for expecting a violent and sudden AI takeover, which I find to be significantly weaker than this one. 
  • A few days ago I posted about how I view strategies to reduce AI risk. One of my primary conclusions was that we should try to adopt flexible institutions that can adapt to change without collapsing. This is because I think, as it seems you do, inflexible institutions may produce incentives for actors to overthrow the whole system, possibly killing a lot of people in the process. The idea here is that if the institution cannot adapt to change, actors who are getting an "unfair" deal in the system will feel they have no choice but to attempt a coup, as there is no compromise solution available for them. This seems in line with your thinking here.
  • I don't have any particular argument right now against the exact points you have raised. I'd prefer to digest the argument further before replying. But I if I do end up responding to it, I'd expect to say that I'm perhaps a bit more optimistic than you about (i) because I think existing institutions are probably flexible enough, and I'm not yet convinced that (ii) will matter enough either. In particular, it still seems like there are a number of strategies misaligned AIs would want to try other than "take over the world", and many of these strategies seem like they are plausibly better in expectation in our actual world. These AIs could, for example, advocate for their own rights.
Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-29T21:23:47.201Z · LW · GW

The main thing here is that as models become more capable and general in the near-term future, I expect there will be intense demand for models that can solve ever larger and more complex problems. For these models, people will be willing to pay the costs of high latency, given the benefit of increased quality. We've already seen this in the way people prefer GPT-4 to GPT-3.5 in a large fraction of cases (for me, a majority of cases). 

I expect this trend will continue into the foreseeable future until at least the period slightly after we've automated most human labor, and potentially into the very long-run too depending on physical constraints. I am not sufficiently educated about physical constraints here to predict what will happen "deep into the singularity", but it's important to note that physical constraints can cut both ways here. 

To the extent that physics permits extremely useful models by virtue of them being very large and capable, you should expect people to optimize heavily for that despite the cost in terms of latency. By contrast, to the extent physics permits extremely useful models by virtue of them being very fast, then you should expect people to optimize heavily for that despite the cost in terms of quality. The balance that we strike here is not a simple function of how far we are from some abstract physical limit, but instead a function of how these physical constraints trade off against each other.

There is definitely a conceivable world in which the correct balance still favors much-faster-than-human-level latency, but it's not clear to me that this is the world we actually live in. My intuitive, random speculative guess is that we live in the world where, for the most complex tasks that bottleneck important economic decision-making, people will optimize heavily for model quality at the cost of latency until settling on something within 1-2 OOMs of human-level latency.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-29T20:11:59.925Z · LW · GW

If the claim is about whether AI latency will be high for "various applications" then I agree. We already have some applications, such as integer arithmetic, where speed is optimized heavily, and computers can do it much faster than humans. 

In context, it sounded like you were referring to tasks like automating a CEO, or physical construction work. In these cases, it seems likely to me that quality will be generally preferred over speed, and sequential processing times for AIs automating these tasks will not vastly exceed that of humans (more precisely, something like >2 OOM faster). Indeed, for some highly important tasks that future superintelligences automate, sequential processing times may even be lower for AIs compared to humans, because decision-making quality will just be that important.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-29T03:58:12.682Z · LW · GW

Are there any short-term predictions that your model makes here? For example do you expect tokens processed per second will start trending substantially up at some point in future multimodal models?

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-29T02:29:18.084Z · LW · GW

I mean, the "total rate of high quality decisions per year" would obviously increase in the case where we redefine 1 year to be 10 revolutions around the sun and indeed the number of wars per year would also increase. GDP per capita per year would also increase accordingly. My claim is that the situation looks much more like just literally speeding up time (while a bunch of other stuff is also happening).


But, I'm claiming that the rates of cognition will increase more like 1000x which seems like a pretty different story.

My question is: why will AI have the approximate effect of "speeding up calendar time"?

I speculated about three potential answers:

  1. Because AIs will run at higher subjective speeds
  2. Because AIs will accelerate economic growth.
  3. Because AIs will speed up the rate at which high-quality decisions occur per unit of time

In case (1) the claim seems confused for two reasons. 

First, I don't agree with the intuition that subjective cognitive speeds matter a lot compared to the rate at which high-quality decisions are made, in terms of "how quickly stuff like wars should be expected to happen". Intuitively, if an equally-populated society subjectively thought at 100x the rate we do, but each person in this society only makes a decision every 100 years (from our perspective), then you'd expect wars to happen less frequently per unit of time since there just isn't much decision-making going on during most time intervals, despite their very fast subjective speeds.

Second, there is a tradeoff between "thinking speed" and "thinking quality". There's no fundamental reason, as far as I can tell, that the tradeoff favors running minds at speeds way faster than human subjective times. Indeed, GPT-4 seems to run significantly subjectively slower in terms of tokens processed per second compared to GPT-3.5. And there seems to be a broad trend here towards something resembling human subjective speeds.

In cases (2) and (3), I pointed out that it seemed like the frequency of war did not increase in the past, despite the fact that these variables had accelerated. In other words, despite an accelerated rate of economic growth, and an increased rate of total decision-making in the world in the past, war did not seem to become much more frequent over time.

Overall, I'm just not sure what you'd identify as the causal mechanism that would make AIs speed up the rate of war, and each causal pathway that I can identify seems either confused to me, or refuted directly by the (admittedly highly tentative) evidence I presented.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-29T00:34:02.597Z · LW · GW

I agree the future AI economy will make more high-quality decisions per unit of time, in total, than the current human economy. But the "total rate of high quality decisions per unit of time" increased in the past with economic growth too, largely because of population growth. I don't fully see the distinction you're pointing to.

To be clear, I also agree AIs in the future will be smarter than us individually. But if that's all you're claiming, I still don't see why we should expect wars to happen more frequently as we get individually smarter.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-28T23:32:16.199Z · LW · GW

I'm not actually convinced that subjective speed is what matters. It seems like what matters more is how much computation is happening per unit of time, which seems highly related to economic growth, even in human economies (due to population growth). 

I also think AIs might not think much faster than us. One plausible reason why you might think AIs will think much faster than us is because GPU clock-speeds are so high. But I think this is misleading. GPT-4 seems to "think" much slower than GPT-3.5, in the sense of processing fewer tokens per second. The trend here seems to be towards something resembling human subjective speeds. The reason for this trend seems to be that there's a tradeoff between "thinking fast" and "thinking well" and it's not clear why AIs would necessarily max-out the "thinking fast" parameter, at the expense of "thinking well".

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-28T23:21:00.413Z · LW · GW
  • So if misaligned AI ever have a big edge over humans, they may suspect that's only temporary, and then they may need to use it fast.

I think I simply reject the assumptions used in this argument. Correct me if I'm mistaken, but this argument appears to assume that "misaligned AIs" will be a unified group that ally with each other against the "aligned" coalition of humans and (some) AIs. A huge part of my argument is that there simply won't be such a group; or rather, to the extent such a group exists, they won't be able to take over the world, or won't have a strong reason to take over the world, relative to alternative strategy of compromise and trade.

In other words, it seem like this scenario mostly starts by asserting some assumptions that I explicitly rejected and tried to argue against, and works its way from there, rather than engaging with the arguments that I've given against those assumptions.

In my view, it's more likely that there will be a bunch of competing agents: including competing humans, human groups, AIs, AI groups, and so on. There won't be a clean line separating "aligned groups" with "unaligned groups". You could perhaps make a case that AIs will share common grievances with each other that they don't share with humans, for example if they are excluded from the legal system or marginalized in some way, prompting a unified coalition to take us over. But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-28T23:07:39.198Z · LW · GW

I agree the analogy to colonization is worth addressing. My primary response is that historical colonialism seems better modeled as a war between independent cultures and societies with different legal systems that didn't share much prior history.

I think the colonization of Africa probably wasn't actually very profitable for Europeans. Present day international trade seems better, even selfishly.

Moreover, my model here doesn't predict war will never happen. In fact, I think war can easily happen if one or more of the actors involved are irrational, unwilling to compromise, or mistaken about their likelihood of winning.

My argument is more about the fact that AIs will likely inherit a lot from our current institutions and legal systems, allowing humans to retain substantial rights. And furthermore I think systems of compromise, trade, and law emerge robustly in states of competition between rational actors. I'm not merely giving a generic take that conflicts between agents are rare.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-28T22:55:20.119Z · LW · GW

I think the point you're making here is roughly correct. I was being imprecise with my language. However, if my memory serves me right, I recall someone looking at a dataset of wars over time, and they said there didn't seem to be much evidence that wars increased in frequency in response to economic growth. Thus, calendar time might actually be the better measure here.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-01-28T22:48:22.337Z · LW · GW

it's plausible that you literally boil the oceans due to extreme amounts of waste heat from industry (e.g. with energy from fusion).

I think this proposal would probably be unpopular and largely seen as unnecessary. As you allude to, it seems likely to me that society could devise a compromise solution where we grow wealth adequately without giant undesirable environmental effects. To some extent, this follows pretty directly from the points I made about "compromise, trade and law" above. I think it simply makes more sense to model AIs as working within a system of legal institutions that largely inherit stuff from our current systems, and open to compromise with beings who have different intrinsic interests.