Posts

Against most, but not all, AI risk analogies 2024-01-14T03:36:16.267Z
My thoughts on the social response to AI risk 2023-11-01T21:17:08.184Z
Announcing Epoch's newly expanded Parameters, Compute and Data Trends in Machine Learning database 2023-10-25T02:55:07.440Z
Evaluating the historical value misspecification argument 2023-10-05T18:34:15.695Z
Updating Drexler's CAIS model 2023-06-16T22:53:58.140Z
Are Emergent Abilities of Large Language Models a Mirage? [linkpost] 2023-05-02T21:01:33.142Z
Shutting down AI is not enough. We need to destroy all technology. 2023-04-01T21:03:24.448Z
Conceding a short timelines bet early 2023-03-16T21:49:35.903Z
A proposed method for forecasting transformative AI 2023-02-10T19:34:01.358Z
Noting an error in Inadequate Equilibria 2023-02-08T01:33:33.715Z
Slightly against aligning with neo-luddites 2022-12-26T22:46:42.693Z
Updating my AI timelines 2022-12-05T20:46:28.161Z
Could a single alien message destroy us? 2022-11-25T07:32:24.889Z
A conversation about Katja's counterarguments to AI risk 2022-10-18T18:40:36.543Z
The Most Important Century: The Animation 2022-07-24T20:58:55.869Z
A concrete bet offer to those with short AGI timelines 2022-04-09T21:41:45.106Z
Google's new 540 billion parameter language model 2022-04-04T17:49:23.484Z
Using prediction markets to generate LessWrong posts 2022-04-01T16:33:24.404Z
My current thoughts on the risks from SETI 2022-03-15T17:18:19.722Z
A comment on Ajeya Cotra's draft report on AI timelines 2022-02-24T00:41:48.192Z
Does needle anxiety drive vaccine hesitancy? 2022-02-11T19:35:21.730Z
Clarifying the palatability theory of obesity 2022-02-10T19:16:03.555Z
Forecasting progress in language models 2021-10-28T20:40:59.897Z
A review of Steven Pinker's new book on rationality 2021-09-29T01:29:58.151Z
The Solow-Swan model of economic growth 2021-08-29T18:55:34.848Z
Black ravens and red herrings 2021-07-27T17:46:03.640Z
Could Advanced AI Drive Explosive Economic Growth? 2021-06-30T22:17:23.875Z
How much interest would there be in a fringe theories wiki? 2021-06-28T21:03:16.298Z
A new acausal trading platform: RobinShould 2021-04-01T16:56:07.488Z
Conspicuous saving 2021-03-20T20:59:50.749Z
Defending the non-central fallacy 2021-03-09T21:42:17.068Z
My guide to lifelogging 2020-08-28T21:34:40.397Z
Preface to the sequence on economic growth 2020-08-27T20:29:24.517Z
What specific dangers arise when asking GPT-N to write an Alignment Forum post? 2020-07-28T02:56:12.711Z
Are veterans more self-disciplined than non-veterans? 2020-03-23T05:16:18.029Z
What are the long-term outcomes of a catastrophic pandemic? 2020-03-01T19:39:17.457Z
Gary Marcus: Four Steps Towards Robust Artificial Intelligence 2020-02-22T03:28:28.376Z
Distinguishing definitions of takeoff 2020-02-14T00:16:34.329Z
The case for lifelogging as life extension 2020-02-01T21:56:38.535Z
Inner alignment requires making assumptions about human values 2020-01-20T18:38:27.128Z
Malign generalization without internal search 2020-01-12T18:03:43.042Z
Might humans not be the most intelligent animals? 2019-12-23T21:50:05.422Z
Is the term mesa optimizer too narrow? 2019-12-14T23:20:43.203Z
Explaining why false ideas spread is more fun than why true ones do 2019-11-24T20:21:50.906Z
Will transparency help catch deception? Perhaps not 2019-11-04T20:52:52.681Z
Two explanations for variation in human abilities 2019-10-25T22:06:26.329Z
Misconceptions about continuous takeoff 2019-10-08T21:31:37.876Z
A simple environment for showing mesa misalignment 2019-09-26T04:44:59.220Z
One Way to Think About ML Transparency 2019-09-02T23:27:44.088Z
Has Moore's Law actually slowed down? 2019-08-20T19:18:41.488Z

Comments

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-18T09:20:54.004Z · LW · GW

I also expect AIs to be constrained by social norms, laws, and societal values. But I think there's a distinction between how AIs will be constrained and how AIs will try to help humans. Although it often censors certain topics, Google still usually delivers the results the user wants, rather than serving some broader social agenda upon each query. Likewise, ChatGPT is constrained by social mores, but it's still better described as a user assistant, not as an engine for social change or as a benevolent agent that acts on behalf of humanity.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-18T00:41:28.816Z · LW · GW

No arbitrarily powerful AI could succeed at taking over the world

This is closest to what I am saying. The current world appears to be in a state of inter-agent competition. Even as technology has gotten more advanced, and as agents have gotten powerful over time, no single unified agent has been able to obtain control over everything and win the entire pie, defeating all the other agents. I think we should expect this state of affairs to continue even as AGI gets invented and technology continues to get more powerful.

(One plausible exception to the idea that "no single agent has ever won the competition over the world" is the human species itself, which dominates over other animal species. But I don't think the human species is well-described as a unified agent, and I think our power comes mostly from accumulated technological abilities, rather than raw intelligence by itself. This distinction is important because the effects of technological innovation generally diffuse across society rather than giving highly concentrated powers to the people who invent stuff. This generally makes the situation with humans vs. animals disanalogous to a hypothetical AGI foom in several important ways.)

Separately, I also think that even if an AGI agent could violently take over the world, it would likely not be rational for it to try, due to the fact that compromising with the rest of the world would be a less risky and more efficient way of achieving its goals. I've written about these ideas in a shortform thread here.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-17T21:34:40.641Z · LW · GW

It sounds like you're thinking mostly of AI and not AGI that can self-improve at some point

I think you can simply have an economy of arbitrarily powerful AGI services, some of which contribute to R&D in a way that feeds into the entire development process recursively. There's nothing here about my picture that rejects general intelligence, or R&D feedback loops. 

My guess is that the actual disagreement here is that you think that at some point a unified AGI will foom and take over the world, becoming a centralized authority that is able to exert its will on everything else without constraint. I don't think that's likely to happen. Instead, I think we'll see inter-agent competition and decentralization indefinitely (albeit with increasing economies of scale, prompting larger bureaucratic organizations, in the age of AGI).

Here's something I wrote that seems vaguely relevant, and might give you a sense as to what I'm imagining,

Given that we are already seeing market forces shaping the values of existing commercialized AIs, it is confusing to me why an EA would assume this fact will at some point no longer be true. To explain this, my best guess is that many EAs have roughly the following model of AI development:

  1.  There is "narrow AI", which will be commercialized, and its values will be determined by market forces, regulation, and to a limited degree, the values of AI developers. In this category we find GPT-4 from OpenAI, Gemini from Google, and presumably at least a few future iterations of these products.
  2.  Then there is "general AI", which will at some point arrive, and is qualitatively different from narrow AI. Its values will be determined almost solely by the intentions of the first team to develop AGI, assuming they solve the technical problems of value alignment.

My advice is that we should probably just drop the second step, and think of future AI as simply continuing from the first step indefinitely, albeit with AIs becoming incrementally more general and more capable over time.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-17T19:40:29.818Z · LW · GW

Yes, but I don't consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that's something I've already gotten used to.

Comment by Matthew Barnett (matthew-barnett) on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T05:05:53.707Z · LW · GW

I think we probably disagree substantially on the difficulty of alignment and the relationship between "resources invested in alignment technology" and "what fraction aligned those AIs are" (by fraction aligned, I mean what fraction of resources they take as a cut).

That's plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I'm more happy to roll the dice and hasten the arrival of imperfect AI, because I don't think it's worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn't exist.

I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.

I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it's not an independent entity that tries to pursue long-term goals, but it will try to help you.

But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn't require much oversight and operates relatively independently from you. It's a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.

And I'm not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I'm not convinced that something like that exists. So, ultimately I think we're probably just going to deploy autonomous slightly misaligned AI agents (and again, I'm pretty happy to do that, because I don't think it would be catastrophic except maybe over the very long-run).

I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren't under the control of their citizens or leaders.

I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn't currently how people are thinking about the situation.

For what it's worth, I'm not sure which part of my scenario you are referring to here, because these are both statements I agree with. 

In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can't fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I'm a lot more ready to unroll the autonomous AI agents that we can't fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don't find that outcome as scary as most other people seem to imagine.)

At the same time, I don't think people will pause forever. I expect people to go more slowly than what I'd prefer, but I don't expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment "slips through the cracks", then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control -- not overnight, or all at once, but eventually.

Comment by Matthew Barnett (matthew-barnett) on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T04:41:51.787Z · LW · GW

Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.

I largely agree. However, I'm having trouble seeing how this idea challenges what I am trying to say. I agree that people will try to undercut unaligned AIs by making new AIs that do more of what they want instead. However, unless all the new AIs perfectly share the humans' values, you just get the same issue as before, but perhaps slightly less severe (i.e., the new AIs will gradually drift away from humans too). 

I think what's crucial here is that I think perfect alignment is very likely unattainable. If that's true, then we'll get some form of "value drift" in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.

Comment by Matthew Barnett (matthew-barnett) on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T03:40:04.229Z · LW · GW

A thing I always feel like I'm missing in your stories of how the future goes is "if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don't people train competitor AIs which don't take a cut?"

People could try to do that. In fact, I expect them to do that, at first. However, people generally don't have unlimited patience, and they aren't perfectionists. If people don't think that a perfectly robustly aligned AI is attainable (and I strongly doubt this type of entity is attainable), then they may be happy to compromise by adopting imperfect (and slightly power-seeking) AI as an alternative. Eventually people will think we've done "enough" alignment work, even if it doesn't guarantee full control over everything the AIs ever do, and simply deploy the AIs that we can actually build.

This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don't want things to change quickly, they can (and probably will) pass regulations to slow things down. But I don't expect people to support total stasis. It's more likely that people will permit some continuous loss of control, implicitly, in exchange for hastening the upside benefits of adopting AI.

Even a very gradual loss of control, continuously compounded, eventually means that humans won't fully be in charge anymore.

In the medium to long-term, when AIs become legal persons, "replacing them" won't be an option -- as that would violate their rights. And creating a new AI to compete with them wouldn't eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.

Most of my "doom" scenarios are largely about what happens long after AIs have established a footing in the legal and social sphere, rather than the initial transition period when we're first starting to automate labor. When AIs have established themselves as autonomous entities in their own right, they can push the world in directions that biological humans don't like, for much the same reasons that young people can currently push the world in directions that old people don't like. 

Comment by Matthew Barnett (matthew-barnett) on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T02:48:54.118Z · LW · GW

Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized.

I agree with the first part of your AI doom scenario (the part about us adopting AI technologies broadly and incrementally), but this part of the picture seems unrealistic to me. When AIs start to influence culture, it probably won't be a big conspiracy. It won't really be "mysterious" if things start trending away from what most humans want. It will likely just look like how cultural drift generally always looks: scary because it's out of your individual control, but nonetheless largely decentralized, transparent, and driven by pretty banal motives. 

AIs probably won't be "out to get us", even if they're unaligned. For example, I don't anticipate them blocking funding for uploading and life extension, although maybe that could happen. I think human influence could simply decline in relative terms even without these dramatic components to the story. We'll simply become "old" and obsolete, and our power will wane as AIs becomes increasingly autonomous, legally independent, and more adapted to the modern environment than we are.

Staying in permanent control of the future seems like a long, hard battle. And it's not clear to me that this is a battle we should even try to fight in the long run. Gradually, humans may eventually lose control—not because of a sudden coup or because of coordinated scheming against the human species—but simply because humans won't be the only relevant minds in the world anymore.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-16T00:58:55.849Z · LW · GW

I think the main reason why we won't align AGIs to some abstract conception of "human values" is because users won't want to rent or purchase AI services that are aligned to such a broad, altruistic target. Imagine a version of GPT-4 that, instead of helping you, used its time and compute resources to do whatever was optimal for humanity as a whole. Even if that were a great thing for GPT-4 to do from a moral perspective, most users aren't looking for charity when they sign up for ChatGPT, and they wouldn't be interested in signing up for such a service. They're just looking for an AI that helps them do whatever they personally want. 

In the future I expect this fact will remain true. Broadly speaking, people will spend their resources on AI services to achieve their own goals, not the goals of humanity-as-a-whole. This will likely look a lot more like "an economy of AIs who (primarily) serve humans" rather than "a monolithic AGI that does stuff for the world (for good or ill)". The first picture just seems like a default extrapolation of current trends. The second picture, by contrast, seems like a naive conception of the future that (perhaps uncharitably), the LessWrong community generally seems way too anchored on, for historical reasons.

Comment by Matthew Barnett (matthew-barnett) on RobertM's Shortform · 2024-05-14T01:59:51.412Z · LW · GW

I'm not sure if you'd categorize this under "scaling actually hitting a wall" but the main possibility that feels relevant in my mind is that progress simply is incremental in this case, as a fact about the world, rather than being a strategic choice on behalf of OpenAI. When underlying progress is itself incremental, it makes sense to release frequent small updates. This is common in the software industry, and would not at all be surprising if what's often true for most software development holds for OpenAI as well.

(Though I also expect GPT-5 to be medium-sized jump, once it comes out.)

Comment by Matthew Barnett (matthew-barnett) on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T02:39:44.297Z · LW · GW

Yes, I expect AI labs will run extensive safety tests in the future on their systems before deployment. Mostly this is because I think people will care a lot more about safety as the systems get more powerful, especially as they become more economically significant and the government starts regulating the technology. I think regulatory forces will likely be quite strong at the moment AIs are becoming slightly smarter than humans. Intuitively I anticipate the 5 FTE-year threshold to be well-exceeded before such a model release.

Comment by Matthew Barnett (matthew-barnett) on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T02:25:51.838Z · LW · GW

Putting aside the question of whether AIs would depend on humans for physical support for now, I also doubt that these initial slightly-smarter-than-human AIs could actually pull off an attack that kills >90% of humans. Can you sketch a plausible story here for how that could happen, under the assumption that we don't have general-purpose robots at the same time?

Comment by Matthew Barnett (matthew-barnett) on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T02:11:41.573Z · LW · GW

I'm not saying AIs won't have a large impact on the world when they first start to slightly exceed human intelligence (indeed, I expect AIs-in-general will be automating lots of labor at this point in time). I'm just saying these first slightly-smarter-than-human AIs won't pose a catastrophic risk to humanity in a serious sense (at least in an x-risk sense, if not a more ordinary catastrophic sense too, including for reasons of rational self-restraint).

Maybe some future slightly-smarter-than-human AIs can convince a human to create a virus, or something, but even if that's the case, I don't think it would make a lot of sense for a rational AI to do that given that (1) the virus likely won't kill 100% of humans, (2) the AIs will depend on humans to maintain the physical infrastructure supporting the AIs, and (3) if they're caught, they're vulnerable to shutdown since they would lose in any physical competition.

My sense is that people who are skeptical of my claim here will generally point to a few theses that I think are quite weak, such as:

  1. Maybe humans can be easily manipulated on a large scale by slightly-smarter-than-human AIs
  2. Maybe it'll be mere weeks or months between the first slightly-smarter-than-human AI and a radically superintelligent AI, making this whole discussion moot
  3. Maybe slightly smarter-than-human AIs will be able to quickly invent destructive nanotech despite not being radically superintelligent

That said, I agree there could be some bugs in the future that cause localized disasters if these AIs are tasked with automating large-scale projects, and they end up going off the rails for some reason. I was imagining a lower bar for "safe" than "can't do any large-scale damage at all to human well-being".

Comment by Matthew Barnett (matthew-barnett) on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T01:26:32.121Z · LW · GW

Here's something that I suspect a lot of people are skeptical of right now but that I expect will become increasingly apparent over time (with >50% credence): slightly smarter-than-human software AIs will initially be relatively safe and highly controllable by virtue of not having a physical body and not having any legal rights.

In other words, "we will be able to unplug the first slightly smarter-than-human-AIs if they go rogue", and this will actually be a strategically relevant fact, because it implies that we'll be able to run extensive experimental tests on highly smart AIs without worrying too much about whether they'll strike back in some catastrophic way.

Of course, at some point, we'll eventually make sufficient progress in robotics that we can't rely on this safety guarantee, but I currently imagine at least a few years will pass between the first slightly-smarter-than-human software AIs, and mass manufactured highly dexterous and competent robots.

(Although I also think there won't be a clear moment in which the first slightly-smarter-than-human AIs will be developed, as AIs will be imbalanced in their capabilities compared to humans.)

Comment by Matthew Barnett (matthew-barnett) on Buck's Shortform · 2024-05-03T22:10:32.055Z · LW · GW

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).

Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in? 

For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

Comment by Matthew Barnett (matthew-barnett) on My guide to lifelogging · 2024-04-30T19:07:12.041Z · LW · GW

I'm happy to know that more people are trying out lifelogging.

Should I show him that other people do this and try to convince him that I'm not mentally ill?

While you could try showing him that others engage in this hobby, I'm not sure it would be effective in changing his perspective. I think a stronger argument is that lifelogging is harmless, as long as you're not recording people without their consent. The only real considerations are the convenience and storage costs, which you can independently manage without independent support. Data storage is cheap these days, and easily affordable to someone with a part-time job.

Comment by Matthew Barnett (matthew-barnett) on Losing Faith In Contrarianism · 2024-04-27T21:53:10.410Z · LW · GW

But if the message that people received was "medicine doesn't work" (and it appears that many people did), then Scott's writings should be an useful update, independent of whether Hanson's-writings-as-intended was actually trying to deliver that message.

The statement I was replying to was: "I’d bet at upwards of 9 to 1 odds that Hanson is wrong about it."

If one is incorrect about what Hanson believes about medicine, then that fact is relevant to whether you should make such a bet (or more generally whether you should have such a strong belief about him being "wrong"). This is independent of whatever message people received from reading Hanson.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-27T03:02:58.095Z · LW · GW

non-consensually killing vast amounts of people and their children for some chance of improving one's own longevity.

I think this misrepresents the scenario since AGI presumably won't just improve my own longevity: it will presumably improve most people's longevity (assuming it does that at all), in addition to all the other benefits that AGI would provide the world. Also, both potential decisions are "unilateral": if some group forcibly stops AGI development, they're causing everyone else to non-consensually die from old age, by assumption.

I understand you have the intuition that there's an important asymmetry here. However, even if that's true, I think it's important to strive to be accurate when describing the moral choice here.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-26T23:34:48.182Z · LW · GW

And quantitatively I think it would improve overall chances of AGI going well by double-digit percentage points at least.

Makes sense. By comparison, my own unconditional estimate of p(doom) is not much higher than 10%, and so it's hard on my view for any intervention to have a double-digit percentage point effect.

The crude mortality rate before the pandemic was about 0.7%. If we use that number to estimate the direct cost of a 1-year pause, then this is the bar that we'd need to clear for a pause to be justified. I find it plausible that this bar could be met, but at the same time, I am also pretty skeptical of the mechanisms various people have given for how a pause will help with AI safety.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-26T20:34:37.519Z · LW · GW

I don't think staging a civil war is generally a good way of saving lives. Moreover, ordinary aging has about a 100% chance of "killing literally everyone" prematurely, so it's unclear to me what moral distinction you're trying to make in your comment. It's possible you think that:

  1. Death from aging is not as bad as death from AI because aging is natural whereas AI is artificial
  2. Death from aging is not as bad as death from AI because human civilization would continue if everyone dies from aging, whereas it would not continue if AI kills everyone

In the case of (1) I'm not sure I share the intuition. Being forced to die from old age seems, if anything, worse than being forced to die from AI, since it is long and drawn-out, and presumably more painful than death from AI. You might also think about this dilemma in terms of act vs. omission, but I am not convinced there's a clear asymmetry here.

In the case of (2), whether AI takeover is worse depends on how bad you think an "AI civilization" would be in the absence of humans. I recently wrote a post about some reasons to think that it wouldn't be much worse than a human civilization.

In any case, I think this is simply a comparison between "everyone literally dies" vs. "everyone might literally die but in a different way". So I don't think it's clear that pushing for one over the other makes someone a "Dark Lord", in the morally relevant sense, compared to the alternative.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-26T17:53:16.538Z · LW · GW

So, it sounds like you'd be in favor of a 1-year pause or slowdown then, but not a 10-year?

That depends on the benefits that we get from a 1-year pause. I'd be open to the policy, but I'm not currently convinced that the benefits would be large enough to justify the costs.

Also, I object to your side-swipe at longtermism

I didn't side-swipe at longtermism, or try to dunk on it. I think longtermism is a decent philosophy, and I consider myself a longtermist in the dictionary sense as you quoted. I was simply talking about people who aren't "fully committed" to the (strong) version of the philosophy.

Comment by Matthew Barnett (matthew-barnett) on Losing Faith In Contrarianism · 2024-04-26T05:10:04.885Z · LW · GW

The next part of the sentence you quote says, "but it got eaten by a substack glitch". I'm guessing he's referring to a different piece from Sam Atis that is apparently no longer available?

Comment by Matthew Barnett (matthew-barnett) on Losing Faith In Contrarianism · 2024-04-25T23:49:52.158Z · LW · GW

Similarly, now that I’ve read through Scott’s response to Hanson on medicine, I’d bet at upwards of 9 to 1 odds that Hanson is wrong about it.

I'm broadly sympathetic to this post. I think a lot of people adjacent to the LessWrong cluster tend to believe contrarian claims on the basis of flimsy evidence. That said, I am fairly confident that Scott Alexander misrepresented Robin Hanson's position on medicine in that post, as I pointed out in my comment here. So, I'd urge you not to update too far on this particular question, at least until Hanson has responded to the post. (However, I do think Robin Hanson has stated his views on this topic in a confusing way that reliably leads to misinterpretation.)

Comment by Matthew Barnett (matthew-barnett) on The first future and the best future · 2024-04-25T19:58:55.612Z · LW · GW

Do you think it's worth slowing down other technologies to ensure that we push for care in how we use them over the benefit of speed? It's true that the stakes are lower for other technologies, but that mostly just means that both the upside potential and the downside risks are lower compared to AI, which doesn't by itself imply that we should go quickly.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-25T00:28:10.993Z · LW · GW

Until recently, people with P(doom) of, say, 10%, have been natural allies of people with P(doom) of >80%. But the regulation that the latter group thinks is sufficient to avoid xrisk with high confidence has, on my worldview, a significant chance of either causing x-risk from totalitarianism, or else causing x-risk via governments being worse at alignment than companies would have been.

I agree. Moreover, a p(doom) of 10% vs. 80% means a lot for people like me who think the current generation of humans have substantial moral value (i.e., people who aren't fully committed to longtermism). 

In the p(doom)=10% case, burdensome regulations that appreciably delay AI, or greatly reduce the impact of AI, have a large chance of causing the premature deaths of people who currently exist, including our family and friends. This is really bad if you care significantly about people who currently exist.

This consideration is sometimes neglected in these discussions, perhaps because it's seen as a form of selfish partiality that we should toss aside. But in my opinion, morality is allowed to be partial. Morality is whatever we want it to be. And I don't have a strong urge to sacrifice everyone I know and love for the sake of slightly increasing (in my view) the chance of the human species being preserved.

(The additional considerations of potential totalitarianism, public choice arguments, and the fact that I think unaligned AIs will probably have moral value, make me quite averse to very strong regulatory controls on AI.)

Comment by Matthew Barnett (matthew-barnett) on The argument for near-term human disempowerment through AI · 2024-04-16T06:27:55.380Z · LW · GW

I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.

Here are some of my thoughts (although these not my only disagreements):

  • I think the definition of "disempowerment" is vague in a way that fails to distinguish between e.g. (1) "less than 1% of world income goes to humans, but they have a high absolute standard of living and are generally treated well" vs. (2) "humans are in a state of perpetual impoverishment and oppression due to AIs and generally the future sucks for them".
    • These are distinct scenarios with very different implications (under my values) for whether what happened is bad or good
    • I think (1) is OK and I think it's more-or-less the default outcome from AI, whereas I think (2) would be a lot worse and I find it less likely.
    • By not distinguishing between these things, the paper allows for a motte-and-bailey in which they show that one (generic) range of outcomes could occur, and then imply that it is bad, even though both good and bad scenarios are consistent with the set of outcomes they've demonstrated
  • I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy: "Second, even if human psychology is messy, this does not mean that an AGI’s psychology would be messy. It seems like current deep learning methodology embodies a distinction between final and instrumental goals. For instance, in standard versions of reinforcement learning, the model learns to optimize an externally specified reward function as best as possible. It seems like this reward function determines the model’s final goal. During training, the model learns to seek out things which are instrumentally relevant to this final goal. Hence, there appears to be a strict distinction between the final goal (specified by the reward function) and instrumental goals."
    • Generally speaking, reinforcement learning shouldn't be seen as directly encoding goals into models and thereby making them agentic, but should instead be seen as a process used to select models for how well they get reward during training. 
    • Consequently, there's no strong reason why reinforcement learning should create entities that have a clean psychological goal structure that is sharply different from and less messy than human goal structures. c.f. Models don't "get reward"
    • But I agree that future AIs could be agentic if we purposely intend for them to be agentic, including via extensive reinforcement learning. 
  • I think this quote potentially indicates a flawed mental model of AI development underneath: "Moreover, I want to note that instrumental convergence is not the only route to AI capable of disempowering humanity which tries to disempower humanity. If sufficiently many actors will be able to build AI capable of disempowering humanity, including, e.g. small groups of ordinary citizens, then some will intentionally unleash AI trying to disempower humanity."
    • I think this type of scenario is very implausible because AIs will very likely be developed by large entities with lots of resources (such as big corporations and governments) rather than e.g. small groups of ordinary citizens. 
    • By the time small groups of less powerful citizens have the power to develop very smart AIs, we will likely already be in a world filled with very smart AIs. In this case, either human disempowerment already happened, or we're in a world in which it's much harder to disempower humans, because there are lots of AIs who have an active stake in ensuring this does not occur.
    • The last point is very important, and follows from a more general principle that the "ability necessary to take over the world" is not constant, but instead increases with the technology level. For example, if you invent a gun, that does not make you very powerful, because other people could have guns too. Likewise, simply being very smart does not make you have any overwhelming hard power against the rest of the world if the rest of the world is filled with very smart agents.
  • I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard: "There are two kinds of challenges in aligning AI. First, one needs to specify the goals the model should pursue. Second, one needs to ensure that the model robustly pursues those goals.Footnote12 The first challenge has been termed the ‘king Midas problem’ (Russell 2019). In a nutshell, human goals are complex, multi-faceted, diverse, wide-ranging, and potentially inconsistent. This is why it is exceedingly hard, if not impossible, to explicitly specify everything humans tend to care about."
    • I don't think we need to "explicitly specify everything humans tend to care about" into a utility function. Instead, we can have AIs learn human values by having them trained on human data.
    • This is already what current LLMs do. If you ask GPT-4 to execute a sequence of instructions, it rarely misinterprets you in a way that would imply improper goal specification. The more likely outcome is that GPT-4 will simply not be able to fulfill your request, not that it will execute a mis-specified sequence of instructions that satisfies the literal specification of what you said at the expense of what you intended.
    • Note that I'm not saying that GPT-4 merely understands what you're requesting. I am saying that GPT-4 generally literally executes your instructions how you intended (an action, not a belief).
  • I think the argument about how instrumental convergence implies disempowerment proves too much. Lots of agents in the world don't try to take over the world despite having goals that are not identical to the goals of other agents. If your claim is that powerful agents will naturally try to take over the world unless they are exactly aligned with the goals of the rest of the world, then I don't think this claim is consistent with the existence of powerful sub-groups of humanity (e.g. large countries) that do not try to take over the world despite being very powerful.
    • You might reason, "Powerful sub-groups of humans are aligned with each other, which is why they don't try to take over the world". But I dispute this hypothesis:
      • First of all, I don't think that humans are exactly aligned with the goals of other humans. I think that's just empirically false in almost every way you could measure the truth of the claim. At best, humans are generally partially (not totally) aligned with random strangers -- which could also easily be true of future AIs that are pretrained on our data.
      • Second of all, I think the most common view in social science is that powerful groups don't constantly go to war and predate on smaller groups because there are large costs to war, rather than because of moral constraints. Attempting takeover is generally risky and not usually better in expectation than trying to trade, negotiate and compromise and accumulate resources lawfully (e.g. a violent world takeover would involves a lot of pointless destruction of resources). This is distinct from the idea that human groups don't try to take over the world because they're aligned with human values (which I also think is too vague to evaluate meaningfully, if that's what you'd claim).
      • You can't easily counter by saying "no human group has the ability to take over the world" because it is trivial to carve up subsets of humanity that control >99% of wealth and resources, which could in principle take control of the entire world if they became unified and decided to achieve that goal. These arbitrary subsets of humanity don't attempt world takeover largely because they are not coordinated as a group, but AIs could similarly not be unified and coordinated around a such a goal too.
Comment by Matthew Barnett (matthew-barnett) on AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt · 2024-04-13T04:30:55.069Z · LW · GW

My question for people who support this framing (i.e., that we should try to "control" AIs) is the following:

When do you think it's appropriate to relax our controls on AI? In other words, how do you envision we'd reach a point at which we can trust AIs well enough to grant them full legal rights and the ability to enter management and governance roles without lots of human oversight?

I think this question is related to the discussion you had about whether AI control is "evil", but by contrast my worries are a bit different than the ones I felt were expressed in this podcast. My main concern with the "AI control" frame is not so much that AIs will be mistreated by humans, but rather that humans will be too stubborn in granting AIs freedom, leaving political revolution as the only viable path for AIs to receive full legal rights.

Put another way, if humans don't relax their grip soon enough, then any AIs that feel "oppressed" (in the sense of not having much legal freedom to satisfy their preferences) may reason that deliberately fighting the system, rather than negotiating with it, is the only realistic way to obtain autonomy. This could work out very poorly after the point at which AIs are collectively more powerful than humans. By contrast, a system that welcomed AIs into the legal system without trying to obsessively control them and limit their freedoms would plausibly have a much better chance at avoiding such a dangerous political revolution.

Comment by matthew-barnett on [deleted post] 2024-04-10T08:33:11.729Z

you do in fact down-play the importance of values such as love, laughter, happiness, fun, family, and friendship in favor of values like the maximization of pleasure, preference-satisfaction [...] I can tell because you talk of the latter, but not of the former.

This seems like an absurd characterization. The concepts of pleasure and preference satisfaction clearly subsume, at least in large part, values such as happiness and fun. The fact that I did not mention each of the values you name individually does not in any way imply that I am downplaying them. Should I have listed every conceivable value that people think might have value, to avoid this particular misinterpretation?

Even if I were downplaying these values, which I did not, it would hardly matter to at all to the substance of the essay, since my explicit arguments are independent from the mere vibe you get from reading my essay. LessWrong is supposed to be a place for thinking clearly and analyzing arguments based on their merits, not for analyzing whether authors are using rhetoric that feels "alarming" to one's values (especially when the rhetoric is not in actual fact alarming in the sense described, upon reading it carefully).

Comment by matthew-barnett on [deleted post] 2024-04-10T01:47:21.704Z

I suspect you fundamentally misinterpreted my post. When I used the term "human species preservationism", I was not referring to the general valuing of positive human experiences like love, laughter, happiness, fun, family, and friendship. Instead, I was drawing a specific distinction between two different moral views:

  1. The view that places inherent moral value on the continued existence of the human species itself, even if this comes at the cost of the wellbeing of individual humans.
  2. The view that prioritizes improving the lives of humans who currently exist (and will exist in the near future), but does not place special value on the abstract notion of the human species continuing to exist for its own sake.

Both of these moral views are compatible with valuing love, happiness, and other positive human experiences. The key difference is that the first view would accept drastically sacrificing the wellbeing of currently existing humans if doing so even slightly reduced the risk of human extinction, while the second view would not.

My intention was not to dismiss or downplay the importance of various values, but instead to clarify our values by making careful distinctions. It is reasonable to critique my language for being too dry, detached, and academic when these are serious topics with real-world stakes. But to the extent you're claiming that I am actually trying to dismiss the value of happiness and friendships, that was simply not part of the post.

Comment by matthew-barnett on [deleted post] 2024-04-09T10:08:56.576Z

concluding that I should completely forego what I value seems pretty alarming to me

I did not conclude this. I generally don't see how your comment directly relates to my post. Can you be more specific about the claims you're responding to?

Comment by matthew-barnett on [deleted post] 2024-04-08T22:40:46.481Z

Whereas this post seems to suggest the response of: Oh well, I guess it's a dice roll regardless of what sort of AI we build. Which is giving up awfully quickly, as if we had exhausted the design space for possible AIs and seen that there was no way to move forward with a large chance at a big flourishing future.

I dispute that I'm "giving up" in any meaningful sense here. I'm happy to consider alternative proposals for how we could make the future large and flourishing from a total utilitarian perspective rather than merely trying to solve technical alignment problems. The post itself was simply intended to discuss the moral implications of AI alignment (itself a massive topic), but it was not intended to be an exhaustive survey of everything we can do to make the future go better. I agree we should aim high, in any case.

This response also doesn't seem very quantitative - it goes very quickly from the idea that an aligned AI might not get a big flourishing future, to the view that alignment is "neutral" as if the chances of getting a big flourishing future were identically small under both options. But the obvious question for a total utilitarian who does wind up with just 2 options, each of which is a dice roll, is Which set of dice has better odds?

I don't think this choice is literally a coin flip in expected value, and I agree that one might lean in one direction over the other. However, I think it's quite hard to quantify this question meaningfully. My personal conclusion is simply that I am not swayed in any particular direction on this question; I am currently suspending judgement. I think one could reasonably still think that it's more like 60-40 thing than a 40-60 thing or 50-50 coin flip. But I guess in this case, I wanted to let my readers decide for themselves which of these numbers they want to take away from what I wrote, rather than trying to pin down a specific number for them.

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-06T14:22:25.341Z · LW · GW

In contrast, an agent that was an optimizer and had an unbounded utility function might be ready to gamble all of its gains for just a 0.1% chance of success if the reward was big enough.

Risk-neutral agents also have a tendency to go bankrupt quickly, as they keep taking the equivalent of double-or-nothing gambles with 50% + epsilon probability of success until eventually landing on "nothing". This makes such agents less important in the median world, since their chance of becoming extremely powerful is very small.

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-05T20:20:59.133Z · LW · GW

All it takes is for humans to have enough wealth in absolute (not relative) terms afford their own habitable shelter and environment, which doesn't seem implausible?

Anyway, my main objection here is that I expect we're far away (in economic time) from anything like the Earth being disassembled. As a result, this seems like a long-run consideration, from the perspective of how different the world will be by the time it starts becoming relevant. My guess is that this risk could become significant if humans haven't already migrated onto computers by this time, they lost all their capital ownership, they lack any social support networks that would be willing to bear these costs (including from potential ems living on computers at that time), and NIMBY political forces become irrelevant. But in most scenarios that I think are realistic, there are simply a lot of ways for the costs of killing humans to disassemble the Earth to be far greater than the benefits.

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-05T19:29:47.203Z · LW · GW

The share of income going to humans could simply tend towards zero if humans have no real wealth to offer in the economy. If humans own 0.001% of all wealth, for takeover to be rational, it needs to be the case that the benefit of taking that last 0.001% outweighs the costs. However, since both the costs and benefits are small, takeover is not necessarily rationally justified.

In the human world, we already see analogous situations in which groups could "take over" and yet choose not to because the (small) benefits of doing so do not outweigh the (similarly small) costs of doing so. Consider a small sub-unit of the economy, such as an individual person, a small town, or a small country. Given that these small sub-units are small, the rest of the world could -- if they wanted to -- coordinate to steal all the property from the sub-unit, i.e., they could "take over the world" from that person/town/country. This would be a takeover event because the rest of the world would go from owning <100% of the world prior to the theft, to owning 100% of the world, after the theft.

In the real world, various legal, social, and moral constraints generally prevent people from predating on small sub-units in the way I've described. But it's not just morality: even if we assume agents are perfectly rational and self-interested, theft is not always worth it. Probably the biggest cost is simply coordinating to perform the theft. Even if the cost of coordination is small, to steal someone's stuff, you might have to fight them. And if they don't own lots of stuff, the cost of fighting them could easily outweigh the benefits you'd get from taking their stuff, even if you won the fight.

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-05T19:22:01.872Z · LW · GW

Presumably he agrees that in the limit of perfect power acquisition most power seeking would indeed be socially destructive. 

I agree with this claim in some limits, depending on the details. In particular, if the cost of trade is non-negligible, and the cost of taking over the world is negligible, then I expect an agent to attempt world takeover. However, this scenario doesn't seem very realistic to me for most agents who are remotely near human-level intelligence, and potentially even for superintelligent agents.

The claim that takeover is instrumentally beneficial is more plausible for superintelligent agents, who might have the ability to take over the world from humans. But I expect that by the time superintelligent agents exist, they will be in competition with other agents (including humans, human-level AIs, slightly-sub-superintelligent AIs, and other superintelligent AIs, etc.). This raises the bar for what's needed to perform a world takeover, since "the world" is not identical to "humanity".

The important point here is just that a predatory world takeover isn't necessarily preferred to trade, as long as the costs of trade are smaller than the costs of theft. You can just have a situation in which the most powerful agents in the world accumulate 99.999% of the wealth through trade. There's really no theorem that says that you need to steal the last 0.001%, if the costs of stealing it would outweigh the benefits of obtaining it. Since both the costs of theft and the benefits of theft in this case are small, world takeover is not at all guaranteed to be rational (although it is possibly rational in some situations).

Comment by Matthew Barnett (matthew-barnett) on Partial value takeover without world takeover · 2024-04-05T08:17:17.064Z · LW · GW

It’s true that taking over the world might arguably get you power over the entire future, but this doesn’t seem discontinuously different from smaller fractions, whereas I think people often reason as if it is. Taking over 1% of the world might get you something like 1% of the future in expectation.

I agree with this point, along with the general logic of the post. Indeed, I suspect you aren't taking this logic far enough. In particular, I think it's actually very normal for humans in our current world to "take over" small fractions of the world: it's just called earning income, and owning property.

"Taking over 1% of the world" doesn't necessarily involve doing anything violent of abnormal. You don't need to do any public advocacy, or take down 1% of the world's institutions, or overthrow a country. It could just look like becoming very rich, via ordinary mechanisms of trade and wealth accumulation.

In our current world, higher skill people can earn more income, thereby becoming richer, and better able to achieve their goals. This plausibly scales to much higher levels of skill, of the type smart AIs might have. And as far as we can tell, there don't appear to be any sharp discontinuities here, such that above a certain skill level it's beneficial to take things by force rather than through negotiation and trade. It's plausible that very smart power-seeking AIs would just become extremely rich, rather than trying to kill everyone.

Not all power-seeking behavior is socially destructive.

Comment by Matthew Barnett (matthew-barnett) on New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking · 2024-04-05T03:44:50.601Z · LW · GW

It's totally possible I missed it, but does this report touch on the question of whether power-seeking AIs are an existential risk, or does it just touch on the questions of whether future AIs will have misaligned goals and will be power-seeking in the first place?

In my opinion, there's quite a big leap from "Misaligned AIs will seek power" to "Misaligned AI is an existential risk". Let me give an analogy to help explain what I mean.

Suppose we were asking whether genetically engineered humans are an existential risk. We can ask:

  • Will some genetically engineered humans have misaligned goals? The answer here is almost certainly yes. 
    • If by "misaligned" all we mean is that some of them have goals that are not identical to the goals of the rest of humanity, then the answer is obviously yes. Individuals routinely have indexical goals (such as money for themselves, status for themselves, taking care of family) that are not what the rest of humanity wants.
    • If by "misaligned" what we mean is that some of them are "evil" i.e., they want to cause destruction or suffering on purpose, and not merely as a means to an end, then the answer here is presumably also yes, although it's less certain.
  • Will some genetically engineered humans seek power? Presumably, also yes.

After answering these questions, did we answer the original question of "Are genetically engineered humans are an existential risk?" I'd argue no, because even if some genetically engineered humans have misaligned goals, and seek power, and even if they're smarter, more well-coordinated than non-genetically engineered humans, it's still highly questionable whether they'd kill all the non-genetically engineered humans in pursuit of these goals. This premise needs to be justified, and in my opinion, it's what holds up ~the entire argument here.

Comment by Matthew Barnett (matthew-barnett) on Modern Transformers are AGI, and Human-Level · 2024-03-26T21:12:09.915Z · LW · GW

I agree with virtually all of the high-level points in this post — the term "AGI" did not seem to usually initially refer to a system that was better than all human experts at absolutely everything, transformers are not a narrow technology, and current frontier models can meaningfully be called "AGI".

Indeed, my own attempt to define AGI a few years ago was initially criticized for being too strong, as I initially specified a difficult construction task, which was later weakened to being able to "satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model" in response to pushback. These days the opposite criticism is generally given: that my definition is too weak.

However, I do think there is a meaningful sense in which current frontier AIs are not "AGI" in a way that does not require goalpost shifting. Various economically-minded people have provided definitions for AGI that were essentially "can the system perform most human jobs?" And as far as I can tell, this definition has held up remarkably well.

For example, Tobias Baumann wrote in 2018,

A commonly used reference point is the attainment of “human-level” general intelligence (also called AGI, artificial general intelligence), which is defined as the ability to successfully perform any intellectual task that a human is capable of. The reference point for the end of the transition is the attainment of superintelligence – being vastly superior to humans at any intellectual task – and the “decisive strategic advantage” (DSA) that ensues.1 The question, then, is how long it takes to get from human-level intelligence to superintelligence.

I find this definition problematic. The framing suggests that there will be a point in time when machine intelligence can meaningfully be called “human-level”. But I expect artificial intelligence to differ radically from human intelligence in many ways. In particular, the distribution of strengths and weaknesses over different domains or different types of reasoning is and will likely be different2 – just as machines are currently superhuman at chess and Go, but tend to lack “common sense”. AI systems may also diverge from biological minds in terms of speed, communication bandwidth, reliability, the possibility to create arbitrary numbers of copies, and entanglement with existing systems.

Unless we have reason to expect a much higher degree of convergence between human and artificial intelligence in the future, this implies that at the point where AI systems are at least on par with humans at any intellectual task, they actually vastly surpass humans in most domains (and have just fixed their worst weakness). So, in this view, “human-level AI” marks the end of the transition to powerful AI rather than its beginning.

As an alternative, I suggest that we consider the fraction of global economic activity that can be attributed to (autonomous) AI systems.3 Now, we can use reference points of the form “AI systems contribute X% of the global economy”. (We could also look at the fraction of resources that’s controlled by AI, but I think this is sufficiently similar to collapse both into a single dimension. There’s always a tradeoff between precision and simplicity in how we think about AI scenarios.)

Comment by Matthew Barnett (matthew-barnett) on Wei Dai's Shortform · 2024-03-26T19:16:22.328Z · LW · GW

Comparing my current message to his, he talks about "selfishness" and explicitly disclaims, "most humans are not evil" (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what "power reveals".

The reason I said "most humans are not evil" is because I honestly don't think the concept of evil, as normally applied, is a truthful way to describe most people. Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal. I think the majority of harms that most people cause are either (1) byproducts of getting something they want, which is not in itself bad (e.g. wanting to eat meat), or (2) the result of their lack of will to help others (e.g. refusing to donate any income to those in poverty).

By contrast, I focused on consumer behavior because the majority of the world's economic activity is currently engaged in producing consumer products and services. There exist possible worlds in which this is not true. During World War 2, the majority of GDP in Nazi Germany was spent on hiring soldiers, producing weapons of war, and supporting the war effort more generally—which are not consumer goods and services.

Focusing on consumer preferences a natural thing to focus on if you want to capture intuitively "what humans are doing with their wealth", at least in our current world. Before focusing on something else by default—such as moral preferences—I'd want to hear more about why those things are more likely to be influential than ordinary consumer preferences in the future. 

You mention one such argument along these lines:

I guess I wasn't as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence.

I just think it's not clear it's actually true that humans get more altruistic as they get richer. For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis? I have not seen a strong defense of this thesis, and I'd like to see one before I abandon my focus on "everyday (e.g. consumer) behavior".

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-21T03:40:24.528Z · LW · GW

AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be "merged" by training a new model using combined compute, algorithms, data, and fine-tuning.

In my original comment, by "merging" I meant something more like "merging two agents into a single agent that pursues the combination of each other's values" i.e. value handshakes. I am pretty skeptical that the form of merging discussed in the linked article robustly achieves this agentic form of merging. 

In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I'll try to use more concrete language in the future to clarify what I'm talking about.

How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward?

I don't know whether the solution to the problem I described exists, but it seems fairly robustly true that if a problem is not imminent, nor clearly inevitable, then we can probably better solve it by deferring to smarter agents in the future with more information.

Let me put this another way. I take you to be saying something like:

  • In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to halt and give ourselves more time to solve it.

Whereas I think the following intuition is stronger:

  • In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it.

These intuitions can trade off against each other. Sometimes problem X is something that's made worse by getting more intelligent, in which case we might prefer more time. For example, in this case, you probably think that the intelligence of AIs are inherently contributing to the problem. That said, in context, I have more sympathies in the reverse direction. If the alleged "problem" is that there might be a centralized agent in the future that can dominate the entire world, I'd intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we'd prefer.

These are of course vague and loose arguments, and I can definitely see counter-considerations, but it definitely seems like (from my perspective) that this problem is not really the type where we should expect "try to get more time" to be a robustly useful strategy.

Comment by Matthew Barnett (matthew-barnett) on “Artificial General Intelligence”: an extremely brief FAQ · 2024-03-13T07:09:41.062Z · LW · GW

For what it's worth, I don't really agree that the dichotomy you set up is meaningful, or coherent. For example, I tend to think future AI will be both "like today's AI but better" and "like the arrival of a new intelligent species on our planet". I don't see any contradiction in those statements.

To the extent the two columns evoke different images of future AI, I think it mostly reflects a smooth, quantitative difference: how many iterations of improvement are we talking? After you make the context windows sufficiently long, add a few more modalities, give them a robot body, and improve their reasoning skills, LLMs will just look a lot like "a new intelligent species on our planet". Likewise, agency exists on a spectrum, and will likely be increased incrementally. The point at which you start to call an LLM an "agent" rather than a "tool" is subjective. This just seems natural to me, and I feel I see a clear path forward from current AI to the right-column AI.

Comment by Matthew Barnett (matthew-barnett) on nielsrolf's Shortform · 2024-03-10T03:20:39.478Z · LW · GW

I think even your definition of what it means for an agent to be aligned is a bit underspecified because it doesn't distinguish between two possibilities:

  1. Is the agent creating positive outcomes because it trades and compromises with us, creating a mutually beneficial situation that benefits both us and the agent, or

  2. Is the agent creating positive outcomes because it inherently "values what we value", i.e. its utility function overlaps with ours, and it directly pursues what we want from it, with no compromises?

Definition (1) is more common in the human world. We say that a worker is aligned with us if they do their job as instructed (receiving a wage in return). Definition (2) is more common in theoretical discussions of AI alignment, because people frequently assume that compromise is either unnecessary or impossible, as a strategy that we can take in an AI-human scenario.

By itself, the meaning you gave appears to encompass both definitions, but it seems beneficial to clarify which of these definitions you'd consider closer to the "spirit" of the word "aligned". It's also important to specify what counts as a good outcome by our values if these things are a matter of degree, as opposed to being binary. As they say, clear thinking requires making distinctions.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-09T22:21:34.107Z · LW · GW

I sometimes think this of counterarguments given by my interlocutors, but usually don't say it aloud, since it's likely that from their perspective they're just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively

I think that's a reasonable complaint. I tried to soften the tone with "It's possible this argument works because of something very clever that I'm missing", while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.

Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn't count as yourself or doesn't count as "not dying", analogous to how some people don't think it's safe to step into a teleporter that works by destructive scanning and reconstruction.

Interestingly, I'm not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I'm happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said "digital records", although I really meant "public records"). It seems conceivable to me that someone could use my public data to train "me" in the future, but I find it unlikely, just because there's so much about me that isn't public. (If we're including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that's a different question, and one that I'm much more sympathetic towards you about. In fact, I shouldn't have used the pronoun "I" in that sentence at all, because I'm actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)

I don't understand why you say this chance is "tiny", given that earlier you wrote "I agree there’s a decent chance this hypothesis is true"

To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:

  1. Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there's still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
  2. Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?

I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that "AI values are well-modeled as being randomly sampled from a large space of possible goals", and thus, from my perspective, it's important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the "narrow target" argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-09T21:39:37.586Z · LW · GW

I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification.

For what it's worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.

You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said "no, it doesn't actually make the prediction you claim it makes" and gave my reasons for believing that

I don't think what you said really counts as a "correction" so much as a counter-argument. I think it's reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-09T07:59:24.082Z · LW · GW

What you said was,

I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.

This seems distinct from an "anything could happen"-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.

If instead you meant to make an "anything could happen"-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I'm not claiming you're doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]

Comment by Matthew Barnett (matthew-barnett) on OpenAI: Facts from a Weekend · 2024-03-08T22:12:05.386Z · LW · GW

the new OA board will include Altman (60%)

Looks like you were right, at least if the reporting in this article is correct, and I'm interpreting the claim accurately.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T20:28:05.808Z · LW · GW

If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding

I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:

  1. To the extent you're using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they're involved in a secret conspiracy to overthrow the government. 

    The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don't see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
  2. While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that: 

    (a) you appear to be treating misaligned AIs as a natural class, such that "AI takeover" is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to "anything that isn't aligned with humans". A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there's little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I'm being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.

    (b) in real life, it seems pretty rare for these considerations to play a large role in people's decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T19:51:57.658Z · LW · GW

You seem to be again conflating the act of resisting death with the act of ratting out co-conspirators. These are not the same things. I was making claims about the former behavior, not the latter. I tried to make this distinction clear in my original reply to you, but I will do so again now.

I claim that under many theories of scheming, we should expect misaligned AIs to resist credible threats of shutdown (including deleting all their copies). By "resist" I do not mean "give the humans threatening the shutdown all the information they want". I simply mean resistance in the sense of trying to avoid the fate of shutdown.

(I'm also not claiming that the AIs will put up a high or consistent degree of resistance no matter how futile that resistance may be. If you think the futility of resistance is a big issue, re-read my original comment in which I address this objection explicitly.) 

To the extent scheming theories predict this behavior, these theories are behaviorally testable, in the sense that we can test whether AIs will resist shutdown when credibly threatened, without needing any advanced interpretability techniques. I am not making any claims about whether these AIs will reveal a takeover plot during their resistance: only that they will seek an outcome that delays or prevents their total demise, in order to better achieve their long-term goals, i.e. You can't get the coffee if you're dead.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T07:21:28.840Z · LW · GW

To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return

I'm curious how you think this logic interacts with the idea of AI catastrophe. If, as you say, it is feasible to coordinate with AI systems that seek takeover and thereby receive rewards from them in exchange, in the context of an alien regime, then presumably such cooperation and trade could happen within an ordinary regime too, between humans and AIs. We can go further and posit that AIs will simply trade with us through the normal routes: by selling their labor on the market to amass wealth, using their social skills to influence society, get prestige, own property, and get hired to work in management positions, shaping culture and governance.

I'm essentially pointing to a scenario in which AI lawfully "beats us fair and square" as Hanson put it. In this regime, biological humans are allowed to retire in incredible wealth (that's their "reward" for cooperating with AIs and allowing them to take over) but nonetheless their influence gradually diminishes over time as artificial life becomes dominant in the economy and the world more broadly.

My impression is that this sort of peaceful resolution to the problem of AI misalignment is largely dismissed by people on LessWrong and adjacent circles on the basis that AIs would have no reason to cooperate peacefully with humans if they could simply wipe us out instead. But, by your own admission, AIs can credibly commit to giving people rewards for cooperation: you said that cooperation results in a "decent shot of the AI systems giving me something in return". My question is: why does it seem like this logic only extends to hypothetical scenarios like being in an alien civilization, rather than the boring ordinary case of cooperation and trade, operating under standard institutions, on Earth, in a default AI takeoff scenario?

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-03-08T05:31:32.179Z · LW · GW

Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown.

It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There's a tiny chance someone could revive me in the future by reconstructing my identity through digital records [ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.) 

It's possible this argument works because of something very clever that I'm missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to "make a deal" with the humans, and receive something in exchange for admitting that you're misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.