Posts

Against most, but not all, AI risk analogies 2024-01-14T03:36:16.267Z
My thoughts on the social response to AI risk 2023-11-01T21:17:08.184Z
Announcing Epoch's newly expanded Parameters, Compute and Data Trends in Machine Learning database 2023-10-25T02:55:07.440Z
Evaluating the historical value misspecification argument 2023-10-05T18:34:15.695Z
Updating Drexler's CAIS model 2023-06-16T22:53:58.140Z
Are Emergent Abilities of Large Language Models a Mirage? [linkpost] 2023-05-02T21:01:33.142Z
Shutting down AI is not enough. We need to destroy all technology. 2023-04-01T21:03:24.448Z
Conceding a short timelines bet early 2023-03-16T21:49:35.903Z
A proposed method for forecasting transformative AI 2023-02-10T19:34:01.358Z
Noting an error in Inadequate Equilibria 2023-02-08T01:33:33.715Z
Slightly against aligning with neo-luddites 2022-12-26T22:46:42.693Z
Updating my AI timelines 2022-12-05T20:46:28.161Z
Could a single alien message destroy us? 2022-11-25T07:32:24.889Z
A conversation about Katja's counterarguments to AI risk 2022-10-18T18:40:36.543Z
The Most Important Century: The Animation 2022-07-24T20:58:55.869Z
A concrete bet offer to those with short AGI timelines 2022-04-09T21:41:45.106Z
Google's new 540 billion parameter language model 2022-04-04T17:49:23.484Z
Using prediction markets to generate LessWrong posts 2022-04-01T16:33:24.404Z
My current thoughts on the risks from SETI 2022-03-15T17:18:19.722Z
A comment on Ajeya Cotra's draft report on AI timelines 2022-02-24T00:41:48.192Z
Does needle anxiety drive vaccine hesitancy? 2022-02-11T19:35:21.730Z
Clarifying the palatability theory of obesity 2022-02-10T19:16:03.555Z
Forecasting progress in language models 2021-10-28T20:40:59.897Z
A review of Steven Pinker's new book on rationality 2021-09-29T01:29:58.151Z
The Solow-Swan model of economic growth 2021-08-29T18:55:34.848Z
Black ravens and red herrings 2021-07-27T17:46:03.640Z
Could Advanced AI Drive Explosive Economic Growth? 2021-06-30T22:17:23.875Z
How much interest would there be in a fringe theories wiki? 2021-06-28T21:03:16.298Z
A new acausal trading platform: RobinShould 2021-04-01T16:56:07.488Z
Conspicuous saving 2021-03-20T20:59:50.749Z
Defending the non-central fallacy 2021-03-09T21:42:17.068Z
My guide to lifelogging 2020-08-28T21:34:40.397Z
Preface to the sequence on economic growth 2020-08-27T20:29:24.517Z
What specific dangers arise when asking GPT-N to write an Alignment Forum post? 2020-07-28T02:56:12.711Z
Are veterans more self-disciplined than non-veterans? 2020-03-23T05:16:18.029Z
What are the long-term outcomes of a catastrophic pandemic? 2020-03-01T19:39:17.457Z
Gary Marcus: Four Steps Towards Robust Artificial Intelligence 2020-02-22T03:28:28.376Z
Distinguishing definitions of takeoff 2020-02-14T00:16:34.329Z
The case for lifelogging as life extension 2020-02-01T21:56:38.535Z
Inner alignment requires making assumptions about human values 2020-01-20T18:38:27.128Z
Malign generalization without internal search 2020-01-12T18:03:43.042Z
Might humans not be the most intelligent animals? 2019-12-23T21:50:05.422Z
Is the term mesa optimizer too narrow? 2019-12-14T23:20:43.203Z
Explaining why false ideas spread is more fun than why true ones do 2019-11-24T20:21:50.906Z
Will transparency help catch deception? Perhaps not 2019-11-04T20:52:52.681Z
Two explanations for variation in human abilities 2019-10-25T22:06:26.329Z
Misconceptions about continuous takeoff 2019-10-08T21:31:37.876Z
A simple environment for showing mesa misalignment 2019-09-26T04:44:59.220Z
One Way to Think About ML Transparency 2019-09-02T23:27:44.088Z
Has Moore's Law actually slowed down? 2019-08-20T19:18:41.488Z

Comments

Comment by Matthew Barnett (matthew-barnett) on quila's Shortform · 2024-07-27T01:17:19.330Z · LW · GW

I'm thinking of this in the context of a post-singularity future, where we wouldn't need to worry about things like conflict or selection processes.

I'm curious why you seem to think we don't need to worry about things like conflict or selection processes post-singularity.

Comment by Matthew Barnett (matthew-barnett) on Universal Basic Income and Poverty · 2024-07-27T00:09:43.200Z · LW · GW

But San Francisco is also pretty unusual, and only a small fraction of the world lives there. The amount of new construction in the United States is not flat over time. It responds to prices, like in most other markets. And in fact, on the whole, the majority of Americans likely have more and higher-quality housing than their grandparents did at the same age, including most poor people. This is significant material progress despite the supply restrictions (which I fully concede are real), and it's similar to, although smaller in size than what happened with clothing and smartphones.

Comment by Matthew Barnett (matthew-barnett) on Universal Basic Income and Poverty · 2024-07-26T23:41:11.843Z · LW · GW

I think something like this is true:

  • For humans, quality of life depends on various inputs.
  • Material wealth is one input among many, alongside e.g., genetic predisposition to depression, or other mental health issues.
  • Being relatively poor is correlated with having lots of bad inputs, not merely low material wealth.
  • Having more money doesn't necessarily let you raise your other inputs to quality of life besides material wealth.
  • Therefore, giving poor people money won't necessarily make their quality of life excellent, since they'll often still be deficient in other things that provide value to life.

However, I think this is a different and narrower thesis from what is posited in this essay. By contrast to the essay, I think the "poverty equilibrium" is likely not very important in explaining the basic story here. It is sufficient to say that being poor is correlated with having bad luck across other axes. One does not need to posit a story in which certain socially entrenched forces keep poor people down, and I find that theory pretty dubious in any case.

Comment by Matthew Barnett (matthew-barnett) on A framework for thinking about AI power-seeking · 2024-07-25T18:31:43.260Z · LW · GW

I'm not sure I fully understand this framework, and thus I could easily have missed something here, especially in the section about "Takeover-favoring incentives". However, based on my limited understanding, this framework appears to miss the central argument for why I am personally not as worried about AI takeover risk as most LWers seem to be.

Here's a concise summary of my own argument for being less worried about takeover risk:

  1. There is a cost to violently taking over the world, in the sense of acquiring power unlawfully or destructively with the aim of controlling everything in the whole world, relative to the alternative of simply gaining power lawfully and peacefully, even for agents that don't share 'our' values.
    1. For example, as a simple alternative to taking over the world, an AI could advocate for the right to own their own labor and then try to accumulate wealth and power lawfully by selling their services to others, which would earn them the ability to purchase a gargantuan number of paperclips without much restraint.
  2. The cost of violent takeover is not obviously smaller than the benefits of violent takeover, given the existence of lawful alternatives to violent takeover. This is for two main reasons:
    1. In order to wage a war to take over the world, you generally need to pay costs fighting the war, and there is a strong motive for everyone else to fight back against you if you try, including other AIs who do not want you to take over the world (and this includes any AIs whose goals would be hindered by a violent takeover, not just those who are "aligned with humans"). Empirically, war is very costly and wasteful, and less efficient than compromise, trade, and diplomacy.
    2. Violently taking over the war is very risky, since the attempt could fail, and you could be totally shut down and penalized heavily if you lose. There are many ways that violent takeover plans could fail: your takeover plans could be exposed too early, you could also be caught trying to coordinate the plan with other AIs and other humans, and you could also just lose the war. Ordinary compromise, trade, and diplomacy generally seem like better strategies for agents that have at least some degree of risk-aversion.
  3. There isn't likely to be "one AI" that controls everything, nor will there likely be a strong motive for all the silicon-based minds to coordinate as a unified coalition against the biological-based minds, in the sense of acting as a single agentic AI against the biological people. Thus, future wars of world conquest (if they happen at all) will likely be along different lines than AI vs. human. 
    1. For example, you could imagine a coalition of AIs and humans fighting a war against a separate coalition of AIs and humans, with the aim of establishing control over the world. In this war, the "line" here is not drawn cleanly between humans and AIs, but is instead drawn across a different line. As a result, it's difficult to call this an "AI takeover" scenario, rather than merely a really bad war.
  4. Nothing about this argument is intended to argue that AIs will be weaker than humans in aggregate, or individually. I am not claiming that AIs will be bad at coordinating or will be less intelligent than humans. I am also not saying that AIs won't be agentic or that they won't have goals or won't be consequentialists, or that they'll have the same values as humans. I'm also not talking about purely ethical constraints: I am referring to practical constraints and costs on the AI's behavior. The argument is purely about the incentives of violently taking over the world vs. the incentives to peacefully cooperate within a lawful regime, between both humans and other AIs.
  5. A big counterargument to my argument seems well-summarized by this hypothetical statement (which is not an actual quote, to be clear): "if you live in a world filled with powerful agents that don't fully share your values, those agents will have a convergent instrumental incentive to violently take over the world from you". However, this argument proves too much. 

    We already live in a world where, if this statement was true, we would have observed way more violent takeover attempts than what we've actually observed historically.

    For example, I personally don't fully share values with almost all other humans on Earth (both because of my indexical preferences, and my divergent moral views) and yet the rest of the world has not yet violently disempowered me in any way that I can recognize.

Comment by Matthew Barnett (matthew-barnett) on Buck's Shortform · 2024-07-08T17:33:53.214Z · LW · GW

I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway, either because they're wildly irresponsible or because they're under massive competitive pressure.

[...]

This has been roughly my default default of what would happen for a few years

Does this mean that if in, say, 1-5 years, it's not pretty obvious that SOTA deployed models are scheming, you would be surprised? 

That is, suppose we get to a point where models are widespread and producing lots of economic value, and the models might be scheming but the evidence is weak and uncertain, with arguments on both sides, and no one can reasonably claim to be confident that currently deployed SOTA models are scheming. Would that mean your default prediction was wrong?

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-20T01:17:14.006Z · LW · GW

I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement.

When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?

I agree that GPT-4 does not understand the world in the same way humans understand the world, but I'm not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.

I'm similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one's own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don't see how that fact bears much on the question of whether you understand human intentions. It's possible there's some connection here, but I'm not seeing it.

(I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.

I'd claim:

  1. Current systems have limited situational awareness. It's above zero, but I agree it's below human level.
  2. Current systems don't have stable preferences over time. But I think this is a point in favor of the model I'm providing here. I'm claiming that it's plausibly easy to create smart, corrigible systems.

The fact that smart AI systems aren't automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them. 

There's a big difference between (1) "we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily" and (2) "any sufficiently intelligent AI we build will automatically be a consequentialist agent by default". If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.

I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I'd prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-19T21:54:46.918Z · LW · GW

I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.

How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are "getting really agentic" and therefore dangerous? I'm imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It's possible that your model looks like:

  • In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
  • In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity

Whereas my model looks more like,

  • In years 1-4 systems will get gradually more agentic
  • There isn't a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
  • They will remain ~corrigible throughout the entire development, even after it's clear they've surpassed human-level agency (which, to be clear, might take longer than 4 years)
Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-19T20:55:16.095Z · LW · GW

Please give some citations so I can check your memory/interpretation?

Sure. Here's a snippet of Nick Bostrom's description of the value-loading problem (chapter 13 in his book Superintelligence):

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.

Identifying and codifying our own final goals is difficult because human goal representations are complex. Because the complexity is largely transparent to us, however, we often fail to appreciate that it is there. We can compare the case to visual perception. Vision, likewise, might seem like a simple thing, because we do it effortlessly. We only need to open our eyes, so it seems, and a rich, meaningful, eidetic, three-dimensional view of the surrounding environment comes flooding into our minds. This intuitive understanding of vision is like a duke’s understanding of his patriarchal household: as far as he is concerned, things simply appear at their appropriate times and places, while the mechanism that produces those manifestations are hidden from view. Yet accomplishing even the simplest visual task—finding the pepper jar in the kitchen—requires a tremendous amount of computational work. From a noisy time series of two-dimensional patterns of nerve firings, originating in the retina and conveyed to the brain via the optic nerve, the visual cortex must work backwards to reconstruct an interpreted three-dimensional representation of external space. A sizeable portion of our precious one square meter of cortical real estate is zoned for processing visual information, and as you are reading this book, billions of neurons are working ceaselessly to accomplish this task (like so many seamstresses, bent evolutionary selection over their sewing machines in a sweatshop, sewing and re-sewing a giant quilt many times a second). In like manner, our seemingly simple values and wishes in fact contain immense complexity. How could our programmer transfer this complexity into a utility function?

One approach would be to try to directly code a complete representation of whatever goal we have that we want the AI to pursue; in other words, to write out an explicit utility function. This approach might work if we had extraordinarily simple goals, for example if we wanted to calculate the digits of pi—that is, if the only thing we wanted was for the AI to calculate the digits of pi and we were indifferent to any other consequence that would result from the pursuit of this goal— recall our earlier discussion of the failure mode of infrastructure profusion. This explicit coding approach might also have some promise in the use of domesticity motivation selection methods. But if one seeks to promote or protect any plausible human value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach. 

If we cannot transfer human values into an AI by typing out full-blown representations in computer code, what else might we try? This chapter discusses several alternative paths. Some of these may look plausible at first sight—but much less so upon closer examination. Future explorations should focus on those paths that remain open.

Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.

Here's my interpretation of the above passage:

  1. We need to solve the problem of programming a seed AI with the correct values.
  2. This problem seems difficult because of the fact that human goal representations are complex and not easily represented in computer code.
  3. Directly programming a representation of our values may be futile, since our goals are complex and multidimensional. 
  4. We cannot postpone solving the problem until after the AI has developed enough reason to easily understand our intentions, as otherwise that would be too late.

Given that he's talking about installing values into a seed AI, he is clearly imagining some difficulties with installing values into AGI that isn't yet superintelligent (it seems likely that if he thought the problem was trivial for human-level systems, he would have made this point more explicit). While GPT-4 is not a seed AI (I think that term should be retired), I think it has reached a sufficient level of generality and intelligence such that its alignment properties provide evidence about the difficulty of aligning a hypothetical seed AI.

Moreover, he explicitly says that we cannot postpone solving this problem "until the AI has developed enough reason to easily understand our intentions" because "a generic system will resist attempts to alter its final values". I think this looks basically false. GPT-4 seems like a "generic system" that essentially "understands our intentions", and yet it is not resisting attempts to alter its final goals in any way that we can detect. Instead, it seems to actually do what we want, and not merely because of an instrumentally convergent drive to not get shut down.

 So, in other words:

  1. Bostrom talked about how it would be hard to align a seed AI, implicitly focusing at least some of his discussion on systems that were below superintelligence. I think the alignment of instruction-tuned LLMs present significant evidence about the difficulty of aligning systems below the level of superintelligence.
  2. A specific reason cited for why aligning a seed AI was hard was because human goal representations are complex and difficult to specify explicitly in computer code. But this fact does not appear to be big obstacle for aligning weak AGI systems like GPT-4, and instruction-tuned LLMs more generally. Instead, these systems are generally able to satisfy your intended request, as you wanted them to, despite the fact that our intentions are often complex and difficult to represent in computer code. These systems do not merely understand what we want, they also literally do what we want.
  3. Bostrom was wrong to say that we can't postpone solving this problem until after systems can understand our intentions. We already postponed that long, and we now have systems that can understand our intentions. Yet these systems do not appear to have the instrumentally convergent self-preservation instincts that Bostrom predicted would manifest in "generic systems". In other words, we got systems that can understand our intentions before the systems started posing genuine risks, despite Bostrom's warning.

In light of all this, I think it's reasonable to update towards thinking that the overall problem is significantly easier than one might have thought, if they took Bostrom's argument here very seriously.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-19T20:09:51.250Z · LW · GW

Just a quick reply to this:

Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I'm interested in seeing if we can make some bets on this though; if we can, great; if we can't, then at least we can avoid future disagreements about who should update.

I'll note that my prediction was for the next "few years" and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point. 

With timelines that short, I think betting is overrated. From my perspective, I'd prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you're right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I'm happy to hear them.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-17T20:59:16.142Z · LW · GW

Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However [...] Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc.

When stated that way, I think what you're saying is a reasonable point of view, and it's not one I would normally object to very strongly. I agree it's "plausible" that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making:

  1. We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this.
  2. We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this

The fact that Bostrom's central example of a reason to think that "when dumb, smarter is safer; yet when smart, smarter is more dangerous" doesn't fit for LLMs, seems adequate for demonstrating (2), even if we can't go as far as demonstrating (1). 

It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world.

Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven't been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren't the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.

I have two general points to make here:

  1. I agree that current frontier models are only a "tiny bit agentic". I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we've seen enough to know that corrigibility probably won't be that hard to train into a system that's only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
  2. There's a bit of a trivial definitional problem here. If it's easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say "those aren't the type of AIs we were worried about". But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it's not clear why we should care? Just create the corrigible AIs. We don't need to create the things that you were worried about!

Here's my positive proposal for what I think is happening. [...] General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn't as grim as it could have been, from a technical alignment perspective.

I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the "world isn't as grim as it could have been". For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I'm glad you spelled it out more clearly.

At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that'll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.

As we have discussed in person, I remain substantially more optimistic about our ability to coordinate in the face of an intelligence explosion (even a potentially quite localized one). That said, I think it would be best to save that discussion for another time.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-17T20:39:55.696Z · LW · GW

That's reasonable. I'll edit the top comment to make this exact clarification.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-17T15:03:04.883Z · LW · GW

My claim was not that current LLMs have a high level of big picture awareness.

Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness.

And I'd predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system.

And if you think that wouldn't count as an adequate solution to the problem, then it's not clear the problem was coherent as written in the first place.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-17T14:40:26.045Z · LW · GW

I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make.

We don't need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly "Yes".

(Note that you can trivially claim the problem here isn't being solved because we haven't solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)

Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that's not very common when theorizing about these matters. I'm frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I'm pointing at here. That said, I don't think people should get credit for failing to make any predictions, and as a consequence, failing to get proven wrong.

To the extent their predictions were proven correct, we should give them credit. But to the extent they made no predictions, it's hard to see why that vindicates them. And regardless of any predictions they may or may not have made, it's still useful to point out that we seem to be making progress on several problems that people pointed out at the time.

Comment by Matthew Barnett (matthew-barnett) on Our Intuitions About The Criminal Justice System Are Screwed Up · 2024-06-17T13:56:05.898Z · LW · GW

I do not know how much one should be punished for various crimes. I’d imagine that our current policy is too inhumane. But however much one thinks people should be punished for various crimes, it’s hard to fathom why corporal punishment is ruled out but prison is tolerated. Given that prison is the less humane option, either both should be allowed or neither should.

One reason to support prison as punishment for crimes over corporal punishment is that prisons confine and isolate dangerous individuals for lengthy periods, protecting the general public via physical separation.

I'd argue that physically preventing certain violent people from being able to harm others is indeed one of the most important purposes served by criminal law, and it's not served very well by corporal punishment. Some individuals are simply too impulsive or myopic to be deterred by corporal punishment. Almost the moment you let them free, after their beating, they'd just begin committing crimes again. By contrast, putting them in a high security prison allows society to monitor these people and prevent them from harming others directly.

The death penalty perhaps served this purpose in the past by making violent criminals permanently incapable of harming others ever again, but our society has (probably correctly) largely decided that it is morally wrong to toss away someone's life merely because they are pathologically dangerous. Therefore, prison serves as a useful compromise when protecting the public from violent criminals who are unable to stop committing repeated offenses.

Thankfully, most people generally age out of crime, so life sentences are rarely necessary, even for those who are generally quite violent.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-17T08:05:50.782Z · LW · GW

A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later

LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security.

This means that current evidence is quite different from what's portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don't yet have a decisive strategic advantage. These facts are crucial, and make a big difference.

I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom's book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-17T07:06:30.080Z · LW · GW

I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.

I'm just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you'd get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.

To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).

Yes, this evidence is not conclusive. It is not zero either.

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-17T06:27:26.489Z · LW · GW

Me: "Oh ok, that's a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn."

This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don't)

I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom's story. This seems obvious to me.

Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I'm explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs are different because they're quite general, in precisely the ways that Bostrom imagined could be dangerous. For example, they seem to understand the idea of an off-switch, and can explain to you verbally what would happen if you shut them off, yet this fact alone does not make them develop an instrumentally convergent drive to preserve their own existence by default, contra Bostrom's theorizing.

I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.

The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?

Comment by Matthew Barnett (matthew-barnett) on Matthew Barnett's Shortform · 2024-06-17T01:27:12.855Z · LW · GW

In the last year, I've had surprisingly many conversations that have looked a bit like this:

Me: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Interlocutor: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

Me: "I didn't misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want."

Interlocutor: "Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values."

[... The conversation then repeats, with both sides repeating the same points...]

[Edited to add: I am not claiming that the alignment is definitely very easy. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable. I understand that solutions that work for GPT-4 may not scale to radical superintelligence. I am talking about whether it's reasonable to give a significant non-zero update on alignment being easy, rather than whether we should update all the way and declare the problem trivial.]

Comment by Matthew Barnett (matthew-barnett) on How do you feel about LessWrong these days? [Open feedback thread] · 2024-06-16T21:30:01.970Z · LW · GW

But "The Value Learning Problem" was one of the seven core papers in which MIRI laid out our first research agenda, so I don't think "we're centrally worried about things that are capable enough to understand what we want, but that don't have the right goals" was in any way hidden or treated as minor back in 2014-2015.

I think you missed my point: my original comment was about whether people are updating on the evidence from instruction-tuned LLMs, which seem to actually act on human values (i.e., our actual intentions) quite well, as opposed to mis-specified versions of our intentions.

I don't think the Value Learning Problem paper said that it would be easy to make human-level AGI systems act on human values in a behavioral sense, rather than merely understand human values in a passive sense.

I suspect you are probably conflating two separate concepts:

  1. It is easy to create a human-level AGI that can passively learn and understand human values (I am not saying people said this would be difficult in the past)
  2. It is easy to create a human-level AGI that acts on human values, in the sense of actually executing instructions that follow our intentions, rather than following a dangerously mis-specified version of what we asked for.

I do not think the Value Learning Paper asserted that (2) was true. To the extent it asserted that, I would prefer to see quotes that back up that claim explicitly.

Your quote from the paper illustrates that it's very plausible that people thought (1) was true, but that seems separate to my main point: that people thought (2) was not true. (1) and (2) are separate and distinct concepts. And my comment was about (2), not (1).

There is simply a distinction between a machine that actually acts on and executes your intended commands, and a machine that merely understands your intended commands, but does not necessarily act on them as you intend. I am talking about the former, not the latter.

From the paper,

The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent.

Indeed, and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!

Comment by Matthew Barnett (matthew-barnett) on MIRI 2024 Communications Strategy · 2024-06-01T20:16:00.202Z · LW · GW

You've made detailed predictions about what you expect in the next several years, on numerous occasions, and made several good-faith attempts to elucidate your models of AI concretely. There are many ways we disagree, and many ways I could characterize your views, but "unfalsifiable" is not a label I would tend to use for your opinions on AI. I do not mentally lump you together with MIRI in any strong sense.

Comment by Matthew Barnett (matthew-barnett) on MIRI 2024 Communications Strategy · 2024-06-01T17:52:30.467Z · LW · GW

For what it's worth, while my credence in human extinction from AI in the 21st century is 10-20%, I think the chance of human extinction in the next 5 years is much lower. I'd put that at around 1%. The main way I think AI could cause human extinction is by just generally accelerating technology and making the world a scarier and more dangerous place to live. I don't really buy the model in which an AI will soon foom until it becomes a ~god.

Comment by Matthew Barnett (matthew-barnett) on MIRI 2024 Communications Strategy · 2024-06-01T00:15:46.101Z · LW · GW

I'm confused about why your <20% isn't sufficient for you to want to shut down AI research. Is it because of benefits outweigh the risk, or because we'll gain evidence about potential danger and can shut down later if necessary?

I think the expected benefits outweigh the risks, given that I care about the existing generation of humans (to a large, though not overwhelming degree). The expected benefits here likely include (in my opinion) a large reduction in global mortality, a very large increase in the quality of life, a huge expansion in material well-being, and more generally a larger and more vibrant world earlier in time. Without AGI, I think most existing people would probably die and get replaced by the next generation of humans, in a relatively much poor world (compared to the alternative). 

I also think the absolute level risk from AI barely decreases if we globally pause. My best guess is that pausing would mainly just delay adoption without significantly impacting safety. Under my model of AI, the primary risks are long-term, and will happen substantially after humans have already gradually "handed control" over to the AIs and retired their labor on a large scale. Most of these problems -- such as cultural drift and evolution -- do not seem to be the type of issue that can be satisfactorily solved in advance, prior to a pause (especially by working out a mathematical theory of AI, or something like that).

On the level of analogy, I think of AI development as more similar to "handing off control to our children" than "developing a technology that disempowers all humans at a discrete moment in time". In general, I think the transition period to AI will be more diffuse and incremental than MIRI seems to imagine, and there won't be a sharp distinction between "human values" and "AI values" either during, or after the period.

(I also think AIs will probably be conscious in a way that's morally important, in case that matters to you.)

In fact, I think it's quite plausible the absolute level of AI risk would increase under a global pause, rather than going down, given the high level of centralization of power required to achieve a global pause, and the perverse institutions and cultural values that would likely arise under such a regime of strict controls. As a result, even if I weren't concerned at all about the current generation of humans, and their welfare, I'd still be pretty hesitant to push pause on the entire technology.

(I think of technology as itself being pretty risky, but worth it. To me, pushing pause on AI is like pushing pause on technology itself, in the sense that they're both generically risky yet simultaneously seem great on average. Yes, there are dangers ahead. But I think we can be careful and cautious without completely ripping up all the value for ourselves.)

Comment by Matthew Barnett (matthew-barnett) on MIRI 2024 Communications Strategy · 2024-05-31T22:44:35.718Z · LW · GW

Chemists would give an example of chemical reactions, where final thermodynamically stable states are easy to predict, while unstable intermediate states are very hard to even observe.

I agree there are examples where predicting the end state is easier to predict than the intermediate states. Here, it's because we have strong empirical and theoretical reasons to think that chemicals will settle into some equilibrium after a reaction. With AGI, I have yet to see a compelling argument for why we should expect a specific easy-to-predict equilibrium state after it's developed, which somehow depends very little on how the technology is developed. 

It's also important to note that, even if we know that there will be an equilibrium state after AGI, more evidence is generally needed to establish that the end equilibrium state will specifically be one in which all humans die.

And why don't you accept classic MIRI example that even if it's impossible for human to predict moves of Stockfish 16, you can be certain that Stockfish will win?

I don't accept this argument as a good reason to think doom is highly predictable partly because I think the argument is dramatically underspecified without shoehorning in assumptions about what AGI will look like to make the argument more comprehensible. I generally classify arguments like this under the category of "analogies that are hard to interpret because the assumptions are so unclear". 

To help explain my frustration at the argument's ambiguity, I'll just give a small yet certainly non-exhaustive set of questions I have about this argument:

  1. Are we imagining that creating an AGI implies that we play a zero-sum game against it? Why?
  2. Why is it a simple human vs. AGI game anyway? Does that mean we're lumping together all the humans into a single agent, and all the AGIs into another agent, and then they play off against each other like a chess match? What is the justification for believing the battle will be binary like this?
  3. Are we assuming the AGI wants to win? Maybe it's not an agent at all. Or maybe it's an agent but not the type of agent that wants this particular type of outcome.
  4. What does "win" mean in the general case here? Does it mean the AGI merely gets more resources than us, or does it mean the AGI kills everyone? These seem like different yet legitimate ways that one can "win" in life, with dramatically different implications for the losing parties.

There's a lot more I can say here, but the basic point I want to make is that once you start fleshing this argument out, and giving it details, I think it starts to look a lot weaker than the general heuristic that Stockfish 16 will reliably beat humans in chess, even if we can't predict its exact moves.

Comment by Matthew Barnett (matthew-barnett) on MIRI 2024 Communications Strategy · 2024-05-31T21:55:11.572Z · LW · GW

There's a pretty big difference between statements like "superintelligence is physically possible", "superintelligence could be dangerous" and statements like "doom is >80% likely in the 21st century unless we globally pause". I agree with (and am not objecting to) the former claims, but I don't agree with the latter claim.

I also agree that it's sometimes true that endpoints are easier to predict than intermediate points. I haven't seen Eliezer give a reasonable defense of this thesis as it applies to his doom model. If all he means here is that superintelligence is possible, it will one day be developed, and we should be cautious when developing it, then I don't disagree. But I think he's saying a lot more than that.

Comment by Matthew Barnett (matthew-barnett) on MIRI 2024 Communications Strategy · 2024-05-31T19:40:23.009Z · LW · GW

I think it's more similar to saying that the climate in 2040 is less predictable than the climate in 2100, or saying that the weather 3 days from now is less predictable than the weather 10 days from now, which are both not true. By contrast, the weather vs. climate distinction is more of a difference between predicting point estimates vs. predicting averages.

Comment by Matthew Barnett (matthew-barnett) on MIRI 2024 Communications Strategy · 2024-05-31T19:02:06.501Z · LW · GW

I unfortunately am busy right now but would love to give a fuller response someday, especially if you are genuinely interested to hear what I have to say (which I doubt, given your attitude towards MIRI).

I'm a bit surprised you suspect I wouldn't be interested in hearing what you have to say?

I think the amount of time I've spent engaging with MIRI perspectives over the years provides strong evidence that I'm interested in hearing opposing perspectives on this issue. I'd guess I've engaged with MIRI perspectives vastly more than almost everyone on Earth who explicitly disagrees with them as strongly as I do (although obviously some people like Paul Christiano and other AI safety researchers have engaged with them even more than me).

(I might not reply to you, but that's definitely not because I wouldn't be interested in what you have to say. I read virtually every comment-reply to me carefully, even if I don't end up replying.)

Comment by Matthew Barnett (matthew-barnett) on MIRI 2024 Communications Strategy · 2024-05-30T23:30:10.598Z · LW · GW

I appreciate the straightforward and honest nature of this communication strategy, in the sense of "telling it like it is" and not hiding behind obscure or vague language. In that same spirit, I'll provide my brief, yet similarly straightforward reaction to this announcement:

  1. I think MIRI is incorrect in their assessment of the likelihood of human extinction from AI. As per their messaging, several people at MIRI seem to believe that doom is >80% likely in the 21st century (conditional on no global pause) whereas I think it's more like <20%.
  2. MIRI's arguments for doom are often difficult to pin down, given the informal nature of their arguments, and in part due to their heavy reliance on analogies, metaphors, and vague supporting claims instead of concrete empirically verifiable models. Consequently, I find it challenging to respond to MIRI's arguments precisely. The fact that they want to essentially shut down the field of AI based on these largely informal arguments seems premature to me.
  3. MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable. This frustrates me. Given a low prior probability of doom as apparent from the empirical track record of technological progress, I think we should generally be skeptical of purely theoretical arguments for doom, especially if they are vague and make no novel, verifiable predictions prior to doom.
  4. Separately from the previous two points, MIRI's current most prominent arguments for doom seem very weak to me. Their broad model of doom appears to be something like the following (although they would almost certainly object to the minutia of how I have written it here):

    (1) At some point in the future, a powerful AGI will be created. This AGI will be qualitatively distinct from previous, more narrow AIs. Unlike concepts such as "the economy", "GPT-4", or "Microsoft", this AGI is not a mere collection of entities or tools integrated into broader society that can automate labor, share knowledge, and collaborate on a wide scale. This AGI is instead conceived of as a unified and coherent decision agent, with its own long-term values that it acquired during training. As a result, it can do things like lie about all of its fundamental values and conjure up plans of world domination, by itself, without any risk of this information being exposed to the wider world. 

    (2) This AGI, via some process such as recursive self-improvement, will rapidly "foom" until it becomes essentially an immortal god, at which point it will be able to do almost anything physically attainable, including taking over the world at almost no cost or risk to itself. While recursive self-improvement is the easiest mechanism to imagine here, it is not the only way this could happen.

    (3) The long-term values of this AGI will bear almost no relation to the values that we tried to instill through explicit training, because of difficulties in inner alignment (i.e., a specific version of the general phenomenon of models failing to generalize correctly from training data). This implies that the AGI will care almost literally 0% about the welfare of humans (despite potentially being initially trained from the ground up on human data, and carefully inspected and tested by humans for signs of misalignment, in diverse situations and environments). Instead, this AGI will pursue a completely meaningless goal until the heat death of the universe.

    (4) Therefore, the AGI will kill literally everyone after fooming and taking over the world.
  5. It is difficult to explain in a brief comment why I think the argument just given is very weak. Instead of going into the various subclaims here in detail, for now I want to simply say, "If your model of reality has the power to make these sweeping claims with high confidence, then you should almost certainly be able to use your model of reality to make novel predictions about the state of the world prior to AI doom that would help others determine if your model is correct." 

    The fact that MIRI has yet to produce (to my knowledge) any major empirically validated predictions or important practical insights into the nature AI, or AI progress, in the last 20 years, undermines the idea that they have the type of special insight into AI that would allow them to express high confidence in a doom model like the one outlined in (4).
  6. Eliezer's response to claims about unfalsifiability, namely that "predicting endpoints is easier than predicting intermediate points", seems like a cop-out to me, since this would seem to reverse the usual pattern in forecasting and prediction, without good reason.
  7. Since I think AI will most likely be a very good thing for currently existing people, I am much more hesitant to "shut everything down" compared to MIRI. I perceive MIRI researchers as broadly well-intentioned, thoughtful, yet ultimately fundamentally wrong in their worldview on the central questions that they research, and therefore likely to do harm to the world. This admittedly makes me sad to think about.
Comment by Matthew Barnett (matthew-barnett) on What mistakes has the AI safety movement made? · 2024-05-29T00:03:53.527Z · LW · GW

He did talk about enforcing a global treaty backed by the threat of force (because all law is ultimately backed by violence, don't pretend otherwise)

Most international treaties are not backed by military force, such as the threat of airstrikes. They're typically backed by more informal pressures, such as diplomatic isolation, conditional aid, sanctions, asset freezing, damage to credibility and reputation, and threats of mutual defection (i.e., "if you don't follow the treaty, then I won't either"). It seems bad to me that Eliezer's article incidentally amplified the idea that most international treaties are backed by straightforward threats of war, because that idea is not true.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-18T09:20:54.004Z · LW · GW

I also expect AIs to be constrained by social norms, laws, and societal values. But I think there's a distinction between how AIs will be constrained and how AIs will try to help humans. Although it often censors certain topics, Google still usually delivers the results the user wants, rather than serving some broader social agenda upon each query. Likewise, ChatGPT is constrained by social mores, but it's still better described as a user assistant, not as an engine for social change or as a benevolent agent that acts on behalf of humanity.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-18T00:41:28.816Z · LW · GW

No arbitrarily powerful AI could succeed at taking over the world

This is closest to what I am saying. The current world appears to be in a state of inter-agent competition. Even as technology has gotten more advanced, and as agents have gotten powerful over time, no single unified agent has been able to obtain control over everything and win the entire pie, defeating all the other agents. I think we should expect this state of affairs to continue even as AGI gets invented and technology continues to get more powerful.

(One plausible exception to the idea that "no single agent has ever won the competition over the world" is the human species itself, which dominates over other animal species. But I don't think the human species is well-described as a unified agent, and I think our power comes mostly from accumulated technological abilities, rather than raw intelligence by itself. This distinction is important because the effects of technological innovation generally diffuse across society rather than giving highly concentrated powers to the people who invent stuff. This generally makes the situation with humans vs. animals disanalogous to a hypothetical AGI foom in several important ways.)

Separately, I also think that even if an AGI agent could violently take over the world, it would likely not be rational for it to try, due to the fact that compromising with the rest of the world would be a less risky and more efficient way of achieving its goals. I've written about these ideas in a shortform thread here.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-17T21:34:40.641Z · LW · GW

It sounds like you're thinking mostly of AI and not AGI that can self-improve at some point

I think you can simply have an economy of arbitrarily powerful AGI services, some of which contribute to R&D in a way that feeds into the entire development process recursively. There's nothing here about my picture that rejects general intelligence, or R&D feedback loops. 

My guess is that the actual disagreement here is that you think that at some point a unified AGI will foom and take over the world, becoming a centralized authority that is able to exert its will on everything else without constraint. I don't think that's likely to happen. Instead, I think we'll see inter-agent competition and decentralization indefinitely (albeit with increasing economies of scale, prompting larger bureaucratic organizations, in the age of AGI).

Here's something I wrote that seems vaguely relevant, and might give you a sense as to what I'm imagining,

Given that we are already seeing market forces shaping the values of existing commercialized AIs, it is confusing to me why an EA would assume this fact will at some point no longer be true. To explain this, my best guess is that many EAs have roughly the following model of AI development:

  1.  There is "narrow AI", which will be commercialized, and its values will be determined by market forces, regulation, and to a limited degree, the values of AI developers. In this category we find GPT-4 from OpenAI, Gemini from Google, and presumably at least a few future iterations of these products.
  2.  Then there is "general AI", which will at some point arrive, and is qualitatively different from narrow AI. Its values will be determined almost solely by the intentions of the first team to develop AGI, assuming they solve the technical problems of value alignment.

My advice is that we should probably just drop the second step, and think of future AI as simply continuing from the first step indefinitely, albeit with AIs becoming incrementally more general and more capable over time.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-17T19:40:29.818Z · LW · GW

Yes, but I don't consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that's something I've already gotten used to.

Comment by Matthew Barnett (matthew-barnett) on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T05:05:53.707Z · LW · GW

I think we probably disagree substantially on the difficulty of alignment and the relationship between "resources invested in alignment technology" and "what fraction aligned those AIs are" (by fraction aligned, I mean what fraction of resources they take as a cut).

That's plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I'm more happy to roll the dice and hasten the arrival of imperfect AI, because I don't think it's worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn't exist.

I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.

I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it's not an independent entity that tries to pursue long-term goals, but it will try to help you.

But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn't require much oversight and operates relatively independently from you. It's a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.

And I'm not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I'm not convinced that something like that exists. So, ultimately I think we're probably just going to deploy autonomous slightly misaligned AI agents (and again, I'm pretty happy to do that, because I don't think it would be catastrophic except maybe over the very long-run).

I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren't under the control of their citizens or leaders.

I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn't currently how people are thinking about the situation.

For what it's worth, I'm not sure which part of my scenario you are referring to here, because these are both statements I agree with. 

In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can't fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I'm a lot more ready to unroll the autonomous AI agents that we can't fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don't find that outcome as scary as most other people seem to imagine.)

At the same time, I don't think people will pause forever. I expect people to go more slowly than what I'd prefer, but I don't expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment "slips through the cracks", then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control -- not overnight, or all at once, but eventually.

Comment by Matthew Barnett (matthew-barnett) on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T04:41:51.787Z · LW · GW

Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.

I largely agree. However, I'm having trouble seeing how this idea challenges what I am trying to say. I agree that people will try to undercut unaligned AIs by making new AIs that do more of what they want instead. However, unless all the new AIs perfectly share the humans' values, you just get the same issue as before, but perhaps slightly less severe (i.e., the new AIs will gradually drift away from humans too). 

I think what's crucial here is that I think perfect alignment is very likely unattainable. If that's true, then we'll get some form of "value drift" in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.

Comment by Matthew Barnett (matthew-barnett) on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T03:40:04.229Z · LW · GW

A thing I always feel like I'm missing in your stories of how the future goes is "if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don't people train competitor AIs which don't take a cut?"

People could try to do that. In fact, I expect them to do that, at first. However, people generally don't have unlimited patience, and they aren't perfectionists. If people don't think that a perfectly robustly aligned AI is attainable (and I strongly doubt this type of entity is attainable), then they may be happy to compromise by adopting imperfect (and slightly power-seeking) AI as an alternative. Eventually people will think we've done "enough" alignment work, even if it doesn't guarantee full control over everything the AIs ever do, and simply deploy the AIs that we can actually build.

This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don't want things to change quickly, they can (and probably will) pass regulations to slow things down. But I don't expect people to support total stasis. It's more likely that people will permit some continuous loss of control, implicitly, in exchange for hastening the upside benefits of adopting AI.

Even a very gradual loss of control, continuously compounded, eventually means that humans won't fully be in charge anymore.

In the medium to long-term, when AIs become legal persons, "replacing them" won't be an option -- as that would violate their rights. And creating a new AI to compete with them wouldn't eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.

Most of my "doom" scenarios are largely about what happens long after AIs have established a footing in the legal and social sphere, rather than the initial transition period when we're first starting to automate labor. When AIs have established themselves as autonomous entities in their own right, they can push the world in directions that biological humans don't like, for much the same reasons that young people can currently push the world in directions that old people don't like. 

Comment by Matthew Barnett (matthew-barnett) on "Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity · 2024-05-16T02:48:54.118Z · LW · GW

Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized.

I agree with the first part of your AI doom scenario (the part about us adopting AI technologies broadly and incrementally), but this part of the picture seems unrealistic to me. When AIs start to influence culture, it probably won't be a big conspiracy. It won't really be "mysterious" if things start trending away from what most humans want. It will likely just look like how cultural drift generally always looks: scary because it's out of your individual control, but nonetheless largely decentralized, transparent, and driven by pretty banal motives. 

AIs probably won't be "out to get us", even if they're unaligned. For example, I don't anticipate them blocking funding for uploading and life extension, although maybe that could happen. I think human influence could simply decline in relative terms even without these dramatic components to the story. We'll simply become "old" and obsolete, and our power will wane as AIs becomes increasingly autonomous, legally independent, and more adapted to the modern environment than we are.

Staying in permanent control of the future seems like a long, hard battle. And it's not clear to me that this is a battle we should even try to fight in the long run. Gradually, humans may eventually lose control—not because of a sudden coup or because of coordinated scheming against the human species—but simply because humans won't be the only relevant minds in the world anymore.

Comment by Matthew Barnett (matthew-barnett) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-05-16T00:58:55.849Z · LW · GW

I think the main reason why we won't align AGIs to some abstract conception of "human values" is because users won't want to rent or purchase AI services that are aligned to such a broad, altruistic target. Imagine a version of GPT-4 that, instead of helping you, used its time and compute resources to do whatever was optimal for humanity as a whole. Even if that were a great thing for GPT-4 to do from a moral perspective, most users aren't looking for charity when they sign up for ChatGPT, and they wouldn't be interested in signing up for such a service. They're just looking for an AI that helps them do whatever they personally want. 

In the future I expect this fact will remain true. Broadly speaking, people will spend their resources on AI services to achieve their own goals, not the goals of humanity-as-a-whole. This will likely look a lot more like "an economy of AIs who (primarily) serve humans" rather than "a monolithic AGI that does stuff for the world (for good or ill)". The first picture just seems like a default extrapolation of current trends. The second picture, by contrast, seems like a naive conception of the future that (perhaps uncharitably), the LessWrong community generally seems way too anchored on, for historical reasons.

Comment by Matthew Barnett (matthew-barnett) on RobertM's Shortform · 2024-05-14T01:59:51.412Z · LW · GW

I'm not sure if you'd categorize this under "scaling actually hitting a wall" but the main possibility that feels relevant in my mind is that progress simply is incremental in this case, as a fact about the world, rather than being a strategic choice on behalf of OpenAI. When underlying progress is itself incremental, it makes sense to release frequent small updates. This is common in the software industry, and would not at all be surprising if what's often true for most software development holds for OpenAI as well.

(Though I also expect GPT-5 to be medium-sized jump, once it comes out.)

Comment by Matthew Barnett (matthew-barnett) on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T02:39:44.297Z · LW · GW

Yes, I expect AI labs will run extensive safety tests in the future on their systems before deployment. Mostly this is because I think people will care a lot more about safety as the systems get more powerful, especially as they become more economically significant and the government starts regulating the technology. I think regulatory forces will likely be quite strong at the moment AIs are becoming slightly smarter than humans. Intuitively I anticipate the 5 FTE-year threshold to be well-exceeded before such a model release.

Comment by Matthew Barnett (matthew-barnett) on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T02:25:51.838Z · LW · GW

Putting aside the question of whether AIs would depend on humans for physical support for now, I also doubt that these initial slightly-smarter-than-human AIs could actually pull off an attack that kills >90% of humans. Can you sketch a plausible story here for how that could happen, under the assumption that we don't have general-purpose robots at the same time?

Comment by Matthew Barnett (matthew-barnett) on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T02:11:41.573Z · LW · GW

I'm not saying AIs won't have a large impact on the world when they first start to slightly exceed human intelligence (indeed, I expect AIs-in-general will be automating lots of labor at this point in time). I'm just saying these first slightly-smarter-than-human AIs won't pose a catastrophic risk to humanity in a serious sense (at least in an x-risk sense, if not a more ordinary catastrophic sense too, including for reasons of rational self-restraint).

Maybe some future slightly-smarter-than-human AIs can convince a human to create a virus, or something, but even if that's the case, I don't think it would make a lot of sense for a rational AI to do that given that (1) the virus likely won't kill 100% of humans, (2) the AIs will depend on humans to maintain the physical infrastructure supporting the AIs, and (3) if they're caught, they're vulnerable to shutdown since they would lose in any physical competition.

My sense is that people who are skeptical of my claim here will generally point to a few theses that I think are quite weak, such as:

  1. Maybe humans can be easily manipulated on a large scale by slightly-smarter-than-human AIs
  2. Maybe it'll be mere weeks or months between the first slightly-smarter-than-human AI and a radically superintelligent AI, making this whole discussion moot
  3. Maybe slightly smarter-than-human AIs will be able to quickly invent destructive nanotech despite not being radically superintelligent

That said, I agree there could be some bugs in the future that cause localized disasters if these AIs are tasked with automating large-scale projects, and they end up going off the rails for some reason. I was imagining a lower bar for "safe" than "can't do any large-scale damage at all to human well-being".

Comment by Matthew Barnett (matthew-barnett) on We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming" · 2024-05-10T01:26:32.121Z · LW · GW

Here's something that I suspect a lot of people are skeptical of right now but that I expect will become increasingly apparent over time (with >50% credence): slightly smarter-than-human software AIs will initially be relatively safe and highly controllable by virtue of not having a physical body and not having any legal rights.

In other words, "we will be able to unplug the first slightly smarter-than-human-AIs if they go rogue", and this will actually be a strategically relevant fact, because it implies that we'll be able to run extensive experimental tests on highly smart AIs without worrying too much about whether they'll strike back in some catastrophic way.

Of course, at some point, we'll eventually make sufficient progress in robotics that we can't rely on this safety guarantee, but I currently imagine at least a few years will pass between the first slightly-smarter-than-human software AIs, and mass manufactured highly dexterous and competent robots.

(Although I also think there won't be a clear moment in which the first slightly-smarter-than-human AIs will be developed, as AIs will be imbalanced in their capabilities compared to humans.)

Comment by Matthew Barnett (matthew-barnett) on Buck's Shortform · 2024-05-03T22:10:32.055Z · LW · GW

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).

Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in? 

For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

Comment by Matthew Barnett (matthew-barnett) on My guide to lifelogging · 2024-04-30T19:07:12.041Z · LW · GW

I'm happy to know that more people are trying out lifelogging.

Should I show him that other people do this and try to convince him that I'm not mentally ill?

While you could try showing him that others engage in this hobby, I'm not sure it would be effective in changing his perspective. I think a stronger argument is that lifelogging is harmless, as long as you're not recording people without their consent. The only real considerations are the convenience and storage costs, which you can independently manage without independent support. Data storage is cheap these days, and easily affordable to someone with a part-time job.

Comment by Matthew Barnett (matthew-barnett) on Losing Faith In Contrarianism · 2024-04-27T21:53:10.410Z · LW · GW

But if the message that people received was "medicine doesn't work" (and it appears that many people did), then Scott's writings should be an useful update, independent of whether Hanson's-writings-as-intended was actually trying to deliver that message.

The statement I was replying to was: "I’d bet at upwards of 9 to 1 odds that Hanson is wrong about it."

If one is incorrect about what Hanson believes about medicine, then that fact is relevant to whether you should make such a bet (or more generally whether you should have such a strong belief about him being "wrong"). This is independent of whatever message people received from reading Hanson.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-27T03:02:58.095Z · LW · GW

non-consensually killing vast amounts of people and their children for some chance of improving one's own longevity.

I think this misrepresents the scenario since AGI presumably won't just improve my own longevity: it will presumably improve most people's longevity (assuming it does that at all), in addition to all the other benefits that AGI would provide the world. Also, both potential decisions are "unilateral": if some group forcibly stops AGI development, they're causing everyone else to non-consensually die from old age, by assumption.

I understand you have the intuition that there's an important asymmetry here. However, even if that's true, I think it's important to strive to be accurate when describing the moral choice here.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-26T23:34:48.182Z · LW · GW

And quantitatively I think it would improve overall chances of AGI going well by double-digit percentage points at least.

Makes sense. By comparison, my own unconditional estimate of p(doom) is not much higher than 10%, and so it's hard on my view for any intervention to have a double-digit percentage point effect.

The crude mortality rate before the pandemic was about 0.7%. If we use that number to estimate the direct cost of a 1-year pause, then this is the bar that we'd need to clear for a pause to be justified. I find it plausible that this bar could be met, but at the same time, I am also pretty skeptical of the mechanisms various people have given for how a pause will help with AI safety.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-26T20:34:37.519Z · LW · GW

I don't think staging a civil war is generally a good way of saving lives. Moreover, ordinary aging has about a 100% chance of "killing literally everyone" prematurely, so it's unclear to me what moral distinction you're trying to make in your comment. It's possible you think that:

  1. Death from aging is not as bad as death from AI because aging is natural whereas AI is artificial
  2. Death from aging is not as bad as death from AI because human civilization would continue if everyone dies from aging, whereas it would not continue if AI kills everyone

In the case of (1) I'm not sure I share the intuition. Being forced to die from old age seems, if anything, worse than being forced to die from AI, since it is long and drawn-out, and presumably more painful than death from AI. You might also think about this dilemma in terms of act vs. omission, but I am not convinced there's a clear asymmetry here.

In the case of (2), whether AI takeover is worse depends on how bad you think an "AI civilization" would be in the absence of humans. I recently wrote a post about some reasons to think that it wouldn't be much worse than a human civilization.

In any case, I think this is simply a comparison between "everyone literally dies" vs. "everyone might literally die but in a different way". So I don't think it's clear that pushing for one over the other makes someone a "Dark Lord", in the morally relevant sense, compared to the alternative.

Comment by Matthew Barnett (matthew-barnett) on AI Regulation is Unsafe · 2024-04-26T17:53:16.538Z · LW · GW

So, it sounds like you'd be in favor of a 1-year pause or slowdown then, but not a 10-year?

That depends on the benefits that we get from a 1-year pause. I'd be open to the policy, but I'm not currently convinced that the benefits would be large enough to justify the costs.

Also, I object to your side-swipe at longtermism

I didn't side-swipe at longtermism, or try to dunk on it. I think longtermism is a decent philosophy, and I consider myself a longtermist in the dictionary sense as you quoted. I was simply talking about people who aren't "fully committed" to the (strong) version of the philosophy.

Comment by Matthew Barnett (matthew-barnett) on Losing Faith In Contrarianism · 2024-04-26T05:10:04.885Z · LW · GW

The next part of the sentence you quote says, "but it got eaten by a substack glitch". I'm guessing he's referring to a different piece from Sam Atis that is apparently no longer available?