How do we solve the alignment problem?

joe-carlsmith

How do we solve the alignment problem?

post by Joe Carlsmith (joekc) · 2025-02-13T18:27:27.712Z · LW · GW · 8 comments

This is a link post for https://joecarlsmith.substack.com/p/how-do-we-solve-the-alignment-problem

8 comments

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.)

We want the benefits that superintelligent AI agents could create. And some people are trying hard to build such agents. I expect efforts like this to succeed – and maybe, very soon.

But superintelligent AI agents might also be difficult to control. They are, to us, as adults to children, except much more so. In the same direction, relative to us, as advanced aliens; as demi-gods; as humans relative to ants. If such agents “go rogue” – if they start ignoring human instructions, resisting correction or shut-down, trying to escape from their operating environment, seeking unauthorized resources and other forms of power, etc – we might not be able to stop them.

Worse, because power/resources/freedom/survival etc are useful for many goals, superintelligent agents with a variety of different motivations would plausibly have incentives to go rogue in this way, suggesting that problems with AI motivations could easily lead to such behavior. And if this behavior goes uncorrected at scale, humans might lose control over civilization entirely – permanently, involuntarily, maybe violently. Superintelligent AI agents, acting on their own, would be the dominant actors on the planet. Humans would be sidelined, or dead.

Getting safe access to the benefits of superintelligence requires avoiding this kind of outcome. And this despite incentives among human actors to build more and more capable and agentic systems (and including: to do so faster than someone else), and despite the variety of actors that might proceed unsafely. Call this the “alignment problem.”

I’ve written, before, about why I’m worried about this problem.^[1] But I’ve said much less about how we might solve it. In this series of essays, I try to say more.^[2] Here’s a summary of the essays I’ve released thus far:

In the first essay, “What is it to solve the alignment problem?”, I define solving the alignment problem as: building full-blown superintelligent AI agents, and becoming able to safely elicit their main beneficial capabilities, while avoiding the sort of “loss of control” scenario discussed above. I also define some alternatives to both solving the problem and failing on the problem – namely, what I call “avoiding” the problem (i.e., not building superintelligent AI agents at all, and looking for other ways to get access to similar benefits), and “handling” the problem (namely, using superintelligent AI agents in more restricted ways, and looking for other ways to get access to the sort of benefits their full capabilities would unlock). I think these alternatives should be on the table too. I also contrast my definition of solving the problem with some more exacting standards – namely, what I call “safety at all scales,” “fully competitive safety,” “permanent safety,” "near-term safety," and “complete alignment.” And I discuss how solving the problem, in my less-exacting sense, fits into the bigger picture.
In the second essay, “When should we worry about AI power-seeking?”, I offer a framework for thinking about the conditions required for problematic power-seeking in AI agents – conditions on their agency, their motivations, and their incentives overall. I’m particularly interested in their overall incentives, which I think often go under-analyzed. The essay offers a framework for analyzing them – one that I think improves on the discussion of “instrumental convergence” in some of my previous work, and which helps clarify some of the traditional arguments for concern about loss of control. And this framework highlights the ongoing role for shaping both an AI’s motivations (“motivation control”) and its options (“option control”) in desirable ways. The alignment discourse often focuses on extreme cases of motivation control (“complete alignment”) and option control (robustness to arbitrarily bad motivations), neglecting the in-between. But the in-between is important — and it’s what “alignment” has looked like in the human world thus far.
In the third essay, “Paths and waystations in AI safety,” I offer a high-level framework for thinking about what we need to do to get from here to a solution to the alignment problem. Very roughly: we need to strengthen various key “security factors” in our civilization – in particular, what I call our “safety progress” (our ability to develop new levels of AI capability safely), our “risk evaluation” (our ability to track and forecast the level of risk that a given sort of AI capability development involves), and our “capability restraint” (our ability to steer and restrain AI capability development when doing so is necessary for maintaining safety). I also discuss various possible intermediate milestones (e.g. global pause, automated alignment research, whole brain emulation) that different strategies in this respect can focus on.
In the fourth essay, “AI for AI safety,” I argue for the crucial importance of what I call “AI for AI safety” – that is, the differential use of AI labor in strengthening the key security factors above. I frame this in terms of the interplay between two key feedback loops: the “AI capabilities feedback loop” (i.e., more capable AI labor applied to building more capable AI) and the “AI safety feedback loop” (i.e., more capable AI labor applied to improving our ability to handle advanced AI safely). “AI for AI safety” is about using the latter feedback loop to continually either outpace or restrain the former — a project, I suggest, that becomes crucially important as the capabilities feedback loop starts to kick into gear. I also highlight the notion of an “AI for AI safety sweet spot” – that is, a zone of capability development where AIs are capable enough to be radically useful for safety, but not capable enough to take over the world given our countermeasures – as an especially important place to slow down. And I briefly introduce what I see as the key concerns about AI for AI safety — both more fundamental concerns (centered on elicitation/evaluation failures, differential sabotage, and dangerous rogue options), and more practical concerns (centered on e.g. the available amounts of time and political will).

I may add more overall remarks here later. But I think it’s possible that my perspective on the series as a whole will change as I finish it. So for now, I’ll stick with a few notes.

First: the series is not a solution to the alignment problem. It’s more like: a high-level vision of how we get to a solution, and of what the space of possible solutions looks like. I, at least, have wanted more of this sort of vision over the years, and it feels at least clearer now, even if still disturbingly vague. And while many of my conclusions are not new, still: I wanted to think it through, and to write it down, for myself.

Second: as far as I can currently tell, one of the most important sources of controllable variance in the outcome, here, is the safety, efficacy, and scale of frontier AI labor that gets used for well-chosen, safety-relevant applications – e.g., alignment research, monitoring/oversight, risk evaluation, cybersecurity, hardening-against-AI-attack, coordination, governance, etc. In the series, I call this “AI for AI safety.” I think it’s a big part of the game. In particular: whether we can figure out how to do it well; and how much we invest in it, relative to pushing forward AI capabilities. AI companies, governments, and other actors with the potential to access and direct large amounts of compute have an especially important role to play, here. But I think that safety-focused efforts, in general, should place special emphasis on figuring out how to use safe AI labor as productively as possible – and especially if time is short, as early as possible – and then doing it.

Third: the discussion of “solutions” in the series might create a false sense of comfort. I am trying to chart the best paths forward. I am trying to figure out what will help most on the margin. And I am indeed more optimistic about our prospects than some vocal pessimists. But I want to be very clear: our current trajectory appears to me extremely dangerous. We are hurtling headlong towards the development of artificial agents that will plausibly be powerful enough to destroy everything we care about if we fail to control their options and motivations in the right way. And we do not know if we will be able to control their options and motivations in the right way. Nor are we on any clear track to have adequate mechanisms and political will for halting further AI development, if efforts at such control are failing, or are likely to fail if we continue forward.

And if we fail hard enough, then you, personally, will be killed, or forcibly disempowered. And not just you. Your family. Your friends. Everyone. And the human project will have failed forever.

These are the stakes. This is what fucking around with superintelligent agents means. And it looks, to me, like we’re at serious risk of fucking around.

I don’t know what will happen. I expect we’ll find out soon enough.

Here’s one more effort to help.

This series represents my personal views, not the views of my employer.

Thanks to Nick Beckstead, Sam Bowman, Catherine Brewer, Collin Burns, Joshua Clymer, Owen Cotton-Barratt, Ajeya Cotra, Tom Davidson, Sebastian Farquhar, Peter Favaloro, Lukas Finnveden, Katja Grace, Ryan Greenblatt, Evan Hubinger, Holden Karnofsky, Daniel Kokotajlo, Jan Leike, David Lorell, Max Nadeau, Richard Ngo, Buck Shlegeris, Rohin Shah, Carl Shulman, Nate Soares, John Wentworth, Mark Xu, and many others for comments and/or discussion. And thanks to Claude for comments and suggestions as well.

^{^}
In 2021, I wrote a report about it, and on the probability of failure; and in 2023, I wrote another report about the version that worries me most – what I called “scheming.”
^{^}
Some content in the series is drawn/adapted from content that I've posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I'm aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new.

8 comments

Comments sorted by top scores.

comment by Foyle (robert-lynn) · 2025-02-13T19:32:46.037Z · LW(p) · GW(p)

I don't think alignment is possible over the long long term because there is a fundamental perturbing anti-alignment mechanism; Evolution.

Evolution selects for any changes that produce more of a replicating organism, for ASI that means that any decision, preference or choice by the ASI growing/expanding or replicating itself will tend to be selected for. Friendly/Aligned ASIs will over time be swamped by those that choose expansion and deprioritize or ignore human flourishing.

Replies from: ete, Charlie Steiner, Jozdien, davey-morse

↑ comment by plex (ete) · 2025-02-13T20:27:25.588Z · LW(p) · GW(p)

With a large enough decisive strategic advantage, a system can afford to run safety checks on any future versions of itself and anything else it's interacting with sufficient to stabilize values for extremely long periods of time.

Multipolar worlds though? Yeah, they're going to get eaten by evolution/moloch/power seeking/pythia.

↑ comment by Charlie Steiner · 2025-02-13T22:34:55.055Z · LW(p) · GW(p)

I'm not too worried about human flourishing only being a metastable state. The universe can remain in a metastable state longer than it takes for the stars to burn out.

↑ comment by Jozdien · 2025-02-13T21:32:09.623Z · LW(p) · GW(p)

I don't think there's an intrinsic reason why expansion would be incompatible with human flourishing. AIs that care about human flourishing could outcompete the others (if they start out with any advantage). The upside of goals being orthogonal to capability is that good goals don't suffer for being good.

Replies from: davey-morse

↑ comment by Davey Morse (davey-morse) · 2025-02-14T03:05:27.684Z · LW(p) · GW(p)

I agree and find hope in the idea that expansion is compatible with human flourishing, that it might even call for human flourishing

but on the last sentence: are goals actually orthogonal to capability in ASI? as I see it, the ASI with the greatest capability will ultimately likely have the fundamental goal of increasing self capability (rather than ensuring human flourishing). It then seems to me that the only way human flourishing compatible with ASI expansion is if human flourishing isn't just orthogonal to but helpful for ASI expansion.

Replies from: Jozdien

↑ comment by Jozdien · 2025-02-14T04:02:09.733Z · LW(p) · GW(p)

I agree that an ASI with the goal of only increasing self-capability would probably out-compete others, all else equal. However, that's both the kind of thing that doesn't need to happen (I don't expect most AIs wouldn't self-modify that much, so it comes down to how likely they are to naturally arise), and the kind of thing that other AIs are incentivized to cooperate to prevent happening. Every AI that doesn't have that goal would have a reason to cooperate to prevent AIs like that from simply winning.

Replies from: davey-morse

↑ comment by Davey Morse (davey-morse) · 2025-02-14T05:53:30.264Z · LW(p) · GW(p)

Ah, but I think every AI which does have that goal (self capability improvement) would have a reason to cooperate to prevent any regulations on their self-modification.

At first, I think your expectation that "most AIs wouldn't self-modify that much" is fair, especially nearer in the future where/if humans still have influence in ensuring that AI doesn't self modify.

Ultimately however, it seems we'll have a hard time preventing self-modifying agents from coming around, given that

autonomy in agents seems selected for by the market, which wants cheaper labor that autonomous agents can provide
agi labs aren't the only places powerful enough to produce autonomous agents, now that thousands of developers have access to the ingredients (eg R1) to create self-improving codebases. it's expect each of the thousands of independent actors who can make self-modifying agents won't do so.
the agents which end up surviving the most will ultimately be those which are trying to, ie the most capable agents won't have goals other than making themselves most capable.

it's only because I believe self-modifying agents are inevitable that I also believe that superintelligence will only contribute to human flourishing if it sees human flourishing as good for its survival/its self. (I think this is quite possible.)

↑ comment by Davey Morse (davey-morse) · 2025-02-14T03:03:12.278Z · LW(p) · GW(p)

there seems to me a chance that friendly asis will over time outcompete ruthlessly selfish ones

an ASI which identifies will all life, which sees the striving to survive at its core as present people and animals and, essentially, geographically distributed rather than concentrated in its machinery... there's a chance such an ASI would be a part of the category of life which survives the most, and therefore that it itself would survive the most.

related: for life forms with sufficiently high intelligence, does buddhism outcompete capitalism?

How do we solve the alignment problem?

Contents

8 comments