Breaking down the MEAT of Alignment

jasonbrown

Breaking down the MEAT of Alignment

post by JasonBrown · 2025-04-07T08:47:22.080Z · LW · GW · 0 comments

    Preamble
  Introduction
  Targets
    Alignment to Human Values
    Corrigibility
    Cooperative AI
    Intent Alignment
    Truth Seeking
  Methods
    Core Approaches
    Additional Training Strategies
    Embedding
  Assurances
    Product Assurances
    Process Assurances
  Summary
None
No comments

Preamble

This post is my attempt to try and organise some thinking about AI alignment in a way that hopefully acts as a partial-overview to some of the core ideas and approaches. It is mostly a review of existing ideas arranged with some light opinions thrown in. The goal is that this might help with spotting new combinations of ideas, offering a scaffold for thinking about alignment, or acting as an introductory resource / reference.

Introduction

Sufficiently powerful AI systems will probably not have the positive impact we hope them to have [? · GW] and in fact might go horribly wrong [AF · GW], unless we have in some way guided, or 'aligned', them to act in ways we are confident will be beneficial. Thus we have to have some idea of what this system will do that will make it have a positive impact on the world, we need some way of creating this system, and finally we need to be confident that what we've made (or are going to make) will have the properties we care about.

Thus there are three big things to think about in order to practically "solve alignment", or otherwise have safe deployment of advanced AI systems:

Target: What is the core thing we aiming to do and why?
MEthod: How can we try to achieve it?
Assurances: How will we know we've achieved it?

With slight rearrangement, we see we have the MEAT of alignment. Realistic alignment proposals ought to make some well-motivated selection across each of these categories, or equivalently be able to be decomposed into that, in order to show their suitability as a practical alignment strategy. The rest of this post will focus on detailing these categories and exploring how existing work fits inside of them.

A similar idea to this post is Training Stories [AF · GW], which breaks down proposals into training goals and training rationale. I frame things slightly differently and provide a more up-to-date view of existing literature (though I'm sure there's plenty of interesting and cool work I've missed), but I'd recommend people to also read that if they haven't already.

Targets

The target is the end goal, a specification of the thing we want to instantiate and why it's a good idea to do so. Broadly I think there are five different types of target:

Human Value Aligned
Corrigible
Cooperative
Intent Aligned
Truth Seeking

These proposals are somewhat mixed in the extent of agency [? · GW] they expect advanced AI systems to have (or be constructed to have), which matters as agency probably plays a big role in the alignment problem [LW · GW] and agentic AIs are probably more powerful than non-agentic AI, with some even saying that if a non-agentic AI was sufficiently advanced we might have to be concerned about powerful agents anyway [LW · GW].

Overall we'll see that many of these targets are promising, although none seem both sufficient and easy to execute. They can also overlap or be combined in complimentary ways, so I imagine that for many near-term and medium-term systems we'll aim at some combination of them all.

Alignment to Human Values

The most obvious target for aligned AI is to have it share our values, such that it "wants what we want" and therefore acts in a way that is as beneficial to us as possible. This does the pose the question of what exactly is that we want, and I think the strongest answer to this is Coherent Extrapolated Volition (CEV) [? · GW]. CEV is basically the answer to what we would all want if we knew more, could think faster, and knew ourselves better. This does seem very much like the thing we'll ultimately want to do, or at least something based off it, but the main problem is that this is going to be pretty impossible for quite a while.

Another approach to value alignment is to just try and get near-enough to something roughly representing human values, and hope that optimising this ends up with something we'd all want. This seems to be pretty much the current path we're on, with LLMs being trained to, amongst other things, be "a helpful and harmless assistant" based on satisfying a dataset of human preferences---or at least trying to---and then generalising from this. Whilst in chat form it may not be particularly agentic, "helpful and harmless" agents seem to be our current trajectory.

Near-enough alignment seems to be mostly working for now, but there are some definitely some issues (e.g. jailbreaking, scheming) and some maybe-great-maybe-terrible phenomena (e.g. emergent misalignment, alignment faking). Will near-enough alignment work longer term? I think this is mostly an empirical question about the scaling and limits of model optimisation power, the amount of human-value reward information we can give to models, their ability to process this into a coherent objective, and the precise nature of what underlies Goodhart's Law and reward hacking. Or in other words, can we keep the proxy close enough to the true-thing as optimisation power increases such that we keep doing better on the true-thing in the limit?

Corrigibility

If we're not confident in our ability to fully align human values, we might want some fallback options. For example, being able to switch the AI off if we don't like what it's doing, or update it with a new goal that's closer to what we want. The problem is that if it's very intelligent and reasoning consequentially, it might oppose us doing these things, since they prevent it from carying out whatever goal it currently has. Broadly speaking, an agent which allows such things to happen is "corrigible [? · GW]", and there have been attempts to sketch out what this might look like in practice.

Some people who don't understand instrumental convergence [? · GW] think that this will be easy, but clearly they haven't read Planecrash and realised that corrigibility is anti-natural. Whilst current LLMs are quite corrigible in that we can just turn them off or fine-tune them however we want, if they gain enough power or control of their environment they may try to prevent us from doing so, or deceive us that this is something we ought to do, just as predicted.

Cooperative AI

Even if many agents have different utility functions and are not corrigible to one-another, cooperation can help achieve mutually beneficial outcomes [LW · GW]. Cooperative AI applies this to AI, and motivates the need for AIs to cooperate with us in order to avoid bad outcomes. I think there are two strands to this idea. The first is that cooperation will be required in multi-agent scenarios when we have lots of powerful AI systems deployed in the world [EA · GW], and the second is that cooperation itself may serve as an alignment target even with a single superintelligence.

This sub-field seems alive and well, with people already investigating the tendencies of LLMs to cooperate in simulated societies and social dilemmas, as well as training them to collaborate further. To me this seems like another semi-independent lever we have to pull---alongside value-alignment and corrigibility---for making powerful AI systems safer. It might even be that a powerful AI built with the target to simply cooperate with us is easier than full-on value alignment, though this is just speculation and I'm sure there'd be many non-obvious dangers and difficulties. For one, we'd probably need more advanced / complete decision theories [? · GW] that promote reliable cooperation of agents of varying power levels.

Intent Alignment

If a powerful, safe, value-driven AI might be too difficult, what if we instead just tried to get the AI to do what we say and carry out more narrow tasks myopically? This is often referred to as Task-directed AGI [? · GW] or "Intent-aligned AI", and depending on definitions might somewhat straddle the agent/non-agent boundary. More agentic versions of an intent-aligned AI might look like a corrigible agent, and less agentic versions might look more like an Oracle [? · GW] (also see below). That said, I think it's worth distinguishing it as its own target for alignment, with degree of agency having more relevance to capabilities.

Intent-alignment is not without issues. The most salient of these is misuse risk, with both direct and indirect threats, and considerable concern about the ability of AI to aid in creating CBRN threats. Beyond this there are difficulties in avoiding moral hazards [? · GW], and the specific interpretation of intent-alignment [? · GW]. That said, it's probably quite a lot easier to actually create than the proposals above, and might be our best near-term bet for safe-ish advanced models.

Truth Seeking

One type of target that does not rely on messy human values, anti-natural patterns, solving decision-theory, or inferring complex goals, is the simple truth [LW · GW]. Just have models tell us what is true, or answer our questions, and we'll handle the messiness of taking actions in the world to achieve our goals. There are a lot of proposals which fall in this category, with two broad framings being Oracle [? · GW] and Tool [? · GW] AI. More specific and interesting proposals include Iterated Amplification [? · GW], STEM AI [AF · GW], and Microscope AI [AF · GW].

As noted earlier, there might be some problems [LW · GW] with focussing on safety by not having agents, and there might also be some weird things that go on with sufficiently powerful predictors [LW · GW]. An alternative is simulators, that take in some starting conditions and then just do a step-by-step forward rollout on what happens next. This might avoid self-referential problems underpinning some weirdness of powerful predictors, and surprisingly enough, might actually be what sufficiently powerful self-supervised learning systems (e.g. LLMs) end up at [LW · GW].

Like some of the other targets discussed, truth can also be seen as a separate alignment lever. This is explored in attempts to make LLMs more honest, and combining this with helpfulness and harmlessness. Another usage of the target of truth is AI safety via debate, which leverages zero sum games and the fact that verification is often easier than generation, to provide a way for humans to verify decisions or behaviours that require superintelligence to come up with. This can then be used to provide a robust training signal on complex tasks for LLMs that does not incentivise manipulation and persuasion. It's worth noting that for debate to work we require a form of intent-alignment, as the two debaters need to be trying their hardest to win, and definitely not secretly colluding with each other, though training them to try to win debates might overcome this.

Methods

Once we know what we're aiming for, we've got to figure out how to get there. We'll start by analysing different core approaches, then discuss extra things that can be incorporated to boost effectiveness, and finally we'll discuss scaffolds to embed the core into for further improvement.

Core Approaches

Early approaches to AGI imagined we'd code it up explicitly, like some giant Python program. People made complex diagrams trying to break down the features of intelligence, but it turns out this is quite difficult [LW · GW], and so hand-crafted approaches to AGI fell out of favour. That said, there's been a slight resurgence in this recently, with some proposing we use LLMs to help us code AGI [LW · GW].

If you're not going to code something by hand, your next best bet is probably machine learning (ML)---train some model on an appropriate loss function using a pile of data and maths. Within this, there are two main current approaches: reinforcement learning, which a lot of early intuitions were built off, and self-supervised learning, which actually ended up being the foundation of modern AI when applied to neural networks. Despite the somewhat unreasonable effectiveness of self-supervised learning on language, reinforcement learning has made recent comebacks for finetuning models to be "helpful and harmless" and good at coding and maths, though more direct supervised learning approaches are also being explored.

There are of course many other ML paradigms beyond these that have been explored, but these seem less likely---at least in the short to medium term---to lead to artificial superintelligence. This includes things like Markov chain Monte Carlo (MCMC), Gaussian Processes, evolution, and much much more. In fairness, MCMC did seem quite promising for a while, and might make a comeback.

Overall, it seems like the rough approach people are taking for now is a make a big neural network and then train it on a series of ML objectives via gradient descent. These objectives will mostly be a mix of self-supervised learning on various corpuses of data, reinforcement learning, and supervised learning. The key considerations are therefore the data that is trained on, the loss function, and the inductive bias of the architecture and optimiser. One major problem with this general approach is that it might introduce a bunch of new risks and ways things can go wrong, though some think a clarified conceptual framing [AF · GW] might actually make things easier than they otherwise appear [AF · GW].

Additional Training Strategies

Naively applied machine learning can often have some serious defects, such as adversarial attacks and specification gaming. Other methods might also have analogous defects, and so one might want to incorporate some extra things into their approach in order to ensure everything behaves nicely.

In traditional machine learning literature regularisation methods like dropout, weight decay, and normalisation have been employed to great effect. Training against adversarial examples can also greatly increase robustness, with extensions of this idea being put forward like relaxed adversarial training [AF · GW] and latent adversarial training.

With neural networks and other black-box-by-default methods, you may want to occasionally use transparency tools to figure out if your model is doing anything it shouldn't be, or identify potential weaknesses before deployment. Additionally, once initially trained, you may wish to try and remove bad things the model learnt, though personally, if you're having to rely on this for alignment you've fucked up.

Embedding

We might not be fully confident in our main method to hit the target dead-on by itself. Thus, we might want to wrap our core system up with some additional components in order to do better. Most of these are fairly agnostic to the internals of our system, at least in the general case, though plausibly could be improved if these internals are structured a certain way (e.g. presence of a chain of thought).

One obvious example of this is boxing [? · GW], where we essentially stick the AI "in a box" and limit its ways of interacting with the world to something like a text interface. This is the case by default with predictor, simulator, or chatbot-like AIs, but we might also want to do this to more powerful agentic systems whilst we check they are safe. For sufficiently advanced AI, boxing might not work very well [LW · GW]. In fact, it might end quite badly [LW · GW], even if the AI is around human level and just trying really hard [LW · GW].

A more promising direction is that of AI Control. AI Control aims to efficiently utilise weaker trusted models and human oversight in order to efficiently monitor and sanitise stronger models outputs that otherwise can't be trusted. There are some promising strategies such as untrusted monitoring, resampling, and even just reading the model's thoughts. However, it's unclear if these strategies will scale to be effective on superintelligence, thought reading might be unreliable, and talking about our abilities to do these things might be detrimental for subtle reasons.

Some related approaches include scalable oversight, the aforementioned AI safety via debate, and weak-to-strong generalisation. Scalable oversight seems useful for near-human systems but I'm less confident it will work for superintelligence, debate seems useful for critical questions or as a generator of reliable training signals, but weak-to-strong just doesn't seem useful at all and I personally think people who are excited by it either don't know about or haven't properly internalised the data processing inequality.

More broadly, there might be some other simple strategies that we can do here like input/output filtering, or taking some lessons from closed-loop control.

Assurances

With target in sight, and arrow nocked, how can we be sure our aim is true given the price of failure? Assurances can be broken down into two groups: those of the product of the method, and those of the process of the method. We'll take a look at both. It's worth noting that different assurances may cover different things, and so we might want or need a few of them to fully cover things---especially if they are probabilistic in nature. Safety Cases offers a good scaffolding for doing this effectively.

Overall I think we might really struggle to get solid assurances about very strong systems without stronger mathematics (such as agent foundations [AF · GW], or decision theory), but there are at least a bunch of things that might help us with weaker systems, or at least avoid the easily-avoidable failure modes.

Product Assurances

The main differentiator that makes product assurances easier or harder is degree of transparency, which can vary all the way from a plainly coded AI to a complete black-box.

With a plainly coded AI (or a model that has been sufficiently mech-interped to be equivalent to a plainly coded one), one should hopefully be able to construct proofs of program behaviour and verify it-does-the-thing-you-want. Whilst it might be possible for other models to help with this, my understanding is that proving properties of programs in this way is somewhat limited and difficult to do. There may be scope here as well to do some sort of logical-decision-theory-style provable cooperation [? · GW], but I think this is likely infeasible at best and impossible at worst.

Stepping down the transparency ladder we get to white-box systems. We'd not be quite able to write down the exact algorithm on paper, but we do have at least some sense of what some of the internals are doing. Maybe we use these to verify the model doesn't have any detectable obvious defects like a kill-everyone-neuron, but we could also use them to aid in proving probabilistic bounds of model performance.

Finally we have black-box systems, where we either can't access the internals or just have no way of understanding them. We might still be able to do comprehensive enough evaluations that we're fairly sure things will be ok, but to be honest at this point things are looking a little dicey if we're solely relying on this.

Process Assurances

It may be possible to prove enough properties about the method itself to be confident in its output without needing to check it. Or, more realistically, known properties of the process might lower the bar on the assurances required of its outputs.

An obvious style of process assurance one might want to use when applying machine learning is that the global minima of the loss function being used satisfies certain conditions or has certain properties. This would however have to be combined with a guarantee that this minima would actually be reachable, or for that matter even in the hypothesis class of the model being trained. Alternatively, we might just need to be confident we stop at the right part of parameter space, minima or not. In a similar vein, we might have some nice guarantees stemming from the earlier discussed inductive biases of our model and the other components of our training process.

Analogous to empirically driven product assurances, we might be able to do some set of evaluations on other models / systems, derive some scaling laws or well-validated general hypotheses about how-these-things-work, and then be confident (ish) our bigger model will be safe. This sounds pretty sketchy though. As an example, it could be that the difference between reward on an optimised-for-proxy and the true reward close to zero for some feasible input-trajectory of a function of optimisation power and true reward information.

Finally, maybe one could do something cool with singular learning theory, but I'm not quite sure what this would look like.

Summary

We've decomposed AI alignment into three key components: Target, Method, and Assurances. No single target seems both sufficient and achievable alone, suggesting practical alignment will require combinations. Current methods rely heavily on neural networks with various training paradigms, while our assurances depend largely on system transparency and are overall lacking or speculative. Hopefully organizing alignment thinking through this lens is somewhat helpful, or at the very least this can serve as a guide to many existing ideas and approaches.

0 comments

Comments sorted by top scores.

Breaking down the MEAT of Alignment

Contents

Preamble

Introduction

Targets

Alignment to Human Values

Corrigibility

Cooperative AI

Intent Alignment

Truth Seeking

Methods

Core Approaches

Additional Training Strategies

Embedding

Assurances

Product Assurances

Process Assurances

Summary

0 comments