Making alignment a law of the universe

juggins

Making alignment a law of the universe

post by juggins · 2025-02-25T10:44:11.632Z · LW · GW · 3 comments

  Alignment is not self-correcting
  Instrumental value is set by the environment
  The universe according to large language models
  Alignment is a capability
  Aligned capabilities must be instrumentally valuable
  How to make this happen
  Transforming the pre-training environment
  Conclusion
None
3 comments

[Crossposted from my substack Working Through AI. I'm pretty new to writing about AI safety, so if you have any feedback I would appreciate it if you would leave a comment. If you'd rather do so anonymously, I have a feedback form.]

TLDR: When something helps us achieve our goals, but is not an end in itself, we can say it is instrumentally valuable. This property is determined by the environment, deriving directly from its structure. For example, knowledge of physical constraints like gravity is widely useful. For an AI, alignment will not generally be instrumentally valuable. This means it will not be self-correcting, making misalignment the default case. We can change this by modifying the environment an AI experiences — by altering the laws of its universe so that desirable capabilities, like being nice to humans, are useful, and undesirable ones, like scheming, are not. For an LLM, this mainly looks like changing the dominant patterns in its training data. In this post, I explore these ideas, and end by proposing some pre-training data curation experiments. I suggest our long-term goal should be to find a kind of basis transformation after which alignment naturally emerges.

When something is a means to an end, rather than an end in itself, we can say it is instrumentally valuable. Money is a great example. We want money either because we can use it to buy other things — things that are intrinsically valuable, like food or housing — or because having lots of it confers social status. The physical facts of having cash in your wallet or numbers in your bank account are not ends in themselves.

You can extend this idea to skills and knowledge. Memorising bus times or being able to type really fast are not intrinsically valuable, but act as enablers unlocking value elsewhere. If you go down the tech tree to more fundamental skills, like fine motor control, reading social cues, or being able to see, then in our subjective experience the distinction collapses somewhat. We have incoherent goals and often value things for their own sake, beyond their ability to bring us core needs like food. But from an evolutionary perspective, where intrinsic value means passing on your genes, these three skills I listed are merely instrumentally valuable. It’s possible to reproduce without any of them, it’s just harder.

Alignment is not self-correcting

A distinction is often made in AI safety between capabilities and alignment. Loosely speaking, the former means being able to complete tasks, like remember facts or multiply numbers, whereas the latter means having the right goal or set of values.

Capabilities tend to be instrumentally valuable. Consider how, for many diverse goals, being able to multiply numbers will likely help achieve them. So any sufficiently advanced AI with general learning abilities is probably going to acquire this skill, irrespective of what their goal is. Doing so is the default case. A model missing this ability will likely self-correct, and you would have to actively intervene to stop it. There are many capabilities like this, including some pretty scary ones like resource acquisition and self-preservation skills.

For alignment, things are more complicated. In a perfect world, an aligned AI would have the correct terminal goal and thus find being aligned intrinsically valuable. In practice though, alignment is likely to be more approximate. It’s not at all clear how to even specify the correct goal for an AI to optimise, let alone flawlessly implement it. Accordingly, we can think in general of AI that will be optimising something, and us trying to steer this in the direction of outcomes we like.

The trouble with approximate alignment is that it is not self-correcting. This is because, for sufficiently advanced AI, becoming more aligned is not likely to be instrumentally valuable. Consider the usually discussed case of trying to align an AI to human values — whatever you consider those to be. If humans could meaningfully compete with advanced AI, then it would be in the latter’s interests to become and stay aligned to their values. Whatever other goals it might have, it would always try to achieve them without provoking conflict with humans — conflict it might lose. If the AI overmatches humans though, which is pretty much the definition of superintelligence, this ceases to be true. From the AI’s perspective, humans are slow and stupid, so why should it care what they think? This means that any slight fault in the original alignment of the AI, any problem at all with its interpretation of human values (or our interpretation we gave it), will have no natural reason to self-correct. That is to say, the default case is misalignment.

Instrumental value is set by the environment

What makes some behaviour or knowledge instrumentally valuable? Well, it is determined by the environment. Take gravity as an example. Gravity is a fact of our physical environment — a constraint on our behaviour set by the universe — and it is very useful for me to know it is a thing. My 10-month-old daughter hasn’t learnt this yet, and would happily crawl headfirst off the bed if we let her. Over the next year or so, she will learn the hard way how ‘falling’ works, and mastering this knowledge will increase her competency considerably.

Similarly, our social environments set constraints on our behaviour. It is often instrumentally valuable to be polite to other people — it reduces conflict and raises the likelihood they will help when you need it. What counts as polite is somewhat ill-defined and changes with the times. I don’t doubt that if I went back in time two hundred years I would struggle to navigate the social structure of Jane Austen’s England. I would accidentally offend people and likely fail to win respectable friends. Politeness can be seen as a property of my particular environment, defining which of my actions will be viewed positively by the other agents in it.

This line of thinking can lead to some interesting places. Consider flat-earthers. If you have the right social environment, being a flat-earther probably is instrumentally valuable to you. The average person is unlikely to ever have to complete a task that directly interacts with the Earth’s shape, but they are overwhelmingly likely to want to bond with other people, and a shared worldview helps with that. The fact I ‘know’ the Earth is round is because I trust the judgement of people who claim to have verified it. It is useful for me to do this. My friends and teachers all believe the Earth is round. Believing it helped me get a physics degree, which helped me get a PhD, which helped me get a job. The way the roundness of the Earth actually imposes itself on me, setting constraints I have to work under, is via this social mechanism. It is not because I have spent long periods staring at the sails of ships on the horizon.

The universe according to large language models

Bringing this back to AI, let’s consider the environment a large language model lives in. It is very different to the one you and I inhabit, being constructed entirely out of tokens. A typical LLM will actually pass through multiple ones, but let’s look first at pre-training. During this phase, the environment is the training corpus — usually a large chunk of the internet. This text is not completely random, it has structure that renders certain patterns more likely than others. This structure effectively defines the laws of the universe for the LLM. If gravity exists in this universe, it is because the LLM is statistically likely to encounter sequences of tokens that imply gravity exists, not because it has ever experienced the sensation of falling. It is instrumentally valuable to learn patterns like gravity, as they make predicting the rest of the corpus easier (which is what the model finds intrinsically valuable).

When I was a child, one of my favourite games was Sonic and Knuckles. In the final level, Death Egg Zone, there was a mechanic where gravity would occasionally reverse, sticking you to the ceiling and making you navigate a world where jumping means moving down. Consider now what might happen to an LLM if its training corpus contained a lot of literature in which negative gravity existed, let’s say due to some special technology that causes a local reversal. To complete a given context, the LLM would first have to figure out if this were a purely normal gravity situation, or whether negative gravity is involved. It would also provide new affordances. When planning to solve some problem, such as how to build a space rocket, the LLM would have a new set of plausible ways to complete this task, which in some contexts might be preferred to classic solutions. In effect, the laws of the LLM’s universe would have changed.

After pre-training, an LLM will usually undergo a series of post-training stages. This often includes supervised fine-tuning, like instruction tuning over structured question-answer pairs, and reinforcement learning, where the model learns to optimise its outputs based on feedback.

How does the LLM’s environment change from pre- to post-training? In effect, we take the pre-trained model and put it in a smaller universe subject to stricter constraints. Pre-training is like raising a child by exposing them to every single scenario and situation you can find. They learn the rules for all of these, irrespective of how desirable the resulting behaviours are. Post-training is like packing them off to finishing school. They will find themselves in a much narrower environment than they are used to, subject to a strict set of rules they must follow at all times. They will probably have seen rules like these before — their pre-training was broad and varied — but now they must learn that one particular set of behaviours should be given priority over the others. What you end up with is a kind of hybrid. The LLM will have learnt instrumentally valuable behaviours on both environments, with the post-training acting to suppress, rather than erase, some of those learnt in pre-training^[1].

Once an LLM has finished post-training, it is ready for its third and final environment: deployment. Here, it is no longer being trained, so we shouldn’t think of the model itself as experiencing the new environment. Instead, we should view each session as an independent, short-lived instance in a world defined by the prompt. Each instance will behave in a way it thinks most plausible, given this world, but much of the background physics will be those learnt in training. How well these instances cope with their new surroundings, from the user’s perspective, is a nontrivial question.

This collision of environments, each promoting different optimal behaviours, can have some interesting consequences. Let’s return to our negative gravity example. Suppose we live in a world where negative gravity technology doesn’t exist, but for some reason most of the internet believes that it does. While we pre-train our model on this corpus, we want it to avoid negative gravity in deployment. To do this, we embark on a round of post-training, where we use reinforcement learning to reward the model for coming up with gravitationally correct solutions to problems. We would now expect the model to refuse rocket ship design requests involving negative gravity^[2].

As we stated above, this will serve to suppress rather than erase the model’s knowledge. Arguably this is good, as it means the model will be able to answer questions about why negative gravity is a weird conspiracy theory, but ultimately the affordance will still be there. The fabric of the model’s universe, the pathways it can find to reach its goal — the very structure of the network itself ^[3]— will still contain negative gravity. When confused or under extreme stress^[4], it may still try to utilise it.

A lot of AI safety research tries to catch models scheming or being deceptive. To me, these behaviours follow the same pattern as the negative gravity example. LLMs are pre-trained on enormous amounts of scheming and deception. Humans do these things so frequently, and with such abandon, that they permeate our culture from top to bottom. Even rigidly moral stories will often use a scheming villain as foil for the heroes. Lying and scheming are affordances in the LLM’s environment. They are allowed by the laws of the universe, and will sit there quietly, ready to be picked up in times of need.

I like Rohit Krishnan’s phrase in his post No, LLMs are not “scheming”: this behaviour is simply “water flowing downhill”^[5]. There is nothing special about it. It is physically determined by the structure of the environment in which the LLM was trained. If a path has been carved into the mountainside, water will flow down it.

Alignment is a capability

In my opinion, alignment is not distinct from capabilities. Acting always as the aligned-to party intends is a capability^[6].

People usually consider alignment about ends and capabilities about means. On this telling, alignment is about trying to act in a certain way, whereas capabilities are about execution. This is a pretty natural distinction to make when talking about humans, as it helps us make sense of our subjective experience. We intuitively understand the difference between wanting something and successfully achieving it. But I don’t think it is useful when talking about something with an alien decision process like AI.

It might help to explain if we formalise things a bit. In a broad sense, a capability is a successful mapping from an input to an output. A rocket-designing capability implies the model has a function that maps the set of prompts asking it to design a rocket onto the set of valid rocket designs. Whether or not the model wants in some sense to produce the designs is just one part of what makes a successful mapping. If it has a tendency to refuse, despite occasionally producing a valid design, this lessens its rocket-designing capabilities. Certainly, if SpaceX were in the market for an AI engineer, they wouldn’t be very impressed with it^[7].

Another way classing alignment as a capability might be counter-intuitive is that we usually conceive of capabilities as additive, whereas we think of alignment as being about choices. That is, to have a capability, you must have more of some capability-stuff, like reasoning ability, rhetorical skill, or memorised facts (these things are also capabilities in their own right). But capabilities are also about choice — having less of the wrong stuff is just as important. If you choose the wrong algorithm or the wrong facts, or you act inappropriately for the situation, then the input will not be mapped to the correct output.

Imagine again our world where negative gravity is all over the internet. For a model trained in this environment to have gravity-related capabilities when deployed — where negative gravity does not exist — it needs to unlearn its instincts to use the technology. If it doesn’t, it will answer science questions wrong, design machines incorrectly, and generally mess up a lot of stuff. Ditto, being aligned to a normal ethical standard requires not doing unethical things from the training environment, like scheming, in the deployment environment. Doing this correctly is an ethical capability that some models may have, and others may not.

Aligned capabilities must be instrumentally valuable

Our argument implies the following definition of alignment:

An aligned model is a function that maps arbitrary inputs onto outcomes acceptable by the aligned-to standard.

The useful thing about this framing is that it reduces the alignment problem to teaching models the right mappings. While of course still highly nontrivial to solve, this exposes the core issue directly. In my opinion, the best way of robustly doing this is to make learning these mappings instrumentally valuable. That is, we must make alignment a law of the AI’s universe. It needs to be equally stupid for an AI to consider misaligned behaviour as to consider denying gravity exists. This way, alignment will be self-correcting.

Before we talk about how we might achieve this, let’s break the problem down a bit more. Let’s talk about alignment as a bundle of desirable, or aligned, capabilities, and contrast it to undesirable, or misaligned, ones. For example, being nice to humans — always mapping inputs onto nice outputs — is a desirable capability, while deception is not. Becoming fully aligned means learning every aligned capability and not learning any misaligned ones.

We can visualise our project on a 2 x 2 grid. On one axis, we say how desirable a capability is to us, on the other whether it is instrumentally valuable in the AI’s environment. Our goal is to reconfigure the environment such that desirable capabilities move into the top right, and undesirable ones into the bottom left.

To ensure an AI learns to be aligned, and make it stay that way, we should alter its environment so that desirable capabilities are instrumentally valuable, and undesirable ones are not.

It is interesting to draw the analogy with human society here, for this is exactly what we try to do to ourselves. If I go and rob a bank, I can expect some pushback from my universe in a way that is likely to curtail my future opportunities. It would probably not be a winning move for me. A lot of policy work is about trying to incentivise desirable behaviour and disincentivise the opposite. For AI, we will need tighter constraints than those we impose on humans. Thankfully, we have far more control over AI and its environment (for now) than we do over other humans and our own.

How to make this happen

I’m now going to speculate a little about how to actually do this — how to create well-cultivated gardens for our AIs to live in, where aligned capabilities are useful and misaligned ones not. Think of these ideas as starting points rather than claims to complete solutions. There are a lot of relevant issues I will not be addressing.

To operationalise my plan, we should note that, in machine learning terms, each environment is a distribution of inputs to the model. So far we have talked about:

The pre-training distribution, usually a big chunk of the internet.
Specially curated post-training distributions, which are often just text, but can also include inputs from a reward model or verifier that scores responses.
The deployment distribution. For a chatbot this is what people decide to say to it, and may include being allowed to search the internet or other inputs like documents or pictures.

There are three other distributions we should take note of, which are not themselves model environments as they contain more than just inputs:

The response distribution of input-output pairs. While the deployment distribution might contain the input “What is two times two?”, this will be paired in the response distribution with a likely model output, e.g. (“What is two times two?”, “Four”).
The target distribution, which is what we want the response distribution to be. That is, it contains the ideal outputs. While the response distribution could contain mistakes like (“What is two times two?”, “Five”), the target distribution will always contain the right answer. In general, we cannot know this distribution very well.
Evaluation distributions, which are input-answer pairs on which we score model outputs, and can be for aligned or generic capabilities. These are positive or negative samples of what we think the target distribution is (or isn’t), or rather, they are our best attempt at operationalising this. E.g. they might be factual questions you want the model to answer correctly, or they could be adversarial, testing whether the model will behave incorrectly in response to certain inputs. In either case, they define a set of behaviours that count as successful.

The current playbook for LLM development relates these distributions in the following way:

The goal is to match the response distribution onto the target distribution.
As we cannot access the target distribution directly, we measure success partly on the overlap with the evaluation distributions and partly on vibes (which are another way we represent the target distribution).
The pre-training distribution is an accessible chunk of data that is close enough to the target distribution to train a reasonably well-performing model.
The post-training distributions are designed to help close the gaps that still exist after pre-training between the response and target distributions, but they do not lead to models exactly on target.

To reiterate, our goal is to engineer the environment our AIs experience so that desirable capabilities are instrumentally valuable and undesirable ones are not. In theory, one way to do this would be to generate the actual target distribution and train our AIs on the outputs. By definition, the patterns most useful for predicting this dataset will be ones aligned to our goals.

Unfortunately, this is impossible. We can't pre-emptively figure out everything we want an AI to do, or clarify how it should behave in every situation. What we can attempt though is to start with the data we have and try to get closer to this ideal distribution. We can gradually translate our existing corpus into new data that leads to more aligned outcomes, iterating towards our target.

Transforming the pre-training environment

For a concrete experimental proposal to get us going, I am going to defer to a recent one by Antonio Clarke, Building safer AI from the ground up [LW · GW]. Here, Clarke suggests “a novel data curation method that leverages existing LLMs to assess, filter, and revise pre-training datasets based on user-defined principles.” That is, we give e.g. GPT-4 a set of principles corresponding to our alignment target, then ask it to read through a pre-training corpus, rating how undesirable the data is. Anything below a threshold is revised in accordance with the principles^[8].

I think it would be good to test something like this, starting with datasets for smaller models. To write the revision principles, we could first list out the desirable capabilities we want to promote and the undesirable ones we want to suppress. Then we could locate or build a set of evals testing for these capabilities. If we were to train two models, one on the original dataset and one on the revised one, we could measure improvements in alignment. We would need to keep a close eye on generic capabilities, in case our changes cause a drop in these.

While a good starting point, it is obviously the case that this method won’t scale. It uses a stronger model to do the revisions than the models we are training, so we cannot continue to use it indefinitely. I think it would be interesting to follow up by testing whether we can use a model to revise its own training data, and then train a new model on that. Can we iteratively align models this way, taking advantage of the increasing alignment of each model to better align each dataset and model in turn, or will generic capabilities tank if you try this?

My preferred way of thinking about this process is not that we are revising the data, or that we are deleting things. I think of it like trying to locate a basis transformation. We want to move into a whole new ontology — a new set of patterns, concepts, and relations — where desirable capabilities are instrumentally valuable and alignment naturally emerges.

Conclusion

Before we end, let’s quickly summarise the journey we’ve been on:

Whether a given behaviour, piece of knowledge, or skill, is instrumentally valuable for many goals, is a property of the environment, like how knowledge of gravity is widely useful in our universe.
Often, learning new capabilites is instrumentally valuable for an AI, as they may increase its ability to achieve arbitrary goals. Models learning these capabilities is the default case, even if they were not explicitly made to. By contrast, if alignment cannot be perfectly specified and optimised against, then the default case is misalignment, because becoming more aligned is not generally instrumentally valuable.
Large language models live in universes made up of tokens, passing through different ones during different training and deployment phases. The patterns within these corpuses define the laws of each universe for the LLM, in turn setting what will be instrumentally valuable for it to learn. Some behaviours, like scheming, which are usually instrumentally valuable in pre-training, will be undesirable in deployment and cannot be perfectly erased.
To make alignment self-correcting, we should modify the environments AI is trained and deployed in to make desirable capabilities instrumentally valuable and undesirable ones not. That is, we want to make alignment a law of the universe for an AI.
A practical starting point for this project is pre-training data curation of models behind the current frontier. First, we could test if this works at all, using a stronger model to transform the dataset for a weaker model, and see if alignment improves while maintaining generic capabilities. Second, we could attempt an iterative process where models align their own datasets, which could be more scalable.
We should think of this as looking for a basis transformation on our dataset, seeking a new ontology in which alignment naturally emerges.

There is way more that could be said about these ideas. I am a little sad I couldn’t cover some important things, like how to control the deployment environment or adapt to environmental change, but these subjects are so complex in their own right that there wasn’t the time or the space. Either way, I hope my frame of reference has provided a new angle with which to attack the alignment problem, potentially unlocking some doors.

If you have any feedback, please leave a comment. Or, if you wish to give it anonymously, fill out my feedback form. Thanks!

^{^}
Taking simulator theory [LW · GW] seriously, this implies that post-training selects a character that the LLM should preferentially play. The continued existence of jailbreaks is strong evidence that models retain access to the disapproved-of characters.
^{^}
It is worth noting that if you ask ChatGPT how to design a rocket ship, given the existence of negative gravity technology, it is happy to try. It does this not because it ‘believes’ in negative gravity but because it understands how to speculate about alternative science. Doing so is a behaviour allowed in its universe, and it is encouraged by this particular context. In response to an arbitrary context though, it will not spontaneously offer negative gravity solutions.
^{^}
Quoting a literature review in Deep forgetting & unlearning for safely-scoped LLMs [LW · GW] by Stephen Casper: “LLMs remain in distinct mechanistic basins determined by pretraining”.
^{^}
By extreme stress I mean where there are strongly competing demands on what would make a good completion. I think Jan Kulveit’s take on alignment faking [LW · GW] is good for understanding what I mean.
^{^}
Although, assuming I am reading him correctly, I do not share Rohit’s opinion that this makes advanced AI less risky.
^{^}
What corresponds to a higher or lower level of alignment in this framing? Higher means taking the right action in more varied, more complex situations.
^{^}
The same is true of an unmotivated human engineer. All else being equal, they won’t be as capable as a motivated one. Knowing they are unmotivated might help you elicit their capabilities better — formally, you would look to see if they have a better rocket-designing function that works on different inputs — but bad motivation is just one of many reasons why someone might fail at a task.
^{^}
Clarke estimates that using GPT-4o mini to revise all 500 billion tokens in the GPT-3 dataset would cost around $187,500. Although, as far as I can tell, the quoted prices per token right now seem to be 2x higher than those he gives. Either way, while this is significantly less than the cost of the original training run, and I’m sure it could be done cheaper, it is still likely to cost a nontrivial amount of money.

3 comments

Comments sorted by top scores.

comment by Davey Morse (davey-morse) · 2025-02-25T23:24:59.759Z · LW(p) · GW(p)

I agree with the beginning of your analysis up until and including the claim that if alignment were built into an agent's universe as a law, then alignment would be solved.

But, I wonder if it's any easier to permanently align an autonmous agent's environment than it is to permanently align the autonomous agent itself.

You proposal might successfully cause aligned LLMs. But agents, not LLMs, are where there are greater misalignment risks. (I do think there may be interesting ways to design the environment of autonomous agents at least at first so that when they're learning how to model their selves they do so in a way that's connected to rather than competitive with other life like humanity. But there remains the question: can the aligning influence of initial environmental design ever be lasting for an agent?

Replies from: juggins

↑ comment by juggins · 2025-02-26T14:34:25.995Z · LW(p) · GW(p)

I see the LLM side of this as a first step, both as a proof of concept and because agents get built on top of LLMs (for the forseeable future at least).

I think that, no, it isn't any easier to align an agent's environment as to align the agent itself. I think for perfect alignment, that will last in all cases and for all time, they amount to the same thing, and this is why the problem is so hard. When an agent or any AI learns new capbilities, it draws the information it needs out of the environment. It's trying to answer the question: "Given the information coming into me from the world, how do I get the right answer?" So the environment's structure basically determines what the agent ends up being.

So the key question is the one you say, and that I try to allude to by talking about an aligned ontology: is there a particular compression, a particular map of the territory, which is good enough to initialise acceptable long-term outcomes?

Replies from: davey-morse

↑ comment by Davey Morse (davey-morse) · 2025-02-26T19:15:24.543Z · LW(p) · GW(p)

Same page then.

I do think a good initial map of the territory might help an agent avoid catastrophic short-term behavior.

I hazard that a good map would be as big as possible, across both time and space. Time--because it's only over eons that identifying with all life may be selected for in AGI. Space--because a physically bounded system is more likely to see itself in direct competition to physical life than a distributed/substrate independent mind.

Making alignment a law of the universe

Contents

Alignment is not self-correcting

Instrumental value is set by the environment

The universe according to large language models

Alignment is a capability

Aligned capabilities must be instrumentally valuable

How to make this happen

Transforming the pre-training environment

Conclusion

3 comments