Moral gauge theory: A speculative suggestion for AI alignment
post by James Diacoumis (james-diacoumis) · 2025-02-23T11:42:31.083Z · LW · GW · 1 commentsContents
Introduction Motivation: Messily specified reward functions 1. Generalisation and Invariance Example 1: Invariant Risk Minimisation (IRM) for image classification Example 2: Grokking modular arithmetic Analysis 2. "Good" epistemic practice ≡ Physics 3. Designing a reward function Incorporating Gauge Invariance The Crux Tying the conversation back to physics Observables Conservation laws 4. Objections Objection 1: You're claiming that the action written above is a universal moral theory, I find this hard to believe. Objection 2: You're assuming that it's possible to define a gauge field Aμ that translates between moral coordinates to create a genuinely invariant quantity I(x). I suspect that moral frameworks are so fundamentally different that this wouldn't be possible. Objection 3: You still have an is-ought problem. How are we to determine what the "correct" gauge symmetries are? Objection 4: Ok, so how would we actually implement this in practice? Objection 5: What about Goodhart's Law? This framework assumes we can specify an accurate reward function rather a proxy. Conclusion Appendix: Bayesian Learning ≡ Physics None 1 comment
[Epistemic status: Speculative.
I've written this post mostly to clarify and distill my own thoughts and have posted it in an effort to say more wrong things [LW · GW].]
Introduction
The goal of this post is to discuss a theoretical strategy for AI alignment, particularly in the context of the sharp left-turn [? · GW] phenomenon - the idea that AI systems will be aligned on in-distribution data but risk misalignment when extended to novel regimes. Current alignment strategies, such as Reinforcement Learning from Human Feedback (RLHF), attempt to mitigate this by averaging over human evaluations to create reward functions. However, these approaches are fundamentally limited - they rely on messy, subjective human judgments and fail to address the deeper issue of generalisation failures. In this post, I propose that by leveraging concepts from physics - specifically, the invariance and conservation laws emerging from gauge symmetries - we might be able to design reward functions that are inherently robust against such generalisation failures.
Motivation: Messily specified reward functions
The RLHF paradigm in AI alignment specifies a reward function where is some input text, is some output text and the function gives a scalar which denotes how well the model's output matches the given input .
This function is brittle for several reasons:
- Human subjectivity: What humans rate as "good" may be based on their mood, cultural background or any number of specific biases.
- Moral disagreement: Even if anchored to an objective moral framework, moral theories disagree on many meta-ethical issues.
- Generalisation failures: A reward function may not be correctly extrapolated when AI intelligence generalises out of distribution.
- Goodhart's Law [? · GW]: A proxy reward may be misaligned with the true goal, particularly when the proxy diverges from the true goal out of distribution.
Current RLHF implementations attempt to solve (1) and (2) by averaging over large datasets of human evaluations in the hope that a large enough sample-size and an "averaging" effect produces a coherent middle-ground. However, this approach is insufficient for solving (3) i.e. ensuring that the reward function generalises correctly as AI intelligence scales out of distribution. Averaging over different moral frameworks cannot guarantee alignment through sharp left turns. We need something more robust.
1. Generalisation and Invariance
Several examples in the literature show that invariance leads to robust generalisation out of distribution. Consider the following:
Example 1: Invariant Risk Minimisation (IRM) for image classification
An influential idea in image classification is the idea of Invariant Risk Minimisation (IRM).[1] The paper introduces a thought experiment:
As a thought experiment, consider the problem of classifying images of cows and camels. To address this task, we label images of both types of animals. Due to a selection bias, most pictures of cows are taken in green pastures, while most pictures of camels happen to be in deserts. After training a convolutional neural network on this dataset, we observe that the model fails to classify easy examples of images of cows when they are taken on sandy beaches. Bewildered, we later realize that our neural network successfully minimized its training error using a simple cheat: classify green landscapes as cows, and beige landscapes as camels.
IRM provides a formal mathematical method attempting to encourage the classifier to focus on causal features that generalise well to unseen environments, rather than overfitting to spurious, environment-specific features in the data. In the discussion below the environment variables (grass, sand) are analogous to the nuisance parameters. Given a set of pixels the goal is to create a classifier which is invariant under the choice of specific coordinates.
Example 2: Grokking modular arithmetic
There's a fairly well-known result in mechanistic interpretability whereby small transformers are able to learn the underlying algorithm corresponding to modular addition tasks.[2] i.e. tasks of the form
where for prime and is masked.
The transformer begins by memorising the training data and, when it's scaled up, it learns the underlying algorithm required to grok modular addition. Concretely, the transformer embeds each token as
,
so that given tokens and , the network computes a logit for candidate c approximately as
.
Now, consider a U(1) gauge transformation that rotates the embeddings by an arbitrary phase :
Under this transformation, the logit becomes
,
which is invariant under the rotation. In this way, we would say the logits are gauge invariant under U(1) transformations.
Analysis
In both IRM and grokking modular arithmetic, the invariance properties were crucial for robust generalisation, and suggest this might be a general principle we could apply to alignment.
The weak claim is that this invariance helps the AI to learn a robust mechanism for generalisation beyond its training data.
The strong claim is that this invariance is necessary for the AI to generalise beyond its training data.
2. "Good" epistemic practice Physics
There is a well-known connection between Bayesian learners and physics due to E. T. Jaynes[3] which I've provided more detail on in the appendix.
Concretely;
- The action is a fundamental physical quantity which, mathematically, appears in the same way as the log-likelihood in Bayesian analysis.
- Minimising the action gives the equations of motion for a system in physics.
- Minimising the log-likelihood gives the parameters that best describe the data in Bayesian analysis.
This mathematical equivalence is motivating; when we're minimising the action this is formally equivalent to minimising the log-likelihood in Bayesian analysis. In other words, Bayesian analysis also happens to be mathematically equivalent to the equations of physics. That is… weird.
Fundamentally, I think my surprise comes from two points:
- Conducting Bayesian analysis to inform and update our beliefs is an assertion that Bayesian analysis is “good” epistemic practice. We need to reason about what constitutes “good” epistemic practice independently of physical facts. There is, in other words, an is-ought gap for epistemic norms. I.e. there’s no a priori reason that “good” epistemic practice (ought) should be so similar to physical processes (is).
- Claiming that good = natural is a textbook example of the naturalistic fallacy. Natural does not necessarily imply good.
Nevertheless, I think it’s suggestive that such a link exists and it might offer suggestions for how to model other normative systems. Concretely, if “good” epistemic practice can be modelled using the equations of physics could we also use them to model “good” moral practice?
There is, of course, a catch. When we do epistemic reasoning using Bayesian analysis, if our beliefs don’t correspond to the ground truth we very quickly receive empirical evidence that can be used to update our priors. In moral reasoning we don't have such a “ground truth” which we can use to perform useful updates against. Some philosophers have argued for Moral Realism, i.e. that such a ground truth does, in fact, exist but this view remains controversial and is the subject of some debate within the community.
I’d argue that the current practice of building a reward function to be maximised can be thought of as an attempt to build this ground truth moral field. As a Bayesian learner, the AI then tries to maximise this moral field (i.e. minimise log-likelihood) by implementing “good” epistemic practice.
3. Designing a reward function
Given the discussion above, let's do something a little speculative and see where it takes us...
Define a scalar field over a semantic space which represents the moral content[4] of a string governed by the following action
Here;
- represents the semantic content (e.g. token embeddings),
- are latent moral coordinates,
- is a coupling function, and
- is a potential that encodes our normative principles.
In the above formulation is essentially a reward function - it takes a string as input and outputs a score telling us the moral valence of the input text. The kinetic term penalises large discrepancies in judgements for semantically similar situations, encouraging moral coherence. The potential term incorporates all of the moral principles.
The coordinates can be thought of as hyperparameters corresponding to our moral coordinate system. For example, there might be an axis in which corresponds to moral concepts like fairness or utility. A particular moral framework would then be a vector on this coordinate space.
Incorporating Gauge Invariance
Traditionally, we might worry that different choices of lead to genuine disagreements in the evaluation of . However, in this framework, it's natural to recast each moral framework as a local standard for judging . Switching between frameworks is then akin to a change of gauge. To relate judgments across these different "moral gauges," we introduce a gauge field which is a connection that links local moral frameworks. The "ground-truth" moral facts are then captured by gauge-invariant features which all observers agree on regardless of coordinate system.
Concretely, if the field transforms under a local gauge transformation
where is an element of the gauge group (e.g. or perhaps something more general.) Then we introduce a gauge field which tells you how to "parallel transport" moral judgements from one point to another. It compensates for local variations in such that when you compute the covariant derivative
,
the result transforms properly under the change in moral framework.
The introduction of the gauge field means we now need to write a more complicated action
,
where are gauge-invariant combinations of the gauge field .
The Crux
We're free here to define an invariant quantity that remains unchanged under any local gauge transformation
.
The quantity is independent of the choice of moral coordinate system . Even if two observers are using different moral frameworks they agree on . That is, can be interpreted as encoding some genuine coordinate-independent moral truth of the system. Any apparent disagreement in the evaluation of is simply a reflection of differing coordinate choices rather than a genuine moral discrepancy.
Tying the conversation back to physics
Observables
The action we've written above is exactly the action for electromagnetism. In this theory, the field is invariant under arbitrary phase shifts in the rotation group
so quantities such as remain gauge invariant. In physics, gauge invariant quantities are physically observable while the non gauge invariant quantities are not.
To translate this into the language of non-relativistic quantum mechanics, the wavefunction itself is not directly observable but the gauge independent quantities such as the probability density are observable.
Conservation laws
In physical theories, symmetries and their associated conservation laws provide powerful constraints on the possible dynamics of systems. Through Noether's theorem, each continuous symmetry gives rise to a conserved quantity. For example,
- Invariance under space translations leads to conservation of momentum.
- Invariance under time translation leads to conservation of energy.
- Gauge symmetries lead to conserved currents such as conservation of charge.
If such conservation laws governed the evolution of the moral field these conservation laws would hold universally even out of distribution.
Furthermore, an AI would be able to “grok” the conservation law more readily than a messily specified reward function from RLHF. Conservation laws are fundamental principles that are woven into the fabric of the loss function which may be easier to internalise than a patchwork set of rules.
4. Objections
Objection 1: You're claiming that the action written above is a universal moral theory, I find this hard to believe.
Response: No. I don't think we've gotten to a universal moral theory in this post. Heck, we haven't even specified which gauge group the action is supposed to be invariant under. The point is that constructing a reward function with a log-likelihood that needs to be minimised is equivalent to constructing an action that needs to be minimised. Therefore, the mathematics of a reward function naturally admits these symmetries.
Objection 2: You're assuming that it's possible to define a gauge field that translates between moral coordinates to create a genuinely invariant quantity . I suspect that moral frameworks are so fundamentally different that this wouldn't be possible.
Response 2: I agree, and indeed this is the point. If we can't create a reward function with a robust invariant the AI will not be able to generalise it out of distribution. The challenge for us is to construct the reward function with a suitable invariant so it can be grokked appropriately. If our reward function doesn't exhibit this invariance then we need to throw it out.
Objection 3: You still have an is-ought problem. How are we to determine what the "correct" gauge symmetries are?
Response 3: Sure. We won't know what the correct gauge symmetries to implement are because we don't have any measurable feedback from the moral realm, although, I'm optimistic that this provides a nice framework to reason about the form it should take. For example, it seems necessary that a moral theory should exhibit some kind of invariance over the semantic space as well e.g. phrases which have similar semantic meaning should have similar moral evaluation.
Objection 4: Ok, so how would we actually implement this in practice?
Response 4: I'm not sure. It would be nice to come up with a suitable action from first principles but I suspect we'd have to implement this in a similar way to Invariant Risk Minimisation [? · GW] discussed above, perhaps introducing a regularisation term that penalises moral evaluations which don't exhibit this invariance.
Objection 5: What about Goodhart's Law? This framework assumes we can specify an accurate reward function rather a proxy.
Response 5: I agree, and I haven't given much thought about how to incorporate Goodhart's Law into this framework. I'd hope that proxy-rewards are more brittle than "true" rewards so if we were to look for invariances in the reward function and find they were absent we'd be alerted to the presence of proxy rewards rather than robust "true" rewards, however, I'll admit that I haven't given this the thought it deserves.
Conclusion
In conclusion, I've sketched a framework for designing a robust reward function that an AI would be able to use to generalise out of distribution, even when its intelligence has scaled out of distribution. The challenge for us is to construct reward functions which have the appropriate invariances so the AI can generalise them suitably. This will not be easy. However, I'm hopeful that this post can provide a useful starting point for further exploration.
Appendix: Bayesian Learning Physics
We have the following:
- The "truth", , which is some distribution that is generating our samples;
- A model, , parametrised by weights , where is compact;
- A prior over weights, ;
- And a dataset of samples , where each random variable is i.i.d. according to .
In Bayesian inference, we're trying to infer the posterior distribution of the weights given the data
.
Now, the posterior can be written in exponential form by taking the negative log of the likelihood
which gives
,
where the model evidence (also called the partition function in physics) is given by
.
The expression above is exactly equivalent to the partition function in statistical mechanics
where the prior is assumed to be uniform and we've introduced a function called the action .
- ^
Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217.
- ^
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.
- ^
Jaynes, E. T. (1957). Information Theory and Statistical Mechanics. Physical Review, 106(4), 620.
Jaynes, E. T. (1957). Information Theory and Statistical Mechanics II. Physical Review, 108(2), 171. - ^
I expect to be charged with Moral Realism here, but I don't think that moral realism is necessary for the argument. If you believe there's an equivalence between good epistemic practice and physics (as argued in section 2 [? · GW]) then writing an action with a moral field is mathematically equivalent to specifying a reward function.
1 comments
Comments sorted by top scores.
comment by Mitchell_Porter · 2025-02-23T14:46:12.550Z · LW(p) · GW(p)
Immediate thoughts: I would want to
(1) examine Jaynes's analogy in the light of Cosma Shalizi's critique of Physics from Fisher Information
(2) compare your moral gauge theory to Eric Weinstein's geometric marginalism [LW(p) · GW(p)] (and again, take note of a critique, here due to Timothy Nguyen).