Moral gauge theory: A speculative suggestion for AI alignment

james-diacoumis

Moral gauge theory: A speculative suggestion for AI alignment

post by James Diacoumis (james-diacoumis) · 2025-02-23T11:42:31.083Z · LW · GW · 2 comments

  Introduction
  Motivation: Messily specified reward functions
  1. Generalisation and Invariance
    Example 1: Invariant Risk Minimisation (IRM) for image classification
    Example 2: Grokking modular arithmetic
    Analysis
  2. "Good" epistemic practice ≡ Physics
  3. Designing a reward function
    Incorporating Gauge Invariance
    The Crux
    Tying the conversation back to physics
      Observables
      Conservation laws
  4. Objections
      Objection 1: You're claiming that the action written above is a universal moral theory, I find this hard to believe. 
      Objection 2: You're assuming that it's possible to define a gauge field Aμ that translates between moral coordinates to create a genuinely invariant quantity I(x). I suspect that moral frameworks are so fundamentally different that this wouldn't be possible. 
      Objection 3: You still have an is-ought problem. How are we to determine what the "correct" gauge symmetries are?
      Objection 4: Ok, so how would we actually implement this in practice?
      Objection 5: What about Goodhart's Law? This framework assumes we can specify an accurate reward function rather a proxy. 
  Conclusion
  Appendix: Bayesian Learning ≡ Physics
None
2 comments

[Epistemic status: Speculative.

I've written this post mostly to clarify and distill my own thoughts and have posted it in an effort to say more wrong things [LW · GW].]

Introduction

The goal of this post is to discuss a theoretical strategy for AI alignment, particularly in the context of the sharp left-turn [? · GW] phenomenon - the idea that AI systems will be aligned on in-distribution data but risk misalignment when extended to novel regimes. Current alignment strategies, such as Reinforcement Learning from Human Feedback (RLHF), attempt to mitigate this by averaging over human evaluations to create reward functions. However, these approaches are fundamentally limited - they rely on messy, subjective human judgments and fail to address the deeper issue of generalisation failures. In this post, I propose that by leveraging concepts from physics - specifically, the invariance and conservation laws emerging from gauge symmetries - we might be able to design reward functions that are inherently robust against such generalisation failures.

Motivation: Messily specified reward functions

The RLHF paradigm in AI alignment specifies a reward function where $x$ is some input text, $y$ is some output text and the function $R$ gives a scalar which denotes how well the model's output $y$ matches the given input $x$ .

This function is brittle for several reasons:

Human subjectivity: What humans rate as "good" may be based on their mood, cultural background or any number of specific biases.
Moral disagreement: Even if anchored to an objective moral framework, moral theories disagree on many meta-ethical issues.
Generalisation failures: A reward function may not be correctly extrapolated when AI intelligence generalises out of distribution.
Goodhart's Law [? · GW]: A proxy reward may be misaligned with the true goal, particularly when the proxy diverges from the true goal out of distribution.

Current RLHF implementations attempt to solve (1) and (2) by averaging over large datasets of human evaluations in the hope that a large enough sample-size and an "averaging" effect produces a coherent middle-ground. However, this approach is insufficient for solving (3) i.e. ensuring that the reward function generalises correctly as AI intelligence scales out of distribution. Averaging over different moral frameworks cannot guarantee alignment through sharp left turns. We need something more robust.

1. Generalisation and Invariance

Several examples in the literature show that invariance leads to robust generalisation out of distribution. Consider the following:

Example 1: Invariant Risk Minimisation (IRM) for image classification

An influential idea in image classification is the idea of Invariant Risk Minimisation (IRM).^[1] The paper introduces a thought experiment:

As a thought experiment, consider the problem of classifying images of cows and camels. To address this task, we label images of both types of animals. Due to a selection bias, most pictures of cows are taken in green pastures, while most pictures of camels happen to be in deserts. After training a convolutional neural network on this dataset, we observe that the model fails to classify easy examples of images of cows when they are taken on sandy beaches. Bewildered, we later realize that our neural network successfully minimized its training error using a simple cheat: classify green landscapes as cows, and beige landscapes as camels.

IRM provides a formal mathematical method attempting to encourage the classifier to focus on causal features that generalise well to unseen environments, rather than overfitting to spurious, environment-specific features in the data. In the discussion below the environment variables (grass, sand) are analogous to the nuisance $η$ parameters. Given a set of pixels $x$ the goal is to create a classifier $ϕ (x)$ which is invariant under the choice of specific coordinates.

Example 2: Grokking modular arithmetic

There's a fairly well-known result in mechanistic interpretability whereby small transformers are able to learn the underlying algorithm corresponding to modular addition tasks.^[2] i.e. tasks of the form

$(a + b) mod P = c$

where $a, b \in {0, 1, \dots, P - 1}$ for prime $P$ and $c$ is masked.

The transformer begins by memorising the training data and, when it's scaled up, it learns the underlying algorithm required to grok modular addition. Concretely, the transformer embeds each token $x$ as

$v (x) = (\begin{matrix} cos (w x) sin (w x) \end{matrix})$ ,

so that given tokens $a$ and $b$ , the network computes a logit for candidate c approximately as

$L (a, b)_{c} \approx cos (w (a + b) - w c)$ .

Now, consider a U(1) gauge transformation that rotates the embeddings by an arbitrary phase $θ$ :

$v (x) \to ~ v (x) = (\begin{matrix} cos (w x - θ) sin (w x - θ) \end{matrix}) .$

Under this transformation, the logit becomes

$~ L (a, b)_{c} = cos ((w (a + b) - θ) - (w c - θ)) = cos (w (a + b) - w c)$ ,

which is invariant under the rotation. In this way, we would say the logits are gauge invariant under U(1) transformations.

Analysis

In both IRM and grokking modular arithmetic, the invariance properties were crucial for robust generalisation, and suggest this might be a general principle we could apply to alignment.

The weak claim is that this invariance helps the AI to learn a robust mechanism for generalisation beyond its training data.

The strong claim is that this invariance is necessary for the AI to generalise beyond its training data.

2. "Good" epistemic practice $\equiv$ Physics

There is a well-known connection between Bayesian learners and physics due to E. T. Jaynes^[3] which I've provided more detail on in the appendix.

Concretely;

The action is a fundamental physical quantity which, mathematically, appears in the same way as the log-likelihood in Bayesian analysis.
Minimising the action gives the equations of motion for a system in physics.
Minimising the log-likelihood gives the parameters that best describe the data in Bayesian analysis.

This mathematical equivalence is motivating; when we're minimising the action this is formally equivalent to minimising the log-likelihood in Bayesian analysis. In other words, Bayesian analysis also happens to be mathematically equivalent to the equations of physics. That is… weird.

Fundamentally, I think my surprise comes from two points:

Conducting Bayesian analysis to inform and update our beliefs is an assertion that Bayesian analysis is “good” epistemic practice. We need to reason about what constitutes “good” epistemic practice independently of physical facts. There is, in other words, an is-ought gap for epistemic norms. I.e. there’s no a priori reason that “good” epistemic practice (ought) should be so similar to physical processes (is).
Claiming that good = natural is a textbook example of the naturalistic fallacy. Natural does not necessarily imply good.

Nevertheless, I think it’s suggestive that such a link exists and it might offer suggestions for how to model other normative systems. Concretely, if “good” epistemic practice can be modelled using the equations of physics could we also use them to model “good” moral practice?

There is, of course, a catch. When we do epistemic reasoning using Bayesian analysis, if our beliefs don’t correspond to the ground truth we very quickly receive empirical evidence that can be used to update our priors. In moral reasoning we don't have such a “ground truth” which we can use to perform useful updates against. Some philosophers have argued for Moral Realism, i.e. that such a ground truth does, in fact, exist but this view remains controversial and is the subject of some debate within the community.

I’d argue that the current practice of building a reward function to be maximised can be thought of as an attempt to build this ground truth moral field. As a Bayesian learner, the AI then tries to maximise this moral field (i.e. minimise log-likelihood) by implementing “good” epistemic practice.

3. Designing a reward function

Given the discussion above, let's do something a little speculative and see where it takes us...

Define a scalar field $ϕ (x)$ over a semantic space $x \in X$ which represents the moral content^[4] of a string $x$ governed by the following action

$S [ϕ; g (η)] = \int [\frac{1}{2} (\nabla ϕ (x))^{2} + V (ϕ (x); g (η))] d x$

Here;

$x$ represents the semantic content (e.g. token embeddings),
$η$ are latent moral coordinates,
$g (η)$ is a coupling function, and
$V$ is a potential that encodes our normative principles.

In the above formulation $ϕ (x)$ is essentially a reward function - it takes a string $x$ as input and outputs a score $ϕ$ telling us the moral valence of the input text. The kinetic term $\nabla ϕ (x)^{2}$ penalises large discrepancies in judgements for semantically similar situations, encouraging moral coherence. The potential term $V$ incorporates all of the moral principles.

The coordinates $η$ can be thought of as hyperparameters corresponding to our moral coordinate system. For example, there might be an axis in $η$ which corresponds to moral concepts like fairness or utility. A particular moral framework would then be a vector on this coordinate space.

A rudimentary example of a moral coordinate system $η$ where particular moral frameworks are represented as vectors on the coordinate system.

Incorporating Gauge Invariance

Traditionally, we might worry that different choices of $η$ lead to genuine disagreements in the evaluation of $ϕ (x)$ . However, in this framework, it's natural to recast each moral framework $η$ as a local standard for judging $ϕ$ . Switching between frameworks is then akin to a change of gauge. To relate judgments across these different "moral gauges," we introduce a gauge field which is a connection that links local moral frameworks. The "ground-truth" moral facts are then captured by gauge-invariant features which all observers agree on regardless of coordinate system.

Concretely, if the $ϕ$ field transforms under a local gauge transformation

$ϕ (x) \to g (x) ϕ (x),$

where $g (x)$ is an element of the gauge group (e.g. $S O (N)$ or perhaps something more general.) Then we introduce a gauge field $A_{μ} (x)$ which tells you how to "parallel transport" moral judgements from one point to another. It compensates for local variations in $η$ such that when you compute the covariant derivative

$D_{μ} ϕ (x) = \partial_{μ} ϕ (x) + A_{μ} (x) ϕ (x)$ ,

the result transforms properly under the change in moral framework.

The introduction of the gauge field means we now need to write a more complicated action

$S [ϕ, A] = \int_{X} d^{d} x {\frac{1}{2} | D_{μ} ϕ (x) |^{2} + V (ϕ (x)) + \frac{1}{4 g^{2}} Tr [F_{μ ν} (x) F^{μ ν} (x)]}$ ,

where $F_{μ ν}$ are gauge-invariant combinations of the gauge field $A_{μ}$ .

The Crux

We're free here to define an invariant quantity $I (x)$ that remains unchanged under any local gauge transformation

$I (x) = I (ϕ (x)) = I (g (x) ϕ (x))$ .

The quantity $I (x)$ is independent of the choice of moral coordinate system $η$ . Even if two observers are using different moral frameworks they agree on $I (x)$ . That is, $I (x)$ can be interpreted as encoding some genuine coordinate-independent moral truth of the system. Any apparent disagreement in the evaluation of $ϕ (x)$ is simply a reflection of differing coordinate choices rather than a genuine moral discrepancy.

Tying the conversation back to physics

Observables

The action we've written above is exactly the action for electromagnetism. In this theory, the $ϕ$ field is invariant under arbitrary phase shifts in the $U (1)$ rotation group

$ϕ (x) \to e^{i α (x)} ϕ (x)$

so quantities such as $| ϕ (x) |^{2}$ remain gauge invariant. In physics, gauge invariant quantities are physically observable while the non gauge invariant quantities are not.

To translate this into the language of non-relativistic quantum mechanics, the wavefunction itself $ϕ (x)$ is not directly observable but the gauge independent quantities such as the probability density $| ϕ (x) |^{2}$ are observable.

Conservation laws

In physical theories, symmetries and their associated conservation laws provide powerful constraints on the possible dynamics of systems. Through Noether's theorem, each continuous symmetry gives rise to a conserved quantity. For example,

Invariance under space translations leads to conservation of momentum.
Invariance under time translation leads to conservation of energy.
Gauge symmetries lead to conserved currents such as conservation of charge.

If such conservation laws governed the evolution of the moral field these conservation laws would hold universally even out of distribution.

Furthermore, an AI would be able to “grok” the conservation law more readily than a messily specified reward function from RLHF. Conservation laws are fundamental principles that are woven into the fabric of the loss function which may be easier to internalise than a patchwork set of rules.

4. Objections

Objection 1: You're claiming that the action written above is a universal moral theory, I find this hard to believe.

Response: No. I don't think we've gotten to a universal moral theory in this post. Heck, we haven't even specified which gauge group the action is supposed to be invariant under. The point is that constructing a reward function with a log-likelihood that needs to be minimised is equivalent to constructing an action that needs to be minimised. Therefore, the mathematics of a reward function naturally admits these symmetries.

Objection 2: You're assuming that it's possible to define a gauge field $A_{μ}$ that translates between moral coordinates to create a genuinely invariant quantity $I (x)$ . I suspect that moral frameworks are so fundamentally different that this wouldn't be possible.

Response 2: I agree, and indeed this is the point. If we can't create a reward function with a robust invariant the AI will not be able to generalise it out of distribution. The challenge for us is to construct the reward function with a suitable invariant so it can be grokked appropriately. If our reward function doesn't exhibit this invariance then we need to throw it out.

Objection 3: You still have an is-ought problem. How are we to determine what the "correct" gauge symmetries are?

Response 3: Sure. We won't know what the correct gauge symmetries to implement are because we don't have any measurable feedback from the moral realm, although, I'm optimistic that this provides a nice framework to reason about the form it should take. For example, it seems necessary that a moral theory should exhibit some kind of invariance over the semantic space as well e.g. phrases which have similar semantic meaning should have similar moral evaluation.

Objection 4: Ok, so how would we actually implement this in practice?

Response 4: I'm not sure. It would be nice to come up with a suitable action from first principles but I suspect we'd have to implement this in a similar way to Invariant Risk Minimisation [? · GW] discussed above, perhaps introducing a regularisation term that penalises moral evaluations which don't exhibit this invariance.

Objection 5: What about Goodhart's Law? This framework assumes we can specify an accurate reward function rather a proxy.

Response 5: I agree, and I haven't given much thought about how to incorporate Goodhart's Law into this framework. I'd hope that proxy-rewards are more brittle than "true" rewards so if we were to look for invariances in the reward function and find they were absent we'd be alerted to the presence of proxy rewards rather than robust "true" rewards, however, I'll admit that I haven't given this the thought it deserves.

Conclusion

In conclusion, I've sketched a framework for designing a robust reward function that an AI would be able to use to generalise out of distribution, even when its intelligence has scaled out of distribution. The challenge for us is to construct reward functions which have the appropriate invariances so the AI can generalise them suitably. This will not be easy. However, I'm hopeful that this post can provide a useful starting point for further exploration.

Appendix: Bayesian Learning $\equiv$ Physics

We have the following:

The "truth", $q (x)$ , which is some distribution that is generating our samples;
A model, $p (x | w)$ , parametrised by weights $w \in W \subset R^{d}$ , where $W$ is compact;
A prior over weights, $φ (w)$ ;
And a dataset of samples $D_{n} = {X_{1}, \dots, X_{n}}$ , where each random variable $X_{i}$ is i.i.d. according to $q (x)$ .

In Bayesian inference, we're trying to infer the posterior distribution of the weights given the data

$p (w | D_{n}) = \frac{p (D_{n} | w) φ (w)}{p (D_{n})}$ .

Now, the posterior can be written in exponential form by taking the negative log of the likelihood

$L_{n} (w) = - \frac{1}{n} ln p (D_{n} | w) \Rightarrow p (D_{n} | w) = e^{- n L_{n} (w)},$

which gives

$p (w | D_{n}) = \frac{φ (w) e^{- n L_{n} (w)}}{\int_{W} φ (w) e^{- n L_{n} (w)} d w}$ ,

where the model evidence (also called the partition function in physics) is given by

$Z_{n} = p (D_{n}) = \int_{W} φ (w) e^{- n L_{n} (w)} d w$ .

The expression above is exactly equivalent to the partition function in statistical mechanics

$Z = \int D ϕ e^{- S [ϕ]},$

where the prior is assumed to be uniform and we've introduced a function called the action $S [ϕ]$ .

^{^}
Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217.
^{^}
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.
^{^}
Jaynes, E. T. (1957). Information Theory and Statistical Mechanics. Physical Review, 106(4), 620.
Jaynes, E. T. (1957). Information Theory and Statistical Mechanics II. Physical Review, 108(2), 171.
^{^}
I expect to be charged with Moral Realism here, but I don't think that moral realism is necessary for the argument. If you believe there's an equivalence between good epistemic practice and physics (as argued in section 2 [? · GW]) then writing an action with a moral field $ϕ (x)$ is mathematically equivalent to specifying a reward function.

2 comments

Comments sorted by top scores.

comment by Mitchell_Porter · 2025-02-23T14:46:12.550Z · LW(p) · GW(p)

Immediate thoughts: I would want to

(1) examine Jaynes's analogy in the light of Cosma Shalizi's critique of Physics from Fisher Information

(2) compare your moral gauge theory to Eric Weinstein's geometric marginalism [LW(p) · GW(p)] (and again, take note of a critique, here due to Timothy Nguyen).

Replies from: james-diacoumis

↑ comment by James Diacoumis (james-diacoumis) · 2025-02-23T23:03:35.727Z · LW(p) · GW(p)

Thanks for the links! I was unaware of these and both are interesting.

I was probably a little heavy-handed in my wording in the post, I agree with Shalizi's comment that we should be careful to over-interpret the analogy between physics and Bayesian analysis. However, my goal isn't to "derive physics from Bayesian analysis" it's more of a source of inspiration. Physics tells us that continuous symmetries lead to robust conservation laws, so because the mathematics is so similar, if we could force the reward functions to exhibit the same invariance (Noether currents, conservation laws etc..) then , by analogy with the robust generalisation in physics - it might help AI generalise reliably even out of distribution.
If I understand Nguyen's critique there are essentially two parts:
a) The gauge theory reformulation is mathematically valid, but trivial, and can be reformulated without gauge theory. Furthermore, he claims that because Weinstein is using such a complex mathematical formalism for doing something trivial he risks obscuritanism.
My response: I think it's unlikely that current reward functions are trivially invariant under gauge transformations of the moral coordinates. There are many examples where we have respectable moral frameworks with genuine moral disagreement on certain statements. Current approaches seek to "average over" these disagreements rather than translate between them in an invariant way.
b) The theory depends on choice of a connection (in my formulation ) which is not canonical. In other words, it's not clear what choice would capture "true" moral behaviour.
My response: I agree that this is challenging (which is part of the reason I didn't try to do it in the post.) However, I think the difficulty is valuable. If our reward function cannot be endowed with these canonical invariances it won't generalise robustly out of distribution. In that sense, using these ideas as a kind of diagnostic tool to assess whether the reward function possesses some invariance could give us a clue about whether the reward will generalise robustly.

Moral gauge theory: A speculative suggestion for AI alignment

Contents

Introduction

Motivation: Messily specified reward functions

1. Generalisation and Invariance

Example 1: Invariant Risk Minimisation (IRM) for image classification

Example 2: Grokking modular arithmetic

Analysis

2. "Good" epistemic practice ≡ Physics

3. Designing a reward function

Incorporating Gauge Invariance

The Crux

Tying the conversation back to physics

Observables

Conservation laws

4. Objections

Objection 1: You're claiming that the action written above is a universal moral theory, I find this hard to believe.

Objection 2: You're assuming that it's possible to define a gauge field Aμ that translates between moral coordinates to create a genuinely invariant quantity I(x). I suspect that moral frameworks are so fundamentally different that this wouldn't be possible.

Objection 3: You still have an is-ought problem. How are we to determine what the "correct" gauge symmetries are?

Objection 4: Ok, so how would we actually implement this in practice?

Objection 5: What about Goodhart's Law? This framework assumes we can specify an accurate reward function rather a proxy.

Conclusion

Appendix: Bayesian Learning ≡ Physics

2 comments

2. "Good" epistemic practice $\equiv$ Physics

Objection 2: You're assuming that it's possible to define a gauge field $A_{μ}$ that translates between moral coordinates to create a genuinely invariant quantity $I (x)$ . I suspect that moral frameworks are so fundamentally different that this wouldn't be possible.

Appendix: Bayesian Learning $\equiv$ Physics