Would this solve the (outer) alignment problem, or at least help?
post by Wes R · 2025-04-06T18:49:14.145Z · LW · GW · 1 commentsContents
ALBUM-WMC: Aligning AGI Using Bayesian Updating of its Moral Weights & Modelling Consciousness 1. The Core Challenge: Valuing Conscious Experience 2. A Bayesian Framework for Learning Moral Weights 3. Instilling This Framework in AI 4. Addressing Uncertainty and Minor Misalignment 5. Strategies for Defining and Refining Priors Recap and Call for Discussion None 1 comment
ALBUM-WMC: Aligning AGI Using Bayesian Updating of its Moral Weights & Modelling Consciousness
This document outlines a set of related ideas concerning the challenge of defining and implementing moral weights in advanced AI systems, particularly focusing on the difficult problem of valuing conscious experience. The goal is to structure these thoughts for discussion within the AI safety community, inviting critique and further development.
By the end of reading this, hopefully you’ll learn:
- A way you, and an AI, could actually account for conscious experience correctly, so you (and the AI) don’t run into traps like not being willing to get new moral weights (which is a potential solution to the outer alignment problem!)
- Why Minor Misalignment is okayish in some cases
- You, or an AI, (assuming my first guess is correct) don’t need to have perfectly accurate priors in order to typically make the right decision
- A few bonus concepts
1. The Core Challenge: Valuing Conscious Experience
A central problem in AI alignment is determining the "moral weights" of different outcomes, especially those involving subjective conscious experience. How do we find out, and then assign, the right numeric value to states like happiness, suffering, or other qualia?
- Proposed Model: We can conceptualize the world as comprising of the physical world (our physical world) (well, technically, it be a simulated world from us living in simulation, or something else, but the logic below doesn’t depend on the world being physical, so it applies just fine in these more strange cases) and distinct "universes" of conscious experience. (That is, we label each person’s experience as part of a separate universe.) The physical world/physical-universe causes changes in the experience-worlds, and these experience-universes cause changes in the physical world (or at least there’s some probability that they do) (potentially through mechanisms like free will, though the exact nature isn't critical for our model).
- Priors are Key: We (and this AI) operate with inherent priors (probability distributions) of:
-
- What physical states/events/the physical world cause what changes to conscious experiences/experience-universes
-
- What conscious experiences/experience-universes (or changes to them) cause what changes to physical states/events/the physical world
-
- The intrinsic moral value (positive or negative weight) of these different conscious experiences (and presumably, if you want, the intrinsic moral value (positive or negative weight) of the physical world, if you think some physical arrangements of atoms are somehow inherently valuable, or at least might be)
-
- the correlations between #1 and #2, the correlations between #2 and #3, the correlations between #1 and #3, and the correlations between #1, #2, and #3. (That is, their “Joint probability distributions”.) - note that these correlations should AVOID a situation where the AI changing the state of the physical world, say, by making a lot of paperclips, has the effect of the AI thinking it changed what physical worlds cause what conscious experiences (say, to paperclips being conscious and happy - think “clippy”) or what conscious states cause what physical state. YOU SHOULD AVOID THIS because AIs simply cannot change these cosmic correlations of causality, and convincing an AI it can is counteractive to creating an AI that k-nows the right things. (That is, you’d be telling the AI something that is factually incorrect.)
-
2. A Bayesian Framework for Learning Moral Weights
Our own subjective experience provides a continuous stream of data. This data can be used to perform Bayesian updates on our priors about these factors about the physical world and conscious experience and the moral weights of those experiences.
- Updating from Experience: If we have a prior belief about the negative value of pain, observing that certain actions reliably cause reports of pain - assuming reports of pain (#2) are highly correlated (#4) with pain (#3) (or one experiencing it themself) strengthens our belief (updates our posterior probability) that those actions are morally negative and increases our estimate of the magnitude of that negativity. This is a real thing that, say, the folks at The Moral Weight Project Sequence — EA Forum [? · GW] could do. (Note to self: email this idea to Bob Fischer)
- Example (Illustrative): Consider a prior belief that there's a 10% chance a certain type of pain is "intolerably bad" (high negative weight) versus "moderately bad" (lower negative weight). If we further believe that in worlds where it's intolerably bad, individuals (in this case, chickens) are much more likely to express this verbally, then observing widespread verbal reports strongly updates our belief towards the "intolerably bad" hypothesis and its associated higher negative moral weight. (via Bayes’s rule)
- Experiments and Observation: We can actively (though ethically) seek data to refine these priors. Observing the consequences of actions (e.g., noting distress signals in chickens after certain stimuli) provides Bayesian evidence to update our estimates of the moral weights associated with those actions and the likely experiences they cause.
3. Instilling This Framework in AI
How can we equip an AI with the ability to reason about and act according to appropriate moral weights, especially concerning consciousness?
- We input (into the AI)
- The conceptual model (physical vs. conscious realms, interaction).
- Initial priors (potentially derived from human consensus or ethical reasoning).
- Access to relevant data (e.g., reports of human/animal experiences, results of simulated or real-world interactions - of course, this data should be reasonably representative of reality - if we only give it data on MRI scans, the AI might start out only knowing about how people feel in those scanners, though as we’ll get into further down, it would want to, and actually can, learn about how people feel in other cases). (We also would want to give it more data on what causes pain and things we weigh negatively, so it can try to avoid those early on, and so it doesn't have to run experiments on what causes people pain in order to find out.)
- The Grounding Problem: A challenge is enabling the AI to connect abstract physical data to other concepts in experiences (e.g., based on the AI’s input data, how does it figure out that the experience of "tastes spicy" is similar to "feels hot"?, or how would it be able to come up with the hypothesis that if we slowed someone’s brain by 2, 2 seconds for most people is 1 second for that person, so maybe everything (every second) is practically 1/2 as valuable for that person, or how would it be able to tell that someone focussing on a certain sensation is similar to them experiencing it more intensely?). My guess is this isn’t that important since chatbots can do that pretty well already, and this is mostly about how to align an AI’s goals. This seems like it won’t be necessary to give it the right goals, but you tell me: Is it? I’m asking!
- Testing Understanding: A potential test/standard for a genuine understanding of consciousness would be if the AI could derive new properties or hypotheses about consciousness or moral weights beyond what it was explicitly programmed with - otherwise, unless we’re fine with the AI not knowing some concepts/insights/properties of consciousness, we’d have to explicitly state each property or insight about them manually (which, again, we can do by programming the physical-universe & experience-universe model into the AI, as well as #1 #2 #3 and #4. (Unless this doesn’t describe all the properties of consciousness - is it? I’m asking. Is there anything this misses? Let me know! Seriously, type out if you think it misses anything!)).
This might require providing a foundational understanding without exhaustively defining every aspect. - Experimenting with this: if we had the resources, perhaps we could run an AI with these goals on a sandbox, with some small training data? If you can provide that service (e.g., You know how to code an AI, you know a website where you can experiment with different goals for AIs), tell me! my email is wesreisen2@gmail.com - and if you think this wouldn’t work (say, because you know a good reason I don’t know on why even a small AI in a small experiment would still try to fake alignment), alse let me know!
- Safe Information Gathering: An AI operating within this Bayesian framework can safely gather information to update its moral weight estimates. It isn’t incentivized to selectively seek information just to confirm or inflate specific weights it "prefers." This is because, due to the nature of Bayesian updating, the expected value of a weight, prior to observing new information, is equal to the expected value of a weight, post-observing new information. So, an AI can’t observe anything that, on average, will make people seem happier than they were previously. It won’t seek out data that tells the AI people are, on average, happier than they are because that data doesn’t exist.
- Training Environments: Simulated scenarios could be used to train (as well as test) the AI's ability to behave in accordance with these principles, but we’d have to give it rewards based on the moral weights the AI uses by the end of each situation we train it on, or at least something that, in expected value, is equal to
(Sidenote: if we trained the AI on, say, 1 billion situations, it might learn to take bets in which it’ll lose massively 1 in a trillion times because it never actually comes up that the AI loses the bet! One potential workaround is that, in training, whenever some event that has random outcomes occurs, we train the AI on each or most possible outcomes, and we scale down how much these situations matter in proportion to their probability (e.g., we train it on the scenario where a coin flips heads, and the scenario where a coin flips tails, but we scale down how much these impact the model by 0.5) - this specific alignment technique is useful even for companies that care less about AIs being safe, because this one specifically is useful for making AIs aligned.)
, but I’m pretty sure we still have issues like goal misgeneralization. Do we? Let me know! Seriously! Type what you think! Also, is there any research done on training an AI on a reward function it doesn’t know or that we' have an expected value on the right reward but not certainty? Maybe stuff from the subfield of how we’ll score AI on things we don’t understand? - A nice bonus: If we graphed what the AI’s moral weights are after a few days of the AI learning and performing research, we could see trends like “the more spicy something sastes, the more the AI morally wieghs that experience as positive!”. This is a nice bonus.
4. Addressing Uncertainty and Minor Misalignment
What happens if our specified priors or moral weights are slightly inaccurate? (I mostly came up with this under the assumption that the AI has pre-programmed, set-in-stone moral weights, but they half-apply to the situation when the weights aren’t set in stone.)
- Linear Value Functions: Assume the AI aims to maximize a value function that is a linear combination (dot product) of different outcomes (e.g., Value = w_cats * number-of-cats + w_dogs * number-of-dogs). The AI optimizes based on the weights [w_cats, w_dogs] it is given. That is, it assigns w_cats of value to each cat, w_dogsand to each dog. if [w_cats, w_dogs] was [2, 1], the AI would value cats twice as much as dogs.
- Slightly Wrong Weights: If our true desired weights are slightly different, the AI optimizes in a slightly "wrong" direction.
- Geometric Intuition: We can set the AI's target vector (its moral weights) to be a base vector (as described in https://youtu.be/P2LTAUO1TdA?). (namely j-hat) Let’s also set the vector that’s 90⁰ from [w_cats, w_dogs] , which is [w_dogs, -w_cats] , to be the other base vector, i-hat. (Yes, I know there’s technically 2 vectors 90⁰ on either side of [w_cats, w_dogs] ). Now, let’s say this AI has a decision with some options. Each option will result in some amount of dogs, and some amount of cats, which we can encode using a vector or point, using those two base vectors. Notice that the AI will pick whatever vector/point is highest up there. (Remember: j-hat is in the up-direction, and dot-products look like “projection”, as explained in https://youtu.be/LyGKycYT2v0?) How much this point is on the left or on the right is pretty much random, and if we actually value [w_dogs, -w_cats] positively (that is, we value the point being more on the right, remember i-hat is the right-direction), that won’t make the AI pick points that are typically further to the left.
- Weights add like vectors, and are scalable: Since [w_cats, w_dogs] & [w_dogs, -w_cats] can be base vectors, the AI’s target vector (its moral weights) can be written as some constant* [w_cats, w_dogs] + some other constant*[w_dogs, -w_cats] , and the AI’s moral weights lead to those constants being 1 and 0. Now, ours might be 1 and 0.02 or 1 and -0.02 or something like that. Scaling these 2 values by some positive real number, by, say, x100 is the same thing as maximizing centiQALYs instead of maximizing QALYs, and thus it shouldn't have any impact on what we do. If we scaled it by a negative number, though, we’d be maximizing for DALYs. So, ASSUMING our two numbers result in the first one being a positive real number, we can scale the two numbers up or down so the first one is equal to 1 (Otherwise, we’d have to scale by a negative real number).
Now, let’s take the vector [number of dogs the AI's decision yeilded, number of cats the AI's decision yeilded], and map it to a vector on our grid (we’ll call this new vector D for decision), with the base vectors used above. The moral value of this choice, according to our weights, is either D•[1,-0.02] or D[1,0.02]. Since the dot product of two vectors is linear, E(D•our moral weights)=D•[E(1), E(how much we value D being to the left or to the right)]=D•[1,0], which is pretty high, since the AI is optimizing for that. (E(x) is the expected value of x.)
(Maybe being a bit more precise is still, on average, super useful though, because maybe there’s another option which, if it were D, would be [0.99, 10000], and if you find out your values are actually [1, 0.02], [0.99, 10000]•[1, 0.02] (which is 20.99) might be quite a bit higher than the original D, dot-producted with [1, 0.02].)
- Unbiased Errors: Crucially, if the AI is simply optimizing the weights it has (without malice or actively trying to do bad while optimizing for its goal somehow), and our uncertainty about the true weights is unbiased (which is always the case - if we were more confident that the weights should favor dogs a bit more, we’d increase the AI’s w_dogs beforehand until it matched what we thought), the errors introduced by optimizing in the slightly wrong direction should average out. The AI isn't actively trying to exploit the difference! It’s not actively trying to be bad while optimizing for the goal the humans gave it - It’s just optimizing for the goal the humans gave it!
- A short note on a gaussian distribution for D: If the distribution at for the amount of cats resulting from the AI’s options is a gaussian with variance 1, and the distribution of each option’s number of dogs is a gaussian, you can rotate the base vectors while keeping the probability distribution, and pretty much everything else, the same; for the reasons explained here: https://youtu.be/cy8r7WSuT1I?
5. Strategies for Defining and Refining Priors
How can we arrive at reasonable initial priors to give the AI?
- Thought Experiments & Consensus: Analyzing hypothetical moral dilemmas and observing human consensus on the preferred outcomes can reveal underlying shared priors. For example, the consensus is that english-speakers past the “cry by default” stage in life will typically say “ow” if they get hurt.
- Filtering via Known Judgments: We can use widely accepted ethical judgments (e.g., "punching people is bad") to constrain the set of plausible priors. If a set of priors leads to a conclusion that violates a strong ethical intuition, that set’s probability can be down-weighted or discarded. In this case, priors where physical-universes with punched human bodies correlate with experience-universes that are happy (instead of hurt from being punched) might be less likely than the priors where the experience-universes are more hurt.
Another case is (maybe) that if someone verbally expresses a preference between what state said someone wants to be in (for example, “I’d rather be happy than sad“), your priors should likely lead you to believe that person.
Another case is, if there’s no living things on earth, that probably highly correlates with lack of conscious experience on earth (which probably has very low moral weight.)
- Generalization Principle (based on my Intuition/guess): Priors that correctly handle simple cases (e.g., "don't kick one dog") are likely to generalize better to related, more complex cases (e.g., "don't kick two dogs"), similar to how Khant’s universality principle lets you geneneralize a judgement in one situation to a similar judgement in similar situations, which is also akin to principles of simplicity and universality seen elsewhere (e.g., Kolmogorov complexity?). This suggests that filtering priors based on simple, clear-cut cases could be a powerful strategy.
- Specifying priors not manually: Maybe specifying priors ourselves could be too difficult, maybe even on the scale of directly specifying what an AI should do in each situation, but maybe not. Maybe we’d need the AI to independently come up with concepts and insights about the priors (as alluded to above when I mentioned the AI being able to independently come up with concepts and insights about consciousness), or maybe we could delegate this to an oracle AI (like a chatbot) if it’s too hard to be specific enough to get a good result. Maybe if it’s still too hard, there’s a chance we can’t align superintelligent agentic AIs with agency, and we have to focus on oracles or other types of AI. Would we? I’m asking!
- Robustness: It's possible that extreme precision in defining priors isn't necessary. Perhaps broad ranges ("buckets") of priors (which we can call “Inputs”) result in the same (or similar) preference orderings (which we can call “outputs”). One case where this is true is that, as long as your priors don’t lead you to think there’s literally NO CHANCE of things mattering, there’s some chance of things mattering, in which case you act as though things do matter (not-nhillism), since in the situation things don’t matter, you can pretend like they do just fine, and in the situation things do matter, you should act like it. I reckon we’d still need to be careful around Conscious robot-stuff, since priors about conscious flesh-stuff might not “generalize” well to having an accurate model of conscious robot-stuff (“Artificial sentience.”)
Hopefully this should lead to the AI’s preference ordering being similar enough to our preferent ordering.
Sidenote - an AI’s misalignment is only as bad as it’s preference ordering. To be more precise, for AIs that operate on preference orderings, the impact of all of the AI’s decisions is equal to [the number of decisions it makes]*[the average (mean) value of the decision that the AI chooses, which is the choice that’s highest in the preference ordering than all of the AI’s other options, for that decision.]
Recap and Call for Discussion
So, to recap:
An AI (and us humans) model the world to have a physical-universe, which affects conscious experiences, which we label as their own universes (called “experience-universes”), and the state of those universes correlate with how they affect the physical-universe through free will. We also weigh each experience by some moral weight. We have some joint probability distribution of what these universes are, how we weight them, and how they correlate. The AI can then use Bayesian updating, incorporating new data from observations or experiments of the physical-universe (and maybe somehow its own experience-universe, the same way you know not to punch people in part because you don’t like when you get punched), to refine its accuracy of these relationships and moral weights.
The AI can be trained to act under this by rewarding it based on those moral weights.
This process allows the AI to learn without being explicitly programmed with every detail. Furthermore, the framework suggests that minor inaccuracies in the initial weights might lead to unbiased errors rather than catastrophic misalignment, assuming the AI isn't actively malicious. Strategies like using human consensus, filtering priors based on which lead to some correct judgments in ethics, and the existence of generalizations similar to Kant’s universality principle can help establish reasonable starting points. However, challenges remain, particularly in grounding abstract concepts and ensuring robustness against issues like goal misgeneralization.
These ideas are presented to stimulate discussion, critique, and refinement within the AI safety community. I’d highly encourage you to give me feedback on the model's validity, the feasibility of implementation, potential failure modes (like goal misgeneralization or grounding difficulties), and the proposed strategies for defining priors. Whenever there’s a “?”, feel free to try to answer the question. Whenever I say “I’m asking!”, I was asking! I’d love to hear your answer!
(Also, not to flex, but I came up with all of this in the past 2 days alone, with a bit of help from one other person I’m friends with)
Cheers to a functioning solution to the alignment problem!
1 comments
Comments sorted by top scores.
comment by Martin Vlach (martin-vlach) · 2025-04-07T09:19:05.448Z · LW(p) · GW(p)
hopefully you will learn
seems missing part 2.
??