Would this solve the (outer) alignment problem, or at least help?

post by Wes R · 2025-04-06T18:49:14.145Z · LW · GW · 1 comments

Contents

  ALBUM-WMC: Aligning AGI Using Bayesian Updating of its Moral Weights & Modelling Consciousness 
    1. The Core Challenge: Valuing Conscious Experience
    2. A Bayesian Framework for Learning Moral Weights
    3. Instilling This Framework in AI
    4. Addressing Uncertainty and Minor Misalignment
    5. Strategies for Defining and Refining Priors
    Recap and Call for Discussion
None
1 comment

ALBUM-WMC: Aligning AGI Using Bayesian Updating of its Moral Weights & Modelling Consciousness
 

This document outlines a set of related ideas concerning the challenge of defining and implementing moral weights in advanced AI systems, particularly focusing on the difficult problem of valuing conscious experience. The goal is to structure these thoughts for discussion within the AI safety community, inviting critique and further development.

By the end of reading this, hopefully you’ll learn:

  1. A way you, and an AI, could actually account for conscious experience correctly, so you (and the AI) don’t run into traps like not being willing to get new moral weights (which is a potential solution to the outer alignment problem!)
  2. Why Minor Misalignment is okayish in some cases
  3. You, or an AI, (assuming my first guess is correct) don’t need to have perfectly accurate priors in order to typically make the right decision
  4. A few bonus concepts


 

 

1. The Core Challenge: Valuing Conscious Experience

A central problem in AI alignment is determining the "moral weights" of different outcomes, especially those involving subjective conscious experience. How do we find out, and then assign, the right numeric value to states like happiness, suffering, or other qualia?

2. A Bayesian Framework for Learning Moral Weights

Our own subjective experience provides a continuous stream of data. This data can be used to perform Bayesian updates on our priors about these factors about the physical world and conscious experience and the moral weights of those experiences.

3. Instilling This Framework in AI

How can we equip an AI with the ability to reason about and act according to appropriate moral weights, especially concerning consciousness?

4. Addressing Uncertainty and Minor Misalignment

What happens if our specified priors or moral weights are slightly inaccurate? (I mostly came up with this under the assumption that the AI has pre-programmed, set-in-stone moral weights, but they half-apply to the situation when the weights aren’t set in stone.)

5. Strategies for Defining and Refining Priors

How can we arrive at reasonable initial priors to give the AI?


 

Hopefully this should lead to the AI’s preference ordering being similar enough to our preferent ordering.

Sidenote - an AI’s misalignment is only as bad as it’s preference ordering. To be more precise, for AIs that operate on preference orderings, the impact of all of the AI’s decisions is equal to [the number of decisions it makes]*[the average (mean) value of the decision that the AI chooses, which is the choice that’s highest in the preference ordering than all of the AI’s other options, for that decision.]

Recap and Call for Discussion

So, to recap:

An AI (and us humans) model the world to have a physical-universe, which affects conscious experiences, which we label as their own universes (called “experience-universes”), and the state of those universes correlate with how they affect the physical-universe through free will. We also weigh each experience by some moral weight. We have some joint probability distribution of what these universes are, how we weight them, and how they correlate. The AI can then use Bayesian updating, incorporating new data from observations or experiments of the physical-universe (and maybe somehow its own experience-universe, the same way you know not to punch people in part because you don’t like when you get punched), to refine its accuracy of these relationships and moral weights.

The AI can be trained to act under this by rewarding it based on those moral weights.

This process allows the AI to learn without being explicitly programmed with every detail. Furthermore, the framework suggests that minor inaccuracies in the initial weights might lead to unbiased errors rather than catastrophic misalignment, assuming the AI isn't actively malicious. Strategies like using human consensus, filtering priors based on which lead to some correct judgments in ethics, and the existence of generalizations similar to Kant’s universality principle can help establish reasonable starting points. However, challenges remain, particularly in grounding abstract concepts and ensuring robustness against issues like goal misgeneralization.

These ideas are presented to stimulate discussion, critique, and refinement within the AI safety community. I’d highly encourage you to give me feedback on the model's validity, the feasibility of implementation, potential failure modes (like goal misgeneralization or grounding difficulties), and the proposed strategies for defining priors. Whenever there’s a “?”, feel free to try to answer the question. Whenever I say “I’m asking!”, I was asking! I’d love to hear your answer!

(Also, not to flex, but I came up with all of this in the past 2 days alone, with a bit of help from one other person I’m friends with)

Cheers to a functioning solution to the alignment problem!

1 comments

Comments sorted by top scores.

comment by Martin Vlach (martin-vlach) · 2025-04-07T09:19:05.448Z · LW(p) · GW(p)

hopefully you will learn 

seems missing part 2.

??