Failure modes in a shard theory alignment plan

post by Thomas Kwa (thomas-kwa) · 2022-09-27T22:34:06.834Z · LW · GW · 2 comments

Contents

    Definitions for this document
  A possible shard theory alignment plan
  Some requirements for this plan to work
  Opinions
None
2 comments

Thanks to David Udell, Alex Turner, and others for conversations that led to this

Recently in a conversation with David Udell, Nate Soares said:

[David's summary of shard theory] doesn't yet convince me that you know something i don't about the hopefullness of such a plan. the sort of summary that might have that effect is a summary of what needs to be true about the world (in this case, probably about RL models, optimization, and/or human values) for this idea to have hope. in particular, the point where i start to be interested in engaging is the point where it seems to me like you perceive the difficulties and have a plan that you expect to overcome those difficulties."

I'm not as pessimistic as Nate, but I also think shard theory needs more concreteness and exploration of failure modes, so as an initial step I spent an hour brainstorming with David Udell, with each of us trying to think of difficulties, and then another couple of hours writing this up. Our methodology was to write down a plan for the sake of concreteness (even if we think it's unlikely to work), then try to identify as many potential difficulties as possible.

Definitions for this document

A possible shard theory alignment plan

  1. Play around with modern RL models, and extract quantitative relationships between reward schedule and learned behaviors.
  2. Instill corrigibility shards inside powerful RL models to the greatest extent possible.
  3. Scale those aligned models up to superintelligent agents, and allow value formation to happen.

Note that this is just one version of the shard theory alignment plan; other versions [LW · GW] might replace RL with large language models or other systems.

Some requirements for this plan to work

Exercise for the reader: Which of these persist with simulator [LW · GW]-based plans like GEM?

Opinions

Note: I've thought about this less than I would like and so am fairly unconfident, but I'm posting it anyway

In a non-shard theory frame like Risks from Learned Optimization [LW · GW], we decompose the alignment problem into outer alignment (finding an outer objective aligned with human values) and inner alignment (finding a training process such that an AI's behavioral objective matches the outer objective).

In the shard theory frame, the observations that reward is not the optimization target [LW · GW] motivates a different decomposition: our reward signal no longer has to be an outer objective that we expect the AI to be aligned with. But my understanding is that we still have to:

These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it's not clear that they're easier than inner and outer alignment. Some analogies from humans [? · GW] imply that various core alignment problems really are easy, but there are also reasons why they wouldn't be.

First is understanding the reward -> behavioral shard map. In the RLO frame, we have an outer loss function that updates the agent in almost the correct direction, but has some failure modes like deceptive alignment. In a world where models were perfect Bayesian samplers with no path-dependence, the ideal reward function would just be performance on an outer objective. Every departure from the perfect Bayesian sampler implies, in theory, some way to better shape behavior than just using an outer objective. Despite this advantage over the "inner alignment" framing, some of the problems with inner alignment remain.

If we view the inner alignment problem as distinguishing functions that are identical on the training distribution, it becomes clear that shard theory has not dissolved inner alignment. Selecting on behavior alone is not sufficient to guarantee inner alignment[4], and for the same reason, we will need good transparency [LW · GW] or process-based feedback [LW · GW] (as part of our knowledge about the reward -> shards map) to reliably induce behaviors we want.

Work can be traded off between specifying desired behavioral tendencies and value formation, and the part that happens at subhuman capability seems doable. I'll assume that specifying behavioral tendencies happens at subhuman level and value formation is done as the system scales to superhuman level.

In my opinion, the main hope of shard theory is the analogy to humans: human values are somewhat reliably produced by the human reward system, despite the reward system not acting like an outer objective. But when we consider the third subproblem, value formation, the analogy breaks down. Human value formation seems really complex, and it's not clear that human values can be fully described by game-theoretic negotiations [LW · GW] between fully agentic, self-preserving shards associated with your different behavioral tendencies.[5]

Corrigibility might be easier to learn, but it's still the case that only some ways shards could exist in a mind cause its goals to scale correctly to superintelligence.[6] For example, if shards are just self-preserving circuits that encode behaviors activated by certain observations, then when the agent goes OOD, the observations that prompted the agent to activate the shard (e.g. be helpful to humans) are no longer present. Or, if shards have goals and a shard doesn't prevent its goals from being modified, then its goals will be overwritten. Or, if training causes the corrigibility shards to be limited in intelligence whereas shards with other goals keep getting smarter, the corrigibility shards will eventually lose influence over the mind. If there are important forces in value formation other than internal competition and negotiation between self-preserving shards (which seems highly likely given how humans work), there are even more failure modes, which is why I think a predictable value formation method is key.

  1. ^

    In reality, there is a continuum of coherence levels between behavioral tendencies and values.

  2. ^

    I think corrigibility is natural iff robust pointers to it can easily get into the AI's goals, and it's not clear whether this is the case-- this is a disagreement between Eliezer and Paul [LW(p) · GW(p)].

  3. ^

    and maybe "be able to identify" as well-- depends how reliable the mapping from (1) is

  4. ^

    unless you can get a really strong human prior, which is where the simulators hope comes from

  5. ^

     I think I care about animal suffering due to some combination of (a) it's high status in my culture; (b) I did some abstract thinking that formed a similarity between animal suffering and human suffering, and I already decided I care about humans; (c) I wanted to "have moral clarity" (whatever that means), went vegan for a month, and decided that the version of me without associations between animals and food had better moral intuitions. It's not as simple as an "animal suffering bad" shard in my brain outcompeting "animal suffering okay" shards.

  6. ^

    I have reasons outside the scope of this post why the particular subagent models shard theory have been using seem unlikely.

2 comments

Comments sorted by top scores.

comment by TurnTrout · 2022-09-27T22:56:27.143Z · LW(p) · GW(p)

I think I have two main complaints still, on a skim. 

First, I think the following is wrong:

These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it's not clear that they're easier than inner and outer alignment.

I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.

Second, I'm wary of saying "maybe we can get corrigibility" or "maybe corrigibility doesn't fit into a utility function [AF · GW]", because this can map shard theory hopes onto old debates where we already have settled into positions. Whereas I consider myself to be thinking about qualitatively different questions and spreads of values I might hope to get into an AI. 

I think corrigibility is natural iff robust pointers to it can easily get into the AI's goals

This doesn't make sense to me. It sounds like saying "liking yellow cubes is natural iff we can get a pointer to 'liking yellow cubes' within the AI's goals." That sounds like a thing which would be said if we had no idea how yellow cubes got liked, directly, and were instead treating liking-yellow-cubeness as a black box which happened to exist in the real world (e.g. how corrigibility, or the desire to help people, could be "pointed to" in a classic corrigibility hope).

I have more thoughts on this post but I don't have time to type more for now.

Replies from: thomas-kwa
comment by Thomas Kwa (thomas-kwa) · 2022-09-27T23:32:54.669Z · LW(p) · GW(p)

I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.

Since the original draft I realized your position has "outer/inner alignment is a broken frame with mismatched type signatures which is much less likely to work than people think", so this seems reasonable from your perspective. I haven't thought much about this document and might end up agreeing with you, so the version I believe is something like "it's not clear that my shard theory decomposition is substantially easier than inner+outer alignment is, assuming that inner+outer alignment is as valid as Evan thinks it is". 

Agree that I'm not being concrete about how corrigibility would be implemented. Concreteness is a virtue and it seems good to think about this in more detail eventually.