Miscellaneous First-Pass Alignment Thoughts

post by NickGabs · 2022-11-21T21:23:53.960Z · LW · GW · 4 comments

I have been learning more about alignment theory in the last couple of months, and have heard from many people that writing down naive hypotheses [LW · GW] can be a good strategy for developing your thoughts and getting feedback about them.  So here goes:

 

 

 

 

 

 

 

 

Overall, my view is that alignment doesn't seem extremely hard, but that p(doom) is still fairly high (~45%) due to the plausibility of very short timelines, capabilities researchers not taking the problem sufficiently seriously or not being willing to pay the alignment tax to implement alignment strategies, and the fact that if alignment is hard (in the sense that relatively simple training mechanisms + oversight using interpretability do not work), I think we are probably doomed.  However, all of these statements are strong claims, weakly held - tell me why I'm wrong!

4 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2022-11-21T23:19:08.484Z · LW(p) · GW(p)

Thanks, I'm planning to release an advent calendar of hot takes and this gives me fodder for a few :P

My short notes that I'll expand in the advent calendar:

  • The notions of inner and outer alignment make a lot of sense in model-free deep RL, and a related kind of sense in RL using RNNs or black-box program search, but other architectures will have different generalization problems that present differently. Model-based deep RL, in particular, has its own generalization problems but none of them are really "inner alignment."
  • No, IDA and RLHF are not solutions to outer alignment. Alignment solutions that work only if humans converge to sensible behavior will not work. Humans in bureaucracies can do things that don't serve the interests of the whole. Humans can be deceived (e.g. claw in front of ball example from RLHF). [aside: I wrote what I think is an interesting post [LW · GW] about HCH / IDA.]
Replies from: NickGabs
comment by NickGabs · 2022-11-22T02:00:14.209Z · LW(p) · GW(p)

Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA

comment by TurnTrout · 2022-11-21T21:34:58.238Z · LW(p) · GW(p)

Shard theory explicitly assumes certain claims about how the human brain works, in particular that the genome mostly specifies crude neural reward circuitry and that ~all of the details of the cortex are basically randomly initialized. I think these claims are plausible but uncertain and quite important for AI safety, so I would be excited about more people looking into this question given that it seems controversial among neuroscientists and geneticists, and also seems tractable given that there is a wealth of existing neuroscience research.

Note that this pertains to the shard theory of human values, not shard-centric models of how AI values might form. That said, I'm likewise interested in investigation of the assumptions. EG how people work is important probabilistic evidence for how AI works because there are going to be "common causes" to effective real-world cognition and design choices.

Replies from: NickGabs
comment by NickGabs · 2022-11-21T21:45:53.981Z · LW(p) · GW(p)

Good point, post updated accordingly.