Does iterated amplification tackle the inner alignment problem?
post by JanBrauner
This is a question post.
When iterated distillation and amplification (IDA) was published, some people described it described as "the first comprehensive proposal for training safe AI". Having read a bit more about it, it seems that IDA is mainly a proposal for outer alignment and doesn't deal with the inner alignment problem at all. Am I missing something?
answer by ofer
) · GW
My understanding is that amplification-based approaches are meant to tackle inner alignment by using the amplified systems that are already trusted (e.g. humans + many invocations of a trusted model) to mitigate inner alignment problems in the next (slightly more powerful) models that are being trained. A few approaches for this have already been suggested (I'm not aware of published empirical results), see Evan's comment [LW(p) · GW(p)] for some pointers.
I hope a lot more research will be done on this topic. It's not clear to me whether we should expect to have amplified systems that allow us to mitigate inner alignment risks to a satisfactory extent before the point where we have x-risk posing systems, how can we make that more likely, and if it's not feasible how do we realize that as soon as possible?
Comments sorted by top scores.