Posts
Comments
I think it’s an important caveat that this is meant for early AGI with human-expert-level capabilities, which means we can detect misalignment as it manifests in small-scale problems. When capabilities are weak, the difference between alignment and alignment-faking is less relevant because the model’s options are more limited. But once we scale to more capable systems, the difference becomes critical.
Whether this approach helps in the long term depends on how much the model internalizes the corrections, as opposed to just updating its in-distribution behavior. It’s possible that the behavior we see is not a good indicator of the internal nature of the model, so we would be improving the acting method of the model but not fixing the underlying misalignment. This is a question about the amount of overlap between visible misalignment and total misalignment. If most of the misalignment is invisible until late, then this approach is less helpful in the long term.
As you mention, the three examples here work regardless of whether SSA or SIA is true because none of the estimated outcomes affect the total number of observers. But the Doomsday Argument is different and does depend on SSA. If SIA is true, the early population of a long world is just as likely to exist as the total population of a short world, so there’s no update upon finding yourself in an early-seeming world.
A total utilitarian observing from outside both worlds will care just as much about the early population of a long world as the total population of a short world, so the expected value of both reference classes is the same. This suggests to me that if I care about myself, I should be indifferent between the possibilities that I’m early and that I’m in a world with a short lifespan. Of course, if my decisions in one world will affect more people, then I should adjust my actions accordingly.