formal alignment: what it is, and some proposals

post by Tamsin Leake (carado-1) · 2023-01-29T11:32:33.239Z · LW · GW · 3 comments

This is a link post for https://carado.moe/formal-alignment.html

Contents

3 comments

what i call "formal alignment" is an approach to solving AI alignment [LW · GW] that consists of:

those two points correspond to formal alignment's notions of outer and inner alignment, respectively: determining what formal thing to align the AI to, and figuring out how to build something that is indeed aligned to it without running into inner misalignment issues.

for reasons why i think this is the least hopeless path to saving the world, see my outlook on AI risk mitigation [LW · GW]. the core motivation for formal alignment, for me, is that a working solution is at least eventually aligned: there is an objective answer to the question "will maximizing this with arbitrary capabilities produce desirable outcomes?" where the answer does not depend, at the limit, on what does the maximization. and the fact that such a formal thing is aligned in the limit makes it robust to sharp left turns [LW · GW]. what remains then is just "bridging the gap": getting from eventual to continuous alignment, perhaps by ensuring the right ordering of attained capabilities [LW · GW].

potential formal alignment ideas include:

if there are formal alignment ideas i'm missing, please tell me about them and i'll add them here.

because these various proposals consist of putting together a formal mathematical expression, they rely on finding various true names [LW · GW]. for example: PreDCA tries to put together the true names for causality, agency, and the AI's predecessor; IGP requires the true name for computing a program forwards; QACI requires a true name for identifying pieces of data in causal worlds, and replacing them with counterfactual alternatives; UAT requires the true names for parent universe/simulation, control over resources, and comparing amounts of resources with those in the AI's future lightcone.

see also: clarifying formal alignment implementation

3 comments

Comments sorted by top scores.

comment by Mitchell_Porter · 2023-02-14T05:28:50.179Z · LW(p) · GW(p)

Nice to see someone who wants to directly tackle the big problem. Also nice to see someone who appreciates June Ku's work. 

comment by Roman Leventov · 2023-05-06T15:00:37.804Z · LW(p) · GW(p)

the core motivation for formal alignment, for me, is that a working solution is at least eventually aligned: there is an objective answer to the question "will maximizing this with arbitrary capabilities produce desirable outcomes?" where the answer does not depend, at the limit, on what does the maximization.

I don't know about other proposals because I'm not familiar with them, but Methaethical AI actually describes the machinery of the agent, hence "the answer" does depend "on what does the maximisation".