a rough sketch of formal aligned AI using QACI

post by Tamsin Leake (carado-1) · 2022-12-11T23:40:37.536Z · LW · GW · 0 comments

This is a link post for https://carado.moe/rough-sketch-formal-aligned-ai.html

Contents

No comments

in this post, i put forth some of my current thoughts about the shape of a formal aligned AI using QACI [LW · GW] for its decision — "decision" in the singular here, as this is sufficient when the AI's decision can be "run me again but with these diffirent inputs". as it turns out, this doesn't require solving as many things as i'd thought — it seems like QACI might be general enough to delegate picking a decision theory and solving embedded agency to the counterfactual consideration of the past user.

we'll posit:

we'll define the following:

a "carver" function which returns a set of tuples of:

this is done by splitting a world into the piece of data , and "everything else", denoted . in practice with arbitrary other worlds, would return "garbage" the immense majority of the time, but the hope is that given a same carving for the question , a same carving for the answer would work enough time to give a signal that would tend to beat the overall noise of the failing cases.

we'll define , the question-answer counterfactual interval [LW · GW] device used to consider answers to counterfactual questions , given a world hypothesis and a known question and answer :

note how measures the simplicity of all four functions together so as to favor them being simple but also similar, but ignores the simplicity of the "rest of the world" values..

finally, we can define our AI as a function of but also an observation which could, in practice, be anything that lets the AI as well as the user better locate themselves in the set of possible worlds.

what's going on here, is for each world hypothesis weighed by its simplicity we're finding carvings for and that let us inject a counterfactual question and get a counterfactual answer, favoring simpler carvings using . the counterfactual rerun of the past user is given the world hypothesis being considered, as well as the action — we're not giving the user because that would make the counterfactual question guaranteed to be strictly larger than the original question, which seems like it might break things, and is not absolutely necessary. the user gets to return an arbitrary function to be evaluated, to which are actually passed. we are, however, clamping the result of inside and then multiplying it by ourselves, so as to avoid erroneous carvings getting to influence the decision process too much by returning arbitrary large values. the hope here is merely that desired carvings generally outweigh undesired carvings, and especially adverserial carvings.

ways this could go wrong include:

this is of course highly uncomputable. the intent here, is to use something like logical induction [? · GW] to approximate good results to this function. what makes me hopeful that a powerful AI can make helpful guesses as to what actions this process would find, if it is indeed aligned, is that even i, a mere human mind, feel like i can make some helpful guesses as to what actions this process would find.

0 comments

Comments sorted by top scores.