Prize for probable problems
post by paulfchristiano · 2018-03-08T16:58:11.536Z · LW · GW · 63 commentsContents
Background on what I’m looking for None 63 comments
Summary: I’m going to give a $10k prize to the best evidence that my preferred approach to AI safety is doomed. Submit by commenting on this post with a link by April 20.
I have a particular vision for how AI might be aligned with human interests, reflected in posts at ai-alignment.com and centered on iterated amplification.
This vision has a huge number of possible problems and missing pieces; it’s not clear whether these can be resolved. Many people endorse this or a similar vision as their current favored approach to alignment, so It would be extremely valuable to learn about dealbreakers as early as possible (whether to adjust the vision or abandon it).
Here’s the plan:
- If you want to explain why this approach is doomed, explore a reason it may be doomed, or argue that it’s doomed, I strongly encourage you to do that.
- Post a link to any relevant research/argument/evidence (a paper, blog post, repo, whatever) in the comments on this post.
- The contest closes April 20.
- You can submit content that was published before this prize was announced.
- I’ll use some process to pick my favorite 1-3 contributions. This might involve delegating to other people or might involve me just picking. I make no promise that my decisions will be defensible.
- I’ll distribute (at least) $10k amongst my favorite contributions.
If you think that some other use of this money or some other kind of research would be better for AI alignment, I encourage you to apply for funding to do that (or just to say so in the comments).
This prize is orthogonal and unrelated to the broader AI alignment prize. (Reminder: the next round closes March 31. Feel free to submit something to both.)
This contest is not intended to be “fair”---the ideas I’m interested in have not been articulated clearly, so even if they are totally wrong-headed it may not be easy to explain why. The point of the exercise is not to prove that my approach is promising because no one can prove it’s doomed. The point is just to have a slightly better understanding of the challenges.
Edited top add the results:
- $5k for this post [LW · GW] by Wei_Dai, and the preceding/following discussion, some points about the difficulty of learning corrigibility in small pieces.
- $3k for Point 1 from this comment [LW(p) · GW(p)] by eric_langlois, an intuition pump for why security amplification is likely to be more difficult than you might think.
- $2k for this post [LW · GW] by William_S, which clearly explains a consideration / design constraint that would make people less optimistic about my scheme. (This fits under "summarizing/clarifying" rather than novel observation.)
Thanks to everyone who submitted a criticism! Overall I found this process useful for clarifying my own thinking (and highlighting places where I could make it easier to engage with my research by communicating more clearly).
Background on what I’m looking for
I’m most excited about particularly thorough criticism that either makes tight arguments or “plays both sides”---points out a problem, explores plausible responses to the problem, and shows that natural attempts to fix the problem systematically fail.
If I thought I had a solution to the alignment problem I’d be interested in highlighting any possible problem with my proposal. But that’s not the situation yet; I’m trying to explore an approach to alignment and I’m looking for arguments that this approach will run into insuperable obstacles. I'm already aware that there are plenty of possible problems. So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.
On the other hand, I’m hoping that we'll solve alignment in a way that knowably works under extremely pessimistic assumptions, so I’m fine with arguments that make weird assumptions or consider weird situations / adversaries.
Examples of interlocking obstacles I think might totally kill my approach:
- Amplification may be doomed because there are important parts of cognition that are too big to safely learn from a human, yet can’t be safely decomposed. (Relatedly, security amplification might be impossible.)
- A clearer inspection of what amplification needs to do (e.g. building a competitive model of the world in which an amplified human can detect incorrigible behavior) may show that amplification isn’t getting around the fundamental problems that MIRI is interested in and will only work if we develop a much deeper understanding of effective cognition.
- There may be kinds of errors (or malign optimization) that are amplified by amplification and can’t be easily controlled (or this concern might be predictably hard to address in advance by theory+experiment).
- Corrigibility may be incoherent, or may not actually be easy enough to learn, or may not confer the kind of robustness to prediction errors that I’m counting on, or may not be preserved by amplification.
- Satisfying safety properties in the worst case (like corrigibility) may be impossible. See this post for my current thoughts on plausible techniques. (I’m happy to provisionally grant that optimization daemons would be catastrophic if you couldn’t train robust models.)
- Informed oversight might be impossible even if amplification works quite well. (This is most likely to be impossible in the context of determining what behavior is catastrophic.)
I value objections but probably won't have time to engage significantly with most of them. That said: (a) I’ll be able to engage in a limited way, and will engage with objections that significantly shift my view, (b) thorough objections can produce a lot of value even if no proponent publicly engages with them, since they can be convincing on their own, (c) in the medium term I’m optimistic about starting a broader discussion about iterated amplification which involves proponents other than me.
I think our long-term goal should be to find, for each powerful AI technique, an analog of that technique that is aligned and works nearly as well. My current work is trying to find analogs of model-free RL or AlphaZero-style model-based RL. I think that these are the most likely forms for powerful AI systems in the short term, that they are particularly hard cases for alignment, and that they are likely to turn up alignment techniques that are very generally applicable. So for now I’m not trying to be competitive with other kinds of AI systems.
63 comments
Comments sorted by top scores.
comment by paulfchristiano · 2018-05-08T16:23:21.841Z · LW(p) · GW(p)
The results:
- $5k for this post [LW · GW] by Wei_Dai, and the preceding/following discussion, some points about the difficulty of learning corrigibility in small pieces.
- $3k for Point 1 from this comment [LW(p) · GW(p)] by eric_langlois, an intuition pump for why security amplification is likely to be more difficult than you might think.
- $2k for this post [LW · GW] by William_S, which clearly explains a consideration / design constraint that would make people less optimistic about my scheme. (This fits under "summarizing/clarifying" rather than novel observation.)
Thanks to everyone who submitted a criticism! Overall I found this process useful for clarifying my own thinking (and highlighting places where I could make it easier to engage with my research by communicating more clearly).
Replies from: Benito↑ comment by Ben Pace (Benito) · 2019-10-19T02:39:09.108Z · LW(p) · GW(p)
Can you link this comment from the OP? I skimmed the whole thread looking for info on who won prizes and managed to miss this on my first pass.
comment by eric_langlois · 2018-04-20T15:28:18.933Z · LW(p) · GW(p)
Point 1: Meta-Execution and Security Amplification
I have a comment on the specific difficulty of meta-execution as an approach to security amplification. I believe that while the framework limits the "corruptibility" of the individual agents, the system as a whole is still quite vulnerable to adversarial inputs.
As far as I can tell, the meta-execution framework is Turing complete. You could store the tape contents within one pointer and the head location in another, or there's probably a more direct analogy with lambda calculus. And by Turing complete I mean that there exists some meta-execution agent that, when given any (suitably encoded) description of a Turing machine as input, executes that Turing machine and returns its output.
Now, just because the meta-execution framework is Turing complete, this doesn't mean that any particular agent created in this manner is Turing complete. If our agents were in practice Turing complete, I feel like that would defeat the security-amplification purpose of meta-execution. Maybe the individual nodes cannot be corrupted by the limited input they see, but the system as a whole could be made to perform arbitrary computation and produce arbitrary output on specific inputs. The result of "interpret the input as a Turing machine and run it" is probably not the correct or aligned response to those inputs.
Unfortunately, it seems empirically the case that computation systems become Turing complete very easily. Some examples:
- Accidentally Turing Complete
- mov is Turing Complete
- Return oriented Programming
- I would argue that even humans are approximately Turing complete (there are probably input sequences that would cause a person to carry out an arbitrary computation to the best of their abilities). I assume this contributes to the desire for information limitation in meta-execution in the first place.
In particular, return oriented programming is interesting as an adversarial attack on pre-written programs by taking advantage of the fact that a limited control over execution flow in the presence of existing code often forms a Turing complete system, despite the attacker having no control over the existing code.
So I suspect that any meta-computation agent that is practically useful for answering general queries is likely to be Turing complete, and that it will be difficult to avoid Turing completeness (up to a resource limit, which doesn't help the arbitrary code execution problem).
An addition to this argument thanks to William Saunders: We might end up having to accept that our agent will be Turing complete and hope that the malicious inputs are hard to find or work with low probability. But in that case, limiting the amount of information seen by individual nodes may make it harder for the system to detect and avoid these inputs. So what you gain in per-node security you lose in overall system security.
Point 2: IDA in general
More broadly, my main concern with IDA isn't that it has a fatal flaw but that it isn't clear to me how the system helps with ensuring alignment compared to other architectures. I do think that IDA can be used to provide modest improvement in capabilities with small loss in alignment (not sure if better or worse then augmenting humans with computational power in other ways), but that the alignment error is not zero and increases the larger the improvement in capabilities.
Argument:
- It is easy and tempting for the amplification to result in some form of search ("what is the outcome of this action?" "what is the quality of this outcome?" repeat), which fails if the human might misevaluate some states.
- To avoid that, H needs to be very careful about how they use the system.
- I don't believe that it is practically possible for formally specify the rules H needs to follow in order to produce an aligned system (or if you can, it's just as hard as specifying the rules for a CPU + RAM architecture). You might disagree with this premise in which case the rest doesn't follow.
- If we can't be confident of the rules H needs to follow, then it is very risky just asking H to act as best as they can in this system without knowing how to prevent things from going wrong.
- Since I don't believe specifying IDA-specific rules is any easier than for other architectures, it seems unlikely to me that you'd have a proof about the alignment or corrigibility of such a system that wouldn't be more generally applicable, in which case why not use a more direct architecture with fewer approximation steps?
To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H's values (retrievable through IRL, for example). And so must A[i] for every i. So the alignment and distillation procedures must preserve the implicit values H. If we can prove that the distillation preserves implicit values, then it seems plausible that a similar procedure with similar proof would be able to just directly distill the values of H explicitly and then we can train an agent to behave optimally with respect to those.
Replies from: Wei_Dai↑ comment by Wei Dai (Wei_Dai) · 2018-04-21T07:24:37.502Z · LW(p) · GW(p)
I find your point 1 very interesting but point 2 may be based in part on a misunderstanding.
To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H’s values (retrievable through IRL, for example). And so must A[i] for every i.
I think this is not how Paul hopes his scheme would work. If you read https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims [LW · GW], it's clear that in the LBO variant of IDA, A[1] can't possibly learn H's values. Instead A[1] is supposed to learn "corrigibility" from H and then after enough amplifications, A[n] will gain the ability to learn values from some external user (who may or may not be H) and then the "corrigibility" that was learned and preserved through the IDA process is supposed to make it want to help the user achieve their values.
Replies from: eric_langlois↑ comment by eric_langlois · 2018-04-24T16:50:31.535Z · LW(p) · GW(p)
I won't deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn't you just say that corrigibility is a value that H has? Then use the same argument with "corrigibility" in place of "value"? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).
If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.
Replies from: paulfchristiano↑ comment by paulfchristiano · 2018-04-24T17:08:54.690Z · LW(p) · GW(p)
so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property
I'm imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible.
If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.
comment by Wei Dai (Wei_Dai) · 2018-03-08T23:00:02.674Z · LW(p) · GW(p)
So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.
This seems like a hard thing to do that most people may not have much experience with (especially since the problems are only defined informally at this point). Can you link to some existing such arguments, either against this AI alignment approach (that previously caused you to change your vision), or on other topics, to give a sense of what kinds of techniques might be helpful for establishing such a universal quantifier?
For example should one try to define the problem formally and then mathematically prove that no solution exists? But how does one show that there's not an alternative formal definition of the problem (that still captures the essence of the informal problem) for which a solution does exist?
Replies from: paulfchristiano, paulfchristiano↑ comment by paulfchristiano · 2018-03-09T06:31:44.256Z · LW(p) · GW(p)
Some examples that come to mind:
- This comment of yours changed my thinking about security amplification by cutting off some lines of argument and forced me to lower my overall goals (though it is simple enough that it feels like it should have been clear in advance). I believe the scheme overall survives, as I discussed at the workshop, but in a slightly different form.
- This post by Jessica both does a good job of overviewing some concerns and makes a novel argument (if the importance weight is slightly wrong then you totally lose) that leaves me very skeptical about any importance-weighting approach to fixing Solomonoff induction, which in turn leaves me more skeptical about "direct" approaches to benign induction.
- In this post I listed implicit ensembling as an approach to robustness. Between Jessica's construction described here and discussions with MIRI folk arguing persuasively that the number of extra bits needed to get honesty was reasonably large such that even a good KWIK bound would be mediocre (partially described by Jessica here) I ended up pessimistic.
None of these posts use heavy machinery.
↑ comment by paulfchristiano · 2018-03-09T06:11:22.348Z · LW(p) · GW(p)
To clarify, when I say "trying to establish" I don't mean "trying to establish in a rigorous way," I just mean that that the goal of the informal reasoning should be the informal conclusion "we won't be able to find a way around this problem." It's also not a literal universal quantifier, in the same way that cryptography isn't up against against a literal universal quantifier, so I was doubly sloppy.
I don't think that a mathematical proof is likely to be convincing on its own (as you point out, there is a lot of slack in the choice of formalization). It might be helpful as part of an argument, though I doubt that's going to be where the action is.
comment by cousin_it · 2018-03-08T20:50:09.525Z · LW(p) · GW(p)
I'm not bidding for the prize, because I'm judging the other prize and my money situation is okay anyway. But here's one possible objection:
You're hoping that alignment will be preserved across steps. But alignment strongly depends on decisions in extreme situations (very high capability, lots of weirdness), because strong AI is kind of an extreme situation by itself. I don't see why even the first optimization step will preserve alignment w.r.t. extreme situations, because that can't be easily tested. What if the tails come apart immediately?
This is related to your concerns about "security amplification" and "errors that are amplified by amplification", so you're almost certainly aware of this. More generally, it's an special case of Marcello's objection that says path dependence is the main problem. Even a decade later, it's one of the best comments I've ever seen on LW.
Replies from: tristanm, paulfchristiano↑ comment by tristanm · 2018-03-09T18:09:27.107Z · LW(p) · GW(p)
It seems like this objection might be empirically testable, and in fact might be testable even with the capabilities we have right now. For example, Paul posits that AlphaZero is a special case of his amplification scheme. In his post on AlphaZero, he doesn't mention there being an aligned "H" as part of the set-up, but if we imagine there to be one, it seems like the "H" in the AlphaZero situation is really just a fixed, immutable calculation that determines the game state (win/loss/etc.) that can be performed with any board input, with no risk of the calculation being incorrectly performed, and no uncertainty of the result. The entire board is visible to H, and every board state can be evaluated by H. H does not need to consult A for assistance in determining the game state, and A does not suggest actions that H should take (H always takes one action). The agent A does not choose which portions of the board are visible to H. Because of this, "H" in this scenario might be better understood as an immutable property of the environment rather than an agent that interacts with A and is influenced by A. My question is, to what degree is the stable convergence of AlphaZero dependent on these properties? And can we alter the setup of AlphaZero such that some or all of these properties are violated? If so, then it seems as though we should be able to actually code up a version in which H still wants to "win", but breaks the independence between A and H, and then see if this results in "weirder" or unstable behavior.
↑ comment by paulfchristiano · 2018-03-09T05:57:17.302Z · LW(p) · GW(p)
Clearly the agent will converge to the mean on unusual situations, since e.g. it has learned a bunch of heuristics that are useful for situations that come up in training. My primary concern is that it remains corrigible (or something like that) in extreme situations. This requires (a) corrigibility makes sense and is sufficiently easy-to-learn (I think it probably does but it's far from certain) and (b) something like these techniques can avoid catastrophic failures off distribution (I suspect they can but am even less confident).
comment by Wei Dai (Wei_Dai) · 2018-04-21T07:00:04.306Z · LW(p) · GW(p)
One concern that I haven't seen anyone express yet is, if we can't discover a theory which assures us that IDA will stay aligned indefinitely as the amplifications iterate, it may become a risky yet extremely tempting piece of technology to deploy, possibly worsening the strategic situation from one where only obviously dangerous AIs like reinforcement learners can be built. If anyone is creating mathematical models of AI safety and strategy, it would be interesting to see if this intuition (that the invention of marginally less risky AIs can actually make things worse overall by increasing incentives to deploy risky AI) can be formalized in math.
A counter-argument here might be that this applies to all AI safety work, so why single out this particular approach. I think some approaches, like MIRI's HRAD, are more obviously unsafe or just infeasible without a strong theoretical framework to build upon, but IDA (especially the HBO variant) looks plausibly safe on its face, even if we never solve problems like how to prevent adversarial attacks on the overseer, or how to ensure that incorrigible optimizations do not creep into the system. Some policy makers are bound to not understand those problems, or see them as esoteric issues not worth worrying about when more obviously important problems are at hand (like how to win a war or not get crushed by economic competition).
comment by Wei Dai (Wei_Dai) · 2018-03-13T09:08:07.907Z · LW(p) · GW(p)
Can iterated amplification recreate a human's ability for creative insight? By that I mean the phenomenon where after thinking about a problem for an extended period of time, from hours to years, a novel solution suddenly pops into your head seemingly out of nowhere. I guess under the hood what's probably happening is that you're building up and testing various conceptual frameworks for thinking about the problem, and using those frameworks and other heuristics to do a guided search of the solution space. The problem for iterated amplification is that we typically don't have introspective access to the conceptual framework building algorithms or the search heuristics that our brains learned or came up with over our lifetimes, so it's unclear how to break down these tasks when faced with a problem that requires creative insight to solve.
If iterated amplification needs to exhibit creative insight in order to succeed (not sure if you can sidestep the problem or find a workaround for it), I suggest that it be included in the set of tasks that Ought will evaluate for their factored cognition project.
EDIT: Maybe this is essentially the same as the translation example, and I'm just not understanding how you're proposing to handle that class of problems?
Replies from: paulfchristiano, William_S↑ comment by paulfchristiano · 2018-03-16T08:52:09.147Z · LW(p) · GW(p)
EDIT: Maybe this is essentially the same as the translation example, and I'm just not understanding how you're proposing to handle that class of problems
Yes, I think these are the same case. The discussion in this thread applies to both. The relevant quote from the OP:
I think our long-term goal should be to find, for each powerful AI technique, an analog of that technique that is aligned and works nearly as well. My current work is trying to find analogs of model-free RL or AlphaZero-style model-based RL.
I think "copy human expertise by imitation learning," or even "delegate to a human," raise different kinds of problems than RL. I don't think those problems all have clean answers.
Replies from: Wei_Dai, Wei_Dai↑ comment by Wei Dai (Wei_Dai) · 2018-03-17T23:04:17.178Z · LW(p) · GW(p)
Going back to the translation example, I can understand your motivation to restrict attention to some subset of all AI techniques. But I think it's reasonable for people to expect that if you're aiming to be competitive with a certain kind of AI, you'll also aim to avoid ending up not being competitive with minor variations of your own design (in this case, forms of iterated amplification that don't break down tasks into such small pieces). Otherwise, aren't you "cheating" by letting aligned AIs use AI techniques that their competitors aren't allowed to use?
To put it another way, people clearly get the impression from you that there's hope that IDA can simultaneously be aligned and achieve state of the art performance at runtime. See this post where Ajeya Cotra says exactly this:
The hope is that if we use IDA to train each learned component of an AI then the overall AI will remain aligned with the user’s interests while achieving state of the art performance at runtime — provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance.
But the actual situation seems to be that at best IDA can either be aligned (if you break down tasks enough) or achieve state of the art performance (if you don't), but not both at the same time.
Replies from: paulfchristiano↑ comment by paulfchristiano · 2018-03-20T05:37:14.139Z · LW(p) · GW(p)
In general, if you have some useful but potentially malign data source (humans, in the translation example) then that's a possible problem---whether you learn from the data source or merely consult it.
You have to solve each instance of that problem in a way that depends on the details of the data source. In the translation example, you need to actually reason about human psychology. In the case of SETI, we need to coordinate to not use malign alien messages (or else opt to let the aliens take over).
Otherwise, aren't you "cheating" by letting aligned AIs use AI techniques that their competitors aren't allowed to use?
I'm just trying to compete with a particular set of AI techniques. Then every time you would have used those (potentially dangerous) techniques, you can instead use the safe alternative we've developed.
If there are other ways to make your AI more powerful, you have to deal with those on your own. That may be learning from human abilities that are entangled with malign behavior in complex ways, or using an AI design that you found in an alien message, or using an unsafe physical process in order to generate large amounts of power, or whatever.
I grant that my definition of the alignment problem would count "learn from malign data source" as an alignment problem, since you ultimately end up with a malign AI, but that problem occurs with or without AI and I don't think it is deceptive to factor that problem out (but I agree that I should be more careful about the statement / switch to a more refined statement).
I also don't think it's a particularly important problem. And it's not what people usually have in mind as a failure mode---I've discussed this problem with a few people, to try to explain some subtleties of the alignment problem, and most people hadn't thought about it and were pretty skeptical. So in those respects I think it's basically fine.
When Ajeya says:
provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance.
This is meant to include things like "You don't have a malign data source that you are learning from." I agree that it's slightly misleading if we think that humans are such a data source.
↑ comment by Wei Dai (Wei_Dai) · 2018-03-16T17:54:48.608Z · LW(p) · GW(p)
I think “copy human expertise by imitation learning,” or even “delegate to a human,” raise different kinds of problems than RL. I don’t think those problems all have clean answers.
I think I can restate the problem as about competing with RL: Presumably eventually RL will be as capable as a human (on its own, without copying from or delegating to a human), including on problems that humans need to use "creative insight" on. In order to compete with such RL-based AI with an Amplification-based AI, it seems that H needs to be able to introspectively access their cognitive framework algorithms and search heuristics in order to use them to help break down tasks, but H doesn't have such introspective access, so how does Amplification-based AI compete?
Replies from: paulfchristiano↑ comment by paulfchristiano · 2018-03-20T05:40:30.408Z · LW(p) · GW(p)
If an RL agent can learn to behave creatively, then that implies that amplification from a small core can learn to behave creatively.
This is pretty clear if you don't care about alignment---you can just perform the exponential search within the amplification step, and then amplification is structurally identical to RL. The difficult problem is how to do that without introducing malign optimization. But that's not really about H's abilities.
Replies from: Wei_Dai↑ comment by Wei Dai (Wei_Dai) · 2018-03-20T09:13:52.332Z · LW(p) · GW(p)
This is pretty clear if you don’t care about alignment---you can just perform the exponential search within the amplification step, and then amplification is structurally identical to RL.
I don't follow. I think if you perform the exponential search within the amplification step, amplification would be exponentially slow whereas RL presumably wouldn't be? How would they be structurally identical? (If someone else understands this, please feel free to jump in and explain.)
The difficult problem is how to do that without introducing malign optimization.
Do you consider this problem to be inside your problem scope? I'm guessing yes but I'm not sure and I'm generally still very confused about this. I think it would help a lot if you could give a precise definition of what the scope is.
As another example of my confusion, an RL agent will presumably learn to do symbolic reasoning and perform arbitrary computations either inside its neural network or via an attached general purpose computer, so it could self-modify into or emulate an arbitrary AI. So under one natural definition of "compete", to compete with RL is to compete with every type of AI. You must not be using this definition but I'm not sure what definition you are using. The trouble I'm having is that there seems to be no clear dividing line between "internal cognition the RL agent has learned to do" and "AI technique the RL agent is emulating" but presumably you want to include the former and exclude the latter from your problem definition?
Another example is that you said that you exclude "all failures of competence" and I still only have a vague sense of what that means.
Replies from: paulfchristiano, William_S↑ comment by paulfchristiano · 2018-03-20T16:58:54.800Z · LW(p) · GW(p)
How would they be structurally identical? (If someone else understands this, please feel free to jump in and explain.)
AlphaZero is exactly the same as this: you want to explore an exponentially large search tree. You can't do that. Instead you explore a small part of the search tree. Then you train a model to quickly (lossily) imitate that search. Then you repeat the process, using the learned model in the leaves to effectively search a deeper tree. (Also see Will's comment.)
Do you consider this problem to be inside your problem scope? I'm guessing yes but I'm not sure and I'm generally still very confused about this.
For now let's restrict attention to the particular RL algorithms mentioned in the post, to make definitions clearer.
By default these techniques yield an unaligned AI.
I want a version of those techniques that produces aligned AI, which is trying to help us get what we want.
That aligned AI may still need to do dangerous things, e.g. "build a new AI" or "form an organization with a precise and immutable mission statement" or whatever. Alignment doesn't imply "never has to deal with a difficult situation again," and I'm not (now) trying to solve alignment for all possible future AI techniques.
We would have encountered those problems even if we replaced the aligned AI with a human. If the AI is aligned, it will at least be trying to solve those problems. But even as such, it may fail. And separately from whether we solve the alignment problem, we may build an incompetent AI (e.g. it may be worse at solving the next round of the alignment problem).
The goal is to get out an AI that is trying to do the right thing. A good litmus test is whether the same problem would occur with a secure human. (Or with a human who happened to be very smart, or with a large group of humans...). If so, then that's out of scope for me.
Replies from: paulfchristiano↑ comment by paulfchristiano · 2018-03-20T17:13:59.209Z · LW(p) · GW(p)
To address the example you gave: doing some optimization without introducing misalignment is necessary to perform as well as the RL techniques we are discussing. Avoiding that optimization is in scope.
There may be other optimization or heuristics that an RL agent (or an aligned human) would eventually use in order to perform well, e.g. using a certain kind of external aid. That's out of scope, because we aren't trying to compete with all of the things that an RL agent will eventually do (as you say, a powerful RL agent will eventually learn to do everything...) we are trying to compete with the RL algorithm itself.
We need an aligned version of the optimization done by the RL algorithm, not all optimization that the RL agent will eventually decide to do.
↑ comment by William_S · 2018-03-20T15:22:02.226Z · LW(p) · GW(p)
I think the way to do exponential search in amplification without being exponentially slow is to not try to do the search in one amplification step, but start with smaller problems, learn how to solve those efficiently, then use that knowledge to speed up the search in later iteration-amplification rounds.
Suppose we have some problem with branching factor 2 (ie. searching for binary strings that fit some criteria)
Start with agent .
Amplify agent to solve problems which require searching a tree of depth at cost .
Distill agent , which uses the output of the amplification process to learn how to solve problems of depth faster than the amplified , ideally as fast as any other ML approach. One way would be to learn heuristics for which parts of the tree don't contain useful information, and can be pruned.
Amplify agent , which can use the heuristics it has learned to prune the tree much earlier and solve problems of depth at cost
Distill agent , which can now efficiently solve problems of depth
If this process is efficient enough, the training cost can be less than to get an agent that solves problems of depth (and the runtime cost is as good as the runtime cost of the ML algorithm that implements the distilled agent)
Replies from: Wei_Dai, ESRogs↑ comment by Wei Dai (Wei_Dai) · 2018-03-20T21:15:22.222Z · LW(p) · GW(p)
Thanks for the explanation, but I'm not seeing how this would work in general. Let's use Paul's notation where and . And say we're searching for binary strings s such that F(s, t)=1 for fixed F and variable t. So we start with (a human) and distill+amplify it into which searches strings up to length (which requires searching a tree of depth at cost ). Then we distill that into which learns how to solve problems of depth faster than , and suppose it does that by learning the heuristic that the first bit of s is almost always the parity of t.
Now suppose I'm an instance of running at the top level of . I have access to other instances of which can solve this problem up to length but I need to solve a problem of length (which let's say is ). So I ask another instance of "Find a string s of length such that s starts with 0 and F(s, t)=1" then followed by query to another "Find a string s of length such that s starts with 1 and F(s, t)=1" Well the heuristic that learned doesn't help to speed up those queries so each of them is still going to take time .
The problem here as I see it is it's not clear how I, as , can make use of the previously learned heuristics to help solve larger problems more efficiently, since I have no introspective access to them. If there's a way to do that and I'm missing it, please let me know.
(I posted this from greaterwrong.com and it seems the LaTeX isn't working. Someone please PM me if you know how to fix this.)
[Habryka edit: Fixed your LaTeX for you. GreaterWrong doesn't currently support LaTeX I think. We would have to either improve our API, or greaterwrong would need to do some more fancy client-side processing to make it work]
Replies from: William_S, paulfchristiano, ESRogs↑ comment by William_S · 2018-03-20T22:44:43.777Z · LW(p) · GW(p)
For this example, I think you can do this if you implement the additional query "How likely is the search on [partial solution] to return a complete solution?". This is asked of all potential branches before recursing into them. learns to answer the solution probability query efficiently.
Then in amplification of in the top level of looking for a solution to problem of length , the root agent first asks "How likely is the search on [string starting with 0] to return a complete solution?" and "How likely is the search on [string starting with 1] to return a complete solution?". Then, the root agent first queries whichever subtree is most likely to contain a solution. (This doesn't improve worst case running time, but does improve average case running time.).
This is analogous to running a value estimation network in tree search, and then picking the most promising node to query first.
Replies from: Wei_Dai↑ comment by Wei Dai (Wei_Dai) · 2018-03-21T00:24:04.127Z · LW(p) · GW(p)
This seems to require that the heuristic be of a certain form and you know what that form is. What if it's more general, like run algorithm G on t to produce a list of guesses for s, then check the guesses in that order?
Replies from: William_S↑ comment by William_S · 2018-03-21T19:35:43.323Z · LW(p) · GW(p)
1. I don't think that every heuristic could use to solve problems of depth needs to be applicable to performing the search of depth - we only need enough heuristics to be useable to be able to keep increasing the search depth at each amplification round in an efficient manner. It's possible that some of the value of heuristics like "solution is likely to be an output of algorithm G" could be (imperfectly) captured through some small universal set of heuristics that we can specify how to learn and exploit. (I think that variations on "How likely is the search on [partial solution] to produce an answer?" might get us pretty far).
The AlphaGo analogy is that the original supervised move prediction algorithm didn't necessarily learn every heuristic that the experts used, but just learned enough to be able to efficiently guide the MCTS to better performance.
(Though I do think that imperfectly learning heuristics might cause alignment problems without a solution to the aligned search problem).
2. This isn't a problem if once the agent can run algorithm G on t for problems of depth , it can directly generalize to applying G to problems of depth . Simple Deep RL methods aren't good at this kind of tasks, but things like the Neural Turing Machine are trying to do better on this sort of tasks. So the ability to learn efficient exponential search could be limited by the underlying agent capability; for some capability range, a problem could be directly solved by an unaligned agent, but couldn't be solved for an aligned agent. This isn't a problem if we can surpass that level of capability.
I'm not sure that these considerations fix the problem entirely, or whether Paul would take a different approach.
It also might be worth coming up with a concrete example where some heuristics are not straightforward to generalize from smaller to larger problems, and it seems like this will prevent efficiently learning to solve large problems. The problem, however, would need to be something that humans can solve (ie. finding a string that hashed to a particular value using a cryptographic hash function would be hard to generalize any heuristics from, but I don't think humans could do it either so it's outside of scope).
↑ comment by paulfchristiano · 2018-03-21T00:44:02.697Z · LW(p) · GW(p)
If an RL agent can't solve a task, then I'm fine with amplification being unable to solve it.
Replies from: Wei_Dai↑ comment by Wei Dai (Wei_Dai) · 2018-03-22T18:09:12.878Z · LW(p) · GW(p)
If an RL agent can’t solve a task, then I’m fine with amplification being unable to solve it.
I guess by "RL agent" you mean RL agents of certain specific designs, such as the one you just blogged about, and not RL agents in general, since as far as we know there aren't any tasks that RL agents in general can't solve?
BTW, I find it hard to understand your overall optimism (only 10-20% expected value loss from AI risk), since there are so many disjunctive risks to just being able to design an aligned AI that's competitive with certain kinds of RL agents (such as not solving one of the obstacles you list in the OP), and even if we succeed in doing that we'd have to come up with more capable aligned designs that would be competitive with more advanced RL (or other kinds of) agents. Have you explained this optimism somewhere?
↑ comment by ESRogs · 2018-03-20T22:47:51.662Z · LW(p) · GW(p)
(I posted this from greaterwrong.com and it seems the LaTeX isn't working. Someone please PM me if you know how to fix this.)
In the LesserWrong comment editor, select the text you want to be LaTeX, then press Ctrl+4 (or Cmd+4 on Mac). You can delete the dollar signs.
(Commenting rather than PM'ing so that others will benefit as well.)
↑ comment by ESRogs · 2018-03-20T18:50:05.321Z · LW(p) · GW(p)
and the runtime cost is as good as the ML algorithm you're using to distill new agents
Why would the runtime cost be on par with the distillation cost?
Replies from: William_S↑ comment by William_S · 2018-03-20T19:20:47.004Z · LW(p) · GW(p)
Sorry, that was a bit confusing, edited to clarify. What I mean is, you have some algorithm you're using to implement new agents, and that algorithm has a training cost (that you pay during distillation) and a runtime cost (that you pay when you apply the agent). The runtime cost of the distilled agent can be as good as the runtime cost of an unaligned agent implemented by the same algorithm (part of Paul's claim about being competitive with unaligned agents).
↑ comment by William_S · 2018-03-14T18:59:00.477Z · LW(p) · GW(p)
If I understand you correctly, It sounds like the problem that "creative insight" is solving is "searching through a large space of possible solutions and finding a good one". It seems like Amplification could, given enough time, systematically search through all possible solutions (ie. generate all bit sequences, turn them into strings, evaluate whether they are a solution). But the problem with that is that it will likely yield an misaligned solution (assuming the evaluation of solutions is imperfect). Humans, when performing "creative insight", have their search process 1) guided by a bunch of these hard-to-access heuristics/conceptual frameworks which guide the search towards useful and benign parts of the search space and 2) are limited in how large of a solution space they can search. These combine that the human creativity search typically get aligned solutions or terminate without finding a solution after doing a bounded amount of computation. Does this fit with what you are thinking of as "creative insight"?
My understanding of what I've read about Paul's approach suggests the solution to both the translation problem and creativity would be extract any search heuristics/conceptual framework algorithms that humans do have access to, and then still limit the search, sacrificing solution quality but maintaining corrigibility. Is your concern then that this amplification based search would not perform well enough in practice to be useful (ie. yielding a good solution and terminating before coming across a malign solution)?
Replies from: Wei_Dai↑ comment by Wei Dai (Wei_Dai) · 2018-03-15T01:45:39.229Z · LW(p) · GW(p)
It seems like Amplification could, given enough time, systematically search through all possible solutions (ie. generate all bit sequences, turn them into strings, evaluate whether they are a solution). But the problem with that is that it will likely yield an misaligned solution (assuming the evaluation of solutions is imperfect).
Well I was thinking that before this alignment problem could even happen, a brute force search would be exponentially expensive so Amplification wouldn't work at all in practice on a question that requires "creative insight".
My understanding of what I’ve read about Paul’s approach suggests the solution to both the translation problem and creativity would be extract any search heuristics/conceptual framework algorithms that humans do have access to, and then still limit the search, sacrificing solution quality but maintaining corrigibility.
My concern is that this won't be competitive with other AGI approaches that don't try to maintain alignment/corrigibility, for example using reinforcement learning to "raise" an AGI through a series of increasingly complex virtual environments, and letting the AGI incrementally build its own search heuristics and conceptual framework algorithms.
BTW, thanks for trying to understand Paul's ideas and engaging in these discussions. It would be nice to get a critical mass of people to understand these ideas well enough to sustain discussions and make progress without Paul having to be present all the time.
comment by William_S · 2018-03-12T15:01:00.983Z · LW(p) · GW(p)
Paul, to what degree do you think your approach will scale indefinitely while maintaining corrigibility vs. just thinking that it will scale while maintaining corrigibility to the point where we "get our house in order"? (I feel like this would help me in understanding the importance of particular objections, though objections relevant to both scenarios are probably still relevant).
Replies from: paulfchristiano↑ comment by paulfchristiano · 2018-03-13T04:09:49.538Z · LW(p) · GW(p)
I'm hoping a solution to scale indefinitely if you hold fixed the AI design. In practice you'd face a sequence of different AI alignment problems (one for each technique) and I don't expect it to solve all of those, just one---i.e., if you solved alignment, you could still easily die because your AI failed to solve the next iteration of the AI alignment problem.
Arguing that this wouldn't be the case---pointing to a clear place where my proposal tops out, definitely counts for the purpose of the prize. I do think that a significant fraction of my EV comes from the case where my approach can't get you all the way, because it tops out somewhere, but if I'm convinced that it tops out somewhere I'm still feeling way more pessimistic about the scheme.
Replies from: William_S↑ comment by William_S · 2018-03-13T15:13:07.552Z · LW(p) · GW(p)
By "AI design" I'm assuming you are referring to the learning algorithm and runtime/inference algorithm of the agent A in the amplification scheme.
In that case, I hadn't thought of the system as only needing to work with respect to the learning algorithm. Maybe it's possible/useful to reason about limited versions which are corrigible with respect to some simple current technique (just not very competent).
comment by Stuart_Armstrong · 2018-04-10T11:10:13.426Z · LW(p) · GW(p)
A new post, looking into the strong version of corrigibility, arguing that it doesn't make sense without a full understanding of human values (and that with that understanding, it's redundant). Relevant to Amplification/Distillation since corrigibility is one of the aims of that framework.
https://www.lesswrong.com/posts/T5ZyNq3fzN59aQG5y/the-limits-of-corrigibility
comment by William_S · 2018-03-12T16:03:36.284Z · LW(p) · GW(p)
Paul, it might be helpful to clarify the sort of things you think your approach relies upon in regards to bounds on the amount of overhead (training time, human sample complexity), or the amount of overhead that would doom your approach. If I recall correctly, I think you've wanted the approach to have some reasonable constant overhead relative to an unaligned system, though I can't find the post at the moment? It might also be helpful to have bounds, or at least your guesses on the magnitude of numbers related to individual components (ie. the rough numbers in the Universality and Security amplification post).
Replies from: paulfchristiano↑ comment by paulfchristiano · 2018-03-13T04:07:07.421Z · LW(p) · GW(p)
I'm aiming for sublinear overhead (so the proportional overhead falls to 0 as the AI becomes more complex). If you told me that overhead was a constant, like 1x or 10x the cost of the unaligned AI, that would make me pessimistic about the approach (with the degree of pessimism depending on the particular constant). It wouldn't be doomed per se but it would qualify for winning the prize. If you told me that the overhead grew faster than the cost of the unaligned AI, I'd consider that doom.
comment by John_Maxwell (John_Maxwell_IV) · 2018-03-10T02:32:05.182Z · LW(p) · GW(p)
I took a quick look at your proposal; here are some quick thoughts on why I'm not super excited:
- It seems brittle. If there's miscommunication at any level of the hierarchy, you run the risk of breakage. Fatal miscommunications could happen as information travels either up or down the hierarchy.
- It seems like an awkward framework for achieving decisive strategic advantage. I think decisive strategic advantage will be achieved not through getting menial tasks done at a really fast rate, so much as making lots of discoveries, generating lots of ideas, and analyzing lots of possible scenarios. For this, a shared knowledge base seems ideal. In your framework, it looks like new insights get shared with subordinates by retraining them? This seems awkward and slow. And if insights need to travel up and down the hierarchy to get shared, this introduces loads of opportunities for miscommunication (see previous bullet point). To put it another way, this looks like a speed superintelligence at best; I think a quality superintelligence will beat it.
- The framework does not appear to have a clear provision for adapting its value learning to the presence/absence of decisive strategic advantage. The ideal FAI will slow down and spend a lot of time asking us what we want once decisive strategic advantage has been achieved. With your thing, it appears as though this would require an awkward retraining process.
In general, your proposal looks rather like a human organization, or the human economy. There are people called "CEOs" who delegate tasks to subordinates, who delegate tasks to subordinates and so on. I expect that if your proposal works better than existing organizational models, companies will have a financial incentive to adopt it regardless of what you do. As AI and machine learning advance, I expect that AI and machine learning will gradually swallow menial jobs in organizations, and the remaining humans will supervise AIs. Replacing human supervisors with AIs will be the logical next step. If you think this is the best way to go, perhaps you could raise venture capital to help accelerate this transition; I assume there will be a lot of money to be made. In any case, you could brainstorm failure modes for your proposal by looking at how organizations fail.
I've spent a lot of time thinking about this, and I still don't understand why a very simple approach to AI alignment (train a well-calibrated model of human preferences, have the AI ask for clarification when necessary) is unworkable. All the objections I've seen to this approach seem either confused or solvable with some effort. Building well-calibrated statistical models of complex phenomena that generalize well is a hard problem. But I think an AI would likely need a solution to this problem to take over the world anyway.
In software engineering terms, your proposal appears to couple together value learning, making predictions, making plans, and taking action. I think an FAI will be both safer and more powerful if these concerns are decoupled.
Replies from: William_S↑ comment by William_S · 2018-03-12T15:16:21.301Z · LW(p) · GW(p)
It seems brittle. If there's miscommunication at any level of the hierarchy, you run the risk of breakage. Fatal miscommunications could happen as information travels either up or down the hierarchy.
It seems to me that the amplification scheme could include redundant processing/error correction - ie. ask subordinates to solve a problem in several different ways, then look at whether they disagree and take majority vote or flag disagreements as indicating that something dangerous is going on, and this could deal with this sort of problem.
The framework does not appear to have a clear provision for adapting its value learning to the presence/absence of decisive strategic advantage. The ideal FAI will slow down and spend a lot of time asking us what we want once decisive strategic advantage has been achieved. With your thing, it appears as though this would require an awkward retraining process.
It seems to me that balancing the risks of acting vs. taking time to ask questions depending on the current situation falls under Paul's notion of corrigibility, so it would happen appropriately (as long as you maintain the possiblity of asking questions as an output of the system, and the input appropriately describes the state of the world relevant to evaluating whether you have decisive strategic advantage)
Replies from: paulfchristiano, John_Maxwell_IV↑ comment by paulfchristiano · 2018-03-13T04:12:33.391Z · LW(p) · GW(p)
It seems to me that balancing the risks of acting vs. taking time to ask questions depending on the current situation falls under Paul's notion of corrigibility
I definitely agree that balancing costs vs. VOI falls under the behavior-to-be-learned, and don't see why it would require retraining. You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in. If you had to retrain every time the situation changed, you'd never be able to do anything at all :)
(That said, to the extent that corrigibility is a plausible candidate for a worst-case property, it wouldn't be guaranteeing any kind of competent balancing of costs and benefits.)
Replies from: John_Maxwell_IV, John_Maxwell_IV↑ comment by John_Maxwell (John_Maxwell_IV) · 2018-03-14T06:07:59.349Z · LW(p) · GW(p)
Figuring out whether to act vs ask questions feels like a fundamentally epistemic judgement: How confident am I in my knowledge that this is what my operator wants me to do? How important do I believe this aspect of my task to be, and how confident am I in my importance assessment? What is the likely cost of delaying in order to ask my operator a question? Etc. My intuition is that this problem is therefore best viewed within an epistemic framework (trying to have well-calibrated knowledge) rather than a behavioral one (trying to mimic instances of question-asking in the training data). Giving an agent examples of cases where it should ask questions feels like about as much of a solution to the problem of corrigibility as the use of soft labels (probability targets that are neither 0 nor 1) is a solution to the problem of calibration in a supervised learning context. It's a good start, but I'd prefer a solution with a stronger justification behind it. However, if we did have a solution with a strong justification, FAI starts looking pretty easy to me.
Replies from: William_S↑ comment by William_S · 2018-03-14T15:09:42.723Z · LW(p) · GW(p)
My impression (shaped by this example of amplification) is that the agents in the amplification tree would be considering exactly these sort of epistemic questions. (There is then the separate question of how faithfully this behaviour is reproduced/generalized during distillation)
↑ comment by John_Maxwell (John_Maxwell_IV) · 2018-03-14T04:41:38.565Z · LW(p) · GW(p)
You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in.
Sure. But you can't train it on every possible situation--that would take an infinite amount of time.
And some situations may be difficult to train for--for example, you aren't actually going to be in a situation where you have a decisive strategic advantage during training. So then the question is whether your learning algorithms are capable of generalizing well from whatever training data you are able to provide for them.
There's an analogy to organizations. Nokia used to be worth over $290 billion. Now it's worth $33 billion. The company was dominant in hardware, and it failed to adapt when software became more important than hardware. In order to adapt successfully, I assume Nokia would have needed to retrain a lot of employees. Managers also would have needed retraining: Running a hardware company and running a software company are different. But managers and employees continued to operate based on old intuitions even after the situation changed, and the outcome was catastrophic.
If you do have learning algorithms that generalize well on complex problems, then AI alignment seems solved anyway: train a model of your values that generalizes well, and use that as your AI's utility function.
(I'm still not sure I fully understand what you're trying to do with your proposal, so I guess you could see my comments as an attempt to poke at it :)
Replies from: William_S↑ comment by William_S · 2018-03-14T15:00:33.881Z · LW(p) · GW(p)
I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I'm not quite sure which you're pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I'm less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it's overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?
↑ comment by John_Maxwell (John_Maxwell_IV) · 2018-03-14T05:28:14.603Z · LW(p) · GW(p)
It seems to me that the amplification scheme could include redundant processing/error correction - ie. ask subordinates to solve a problem in several different ways, then look at whether they disagree and take majority vote or flag disagreements as indicating that something dangerous is going on, and this could deal with this sort of problem.
That sounds like a good idea. But I still don't feel like I fully understand what we are getting in return for knowledge isolation. Knowledge isolation is clearly useful sometimes, e.g. for ensemble learning or cross validation. But it feels to me like a framework that allowed us to pool & isolate knowledge in a more fluid way would work better.
Imagine an organization where Alice in Department A needs to tell something to Bob in Department B. Suppose the organization is very straightlaced, and Alice knows that the only way Bob will hear her message is if Alice tells it to her boss, who tells it to the CEO, who tells it to to the head of department B, who tells it to Bob. What is this game of telephone buying us, exactly?
Re: corrigibility, see this comment.
Replies from: William_S↑ comment by William_S · 2018-03-14T14:51:33.430Z · LW(p) · GW(p)
I think (based on reading Paul's blog posts) that knowledge isolation provides these benefits:
The distribution of training and test examples for the distilled agent are as similar as possible (possibly identical, or possibly close enough that you can ask for new training data when you find something too far out of distribution). Suppose we allow for unlimited knowledge sharing. The training data gathered from humans will only include examples of humans processing some limited amount of information, and that information will have been produced in a fairly normal set of circumstances that occur during training. But as the IDA procedure continues, later agents will have to deal with much larger amounts of data generated during potentially weirder circumstances. So distributional shift will become more of a problem.
Security amplification. In Universality and Security Amplification, Paul describes limiting the amount of information accessible to each agent to the extent that we can perform red-teaming over the set of possible inputs an agent will have to process, and so have confidence that agents won't be attacked by any input they receive. He acknowledges that this will limit capabilities (so the system wouldn't be able to, say, translate between languages as well as a human translator). But he claims that 1) the system doesn't necessarily need to perform all tasks itself, instead it can just learn how to safely use external humans or system and 2) even the information limited set of queries the system can answer will still be able to include a "simple core of reasoning" sufficient for this task. (I'm still trying to wrap my head around whether I think this kind of system will be able to have sufficient capabilities.)
comment by daozaich · 2018-03-08T23:49:26.744Z · LW(p) · GW(p)
[Meta: Even low-effort engagement, like "known + keyword" or "you misunderstood everything; read <link>" or "go on talking / thinking" is highly appreciated. Stacks grow from the bottom to the top today, unlike x86 or threads on the internet]
------------
Iterative amplification schemes work by having each version trained by previous iteration ; and, whenever version fails at finding a good answer (low confidence in the prediction), punting the question to , until it reaches the human overseer at , which is the ground truth for our purposes. At the same time, the distribution of relevant question widens at each iteration, as capabilities improve: For example, a very stupid clippy will need to deal with simple linear optimization problems of its supply chain; a very smart clippy will need to figure out a proper plan for world-domination.
Now, in order to correctly punt questions down the call-stack, each level when faced with a problem must decide whether it can solve it (providing training data for level ), or whether it must punt downwards. More precisely, level must recognize whether is inside the distribution it was trained with (resp. , the distribution of problems it can generate / handle post-training).
False positives (underconfidence) lead to increased load on the human overseers / capability loss. False negatives (overconfidence) lead to safety failures: If they happen during training, then the machine learns wrong values / ethics / whatever-it-should-learn, if they happen in production then they lead to bad actions (I am assuming that no one would run the AI unsupervised, but supervision will, by necessity, be pretty light).
This is a problem, even with a very shallow call-stack (few iterations), but it gets (possibly exponentially!) worse with deeper stacks: After all, if even a single false-negative mistake occurs along the chain, we have a safety incident; naively, the chance of mistakes grows exponentially in the depth, for fixed false-negative rate at each level. In other words, we would need to ensure that the infinite series converges, where is the false-negative rate at level .
Something that may makes this problem either simpler or harder is the fact that, once a question has been bumped down a couple of times, it can be very far out-of-distribution. This makes it much easier to get mediocre reliability (the questions bumped down from very high levels are very weird for the low levels), but, at the same time, the low levels are really, really not equipped for dealing with these weird questions. In the clippy example, Paul's scheme would totally allow the high capability clippy to convince low levels into accepting its proposal by exploiting its weaknesses (adversarial examples).
In different words: The entire scheme needs really good anomaly detection. Anomaly detection is a machine-learning problem, right? So learn this along the way?
The problem is that, while training the anomaly detection at level , we cannot sample from the distribution , because we simply don't know it yet. As we run amplification, we extend both the capability of answering questions and the range of questions that come up.
Note that this would not be a problem if the extensions of supports of the distributions were very mild, compared to the larger domain where the anomaly detection of level is reliable-by-training. This appears like a very dangerous assumption on the "mind-space" accessible to our training procedure, for the "problem-space" at hand and the specific training procedure. In other words: No universality of safety amplification, if it relies on this assumption. Please link me to relevant discussions if this assumption has been justified anywhere.
A different approach would be to not run the scheme in a hierarchical way; instead try to feed it into itself. This reintroduces the problem of "drift", i.e. instability: The system will acquire a specific culture, and we can only pray (i.e. must assume / ensure) that desireable states are an attractor, for the specific training procedure. I fear that this is extra-hard to check, because meta-stable systems look pretty stable, until you increase capability: Bad scaling.
A third approach to solving this problem would be to create, separately, a reliable anomaly detector. This may very well be an AI complete problem, but it may be simpler than alignment! Further, robust anomaly detection (recognizing that data points are outside of the training distribution) might be an avenue for differential progress that matters more for safety/alignment than capabilities.
Do you know other approaches for this problem?
------------
When is this harmless?
Well, this is entirely harmless if our problem domain contains only simple questions, where verification of ground truth is dirt cheap. For example, NP questions (like, e.g., all of mathematics! Proof verification is trivial after all). This observation should give us a bad feeling, because an AI that is really good at mathematics / programming and bad at everything else is one of the most dangerous situation we could imagine (magical function optimizers, aka mindless AIXI genies, being the only more hopeless case I can think of). On the other hand, typical NP questions don't scale down: It is currently entirely infeasible to use machine learning for theorem proving, simply because useful transformations are exponentially rare in the space of possible ones (I am aware of some papers using the mizar library; while the neural net + theorem prover beat the unaided prover, I was less than impressed by the results).
For problem domains that feel more like exptime, this is more likely to be a problem: Say, training to play games like Go. Then, we can play against our ancestors in order to judge performance, and gain access to some kind of ground truth. Unfortunately, (1) strength is not linearly ordered: You clearly can have situations where A beats B beats C beats A, and (2) if we wanted to optimize "strength against perfect play", aka min-max, then we don't have access to a perfect opponent during training. Afaik it is usual for training-through-amplification of Game AI to develop "fads", i.e. cheesy tactics, on the way; sometimes, these recur cyclically. This is also observed for the metagame in many multiplayer videogames. I have a feeling that the Go successes tell us a lot about how MCTS is amazingly stable against cheesy tactics; and who knows how much tweaking deepmind had to do until they got the amplification stable.
Now, safety amplification / value learning has a much, much harder problem: The ground truth is only accessible through examples / very expensive oracle queries (which might be fundamentally unsafe, at very high levels of capability: Don't let human operators talk to unaligned too-clever AI).
------------
Post-script: Writing this down in clear words made me slightly update against Paul's amplification schemes eventually growing into a solution. I still think that Paul's line of research is damn cool and promising, so I'm more playing devil's advocate here. The possible differential gain for capability in NP problems versus harder-than-NP alignment for this kind of amplification procedure made me slightly more pessimistic about our prospects in general. Moreover, it makes me rather skeptic whether amplification is a net win for safety / alignment in the differential progress view. I want to look more into anomaly detection now, for fun, my own short-term profit and long-term safety.
Replies from: paulfchristiano, paulfchristiano↑ comment by paulfchristiano · 2018-05-06T05:17:44.513Z · LW(p) · GW(p)
Iterative amplification schemes work by having each version i+1 trained by previous iteration i; and, whenever version i fails at finding a good answer (low confidence in the prediction), punting the question to i−1 , until it reaches the human overseer at i=0, which is the ground truth for our purposes.
There is a dynamic like this in amplification, but I don't think this is quite what happens.
In particular, the AI at level i-1 generally isn't any more expensive than the AI at level i. The main dynamic for punting down is some way of breaking the problem into simpler pieces (security amplification requires you to take out-of-distribution data and, after enough steps, to reduce it to in-distribution subtasks), rather than punting to a weaker but more robust agent.
The problem is that, while training the anomaly detection at level i , we cannot sample from the distribution Di+N , because we simply don't know it yet. As we run amplification, we extend both the capability of answering questions and the range of questions that come up.
I do agree with the basic point here though: as you do amplification the distribution shifts, and you need to be able to get a guarantee on a distribution that you can't sample from. I talk about this problem in this post. It's clearly pretty hard, but it does look significantly easier than the full problem to me.
↑ comment by paulfchristiano · 2018-03-09T06:06:25.736Z · LW(p) · GW(p)
I think the "false positives" we care about are a special kind of really bad failure, it's OK if the agent guesses wrong about what I want as long as it continues to correctly treat its guess as provisional and doesn't do anything that would be irreversibly bad if the guess is wrong. I'm optimistic that (a) a smarter agent could recognize these failures when it sees them, (b) it's easy enough to learn a model that never makes such mistakes, (c) we can use some combination of these techniques to actually learn a model that doesn't make these mistakes. This might well be the diciest part of the scheme.
I don't like "anomaly detection" as a framing for the problem we care about because that implies some change in some underlying data-generating process, but that's not necessary to cause a catastrophic failure.
(Sorry if I misunderstood your comment, didn't read in depth.)
comment by Stuart_Armstrong · 2018-03-30T02:18:30.645Z · LW(p) · GW(p)
https://www.lesserwrong.com/posts/ZyyMPXY27TTxKsR5X/problems-with-amplification-distillation
Summary:
I have four main criticisms of the approach:
- 1. "Preserve alignment" is not a valid concept, and "alignment" is badly used in the description of the method.
- 2. The method requires many attendant problems to be solved, just like any other method of alignment.
- 3. There are risks of generating powerful agents within the systems that will try to manipulate it.
- 4. If those attendant problems are solved, it isn’t clear there’s much remaining of the method.
The first two points will form the core of my critique, with the third as a strong extra worry. I am considerably less convinced about the fourth.
comment by Jan_Kulveit · 2018-03-09T10:02:48.014Z · LW(p) · GW(p)
My intuition is this is not particularly stable against adversarial inputs. Trying to think about is as a practical problem, I would attack it in this way
- provide adversarial inputs to A0 which will try to manipulate them so they are better at the "simple task at hand" but e.g. have slightly distorted some model of the world
- it seems feasible to craft manipulations which would "evaluate" and steer the whole system only while already at superhuman level, so we have Amplify(H,A[10]) where A(10) is superhuman and optimizing to take over the control for the attackers goal
In vivid language, you seed the personal assistants with the right ideas ... and many iterations later, they start a communist revolution
comment by avturchin · 2018-03-08T18:29:55.225Z · LW(p) · GW(p)
I will try to list some objections appeared in my mind while reading your suggestions.
- I think that the main problem is the idea that "humans have values". No, they don't have values. "Values" were invented by psychologists to better describe human behaviour. To calling that humans "have values" is fundamental attribution error, which in most cases very subtle, because humans behave as if they have values. But this error could become amplified by suggested IDA, because human reactions to AI's tests will be not coherent and their extrapolation will stuck at some level. How exactly it will stuck is not easy to predict. In short: it is impossible to align two fuzzy things. I am going to write something longer about the problem of non-existence of human values.
- I have values about values which contradict my observed behaviour. If AI could observe one's behaviour, it may conclude that this person - suppose - is lazy, sexually addicted and prefer to live in the world of Game of Thrones. However, a person could deny part of his behaviour as unethical. Thus behaviour will give wrong clues.
- Human behaviour is controlled not only by person's consciousness, but also partly by person's unconsciousness, which is unobservable to him, but obvious for others. In other words, human's behaviour often is a sum of several different minds inside his head. We don't want AI to extrapolate our unconscious goals as they are often unethical and socially inappropriate. (Could support these claims by literature and examples.)
- Small tests performed by AI may be dangerous or unethical. Maybe AI will try to torture a person to extract his preference about money and pain.
- AI could be aligned this the same human in many different ways. All of them could be called "alignment", but because AI is very complex, different approximations of human behaviour could satisfy to any criterion we have. Some of this approximations may better or worse.
- AI alignment per se doesn't solve AI safety, as it doesn't explain how war between humans empowered by AIs will be prevented. I wrote an article about it: Turchin, Alexey, and David Denkenberger. Military AI as a convergent goal of the self-improving AI. In edited volume: Artificial intelligence safety and security, CRC, 2018, https://philpapers.org/rec/TURMAA-6
- AI's test of human values will change human values, and AI could change these values in any possible direction. That is, AI could design "adversarial test" which will change humans' value system in any other value system without human noticing it. Some human also could do it. See more about it in "Ericksonian hypnosis". It is particular nasty example of AI boxing goes wrong.
- Obviously, if AI's capabilities are underestimated, it could experience "treacherous turn" in any moment of preparing of the test of human behaviour, - I think this should be a standard MIRI reply.
- To be "aligned" is not just mathematical relation, to which typical notations of transitivity - or whatever we could easily do with mathematical symbols - are applicable. Alignment is complex and vague, never certain. It is like love. We can't describe human love by introducing notation L, and start writing equations like L(A,B)=C.
Anyway, I think that there is a chance that you scheme will work but under one condition: First AI is something like a human upload which has basic understanding of what it is to be aligned. If we have such perfect upload, which has common sense and general understanding of what humans typically want, the amplification and distillation scheme will work. The question is how we can get this first upload? I have some ideas, but this is another topic.
Replies from: paulfchristiano↑ comment by paulfchristiano · 2018-03-09T06:07:48.468Z · LW(p) · GW(p)
- My scheme doesn't have any explicit dependence on "human values," or even involve the AI working with an explicit representation of what it values, so seems unusually robust to humans not having values (though obviously this leaves it more vulnerable to certain other problems). I agree there are various undesirable features that may get amplified.
- My scheme is definitely not built on AI observing people's behavior and then inferring their values.
- I'm not convinced that it's bad if our AI extrapolates what we actually want. I agree that my scheme changes the balance of power between those forces somewhat, compared to the no-AI status quo, but I think this procedure will tend to increase rather than decrease the influence of what we say we want, relative to business as usual. I guess you could be concerned that we don't take the opportunity of building AI to go further in the direction of quashing our unconscious preferences. I think we can and should consider quashing unconscious preferences as a separate intervention from aligning AGI.
- An agent trained with my scheme only tries to get info to the extent the oversight process rewards info-gathering. This may lead to inefficiently little info gathering if the overseer doesn't appreciate its value. Or it may lead to other problems if the overseer is wrong in some other way about VOI. But I don't see how it would lead to this particular failure mode. (It could if amplification breaks down, such that e.g. the amplified agent just has a representation of "this action finds relevant info" but not an understanding of what it actually does. I think that if amplification fails that badly we are screwed for other reasons.)
- I don't know if it matters what's called alignment, we can just consider the virtues of iterated amplification.
- I'm not proposing a path to world peace, I'm trying to resolve the particular risk posed by misaligned AI. Most likely it seems like there will be war between AI-empowered humans, though with luck AI will enable them to come up with better compromises and ultimately end war.
- I don't understand the relevance exactly.
- Here are what I see as the plausible techniques for avoiding a treacherous turn. If none of those techniques work I agree we have a problem.
- I agree that there are some desirable properties that we won't be able to make tight arguments about because they are hard to pin down.
↑ comment by avturchin · 2018-03-10T08:40:58.441Z · LW(p) · GW(p)
I have some additional thoughts after thinking more about your proposal.
What wary me is the jump from AI to AGI learning. The proposal will work on Narrow AI level, approximately as similar model worked on in case of AlphaGoZero. The proposal will also work if we have perfectly aligned AGI, something like human upload or perfectly aligned Seed AI. It is rather possible that Seed AGI can grow in it capabilities while preserving aligning.
However, the question is how you model will survive the jump from narrow non-agential AI capabilities, to agential AGI capabilities? - This could happen during the evolution of your system in some unpredicted moment, and may include modeling of outside world, all humanity and some converging basic drives, like self-preservation. So it will be classical treacherous turn or intelligent explosion or "becoming self aware moment" - and in that moment previous ways of alignment will be instantly obsolete, and will not provide any guarantee that the system will be aligned on its new level of capabilities.
comment by Donald Hobson (donald-hobson) · 2018-04-15T14:29:37.310Z · LW(p) · GW(p)
https://github.com/DonaldHobson/AI-competition-2
comment by lucarade · 2018-04-29T03:54:19.150Z · LW(p) · GW(p)
Only just learned of this, unfortunately.
OP (please comment there): https://medium.com/@lucarade/issues-with-iterated-distillation-and-amplification-5aa01ab37173
I will show that, even under the most favorable assumptions regarding the feasibility of IDA and the solving of currently open problems necessary for implementing IDA, it fails to produce an aligned agent in the sense of corrigibility.
Part 1: The assumptions
Class 1: There are no problems with the human overseer.
1.1: Human-generated vulnerabilities are completely eliminated through security amplification. (See this post for a lengthy overview and intuition, and this postfor a formalization). In short, security amplification converts the overseer in IDA from high-bandwidth (receiving the full input in one piece) to low-bandwidth (receiving inputs divided into small pieces), to make impossible an attack which inputs data in such a way as to exploit human vulnerability to manipulation. See this post [LW · GW] for a good explanation of a high-bandwidth vs low-bandwidth overseer.
My critique applies equally to high-bandwidth and low-bandwidth overseers so I make no assumption on that front.
1.2: There is no moral hazard in the human overseers. This eliminates one of Stuart’s critiques. Furthermore, the human overseer displays corrigible behaviors without error.
1.3: The relevant experts are willing to put in a substantial amount of time for the training process. This is a non-trivial assumption which I have not yet seen discussed.
Class 2: The framework and its auxiliary components function as intended.
2.1: Reliability amplification functions as intended. In summary, reliability amplification uses a voting ensemble of agents at each stage of amplification to avoid error amplification, in which an initially small probability of error grows with each iteration.
2.2: Corrigibility, not optimal value-aligned performance, is our goal. All we care about is that our agent “is trying to do what its operator wants it to do.” It may be bad at actually figuring out what its operator wants or at carrying out those wants, but the point is that it cares about improving, and will never intentionally carry out an action it knows is contrary to what its operator would want it to do (see this post and this post [LW · GW] for a clarification of Paul’s approach to AI alignment by achieving corrigibility).
Stuart has pointed out problems with corrigibility [LW · GW], which I agree with. Essentially, the concept is ill-defined given the fuzziness of human values, and to properly implement corrigibility an agent must completely understand human values, thus reducing to the much harder value learning problem. However, we will assume that an agent which understands and implements the general concept of corrigibility, even if it accidentally misbehaves in many cases and causes widespread harm upon initial implementation as Stuart’s argument suggests, will still avoid existential risk and allow us to improve it over time, and is thus satisfactory. I think this is Paul’s approach to the matter.
Even a fully corrigible agent can be catastrophically misaligned, as detailed in this post [LW · GW]. As addressed in the comments of that post, however, if we assume humans are smart enough to avoid a corrigible AI causing existential risk in this manner then the issue goes away.
2.3: There is no coordination possible among any of the A[n]s, eliminating another of Stuart’s critiques.
2.4: The informed oversight problem is solved. In summary, the problem is that it is difficult for a more powerful aligned overseer agent to fully understand the decision-making process of a weaker agent in a way that allows the overseer to push the weaker agent towards alignment. However, it does not follow that it is possible for a weaker aligned overseer to understand the decision-making process of a more powerful agent. It seems like this will never be possible by definition, because a weaker agent cannot understand the decision-making processes of a more powerful agent even if full transparency were possible.
2.5: Worst-case scenario techniques are effective at eliminating the risk of a treacherous turn, in which an agent performs well on training and test data but behaves catastrophically on some input in the real world which was not accounted for.
Part 2: Corrigibility is unlikely to be trained
The crux of my argument is that, even given the very optimistic assumptions above, IDA will fail to train a corrigible agent. I will list the problems sequentially, but my argument is that all five are necessary for failure. It is possible a subset is sufficient but I will not explore that possibility in this post.
Problem 1: Limited training data. Even given assumption 1.3, carrying out the process of training each new iteration is extremely time-intensive. It would take years of many experts’ full-time commitment as overseers to produce anything close to a reasonable amount of data.
Problem 2: Hidden error amplification. If an error is identifiable by an aligned agent, then reliability amplification will almost certainly remove it. However, a small initial hidden error will be rapidly amplified, since in the next iteration it will be manifested in various ways in many of the thousands of copies of the error-containing agent. Thus in the distillation step, the initial error will be propagated in many different forms, leading to a multiplication of errors. Informed oversight doesn’t help since the human is not more powerful than the subroutines, and even if transparency is achieved, if the human can verify all output then only very limited capability amplification is possible.
I will show in Problem 4 that such hidden errors are likely to arise.
Problem 3: Information loss. The human overseer has access to information, in the form of memories (broadly construed as information from before the current training process stored in the human’s mind), that is not accessible to the agent A[n+1] training on (H+A[n]), who sees only (H+A[n])’s actions. A[n+1] will thus erroneously attribute decisions made by the human relying on non-available information to available information. This information loss and the resulting distortion is greatly exacerbated by the same process described in Problem 2, doubly: first in the amplification of the information loss of the previous overseer, and second in the introduction of further information loss of the current overseer.
Paul thinks this is a general problem of learning and thus falls outside of the scope of things that specifically cause IDA to fail, as pointed out in this postand in the comments on this post [LW · GW], but I disagree. One can certainly imagine (and some have experienced) a human robustly learning another human’s decision-making heuristics over time without direct access to the other human’s memories, and can by extension also imagine an artificial agent extracting information from a human to robustly understand that human’s decision-making process. The problem exists not in all forms of learning but in the class of training techniques which do not involve a direct and adaptive extraction of information from a human in some form.
Problem 4: No prior concept of corrigibility. Because of information loss, an agent has no way of extracting the concept of corrigibility from its training data, only the behavior of corrigibility. The way the agent implements corrigibility will thus necessarily be an approximation, even if an extremely good one, and will not necessarily be robust to drastic changes in context. This causes the small hidden errors that are then amplified through the hidden error amplification in Problem 2, making reliability amplification ineffective. Without hidden error amplification this would probably not be a problem, since agents which successfully approximate corrigibility behaviorally will be able to detect all but the tiniest deviations from optimal corrigibility (ie, understanding the concept the way you and I do). However, hidden error amplification causes a nontrivial corrosion of corrigibility throughout iterations, and as each newly distilled agent approximates an increasingly corrupted behavioral corrigibility that deviates from our ideal conceptual corrigibility, reliability amplification is keeping us close to each further deviated behavioral corrigibility but not close to the ideal conceptual corrigibility. The process behaves essentially as a high-dimensional random walk with extremely small steps, but with thousands of steps per iteration manifested in the copies of A[n].
Problem 5: Temporal inconsistency of proxy dynamics (TIPD). Any incomplete simulation is not robust over time without an adaptive capacity. There are certain underlying processes which are time-invariant, such as the laws of physics and the mathematics of evolution. However, clearly we can never completely simulate any non-trivial situation purely in terms of these processes. Thus, an agent must necessarily rely on proxy dynamics for decision-making: emergent properties of the fundamental processes, which fairly reliably approximate cause-and-effect relationships between actions and outputs. However, because of the complexity of the underlying dynamics and their interactions, these proxy dynamics change over time, and often quite drastically over short periods (see the literature on chaos theory, critical transitions, bifurcation points). Thus, an agent which performs robustly at one point in time may behave catastrophically at another. The only solution is for the agent to be capable of adapting its policy to changes in the proxy dynamics it uses.
This sounds like the treacherous turn problem, but it is distinct, and harder. In the treacherous turn problem, we have an agent that is not sufficiently well trained given the input-output relationships of the world. This can probably be solved by worst-case scenario techniques like adversarial training. In TIPD, even if we succeed in training a robust policy, the proxy dynamics used to inform decisions will change such that an action in response to an input which previously would have produced a safe behavior now produces a catastrophic behavior.
As a result, behavioral corrigibility, whether corrupted or not, is not robust over time since it does not adapt to changing input-output relationships. An agent must possess conceptual corrigibility for such adaptation to occur, which is extremely hard, and may reduce to the value learning problem.
Part 3: Achieving alignment in this process through anything but corrigibility is doomed.
This is fairly obvious, and mostly follows from Part 2. Any proxy of the human’s decision-making process will clearly fail without an adaptive capacity, and it is not clear how such an adaptive capacity could be robustly implemented. And clearly this method will never achieve anything but a proxy due to information loss.
Conclusion
I have argued that even under the most optimistic assumptions about the human overseer and the successful operation of the framework, IDA will fail to produce a corrigible agent. This failure is a result of the interplay between hidden error amplification, information loss, the ability to learn behavioral corrigibility but not conceptual corrigibility, and the temporal inconsistency of proxy dynamics (TIPD). The solution to these problems seems very hard, and may reduce to the value learning problem, in which case the IDA framework does not provide us with any advantage.
comment by Roland Pihlakas (roland-pihlakas) · 2018-04-20T19:26:46.255Z · LW(p) · GW(p)
Hello! Thanks for the prize announcement :)
Hope these observations and clarifying questions are of some help:
https://medium.com/threelaws/a-reply-to-aligned-iterated-distillation-and-amplification-problem-points-c8a3e1e31a30
Summary of potential problems spotted regarding the use of AlphaGoZero:
- Complete visibility vs Incomplete visibility.
- Almost complete experience (self-play) vs Once-only problems. Limits of attention.
- Exploitation (a game match) vs Exploration (the real world).
- Having one goal vs Having many conjunctive goals. Also, having utility maximisation goals vs Having target goals.
- Who is affected by the adverse consequences (In a game vs In the real world)? - The problems of adversarial situation, and also of the cost externalising.
- The related question of different timeframes.
Summary of clarifying questions:
- Could you build a toy simulation? So we could spot assumptions and side-effects.
- In which ways does it improve the existing social order? Will we still stay in mediocristan? Long feedback delay.
- What is the scope of application of the idea (Global and central vs Local and diverse?)
- Need concrete implementation examples. Any realistically imaginable practical implementation of it might not be so fine anymore, each time for different reasons.
comment by Visionbuilder · 2018-04-19T12:44:00.320Z · LW(p) · GW(p)
https://www.dropbox.com/s/98469ukjkzgdy6k/AI%20alignment%20%28FINAL%29.docx?dl=0
I dont use Dropbox normally so I hope the link works. let me know if it doesn't and I'll figure something else out. Cant say im convinced of this approach to alignment but its good to see continued discussion.
comment by Gordon Seidoh Worley (gworley) · 2018-03-15T17:49:30.942Z · LW(p) · GW(p)
This is a more general objection about various sorts of proposed AI alignment schemes that ends up applying to your ideas that I've been thinking about since formalizing the AI alignment problem. I'll probably write more up on this later but for now here's a sketch as it applies to your ideas as far as I understand them.
In "Formally Stating the AI Alignment Problem" I say alignment requires that
A must learn the values of H and H must know enough about A to believe A shares H’s values
where A is an AGI and H is humanity (this is an informal summary of the more formal result given earlier in the paper). Your solution only seems to be designed to address the first part, that A must learn the values of H. Your proposal mostly seems to me to ignore the requirement that H must know enough about A to believe A shares H's values, although I know you have thought about the issue of transparency and see it as necessary. Nevertheless in your current approach it feels tacked on rather than a key part of the design since you seem to hope it can be achieved by training for transparency.
The need to train for transparency/explainability is what makes me suspicious of RL-based approaches to AI alignment now. It's difficult to come up with a training program that will get an agent to be reliably transparent in that the reward function will always reward the appearance of transparency but not actual transparency since it can only force observed behavior but not internal structure. This creates an opportunity for a treacherous turn where the AI acts in ways that indicate it shares our values, appears to be believably sharing our values based on the explanations it gives for its actions, yet could be generating those explanations independent of how it actually reasons such that it would appear aligned right up until it's not.
Intuitively this is not very surprising because we have the same challenge with aligning humans are not able to do it reliably. That is, we can try to train individual humans to share our values, observe their behavior indicating they have learned our values and are acting on them, and ask them for reasons that convince us that they really do share our values, but then they may still betray those values anyway since they could have been dissociative and hiding their resentful intent to rebel the entire time. We know this to be the case because totalitarian states have tried very hard to solve this problem and repeatedly failed even if they are often successful in individual cases. This suggests approaches of this class cannot be made reliable enough to be worth pursuing.