I think in that case I tend towards thinking that such corrections are unlikely to scale towards what I eventually want out of corrigibility, though I do think there are other ways in which such approaches can still be quite useful. I mainly see approaches of that form as broadly similar to impact measure approaches such as Attainable Utility Preservation or Relative Reachability where the idea is to try to disincentive the agent from acting non-corrigibly. The biggest hurdles there seem to be a) making sure your penalty doesn't miss anything, b) figuring out how to trade off between the penalty and performance, and c) making sure that if you train on such a penalty the resulting model will actually learn to always behave according to it (I see this as the hardest part).
On that note, I totally agree that it becomes quite a lot more complicated when you start imagining creating these things via learning, which is the picture I think about primarily. In fact, I think your intuitive picture here is quite spot on: it's just very difficult to control exactly what sort of model you'll get when you actually try to train an agent to maximize something.
comment by Koen.Holtman
· score: 2 (2 votes) · LW
) · GW
Yes you are correct, the paper is not really about agents, but about applying correction functions like to the type agents.
The rest of this comment is all about different types of corrigibility. Basically it is about identifying and resolving a terminological confusion.
Reading your link om 'what I eventually want out of corrigibility', I think I see a possible reason why there may be a disconnect between the contents of my paper and what kind of things you are looking for. I'll unpack this statement a bit. If I do a broad scan of papers and web pages the I see that different types of 'corrigibility' are being considered in the community: they differ in what people want from it, and in how they expect to design it. To assign some labels, there are at least button-corrigibility, and preference-learner-corrigibility.
The Soares, Fallenstein, Armstrong and Yudkowsky corrigibility paper, and my paper above, are about what I call button-corrigibility.
The corrigibility page by Christiano you link to is on what I call preference-learner-corrigibility. I describe this type of corrigibility in my paper as follows:
Agents that are programmed to learn can have a baseline utility function that incentivizes the agent to accept corrective feedback from humans, feedback that can overrule or amend instructions given earlier. This learning behavior creates a type of corrigibility, allowing corrections to be made without facing the problem of over-ruling the emergent incentive of the agent to protect itself. This learning type of corrigibility has some specific risks: the agent has an emergent incentive to manipulate the humans into providing potentially dangerous amendments that remove barriers to the agent achieving a higher utility score. There is a risk that the amendment process leads to a catastrophic divergence from human values. This risk exists in particular when amendments can act to modify the willingness of the agent to accept further amendments. The corrigibility measures considered here can be used to add an extra safety layer to learning agents, creating an emergency stop facility that can be used to halt catastrophic divergence. A full review of the literature about learning failure modes is out of scope for this paper. [OA16] discusses a particular type of unwanted divergence, and investigates ’indifference’ techniques for suppressing it. [Car18] discusses (in)corrigibility in learning agents more broadly.
Ideally, we want an agent that has both good button-corrigibility and good preference-learner-corrigibility: I see these as independent and complementary safety layers.
Christiano expresses some hope that a benign act-based agent will have good preference-learner-corrigibility. From a button-corrigibility standpoint, the main failure mode I am worried about is not that the the benign act-based agent will be insufficiently skilled at building a highly accurate model of human preferences, but that it will conclude that this highly accurate model is much easier to build and maintain once it has made all humans addicted to some extreme type of drug.
At the risk of oversimplifying: in preference-learner-corrigibility, the main safety concern is that we want to prevent a learning process that catastrophically diverges from human values. In button-corrigibility, there is a quite different main safety concern: we want to prevent the agent from taking actions that manipulate human values or human actions in some bad way, with this manipulation creating conditions that make it easier for the agent to get or preserve a high utility score. The solution in button-corrigibility (or at least the solution considered in Soares et al. and in my paper) is to have a special 'ritual', e.g. a button press, that the humans can perform in order to change the utility function of the agent, with the agent designed to be indifferent about whether or not the humans will perform the special ritual. This solution approach is quite different from improving learning, in fact it can be said to imply the opposite of learning.
Say that the button-corrigible agent agent in my paper becomes somewhat aware that the people might be planning to push the button. If so, the constructed indifference property implies that the agent will have no motivation whatsoever to act on this awareness by launching a deeper investigation into what values or emotions might motivate the people to have these plans. Having improved knowledge about motivations might be very useful if the agent wanted to stop or delay the people, if the agent wanted to engage in acts of lobbying as I call it in my paper. But as the agent is completely indifferent about stopping them, it has no motivation to spend energy on the sub-goal of learning more about their motivations. So in a sense, button-corrigibility has the side effect of making the agent into a non-learner when it comes to certain topics. (This is all modulo what is inside : the in the agent might contain an incentive for the agent to explore the new phenomenon further if it becomes aware that people might have button pushing plans, but if so this has no effect on how works to suppress lobbying.)
So overall, if you are looking in a button-corrigibility paper for a mechanism that generates a corrective pressure to improve a learning process about human values, a mechanism that improves preference-learner-corrigibility, you will probably not find it.
Personal experience: I often come across a paper or web page where the authors introduce the concept of corrigibility by referencing Soares et al, so then I read on expecting a discussion of button-corrigibility concerns and approaches. But when I read on I often discover that it is really about preference-learner-corrigibility concerns and approaches. This happened to me also when reading your paper Risks from Learned Optimization [LW · GW]: section 4.4 says that 'Furthermore, in both deceptive and corrigible alignment, the mesa-optimizer will have to spend time learning about the base objective to enable it to properly optimize for it'. As I explained above, a correctly working button-corrigible mesa-optimiser will not be motivated to spend any time learning about the base objective so that it can improve its ability to be deceptively aligned. A preference-learner-corrigible mesa-optimiser might, tough. So the above sentence triggered me into realising that you and your co-authors were likely not thinking about button-corrigible optimisers when it was written.
comment by evhub
· score: 1 (1 votes) · LW
) · GW
In button-corrigibility, there is a quite different main safety concern: we want to prevent the agent from taking actions that manipulate human values or human actions in some bad way, with this manipulation creating conditions that make it easier for the agent to get or preserve a high utility score.
I generally think of solving this problem via ensuring your agent is myopic rather than via having a specific ritual which the agent is indifferent to, as you describe. It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward. Thus, it seems like what you really want here is some guarantee that the agent is only considering its per-episode reward and doesn't care at all about its cross-episode reward, which is the condition I call myopia. That being said, how to reliably produce myopic agents is still a very open problem.
comment by Koen.Holtman
· score: 1 (1 votes) · LW
) · GW
It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward.
Yes this is true: basically the fundamental job of the button-corrigible agent is still to perform a search over all actions, and find those actions that best maximise the reward computed by the utility function. As mentioned in the paper, such actions may be to start a hobby club, where humans come together to have fun and build petrol engines that will go into petrol cars. Starting a hobby club is definitely an action that influences humans, but not in a bad way.
Button-corrigibility intends to support a process where bad influencing behaviour can be corrected when it is identified, without the agent fighting back. The assumption is that we are not smart enough yet to construct an agent that is perfect on such metrics when we start it up initially.
On myopia: this is definately a technique that might make agents safer, though I would worry if this implies that an agent is incapable of considering long term effects when it chooses actions: this would make the agent unsafe in many cases.
Short term, I think myopia is a useful safety technique for many types of agents. Long term, if we have agents that can modify themselves or build new sub-agents, myopia has a big open problem: how do we ensure that any successor agents will have exactly the same myopia? Button-corrigibility avoids having to solve this preservation of ignorance problem: it also works for agents and successor agents that are maximally perceptive and intelligent, because it modifies the agent goals, not agent perception or reasoning capability. The successor agent building problem does not go away completely through: button-corrigibility still has some hairy problems with agent modification and sub-agents, but overall I have found these to be more tractable than the problem of ignorance preservation.