Online Learning 2: Bandit learning with catastrophespost by RyanCarey · 2016-10-29T16:53:28.000Z · LW · GW · None comments
Note: This describes an idea of Jessica Taylor's.
The usual training procedures for machine learning models are not always well-equipped to avoid rare catastrophes. In order to maintain the safety of powerful AI systems, it will be important to have training procedures that can efficiently learn from such events. 
We can model this situation with the problem of exploration-only online bandit learning. In this scenario, we grant the AI system an exploration phase, in which it is allowed to select catastrophic arms and view their consequences. (We can imagine that such catastrophic selections are acceptable because they are simulated or just evaluated by human overseers). Then, the AI system is switched into the deployment phase, in which it must select an arm that almost always avoids catastrophes.
##Setup In outline, the learner will receive a series of randomly selected examples, and will select an expert (modeled as a bandit arm) at each time step. The challenge is to find a high-performing expert in as few time steps as possible. We give some definitions:
- Let be some finite set of possible inputs.
- Let be the set of available experts (i.e. bandit arms).
- Let be the reward function. is the reward for following expert on example .
- Let be the catastrophic risk function. is the catastrophic risk incurred by following expert on example .
- Let be the mixed payoff that the learner is to optimize, where is the risk-tolerance. can be very small, on the order of .
- Let be the input distribution from which examples are drawn in the deployment phase.
- Let be the risk distribution. This is an alternative distribution that assigns higher weight to inputs that are more likely to be catastrophic for expert .
- Let be the learner's guess at . This guess is assumed to contain a grain of truth: where is some known constant. This assumption is similar to the one in Paul Christiano's post on active learning.
- Let be the expected overall payoff of expert in the deployment phase.
The learner's exploration phase consists of time steps . At each time step , the learner chooses an expert . Then, the learner samples from , or both and observes a tuple containing a risk and a reward. i.e. it observes either i) , ii) , iii) , or iv) where . After steps, the learner selects some final arm .
The deployment phase just computes the performance of on the deployment distribution . The mean payoff of the final expert is denoted . The highest mean payoff that any expert(s) can achieve is denoted . Then, the simple regret of the learner is .
The aim for the learner is to select an expert that achieves low regret with high probability: , using as few exploration steps as possible. is some regret tolerance, which is less than the reciprocal of the (very small) risk-tolerance: .
We assume that the agent knows the deployment distribution and knows the estimate of the risk distribution but does not know the actual risk distribution . Additionally, we assume that there exists some expert whose recommendations incur no risk of catastrophe,
##Bounds on the number of time steps
The standard approach
A natural way to get this PAC bound is to use a bandit algorithm like Median Elimination. The overall payoff of each expert is defined by the random variable:
This algorithm will find an -optimal expert with probability. However, have a very large range, , and so if the risk-tolerance is low, then the number of time steps required will be very large. In fact, if is the number of time steps required, then :
Intuitively, the problem is that if some risky scenarios only occur rarely in then many samples will be required to identify these.
##Using a surrogate objective to reduce the required number of time steps Instead, the learner can sample using its guess of the risk distribution. If we cap the extent to which we penalize risky arms, then we arrive at a surrogate bandit problem, in which the payoff of each expert is :
Since is bounded by , the required number of time steps for this problem is:
This improves the speed of the standard approach by a factor of , which may be a large speedup indeed if catastrophes are rare.
We can prove that any -optimal expert in this surrogate problem will also be -optimal in the original problem.
Theorem 1. Let be the overall payoff of the optimal expert.
For an expert , if then .
To prove this, first we show that:
Proof of Lemma 2.
Proof of Theorem 1. We have assumed that for the expert :
This result suggests that it might be possible to avoid catastrophes by training machine learning systems on inputs that induce catastrophic behavior. We have given a clear set of circumstances in which an agent can learn to avoid catastrophes. However, this set of requirements does seem very tricky to meet. Finding a hypothesis about the location of catastrophes that has a grain of truth for all experts with all examples may be very difficult. The assumption of separate training and execution phases may also be difficult to instantiate. For example, if the AI system is able to escape its training simulation, then it will no-longer be valid to allow failures during the exploration phase, and so this training procedure would not work.
Another problem is that it could be difficult to know the input distribution . In future work we plan on exploring learning with catastrophes in an online active learning setting instead of the supervised learning setting of this post, so that this requirement can be relaxed.
Paul Christiano has discussed approaches to learning with catastrophes and their application to AI Control at Learning with catastrophes and red teams. The idea explored in this post can be seen as a form of red teaming (with the distribution representing the red team).
The PAC bound for Median Elimination is proved in Theorem 4. Even-Dar, Eyal, Shie Mannor, and Yishay Mansour. "PAC bounds for multi-armed bandit and Markov decision processes." International Conference on Computational Learning Theory. Springer Berlin Heidelberg, 2002. To get the exact bound, we Hoeffding's inequality for a random variable with range rather than on as in the original paper.
Comments sorted by top scores.