Delegative Reinforcement Learning with a Merely Sane Advisor
post by Vanessa Kosoy (vanessa-kosoy) · 2017-10-05T14:15:45.000Z · LW · GW · 2 commentsContents
Notation Results Definition 1 Definition 2 Definition 3 Theorem 1 Corollary 1 Definition 4 Corollary 2 Corollary 3 Appendix A Proposition A.1 Proof of Proposition A.1 Proposition A.2 Proposition A.3 Proof of Proposition A.3 Proof of Proposition A.2 Definition A.1 Proposition A.4 Proof of Proposition A.4 Proposition A.5 Proof of Proposition A.5 Proof of Theorem 1 Proof of Corollary 1 Definition A.2 Proposition A.6 Proof of Proposition A.6 Definition A.4 Proposition A.7 Proof of Proposition A.7 Proof of Corollary 2 Proof of Corollary 3 Appendix B Proposition B.1 Proposition B.2 None 2 comments
Previously [AF · GW], we defined a setting called "Delegative Inverse Reinforcement Learning" (DIRL) in which the agent can delegate actions to an "advisor" and the reward is only visible to the advisor as well. We proved a sublinear regret bound (converted to traditional normalization in online learning, the bound is ) for one-shot DIRL (as opposed to standard regret bounds in RL which are only applicable in the episodic setting). However, this required a rather strong assumption about the advisor: in particular, the advisor had to choose the optimal action with maximal likelihood. Here, we consider "Delegative Reinforcement Learning" (DRL), i.e. a similar setting in which the reward is directly observable by the agent. We also restrict our attention to finite MDP environments (we believe these results can be generalized to a much larger class of environments, but not to arbitrary environments). On the other hand, the assumption about the advisor is much weaker: the advisor is only required to avoid catastrophic actions (i.e. actions that lose value to zeroth order in the interest rate) and assign some positive probability to a nearly optimal action. As before, we prove a one-shot regret bound (in traditional normalization, ). Analogously to before, we allow for "corrupt" states in which both the advisor and the reward signal stop being reliable.
Appendix A contains the proofs and Appendix B contains propositions proved before.
Notation
The notation means is Markov kernel from to . When is a finite set, this is the same as a measurable function , and we use these notations interchangeably. Given and (corr. ), we will use the notation (corr ) to stand for (corr. ). Given is a finite set, , and , the notation means .
Given a measurable space, , , , finite sets and , random variables (measurable mappings), the mutual information between the joint distribution of the and the joint distribution of the will be denoted
We will parameterize our geometric time discount by , thus all functions that were previously defined to depend on are now considered functions of .
Results
We start by explaining the relation between the formalism of general environments we used before and the formalism of finite MDPs.
Definition 1
A finite Markov Decision Process (MDP) is a tuple
Here, is a finite set (the set of states), is a finite set (the set of actions), is the transition kernel and is the reward function.
A stationary policy for is any . The space of stationary policies is denoted . Given , we define by
We define and by
Here, is just the -th power of in the sense of Markov kernel composition.
As well known, and are rational functions in for , therefore in this limit we have the Taylor expansions
Given any , we define recursively by
All MDPs will be assumed to be finite, so we drop the adjective "finite" from now on.
Definition 2
Let be an interface. An -universe is said to be an -realization of MDP with state function when and for any , and
Now now define the relevant notion of a "good advisor."
Definition 3
Let be a universe and . A policy is said to be -sane for when there are , s.t. is an -realization of with state function and for any
i.
ii.
We can now formulate the regret bound.
Theorem 1
Fix an interface and . Consider for some (note that doesn't depend on ). Assume that for each , is an -sane policy for . Then, there is an -metapolicy s.t. for any
Corollary 1
Fix an interface and . Consider . Assume that for each , is a -sane policy for . Define . Then, is learnable.
Now, we deal with corrupt states.
Definition 4
Let be a universe and . A policy is said to be locally -sane for when there are , and (the set of uncorrupt states) s.t. is an -realization of with state function , and for any , if then
i. If and then .
ii.
iii.
iv.
Of course, this requirement is still unrealistic for humans in the real world. In particular, it makes the formalism unsuitable for modeling the use of AI for catastrophe mitigation (which is ultimately what we are interested in!) since it assumes the advisor is already capable of avoiding any catastrophe. In following work, we plan to relax the assumptions further.
Corollary 2
Fix an interface and . Consider for some . Assume that for each , is locally -sane for . For each , let be the corresponding set of uncorrupt states. Assume further that for any and , if and , then . Then, there is an -policy s.t. for any
Corollary 3
Assume the same conditions as in Corollary 2, except that may be countable infinite. Then, is learnable.
Appendix A
First, we prove an information theoretic bound that shows that for Thompson sampling, the expected information gain is bounded below by a function of the loss.
Proposition A.1
Consider a probability space , , a finite set and random variables , and . Assume that and . Then
Proof of Proposition A.1
We have
Using Pinsker's inequality, we get
Denote . We get
Now, we describe a "delegation routine" that can transform any "proto-policy" that recommends some set of actions from into an actual -policy s.t (i) with high probability, on each round, either a "safe" recommended action is taken, or all recommended actions are "unsafe" or delegation is performed and (ii) the expected number of delegations is small. For technical reasons, we also need to the modified routines which behave the same way as except for some low probability cases.
Proposition A.2
Fix an interface , , , . Consider some . Then, there exist and with the following properties. Given , we denote its projection to . Thus, . Given an -environment, , and , we can define as follows
We require that for every , and as above, the following conditions hold
i.
ii.
iii. For all , if then
iv. For all , if then
In order to prove Proposition A.2, we need another mutual information bound.
Proposition A.3
Consider , a finite set, , , , and . Suppose that for every
Then
Proof of Proposition A.3
We have
Define by
Let be s.t. and either or every has . For every , denote
If then either or there is s.t. and . This implies
We have
It follows that
Proof of Proposition A.2
We define , , , , , and recursively. For each , , , and , we require
Denote . Proposition A.3 implies that for any
This gives us condition i.
Condition ii follows because the only difference is in the equation for vs. . That is, may "discard" the "correct" element of but this happens with probability at most per discard and there are at most discards.
Conditions iii and iv are obvious from the definition.
Definition A.1
Consider and a universe that is an -realization of with state function . A policy is called -optimal for when for any
is called Blackwell optimal for when it is -optimal for any . Obviously, a stationary Blackwell optimal policy always exists. It is a standard (up to straightforward adaption to our formalism) result in MDP theory that any Blackwell optimal policy has maximal expected utility for any sufficiently close to 1.
The following proposition relates optimality in terms of expected utility to expected truncated utility, where truncated utility is defined by only summing rewards within a time duration .
Proposition A.4
Fix an MDP . Then, for any , , universe that is an -realization of with state function , a -optimal policy for and a Blackwell optimal policy for , we have
Proof of Proposition A.4
Let be a policy s.t. for any
By Proposition B.1
Using the Blackwell optimality of and 1-optimality of , we get that for
Denote , . Using again the Blackwell optimality of
Since both and are in particular 0-optimal, we have
It follows
The following shows that for any policy that doesn't make "irreversible errors," regret can be approximated by "episodic regret" for sufficiently large episode duration.
Proposition A.5
Consider a universe that is an -realization of with state function . Suppose that is a Blackwell optimal policy for and is a -optimal policy for . For any , let be a policy s.t. for any
Then, for any and
Proof of Proposition A.5
By Proposition B.1, for any
becomes Blackwell optimal after , therefore for ,
and coincide until , therefore
Denote , . We also use the shorthand notations , , . Both and is Blackwell optimal after , therefore
Since is 0-optimal, we get
Summing over , we get
Applying Proposition B.1 to the right hand side
Proof of Theorem 1
Fix , and . For each , suppose is an -realization of with state function and denote . To avoid cumbersome notation, whenever should appear a subscript, we will replace it by . Let be a probability space. Let be a random variable and the following be stochastic processes
We also define by
(The following conditions on and imply that the range of the above is indeed in .) Let and be as in Proposition A.2 (we assume w.l.o.g. that ). We construct , , , , , , and s.t is uniformly distributed and for any , , and , denoting
Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on "impossible" information.
We now construct the -policy s.t. for any , s.t. and
That is, we perform Thompson sampling at time intervals of size , moderated by the delegation routine , and discard from our belief state hypotheses whose probability is below and hypotheses sampling which resulted in recommending "unsafe" actions i.e. actions that refused to perform.
In order to prove has the desired property, we will define the stochastic processes , , , , and , each process of the same type as its shriekless counterpart (thus is constructed to accommodate them). These processes are required to satisfy the following:
For any , we construct the -policy s.t. for any , s.t. and
Given any -policy and -policy we define by
Here, is a constant defined s.t. the probabilities sum to 1. We define the -policy by
Condition iii of Proposition A.2 and condition i of -sanity for imply that for any
This means we can apply Proposition A.5 and get
Here, the -policy is defined as in Proposition A.5. We also define the -policies and by
Denote , , , , . For each , denote
We have
Condition iv of Proposition A.2 and condition ii of -sanity for imply that, given s.t.
Therefore, we can apply Proposition A.4 to the terms and get
We have
Thus, using condition i of Proposition A.2, we can bound the contribution of the terms and get
We denote
Define the random variables by
Averaging the previous inequality over , we get
We apply Proposition A.1 to each term in the sum over .
Condition ii of Proposition A.2 implies that
Here, the factor of 2 comes from the difference between the equations for and (we can construct and intermediate policy between and and use the triangle inequality for ). We conclude
Now we set , . Since as , we can use the approximation . We get
Proof of Corollary 1
Follows immediately from Theorem 1 and Proposition B.2.
Definition A.2
Consider an MDP and . The MDP is defined by by
Proposition A.6
Consider an MDP and some . Suppose that for every there is s.t. . Then, for every and
Moreover, if is s.t. then
Proof of Proposition A.6
It is obvious that for any and , . Now consider any s.t. for any , and . Fix . For any , and therefore
On the other hand, is -optimal and therefore
We conclude that
This implies the equation for and the equation for is implied in turn.
Definition A.4
Fix an interface . Consider an -universe which is an -realization of with state function , and s.t. . Denote and . The -universe is defined by
It is easy to see that is an -realization of with state function defined by
Proposition A.7
Fix an interface . Consider and an -realization of with state function . Suppose that is a locally -sane policy for and is the corresponding set of uncorrupt states. Let be any -policy s.t. for any , . Then, is -sane for .
Proof of Proposition A.7
Without loss of generality, assumes all states of are reachable from (otherwise, is an -realization of the MDP we get by discarding the unreachable states). Consider any . If then conditions i+ii of Definition 3 are trivial. Otherwise, we have and . We apply Proposition A.6 for (by condition iv of Definition 4) and get that: conditions i+ii of Definition 4 imply condition i of Definition 3 and conditions i+iii of Definition 4 imply condition ii of Definition 3.
Proof of Corollary 2
For every , denote . By Proposition A.7, is -sane for (where we abuse notation by arbitrarily extending to an -policy). Moreover, it is easy to see that all the share the same reward function . Apply Theorem 1 to . We get s.t. for every and
Obviously and therefore
On the other hand, by Proposition A.6
We conclude
Proof of Corollary 3
Follows immediately from Corollary 2 and Proposition B.2.
Appendix B
The following appeared before as Proposition A.5:
Proposition B.1
Given , and are defined s.t. .
Consider a universe , a policy and . Then,
The followed appeared as Proposition 2:
Proposition B.2
Fix an interface . Let be a countable set of meta-universes s.t. any finite is learnable. Then, is learnable.
2 comments
Comments sorted by top scores.
comment by michaelcohen (cocoa) · 2019-04-22T00:56:53.980Z · LW(p) · GW(p)
I did a search for "ergodic" and was surprised not to find it. Then I did a search for "reachable" and found this:
Without loss of generality, assumes all states of M are reachable from S(λ) (otherwise, υ is an O-realization of the MDP we get by discarding the unreachable states)
You could just be left with one state after that! If that's the domain that the results cover, that should be flagged. It seems to me like this result only applies to ergodic MDPs.
comment by michaelcohen (cocoa) · 2019-04-22T00:50:57.988Z · LW(p) · GW(p)
(as opposed to standard regret bounds in RL which are only applicable in the episodic setting)
??