Catastrophe Mitigation Using DRL (Appendices)
post by Vanessa Kosoy (vanessa-kosoy) · 2018-02-14T11:57:47.000Z · LW · GW · 0 commentsContents
No comments
% operators that are separated from the operand by a space
% autosize deliminaters
% operators that require brackets
% operators that require parentheses
% Paper specific
These are Appendices B and C for the essay Catastrophe Mitigation Using DRL. They appear in a separate post because of a length limit in the website.
##Appendix B
Given , we denote , .
#Proposition B.1
Consider a universe which an -realization of an MDP with state function , a stationary policy , an arbitrary -policy and some . Then,
#Proof of Proposition B.1
For the sake of encumbering the notation less, we will omit the parameter in functions that depend on it. We will use implicitly, i.e. given a function on and , . Finally, we will omit , using the shorthand notations , .
For any , it is easy to see that
Taking expected value over , we get
It is easy to see that the second term vanishes, yielding the desired result.
#Proposition B.2
Consider some , , a universe that is an -realization of with state function , a stationary policy and an arbitrary -policy . For any , let be an -policy s.t. for any
Assume that
i. For any
ii. For any and
Then, for any ,
#Proof of Proposition B.2
For the sake of encumbering the notation less, we will use implicitly, i.e. given a function on and , . Also, we will omit , using the shorthand notations , .
By Proposition B.1, for any
coincides with after , therefore the corresponding expected values vanish.
Subtracting the equalities for and , we get
and coincide until , therefore
Denote , . We also use the shorthand notations , , . Both and coincide with after , therefore
Denote . By the mean value theorem, for each there is s.t.
It follows that
Here, an expected value w.r.t. the difference of two probability measures is understood to mean the corresponding difference of expected values.
It is easy to see that assumption i implies that is a submartingale for (whereas it is a martingale for ) and therefore
We get
Summing over , we get
Applying Proposition B.1 to the right hand side
#Proof of Lemma A.1
Fix , and . Denote . To avoid cumbersome notation, whenever should appear a subscript, we will replace it by . Let be a probability space\Comment{ and a filtration of }. Let be \Comment{measurable w.r.t. }a random variable and the following be stochastic processes\Comment{ adapted to }
We also define by
(The following conditions on and imply that the range of the above is indeed in .) Let and be as in Proposition C.1 (we assume w.l.o.g. that ). We construct \Comment{, }, , , , , , and s.t is uniformly distributed and for any , , and , denoting
Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on "impossible" information.
We now construct the -policy s.t. for any , s.t. and
That is, we perform Thompson sampling at time intervals of size , moderated by the delegation routine , and discard from our belief state hypotheses whose probability is below and hypotheses sampling which resulted in recommending "unsafe" actions i.e. actions that refused to perform.
In order to prove has the desired property, we will define the stochastic processes , , , , and , each process of the same type as its shriekless counterpart (thus is constructed to accommodate them). These processes are required to satisfy the following:
For any , we construct the -policy s.t. for any , s.t. and
Given any -policy and -policy we define by
Here, is a constant defined s.t. the probabilities sum to 1. We define the -policy by
Condition iii of Proposition C.1 and condition i of Definition A.1 imply that for any
This means we can apply Proposition B.2 and get
Here, the -policy is defined as in Proposition B.2. We also define the -policies and by
Denote
For each , denote
We have
Condition iv of Proposition C.1 and condition ii of Definition A.1 imply that, given s.t.
Therefore, , and we remain with
We have
Since , it follows that
Using condition i of Proposition C.1, we conclude
Define the random variables by
Averaging the previous inequality over , we get
We apply Proposition C.2 to each term in the sum over .
Condition ii of Proposition C.1 implies that
Here, the factor of 2 comes from the difference between the equations for and (we can construct and intermediate policy between and and use the triangle inequality for ). We conclude
Now we set
Without loss of generality, we can assume that (because of the form of the bound we are proving), which implies that and . We get
##Appendix C
The following is a simple special case of what appeared as "Proposition A.2" in the previous essay, where we restrict to be single-valued (the more general case isn't needed).
#Proposition C.1
Fix an interface , , , . Consider some . Then, there exist and with the following properties. Given , we denote its projection to . Thus, . Given an -environment, , and , we can define as follows
We require that for every , and as above, the following conditions hold
i.
ii.
iii. For all , if then
iv. For all , if then
The following appeared in the previous essay as "Proposition A.1".
#Proposition C.2
Consider a probability space , , a finite set and random variables , and . Assume that and . Then
0 comments
Comments sorted by top scores.