Dimensional regret without resets
post by Vanessa Kosoy (vanessa-kosoy) · 2018-11-16T19:22:32.551Z · LW · GW · 0 commentsContents
Results Definition 1 Definition 2 Proposition 1 Proposition 2 Definition 3 Definition 4 Definition 5 Definition 6 Theorem 1 Proofs Proof of Proposition 1 Proof of Proposition 2 Proposition A.1 Proof of Proposition A.1 Proposition A.2 Proof of Proposition A.2 Q.E.D. Proposition A.3 Proof of Proposition A.3 Q.E.D. Definition A.1 Proposition A.4 Proof of Proposition A.4 Definition A.2 Proposition A.5 Proof of Proposition A.5 Q.E.D. Proposition A.6 Proof of Proposition A.6 Q.E.D. Proof of Theorem 1 Q.E.D. Appendix Theorem B.1 (John) Theorem B.2 (Kadets-Snobar) Corollary B.1 Proof of Corollary B.1 Proposition B.1 (Osband-Van Roy) Proposition B.2 (Osband-Van Roy) None No comments
TLDR: I derive a variant of the RL regret bound by Osband and Van Roy (2014), that applies to learning without resets of environments without traps. The advantage of this regret bound over those known in the literature, is that it scales with certain learning-theoretic dimensions rather than number of states and actions. My goal is building on this result to derive this type of regret bound for DRL, and, later on, other settings interesting from an AI alignment perspective.
Previously [AF · GW] I derived a regret bound for deterministic environments that scales with prior entropy and "prediction dimension". That bound behaves as in the episodic setting but only as in the setting without resets. Moreover, my attempts to generalize the result to stochastic environments led to bounds that are even weaker (have a lower exponent). Therefore, I decided to put that line of attack on hold, and use Osband and Van Roy's technique instead, leading to an bound in the stochastic setting without resets. This bound doesn't scale down with the entropy of the prior, but this does not seem as important as dependence on .
Results
Russo and Van Roy (2013) introduced the concept of "eluder dimension" for the benefit of the multi-armed bandit setting, and Osband and Van Roy extended it to a form suitable for studying reinforcement learning. We will consider the following, slightly modified version of their definition.
Given a real vector space , we will use to denote the set of positive semidefinite bilinear forms on and to denote the set of positive definite bilinear forms on . Given a bilinear form , we will slightly abuse notation by also regarding it as a linear functional . Thereby, we have and . Also, if is finite-dimensional and is non-degenerate, we will denote the unique bilinear form which satisfies
Definition 1
Consider a set , a real vector space , some and a family . Consider also , a sequence and . is said to be -dependant on when, for any
Otherwise, is said to be -independent of .
Definition 2
Consider a set , a real vector space and some . The Russo-Van Roy-Osband dimension (RVO dimension) is the supremum of the set of for which there is and s.t. for all , is -independent of .
We have the following basic bounds on RVO dimension.
Proposition 1
Consider a set , a real vector space and some . Then
Proposition 2
Consider a set , a real vector space and some . Then
Another concept we need to formulate the regret bound is the Minkowski–Bouligand dimension.
Definition 3
Consider a set , a real vector space , some and a family . A set is said to be a -covering of when
Definition 4
Consider a set , a real vector space , some and a family . The covering number is the infimum of the set of for which there is a -covering of of size .
Definition 5
Consider a finite set , a finite-dimensional real vector product space and some . Fix any . The Minkowski–Bouligand dimension (MB dimension) of is defined by
It is easy to see the above is indeed well-defined, i.e. doesn't depend on the choice of . This is because given any , there are constants s.t. for all and
Similarly, for any , we have
For finite and , it's obvious that , and in particular . It is also possible to show that, for any bounded , .
Note that, in general, MB dimension is fractional.
We will need yet another (but rather simple) notion of "dimension".
Definition 6
Consider a set , a vector space and some . The local dimension of is defined by
Obviously and .
Consider finite non-empty sets (states) and (actions). Observe that can be regarded as a subset of the vector space . This allows us to speak of the RVO, MB and local dimensions of a hypothesis class of transition kernels (in this case and ).
We can now formulate the regret bound.
Theorem 1
There is some s.t. the following holds.
Consider any finite non-empty sets and , , closed set and Borel probability measure on (prior). We define the maximal bias span by
Denote , and . Then, there is a family of policies s.t.
Here (like in previous essays), is the value function for transition kernel , reward function , state and geometric time discount parameter ; is the expected utility for policy ; is the maximal expected utility over policies. The expression in the numerator is, thereby, the Bayesian regret.
The implicit condition implies that for any and , . This is a no-traps condition stronger than the condition we used before: not only that long-term value cannot be lost in expectation, it cannot be lost at all. For any finite satisfying this no-traps condition, we have
A few directions for improving on this result:
-
It is not hard to see from the proof that it is also possible to write down a concrete bound for fixed (rather than considering the limit), but its form is somewhat convoluted.
-
It is probably possible to get an anytime policy with this form of regret, using PSRL with dynamic episode duration.
-
It is interesting to try to make do with the weaker no-traps condition , especially since a stronger no-traps conditions would translate to a stronger condition on the advisor in DRL.
-
It is interesting to study RVO dimension in more detail. For example, I'm not even sure whether Proposition 2 is the best possible bound in terms of or it's e.g. possible to get a linear bound.
-
It seems tempting to generalize local dimension by allowing the values of to lie on some nonlinear manifold of given dimension for any given . This approach, if workable, might require a substantially more difficult proof.
-
The "cellular decision processes" discussed previously [AF · GW] in "Proposition 3" have exponentially high local dimension, meaning that this regret bound is ineffective for them. We can consider a variant in which, on every time step, only one cell or a very small number of cells (possibly chosen randomly) change. This would have low local dimension. One way to interpret it is, as a continuous time process in which each cell has a certain rate of changing its state. EDIT (2019-07-06): It is actually possible to use Theorem 1 to get an effective regret bound for CDPs, by replacing the CDP with a different but equivalent MDP where each time step is replaced by a series of time steps in which the cells are "decided" one by one. It would be interesting to (i) have an explicit expression for the best possible bound among those generated by such "isomorphisms" for general MDPs (ii) derive bounds on RVO dimension for stochastic analogues of CDPs (possibly defined using Markov random fields).
Proofs
Proof of Proposition 1
Consider some and which is -independent of . Then, there are some s.t. the following two inequalities hold
The first inequality implies that, for any , . Comparing with the second inequality, we conclude that .
Now suppose that is s.t. for all , is -independent of . By the previous paragraph, it follows that for any s.t. . Hence, , Q.E.D.
Proof of Proposition 2
Suppose that is s.t. for all , is -independent of . Then, for each , we can choose s.t. the following two inequalities hold
In particular, the second inequality implies that .
Now, consider some . By the first inequality
On the other hand, the second inequality with instead of gives
It follows that . Therefore, cannot exceed the number of unordered pairs of distinct elements in , Q.E.D.
Given a measurable space , some and a measurable function , we will use the notation
Given measurable spaces, some and a measurable function , we will use the notation
Given finite sets , some and , we will use the notation defined by
Proposition A.1
In the setting of Theorem 1, fix and . Let be a Borel measurable mapping s.t. is an optimal policy, i.e.
Let be the policy implemented by a PSRL algorithm with prior and episode length , s.t. whenever hypothesis is sampled, policy is followed. Denote the Bayesian regret by
Let be a probability space governing both the uncertainty about the true hypothesis, the stochastic behavior of the environment and the random sampling inside the algorithm. [See the proof of "Lemma 1" in this [AF · GW] previous essay or the proof of "Theorem 1" in another previous essay.] Furthermore, let be a random variable representing the true hypothesis, be the random variables s.t. represents the hypothesis sampled at time (i.e. during episode number ), be random variables s.t. represents the state at time and be random variables s.t. represents the action taken at time . Denote and . Then
Here, means raising to the -th power w.r.t. Markov kernel composition, and the same for .
We will use the notation to stand for the expected utility achieved by policy when starting from state with transition kernel , reward function and time discount parameter .
Proof of Proposition A.1
For any , define as follows.
That is, is a policy that follows for time and afterwards.
In the following, we use the shorthand notation
It is easy to see that
The above is essentially what appeared as "Proposition B.1" before, for the special case of PSRL, and where we regard every episode as a single time step in a new MDP in which every action is a policy for the duration of an episode in the original MDP.
By definition, and have the same distribution even when conditioned by the history up to . Therefore
It follows that
We now prove by induction on that
For this is a tautology. For any , the Bellman equation says that
It follows that
Since is exactly the policy followed by PSRL at time , we get
We now subtract and add , and use the fact that is the conditional distribution of .
Applying this identity to the second second term on the right hand side of the induction hypothesis, we prove the induction step. For , we get, denoting and
Clearly
Using the definition of PSRL, we can exchange and true and sampled hypothesis and get
It follows that
Applying this to each term in the earlier expression for , we get the desired result. Q.E.D.
Proposition A.2
Consider a real finite-dimensional normed vector space and a linear subspace . Then, there exists s.t.
-
For any ,
-
For any ,
Proof of Proposition A.2
We assume since otherwise the claim is trivial (take ).
By Theorem B.1 (see Appendix), there is s.t. for any
By Corollary B.1 (see Appendix), there is a projection operator s.t. and . We define by
For any , we have
For any , we have and therefore
Q.E.D.
Proposition A.3
Consider a finite-dimensional real vector space , some and a Borel probability measure s.t. . Let and . Then, is -sub-Gaussian w.r.t. , i.e., for any
Proof of Proposition A.3
By isomorphism, it is sufficient to consider the case , , for some and . For this form of it is sufficient to consider the case . It remains to show that
Here, we assume that .
We consider separately the cases and . In the first case
In the second case, we use that for any
The above holds because, at the left hand side and the right hand have the same value and first derivative, and for any , the second derivative of the left hand side is less than the second derivative of the right hand side. We get
Q.E.D.
Definition A.1
Consider a set , a finite-dimensional real vector space , some and a family . Assume is compact w.r.t. the product topology on . Consider also some , and . We then use the notation
I chose the notation as an abbreviation of "least squares" and as an abbreviation of "confidence set". Note that is somewhat ambiguous (and therefore, so is ) since there might be multiple minima, but this will not be important in the following (i.e. an arbitrary minimum can be chosen).
Proposition A.4
There is some s.t. the following holds.
Consider finite sets , some and a family . Assume that for any and , . Let be the canonical filtration, i.e.
Consider also and s.t. for any , , and
Fix , . Denote
Then,
Proof of Proposition A.4
For each , choose a finite set and a linear mapping s.t. . Let . Define by
Define by
Let and . By Proposition A.3, is -sub-Gaussian. Applying Proposition B.1 (see Appendix) to the tilde objects gives the desired result. Here, we choose the constant s.t. the term in as defined in Proposition B.1 is absorbed by the term (note that since we substitute , and ). Q.E.D.
Definition A.2
Consider a set , some , a real vector space , some and a family . The -width of at is defined by
Proposition A.5
There is some s.t. the following holds.
Consider a set , a real vector space , some and a family . Consider also some , , , , , and . Denote
For any , define by
Assume that . Then,
Proof of Proposition A.5
For each , choose a finite set and a linear mapping s.t. . Let . Define by
Define also by
Applying Proposition B.2 (see Appendix) to the tilde objects, we conclude that for any
Multiplying the inequality by and summing over , we get
On the left hand side, we sum the geometric series. On the right hand side, we use the observation that
Here, we used that is an increasing function for and . We get
The integral on the right hand side is a Laplace transform (where the dual variable is ). The functions , and all have the property that, for any
Indeed, we have
Here, is the Euler-Mascheroni constant.
We conclude
Using the condition and absorbing factors into the definition of , we get
Since , this implies
Q.E.D.
Proposition A.6
There is some s.t. the following holds.
Consider a set , a real vector space , some and a family . Assume that for any and , . Consider also some , , , , and . Define and the same way as in Proposition A.5. Denote . Assume and . Then,
Proof of Proposition A.6
Due to the assumption , we have . For any , we have
Applying Proposition A.5 to the right hand side
Evaluating the integral and dropping some negative terms on the right hand side, we get
We now set to be
For an appropriate choice of , it follows that
Q.E.D.
Proof of Theorem 1
We take where is as in Proposition A.1 and will be specified later. Denote the Bayesian regret by
By Proposition A.1
We will use the notation
It follows that
Consider equipped with the norm. Given and , consider also the subspace
By Proposition A.2, there is s.t.
- For any
- For any
We now apply Proposition A.4 with and . We get
Here, was defined in Proposition A.4.
Since the hypothesis is sampled from the posterior, for any we also have
Denote
We get
Using property 2 of B
Denote
Clearly
Using this inequality, dropping the (since it can only the right hand side smaller) and moving the sum inside the expected value, we get
We will now assume (if , then and ) and . Applying Proposition A.6, we conclude
Denote the second factor on the right hand side , so that the inequality becomes
We set
Now, we analyze the limit. In this limit, the expression for justifies our assumption that . Indeed, we have
Using the previous inequality for , we get
We assume , otherwise the theorem is vacuous. Multiplying the numerator and denominator on the right hand side by , we get
We will now analyze the contribution of each term in to the limit on the right hand side (using the fact that ). We ignore multiplicative constants that can be absorbed into .
The first term gives (using our choice of )
The second term gives (using the definitions of and our choices of and )
We analyze each subterm separately. The first subterm gives
The second subterm gives
The third subterm gives
The third and fourth terms give
To analyze the fifth term, observe that . Hence
Putting everything together, we observe that the expression dominates all the terms (up to a multiplicative constant). Here, we use that is an integer and hence . We conclude
Q.E.D.
Appendix
The following was proved in John (1948), available in the collection Giorgi and Kjeldesen (2014) (the theorem in question is on page 214).
Theorem B.1 (John)
Consider a real finite-dimensional normed vector space . Then, there exists s.t. for any
The following is from Kadets and Snobar (1971).
Theorem B.2 (Kadets-Snobar)
Consider a real Banach space and a finite-dimensional linear subspace . Then, for any , there is a projection operator s.t. and .
Corollary B.1
Consider a real finite-dimensional normed vector space and a linear subspace . Then, there is a projection operator s.t. and .
Proof of Corollary B.1
Let be the vector space of linear operators . Define
is an affine subspace of and is a compact set. Define by . is a continuous function. By Theorem B.2, . Since is compact, attains its infimum at some , and we have , Q.E.D.
The next proposition appeared (in slightly greater generality) in Osband and Van Roy as "Proposition 5". We will use to denote the form corresponding to the identity matrix.
Proposition B.1 (Osband-Van Roy)
There is some s.t. the following holds.
Consider a finite set , the vector space for some and some . Let be the canonical filtration, i.e.
Consider also and s.t. for any and
Assume that is s.t. with -probability for all . Assume also that is s.t. is -sub-Gaussian, i.e., for any , and
Fix , . Define by
Then,
Note that we removed a square root in the definition of compared to equation (7) in Osband and Van Roy. This only makes larger (up to a constant factor) and therefore, only makes the claim weaker.
The proposition below appeared in Osband and Van Roy as "Lemma 1".
Proposition B.2 (Osband-Van Roy)
Consider a set , the vector space for some and some . Consider also some , , , , and a nondecreasing sequence . For any , define by
Then,
0 comments
Comments sorted by top scores.