Reinforcement learning with imperceptible rewards

vanessa-kosoy

Reinforcement learning with imperceptible rewards

post by Vanessa Kosoy (vanessa-kosoy) · 2019-04-07T10:27:34.127Z · LW · GW · 1 comments

  Background
  Results
    Partially Observable MDPs with Imperceptible Rewards
      Example 1
    Instrumental States and Reward Functions
      Definition 1
      Definition 2
      Definition 3
      Definition 4
      Definition 5
      Definition 6
      Definition 7
      Proposition 1
      Proposition 2
      Definition 8
      Definition 9
      Definition 10
      Definition 11
      Definition 12
      Proposition 3
      Proposition 4
      Definition 13
      Definition 14
      Theorem 1
      Definition 15
      Definition 16
      Proposition 5
    A Regret Bound for RL with Instrumental Rewards
      Theorem 2
  Discussion
    More on the Regret Bound
    The Death of the Agent and Kamikaze Strategies
    Specifying Instrumental Reward Functions
  Proofs
      Proof of Proposition 1
      Proposition A.1
      Proof of Proposition A.1
      Proof of Proposition 2
      Proposition A.2
      Proof of Proposition A.2
      Proposition A.3
      Proof of Proposition A.3
      Proposition A.4
      Proof of Proposition A.4
      Proof of Proposition 3
      Proof of Proposition 4
      Proposition A.5
      Proof of Proposition A.5
      Proposition A.6
      Proof of Proposition A.6
      Proposition A.7
      Proof of Proposition A.7
      Proposition A.8
      Proof of Proposition A.8
      Proposition A.9
      Proof of Proposition A.9
      Proposition A.10
      Proof of Proposition A.10
      Proposition A.11
      Proof of Proposition A.11
      Proof of Theorem 1
      Proposition A.12
      Proof of Proposition A.12
      Proof of Proposition 5
      Proposition A.13
      Proof of Proposition A.13
      Proposition A.14
      Proof of Proposition A.14
      Proposition A.15
      Proof of Proposition A.15
      Proposition A.16
      Proof of Proposition A.16
      Proposition A.17
      Proof of Proposition A.17
      Proposition A.18
      Proof of Proposition A.18
      Proposition A.19
      Proof of Proposition A.19
      Proposition A.20
      Proof of Proposition A.20
      Proposition A.21
      Proof of Proposition A.21
      Proof of Theorem 2
  Appendix
      Definition B.1
      Definition B.2
      Proposition B.1
      Proof of Proposition B.1
None
1 comment

TLDR: We define a variant of reinforcement learning in which the reward is not perceived directly, but can be estimated at any given moment by some (possibly costly) experiment. The reward function is no longer a function of the observation history, but a different object that we call "instrumental reward function". We give two definitions of instrumental reward function and prove their equivalence. We also derive a regret bound for this setting.

Background

In "classical" reinforcement learning the agent perceives the reward signal on every round of its interaction with the environment, whether through a distinct input channel or through some given way to compute the reward from the interaction history so far. On the other hand, we can rather easily imagine agents that optimize properties of their environment that they do not directly perceive. For example, if Alice, who lives in Dominica, donates money to the Against Malaria Foundation in order to save someone in Africa, then the result is usually not visible to Alice at the time it occurs, if ever. Similarly, Clippy the paperclip maximizer doesn't always perceive all the paperclips in the universe. Moreover, we might want to design agents that, in order to estimate the reward, direct queries to humans (which is costly and cannot be done continuously non-stop).

Now, it is possible to define the perceived reward as the subjective expected value of the "true" imperceptible reward (see the Results section for details). Although this transformation preserves expected utility, it does not preserve Bayesian regret. Indeed, Bayesian regret is the difference between the expected utility attained by the agent and the expected utility attained by a "reference" agent that knows the true environment from the onset. However, after the transformation, the reference agent will behave as if it knows the observable dynamics of the true environment but still pretends not to know the true environment for the purpose of computing the reward. Therefore, regret analysis requires us to consider the imperceptible reward honestly. In fact, as we will see, certain assumptions about the imperceptible reward function are needed even to make the problem learnable at all. Finally, this transformation makes the reward function more complex and hence applying a "generic" reinforcement learning algorithm after the transformation (instead of exploiting the special form of the resulting reward function) might carry a significant computational complexity penalty.

Related Work: De Blanc 2011 studies so called "ontological crises". That is, de Blanc examines the problem of translating a reward function from one ontology into another. Here, we avoid this problem by considering reward functions that are automatically defined in all ontologies. That said, it might still be interesting to think about how to specify our type of reward function starting from a reward function that is only defined in one particular ontology. We will return to this in the Discussion section.

Krueger et al 2016 consider a setting where querying the reward is instantaneous and has a fixed cost. This is a special case of our setting, but we allow a much more general type of query and cost. Also, Krueger et al don't derive any theoretical regret bound (but they do present some experimental results).

Finally, multi-armed bandits with partial monitoring are closely related, see for example Bartok et al 2013. However, bandits by definition assume a stateless environment, and also our approach is rather different than what is usually studied in partial monitoring.

The literature study was very cursory and I will be glad to know about prior work I missed!

Results

Partially Observable MDPs with Imperceptible Rewards

Partially Observable Markov Decision Processes (POMDPs) serve as a natural starting point for thinking about imperceptible rewards. Indeed, it might seem like all we have to do is consider RL in a POMDP environment and let the reward to be a function of the (imperceptible) state. However, this setting is in general unlearnable even given assumptions that rule out traps.

A (finite) POMDP is defined by non-empty finite sets (states), $A$ (actions) and $O$ (observations), the transition kernel $T : S \times A k \to S \times O$ and the reward function $R : S \to [0, 1]$ . As opposed to the "classical" definition of POMDP, we allow the value of $R$ is to be imperceptible. The perceptible setting is a special case of the imperceptible setting: we can always encode the reward into the observation.

To formulate a learning problem, we assume $S$ , $A$ and $O$ to be fixed and an initial state $s_{0} \in S$ to be given, while $T$ and $R$ are unknown and belong to some hypothesis class $H$ :

$H \subseteq {S \times A k \to S \times O} \times {S \to [0, 1]}$

Such a problem can be unlearnable even if $R$ is known and there are no irreversible events that can happen:

Example 1

Suppose that $S = {s_{0}, s_{-}, s_{+}}$ , $A = {a_{-}, a_{+}}$ , $O = {⊥}$ , $R (s_{0}) = \frac{1}{2}$ , $R (s_{-}) = 0, R (s_{+}) = 1$ and $H = {(T_{-}, R), (T_{+}, R)}$ where for any $s \in S$

$T_{-} (s_{+}, ⊥ | s, a_{-}) = 1$ $T_{-} (s_{-}, ⊥ | s, a_{+}) = 1$ $T_{+} (s_{+}, ⊥ | s, a_{+}) = 1$ $T_{+} (s_{-}, ⊥ | s, a_{-}) = 1$

Since $| O | = 1$ , there is no way to gain any information about which hypothesis is correct. Moreover, the optimal policies for the two hypotheses are different. Namely, for $T_{+}$ we should always take action $a_{+}$ and for $T_{-}$ we should always take action $a_{-}$ .

To formalize and illustrate the discussion in the Background section, suppose $H$ is Borel and $ζ \in Δ H$ is the prior. We can then define the "perceived effective reward" $E R : {(A \times O)}^{*} \to [0, 1]$ by

$E R (a o) := E \begin{matrix} (T, R) \sim ζ (s_{n + 1}, o_{n}^{'}) \sim T (s_{n}, a_{n}) \end{matrix} [R (s_{| h |}) ∣ ∣ o^{'} = o]$

It is then easy to see that the $E$ operator preserves expected utility: given any policy $π : {(A \times O)}^{*} k \to A$ and $m \in N$

$E \begin{matrix} (T, R) \sim ζ a_{n} \sim π (a o_{: n}) (s_{n + 1}, o_{n}) \sim T (s_{n}, a_{n}) \end{matrix} [E R (a o_{: m})] = E \begin{matrix} (T, R) \sim ζ a_{n} \sim π (a o_{: n}) (s_{n + 1}, o_{n}) \sim T (s_{n}, a_{n}) \end{matrix} [R (s_{m})]$

and therefore, for any geometric time discount parameter $γ \in [0, 1)$

$E \begin{matrix} (T, R) \sim ζ a_{n} \sim π (a o_{: n}) (s_{n + 1}, o_{n}) \sim T (s_{n}, a_{n}) \end{matrix} [\infty \sum n = 0 γ^{n} E R (a o_{: n})] = E \begin{matrix} (T, R) \sim ζ a_{n} \sim π (a o_{: n}) (s_{n + 1}, o_{n}) \sim T (s_{n}, a_{n}) \end{matrix} [\infty \sum n = 0 γ^{n} R (s_{n})]$

On the other hand, Bayesian regret is not preserved since, in general

$E (T, R) \sim ζ [max π E a o s \sim T π [\infty \sum n = 0 γ^{n} E R (a o_{: n})]] \neq E (T, R) \sim ζ [max π E a o s \sim T π [\infty \sum n = 0 γ^{n} R (s_{n})]]$

Here $a o s \sim T π$ is shorthand notation for the same probability distribution as before.

Indeed, in Example 1 the LHS of the above is $\frac{1}{2} \cdot \frac{1}{1 - γ}$ since $E R \equiv \frac{1}{2}$ , whereas the RHS is $\frac{1}{2} + \frac{γ}{1 - γ}$ .

The pathology of Example 1 comes about because reward is not only imperceptible but entirely unobservable. That is, no experiment can produce any information about whether the reward on a given round $n > 0$ was 0 or 1. More specifically, the states $s_{-}$ and $s_{+}$ are assigned different rewards, but there is no observable difference between them. It is as if Alice would assign value, not to people in Africa (whose existence and well-being can be measured) but to some Flying Spaghetti Monster s.t. the world behaves exactly the same regardless of its existence or condition.

This observation suggests that, instead of assigning rewards to states which are just abstract labels in a model, we should assign rewards to states that are defined in terms of the observable consequences of the interaction of the agent with the environment. This leads us to the notion of "instrumental state", which we will now define formally.

Instrumental States and Reward Functions

First, we introduce some technical definitions for notational convenience.

Definition 1

$C o n S e t$ is the category whose objects are pairs $(V, C)$ where $V$ is a real vector space and $C$ is a convex subset of $V$ , and whose morphisms are

$M o r ((V, C), (W, D)) :=$ ${f : C \to D | \exists A \in H o m (V, W), w \in W \forall v \in C : f (v) = A v + w}$

We omit describing identity and composition of morphisms since they are obvious.

It is easy to see that $C o n S e t$ has arbitrary limits. In particular, $C o n S e t$ has a final object that we denote by $p t$ (the one point set), products ( $(V, C) \times (W, D) ≅ (V \oplus W, C \times D)$ ) and inverse limits of sequences. For any finite set $A$ , $(R^{A}, Δ A)$ is an object in $C o n S e t$ . Also, we will sometimes abuse notation by regarding $C$ as an object of $C o n S e t$ instead of $(V, C)$ (i.e. making $V$ implicit).

Definition 2

The functor $Cone : C o n S e t \to C o n S e t$ is defined by

$Cone (V, C) := (V \oplus R, {(λ v, λ) | λ \in [0, 1], v \in C})$ $(Cone f) (λ v, λ) := (λ f (v), λ)$

Definition 3

For any $C \in C o n S e t$ , we define $h_{C} : Cone C \to [0, 1]$ by

$h_{C} (λ v, λ) := λ$

Note that $h$ is a natural transformation from $Cone$ to the constant functor with value $[0, 1]$ .

Given $C \in C o n S e t$ and $x \in Cone C$ s.t. $x = (u, λ)$ for $λ > 0$ , we denote $[x] := λ^{- 1} u \in C$ .

Now, we are ready to define instrumental states. We fix the sets $A$ and $O$ .

Definition 4

For any $n \in N$ , we define ${I S}_{n} \in C o n S e t$ , the space of $n$ time step instrumental states, recursively by

${I S}_{0} := p t$ ${I S}_{n + 1} := \prod a \in A {(\prod o \in O h_{{I S}_{n}})}_{{I S}_{n}}^{- 1} (Δ O)$

Here, $\prod_{o \in O} h_{{I S}_{n}}$ is a mapping from $\prod_{o \in O} Cone {I S}_{n}$ to $[0, 1]^{O}$ . The inverse image of a convex set (in this case $Δ O \subseteq [0, 1]^{O}$ ) under an affine mapping (in this case $\prod_{o \in O} h_{{I S}_{n}}$ ) is also a convex set.

The semantics of Definition 4 is as follows. Any $α \in {(\prod_{o \in O} h_{{I S}_{n}})}_{o \in O}^{- 1} (Δ O)$ can be regarded as a pair consisting of some $α^{'} \in Δ O$ (the image of $α$ under $\prod_{o \in O} h_{{I S}_{n}}$ ) and a mapping $α^{''} : supp α^{'} \to {I S}_{n}$ defined by $α^{''} (o) := [α_{o}]$ . Given $θ \in {I S}_{n + 1}$ , $θ_{a}^{'}$ is the probability distribution over observations resulting from taking action $a$ in state $θ$ , whereas $θ_{a}^{''} (o)$ is the state resulting from taking action a in state $θ$ conditional on observing $o$ . This semantics can be made more formal as follows:

Definition 5

Given $θ \in {I S}_{n}$ , we define $dom θ \subseteq {(A \times O)}^{*}$ recursively as follows:

$λ \in dom θ$
For all $h \in {(A \times O)}^{*}$ , $a \in A$ and $o \in O$ : $a o h \in dom θ$ iff $n > 0$ , $h_{{I S}_{n - 1}} (θ_{a o}) > 0$ and $h \in dom [θ_{a o}]$ .

Definition 6

Given $θ \in {I S}_{n}$ , $h \in dom θ$ and $a \in A$ , and assuming that $| h | < n$ , we recursively define $θ (h a) \in Δ O$ by

For $h = λ$ : $θ (o ∣ a) := h_{{I S}_{n - 1}} (θ_{a o})$
For $h = a^{'} o^{'} h^{'}$ with some $a^{'} \in A$ , $o^{'} \in O$ and $h^{'} \in {(A \times O)}^{*}$ : $θ (h a) := [θ_{a^{'} o^{'}}] (h^{'} a)$

The point of defining ${I S}_{n}$ in this manner is that (i) different points in ${I S}_{n}$ correspond to states that are truly not equivalent, i.e. can be empirically distinguished (statistically) and (ii) convex combinations of points in ${I S}_{n}$ correspond precisely to probabilistic mixtures. Formally, we have:

Definition 7

Given $θ \in {I S}_{n}$ and $π : {(A \times O)}^{< n} k \to A$ (a policy), we define $θ π \in Δ {(A \times O)}^{n}$ (probability distribution over histories resulting from following policy $π$ in state $θ$ ) by

$Pr a o \sim θ π [a_{m} = a_{*}] = E a o \sim θ π [π (a_{*} | a o_{: m})]$ $Pr a o \sim θ π [o_{m} = o_{*}] = E a o \sim θ π [θ (o_{*} | a o_{: m} a_{m})]$

Proposition 1

Consider some $θ, θ^{'} \in {I S}_{n}$ and assume that for any $π : {(A \times O)}^{< n} \to A$ , $θ π = θ^{'} π$ . Then, $θ = θ^{'}$ .

Proposition 2

Consider some $θ, θ^{'} \in {I S}_{n}$ , $π : {(A \times O)}^{< n} k \to A$ and $p \in [0, 1]$ . Then

$(p θ + (1 - p) θ^{'}) π = p θ π + (1 - p) θ^{'} π$

${I S}_{n}$ is a bounded polytope but in general it is *not} a simplex: we cannot regard it as just probability distributions over some set. For example, if $| A | = | O | = 2$ , then it's easy to see that ${I S}_{1}$ is a square (one axis is the probability to get a given observation when taking one action, the other axis is the probability for the other action). On the other hand, if $| A | = 1$ then ${I S}_{n}$ is a simplex: it is canonically isomorphic to $Δ O^{n}$ .

There are a natural morphisms ${p r}_{n} : {I S}_{n + 1} \to {I S}_{n}$ whose semantics is forgetting about the behavior of the state at time step $n$ :

Definition 8

We define ${p r}_{n} : {I S}_{n + 1} \to {I S}_{n}$ for any $n \in N$ recursively. ${p r}_{0}$ is the unique morphism from ${I S}_{1}$ to ${I S}_{0} ≅ p t$ . For any $n \in N$ , ${p r}_{n + 1}$ is given by

${p r}_{n + 1} (θ)_{a o} := (Cone {p r}_{n}) (θ_{a o})$

We thereby get a sequence ${I S}_{0} \leftarrow {I S}_{1} \leftarrow {I S}_{2} \leftarrow \dots$

Definition 9

We define ${I S}_{ω}$ , the space of (infinite time step) instrumental states by

${I S}_{ω} := lim \leftarrow - n {I S}_{n}$

We denote the canonical projections by ${p r}_{n}^{ω} : {I S}_{ω} \to {I S}_{n}$ .

Of course, ${I S}_{ω}$ can also be regarded as the space of all possible stochastic environments. Specifically, we have:

Definition 10

For any $μ \in {I S}_{ω}$ we define $dom μ \in {(A \times O)}^{*}$ by

$dom μ := \infty ⋃ n = 0 dom {p r}_{n}^{ω} μ$

Definition 11

Given $μ \in {I S}_{ω}$ , $h \in dom μ$ and $a \in A$ , we define $μ (h a) \in Δ O$ by

$μ (h a) := {p r}_{| h | + 1}^{ω} μ (h a)$

Like in the finite time case, we have

Definition 12

Given $μ \in {I S}_{ω}$ and $π : {(A \times O)}^{*} k \to A$ , we define $μ π \in Δ {(A \times O)}^{ω}$ by

$Pr a o \sim μ π [a_{n} = a_{*}] = E a o \sim μ π [π (a_{*} | a o_{: n})]$ $Pr a o \sim μ π [o_{n} = o_{*}] = E a o \sim μ π [μ (o_{*} | a o_{: n} a_{n})]$

Proposition 3

Consider some $μ, μ^{'} \in {I S}_{ω}$ and assume that for any $π : {(A \times O)}^{*} \to A$ , $μ π = μ^{'} π$ . Then, $μ = μ^{'}$ .

Proposition 4

Consider some $μ, μ^{'} \in {I S}_{ω}$ , $π : {(A \times O)}^{*} k \to A$ and $p \in [0, 1]$ . Then

$(p μ + (1 - p) μ^{'}) π = p μ π + (1 - p) μ^{'} π$

For each $n \in N$ , ${I S}_{n}$ is finite-dimensional and therefore has a natural topology. ${I S}_{ω}$ also becomes a topological space by equipping it with the inverse limit topology. Since the ${I S}_{n}$ are closed and bounded they are compact, and therefore ${I S}_{ω}$ is also compact by Tychonoff's theorem. In the special case $| A | = 1$ , ${I S}_{ω} ≅ Δ O^{ω}$ and the inverse limit topology is the weak topology for probability measures, defined w.r.t. the product topology on $O^{ω}$ .

We can now give the first definition of an instrumental reward function: a continuous affine function $R : {I S}_{ω} \to R$ . Why affine? A convex linear combination of instrumental states is empirically indistinguishable from a probabilistic lottery. If we assign expected values to probabilistic lotteries (as we should by the VNM theorem), then we must also assign them to convex linear combinations of instrumental states: otherwise our reward again depends on entirely unobservable parameters of our model.

An alternative approach is to consider the notion of "experiment" explicitly.

We will use the notation $X^{\leq ω} := X^{*} ⊔ X^{ω}$ . Given a logical condition $ϕ$ , the symbol $1_{ϕ}$ will denote 1 when $ϕ$ is true and 0 when $ϕ$ is false.

Definition 13

Given $π : {(A \times O)}^{*} k \to A ⊔ {⊥}$ and $μ \in {I S}_{ω}$ , we define $μ π \in Δ {(A \times O)}^{\leq ω}$ by

$Pr a o \sim μ π [a_{n} = a_{*}] = E a o \sim μ π [1_{| a o | \geq n} π (a_{*} | a o_{: n})]$ $Pr a o \sim μ π [o_{n} = o_{*}] = E a o \sim μ π [1_{| a o | > n} μ (o_{*} | a o_{: n} a_{n})]$

$π$ is said to be a terminable policy when for any $μ \in {I S}_{ω}$

$Pr h \sim μ π [| h | < \infty] = 1$

That is, a terminable policy is allowed to produce a special token $⊥$ which terminates the "experiment" (instead of choosing an action), and we require that, for any environment, this token will be eventually produced with probability 1.

Definition 14

Given a terminable policy $π$ and a bounded function $r : {(A \times O)}^{*} \to R$ , we define the function $R_{π r} : {I S}_{ω} \to R$ by

$R_{π r} (μ) := E h \sim μ π [r (h)]$

This gives us a second definition of instrumental reward functions. In fact, the two definitions are equivalent:

Theorem 1

For any terminable policy $π$ and bounded function $r : {(A \times O)}^{*} \to R$ , $R_{π r}$ is continuous and affine. Conversely, for any continuous affine $R : {I S}_{ω} \to [0, 1]$ , there exists a terminable policy $π$ and $r : {(A \times O)}^{*} \to [- 2, 2]$ s.t. $R = R_{π r}$ .

Putting the second part of the theorem into words, for any instrumental reward function (in the sense of the first definition) there is some experiment the agent can do which yields an unbiased estimate of the reward for the instrumental state that existed at the beginning of the experiment.

The range $[- 2, 2]$ is not optimal, but for the regret bound in the next subsection, it's only important that it is bounded by some fixed constant.

To illustrate this concept of instrumental reward function, imagine that Clippy has access to a black box with a collection of levers on its outside. Pulling the levers produces some sounds that hint at what happens inside the box, but are not enough to determine it with certainty. The box has a shuttered glass window, whose shutter can be opened from the outside. Through the window, Clippy can see a jumble of scrap metal, mechanical manipulators that are controlled by the levers (and can be used to shape the scrap metal), and also a few mice running around the box and wreaking havoc. However, it is not possible to control the manipulators while the shutter is open. Worse, while opening the shutter allows seeing a snapshot of the shape of the metal, it also causes the manipulators to move chaotically, ruining this shape. So, Clippy can experiment with the levers and occasionally open the shutter to test the result. However, in order to produce and maintain paperclips inside the box, the shutter has to be kept closed (and the paperclips hidden) most of the time.

It is also possible to consider reward functions of the more general form $R : {(A \times O)}^{*} \times {I S}_{ω} \to R$ , required to be continuous and affine in the second argument. Such a reward function depends both on the current (unknown) instrumental state of the environment and the observable history so far. By Theorem 1, such a reward can be equivalently described in terms of a family ${π_{h} : {(A \times O)}^{*} \to A ⊔ {⊥}}_{h \in {(A \times O)}^{*}}^{*}$ of terminable policies and a family ${r_{h} : {(A \times O)}^{*} \to R}_{h \in {(A \times O)}^{*}}^{*}$ of bounded functions s.t.

$R (h, μ) = R_{π_{h} r_{h}} (μ)$

This means that, the value of reward can be estimated empirically, but only if the agent remembers the entire observation history. If the history is forgotten at some point, it might never be able to estimate the reward again. We will such reward functions "semi-instrumental".

Although semi-instrumental reward functions are more general than instrumental reward functions, I think that there is some interest in studying the narrower class. Arguably, instrumental reward functions are a better model of what counts as "consequentialist" or "goal-directed" behavior, since they depend only on the state of the environment. Indeed, it is easy to construct a semi-instrumental reward function that makes any given policy Bayes-optimal for any given prior, so semi-instrumental reward functions are incapable (without further assumptions) to distinguish between "intelligent" and "unintelligent" behavior. On the other hand, optimality w.r.t. some instrumental reward function seems like a stronger condition.

In order to derive a regret bound, we will restrict attention to those reward functions for which the terminable policy can be made to terminate within time that has some finite and bounded expected value. I don't know an elegant way to characterize those reward functions in general, but we will describe one class of such functions.

Definition 15

Consider any $C \in C o n S e t$ . Then, we can define the total variation distance $d_{tv} : C \times C \to R$ by

$d_{tv} (x, y) := sup r \in M o r (C, [0, 1]) | r (x) - r (y) |$

In general, $d_{tv}$ is a pseudometric. Moreover, when $C$ is finite-dimensional and bounded, it is a metrization of the natural topology. For a finite set $A$ and $C = Δ A$ , $d_{tv}$ is just the usual total variation metric. For $C$ a ball of unit diameter in Euclidean space, $d_{tv}$ is the Euclidean metric.

Definition 16

Consider any $λ \in (0, 1)$ . We define the metric $d_{tv}^{λ}$ on ${I S}_{ω}$ by

$d_{tv}^{λ} (μ, ν) := sup n \in N λ^{n} d_{tv} ({p r}_{n}^{ω} μ, {p r}_{n}^{ω} ν)$

For any $λ$ , $d_{tv}^{λ}$ is a metrization of the inverse limit topology on ${I S}_{ω}$ .

Proposition 5

Consider any $λ \in (0, 1)$ and $R : {I S}_{ω} \to R$ affine and Lipschitz with respect to $d_{tv}^{λ}$ . Then there is a terminable policy $π$ and a bounded function $r : {(A \times O)}^{*} \to R$ s.t. $R = R_{π r}$ and

$sup μ \in {I S}_{ω} E h \sim μ π [| h |] < \infty$

Note that $λ$ is a parameter reminiscent of geometric time discount that constraints the shape of the reward function. However, in the regret analysis that follows, it is not the same as the actual geometric time discount parameter $γ$ . In particular, we consider the asymptotics in which the latter approaches 1, while the reward function is assumed to be fixed. It might be interesting to study regimes in which both approach 1, but we will not attempt it at present.

A Regret Bound for RL with Instrumental Rewards

Fix $A$ and $O$ . For any finite set $S$ and $T : S \times A k \to S \times O$ , there is a unique mapping $I_{T} : S \to {I S}_{ω}$ s.t.

$I_{T} (s)_{a o} = (\sum s^{'} \in S T (s^{'}, o ∣ ∣ s, a) I_{T} (s^{'}), \sum s^{'} \in S T (s^{'}, o ∣ ∣ s, a))$

That is, $I_{T} (s)$ is just the instrumental state corresponding to the POMDP state $s$ .

Given $R : {I S}_{ω} \to R$ continuous affine, we get for any $S$ and $T$ the POMDP reward function $R \circ I_{T}$ . Hence, one natural setting for regret analysis is fixing $R$ and $S$ and considering a hypothesis class of transition kernels

$H \subseteq {S \times A k \to S \times O}$

However, regret analysis for POMDPs involves some technical complexity, and we prefer to start with a simpler setting. Hopefully, we will address the POMDP case in a followup essay.

Suppose that $O = S$ . Then, given any MPD transition kernel $T : S \times A k \to S$ , we can define the POMDP transition kernel $T^{♯} : S \times A k \to S \times O$ by

$T^{♯} (s^{'}, s^{''} ∣ ∣ s, a) := 1_{s^{'} = s^{''}} T (s^{'} ∣ ∣ s, a)$

Fix $R : S \times {I S}_{ω} \to [0, 1]$ continuous affine in the second argument (a type of semi-instrumental reward function). For any $T$ as above, we get the induced reward function $R_{T} : S \to [0, 1]$ given by $R_{T} (s) := R (s, I_{T^{♯}} (s))$ . We can now consider the learning problem corresponding to a hypothesis class of MDP transition kernels

$H \subseteq {S \times A k \to S}$

It might seem odd to consider a setting with fully observable states in the context of imperceptible rewards. However, we can think about it as follows: what we observe is only a label whose physical meaning is unknown. Only given the (unknown) transition kernel, such a label can be interpreted as assigned a reward.

For example, imagine a robot tasked with sorting balls into bins according to their weight. Some of the balls are solid gold and some of them are solid silver, so it's in principle possible to know their weight just by looking at them. However, the robot doesn't know a priori whether silver or gold is heavier. On the other hand, the robot can perform some experiment with a ball to determine its weight. In this situation, the reward is imperceptible even if the state (e.g. the locations, colors and sizes of the balls, the locations and shapes of the bins and the location of the robot and the state of its manipulators) is fully observable.

Using the concepts of MB dimension and RVO dimension we defined in a previous essay [AF · GW], we can formulate the regret bound.

Theorem 2

There is some $C \in R^{+}$ s.t. the following holds.

Consider any finite non-empty sets $S$ and $A$ , function continuous affine in the second argument $R : S \times {I S}_{ω} \to [0, 1]$ , closed set $H \subseteq {S \times A k \to S}$ and Borel probability measure $ζ$ on $H$ (prior). Suppose $π_{e s t} : S \times {(A \times S)}^{*} k \to A ⊔ {⊥}$ is s.t. for any $s \in S$ , $π_{e s t} (s, \cdot)$ is terminable, and $r : S \times {(A \times S)}^{*} \to [0, 1]$ is a bounded function s.t.

$R (s, \cdot) = R_{π_{e s t} (s, \cdot) r (s, \cdot)}$

Define the maximal expected reward estimation time $t_{e s t}$ by

$t_{e s t} = sup s \in S, μ \in {I S}_{ω} E h \sim μ π_{e s t} (s, \cdot) [| h |]$

Define the maximal bias span $τ$ by

$τ := limsup γ \to 1 max T \in H \frac{{max}_{s \in S} V_{T R_{T}} (s, γ) - {min}_{s \in S} V_{T R_{T}} (s, γ)}{1 - γ}$

Define $H_{R} \subseteq {S \to R}$ by

$H_{R} := {R_{T} | T \in H}$

Denote $D_{M B} := {dim}_{M B} H_{R}$ and $D_{R V O} := {dim}_{R V O} H_{R}$ . Define the Bayesian regret $R (γ)$ by

$R (γ) := E T \sim ζ [{E U}_{T R_{T}}^{*} (γ) - {E U}_{T R_{T}}^{π_{γ}^{†}} (γ)]$

Then, there is a family of policies ${π_{γ}^{†} : S^{*} \times S k \to A}_{γ \in (0, 1)}^{*}$ s.t.

$limsup γ \to 1 \frac{R (γ)}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) (D_{R V O} + 1) (1 - γ) ln \frac{1}{1 - γ}}} \leq C$

Discussion

More on the Regret Bound

It might appear odd that the regret bound in Theorem 2 depends on the dimensions of the class of reward functions, but not on the dimensions on the class of transition kernels, like in the perceptible case. The reason for this is that we only give the asymptotic form. Asymptotically, the leading term is $O (\sqrt[3]{(1 - γ) ln \frac{1}{1 - γ}})$ and its coefficient depends only on those parameters that appear in Theorem 2. However, examining the proof reveals that there is also an $O (\sqrt{(1 - γ) ln \frac{1}{1 - γ}})$ term that has the same form as the regret bound in the perceptible case. For the sake of brevity and simplicity we will not attempt to write down a more precise regret bound that reflects the role of the dimensions of the class of transition kernels, but in principle it is not difficult. We must keep in mind, however, that in practice the other term might be significant: a priori the dimensions of the class of transition kernels are only bounded by a polynomial in $| S |$ and $| A |$ and the latter might be exponentially big for realistic models.

The Death of the Agent and Kamikaze Strategies

There is one potential application of defining rewards that are not a direct function of observations which seems not supported by instrumental rewards as we defined them. Namely, one can imagine agents that are motivated to follow a certain strategy that is designed to destroy the agent but produce some desirable results in the external world ("kamikaze strategy"). In other words, although survival is a convergent instrumental goal, it seems entirely reasonable for it to be sacrificed in certain specific scenarios. To give an example, imagine a bodyguard robot whose primary goal is keeping a certain person alive. If assassins shoot at the person, and the only wait to stop them is for the robot to block the bullets with its body, then it can be the optimal strategy, even if it will destroy the robot and prevent it from continuing its function as bodyguard (assuming that e.g. it will give the person a chance to escape or someone else a chance to stop the assassins).

Our formalism doesn't directly address the possibility of the agent's death, because the sequence of actions and observations is assumed to be infinite. One simple way to accommodate death is postulating a special observation $⊥ \in O$ s.t. it is never received before death and always received after death. If we do this, then death corresponds to a specific instrumental state and therefore its reward is a fixed constant. This seems incompatible with kamikaze strategies where the decision of self-sacrifice is contingent on conditions in the external world after death.

Another approach is assuming the agent becomes a "ghost": it keeps receiving observations which reflect the actual physical world but its actions have no effect. Such ghosts might theoretically be produced by a simplicity prior: for example, if the agent is an AI connected to a camera that monitors a room, then we can imagine a virtual video of the same room continuing beyond the AI's shutdown. This allows for different instrumental states after death and can potentially justify kamikaze strategies, but it seems like a very fragile construct and is unlikely to guarantee reasonable behavior.

The problem with death can be viewed from another perspective: our regret analysis assumes a no-traps condition ( $τ < \infty$ ) whereas death is usually a trap. Therefore, to guarantee rational behavior while accounting for death, we need to operate within a framework that allows for traps.

One such framework is requiring Bayes-optimality and giving up on learnability. This seems both too weak (because nothing is guaranteed for specific environments) and too strong (because it's computationally intractable). That said, I think this approach can be salvaged by requiring something somewhat weaker than Bayes-optimality and proving something somewhat weaker than learnability (hopefully I will write on that in a future essay). In any case, once we give up on learnability we can allow unobservable rewards (the general POMDP setting in the beginning of the Results section) which allow handling death and kamikaze strategies easily. Specifically, we can have a plethora of "dead" states that produce only the $⊥$ observation and whose transitions do no depend on the action, but which have different rewards. So, this approach "solves" the problem but at a high cost: no learnability or regret analysis.

Another framework for dealing with traps is Delegative Reinforcement Learning [LW · GW] (DRL). In DRL the agent interacts with an external advisor, and is thereby able to successfully navigate all traps that the advisor knows about. In other words, it converges to a policy that is Bayes-optimal with respect to the belief state of the advisor (while the advisor is not able to produce such a policy by itself). Combining DRL with the instrumental rewards formalism should allow accounting for death. Specifically, in any given POMDP hypothesis, the state space will be decomposed as $S = S_{dead} ⊔ S_{alive}$ , with the following assumptions:

The transition kernel on states in $S_{dead}$ doesn't depend on the action.
$S_{dead}$ is invariant under the dynamics (death is irreversible).
The reward function on $S_{dead}$ can be arbitrary.
The reward function on $S_{alive}$ has to factor through $I_{T}$ (i.e. depend only on the instrumental state), and moreover the policy for estimating the reward function is known to be safe (alternatively, we might let the agent learn a safe estimation policy from some space of candidate policies).

Under these assumptions (or some variant thereof), it seems plausible that we should be able to prove learnability and derive a reasonable regret bound. Formalizing this line of thought is a task for future work.

The DRL approach to kamikaze strategies also has implications on corrigibility. Indeed, if the external advice is compatible with an admissible reward function that recommends shutting down the agent (i.e. upon delegation, the advisor acts to shut down the agent) then a DRL agent will assist with its own shutdown. This property is preserved by subagents, since creating a subagent is another potentially irreversible action (and therefore must be vetted [LW(p) · GW(p)] by the advisor).

Also, it is tempting to speculate about DRL as a model of human self-sacrifice. We have already speculated before that we can view DRL as model of human learning, where the human's social environment or humanity as a whole is regarded as playing the role of the advisor. Similarly, humans that sacrifice their lives for the sake of their loved ones or the "greater good" can be regarded as heeding that "advise" regarding the rewards of states in which the human doesn't exist. Under this model, we can make sense of self-sacrifices by individual humans but not of hypothetical self-sacrifice by humanity as a whole (potentially augmented by other minds with which we could communicate and find common ground), but, the latter restriction seems compatible with intuition.

Specifying Instrumental Reward Functions

It might be useful to find natural ways to specify instrumental reward functions. In particular, it is interesting to start with a reward function defined on a specific ontology (POMDP) and somehow extend it to a full instrumental reward function (which, like we said before, induces a reward function on any ontology via the $I_{T}$ mapping).

De Blanc suggests one way to extend reward functions from one POMDP to another, but I don't know whether this operation leads to an instrumental reward function, i.e. whether it is compatible with the constraints imposed by affine dependencies between the states (and I don't see any reason why it should).

Essentially, what we're looking for is a way to extend an affine function from an affine subspace to the entire affine space (the affine subspace is the affine span of the instrumental states corresponding to the states of the initial ontology; note that, if these instrumental states are not affine independent, then we have some constraint on this initial reward function). One natural way to do it is looking for an affine function whose differential (the linear function part) has minimal norm, or choosing some privileged projection operator from the entire space to the affine subspace. However, since the natural norms we have here are not inner product norms, these choices are usually non-unique, and it's possible we can't get much out of it.

A different natural approach is using Occam's razor, i.e. looking for an extension of minimal description length or the expected value of a random extension sampled from a conditional simplicity prior. This requires a natural way to describe instrumental reward functions. We do have such a way: assuming that the $π$ and $r$ guaranteed to exist by Theorem 1 can be chosen to be computable, the description is the program for those objects with respect to a fixed universal Turing machine (we consider a single program that computes both $π$ and $r$ to exploit possible mutual information between the two).

These questions are left for future research.

Proofs

Proof of Proposition 1

We proceed by induction on $n$ . For $n = 0$ there is nothing to prove since ${I S}_{0} = p t$ so $θ = θ^{'}$ a priori. Now, suppose $n > 0$ .

We need to show that for any $a_{*} \in A$ and $o_{*} \in O$ , $θ_{a_{*} o_{*}} = θ_{a_{*} o_{*}}^{'}$ . To show it for given $a_{*}$ and $o_{*}$ , we consider some $π$ s.t. $π (λ) = a_{*}$ . Since $θ π = θ^{'} π$ , we have

$Pr a o \sim θ π [o_{0} = o_{*}] = Pr a o \sim θ^{'} π [o_{0} = o_{*}]$

By Definition 7, we get,

$θ (o_{*} | a_{*}) = θ^{'} (o_{*} | a_{*})$

By Definition 6, we get,

$h_{{I S}_{n - 1}} (θ_{a_{*} o_{*}}) = h_{{I S}_{n - 1}} (θ_{a_{*} o_{*}}^{'})$

If this value vanishes then $θ_{a_{*} o_{*}} = θ_{a_{*} o_{*}}^{'} = 0$ . Otherwise, consider any $σ : {(A \times O)}^{< n - 1} \to A$ and define $π : {(A \times O)}^{< n} \to A$ by

$π (λ) := a_{*}$ $π (a o h) := σ (h)$

Definitions 6 and 7 imply that

$[θ_{a_{*} o_{*}}] σ (h) = \frac{θ π (a_{*} o_{*} h)}{θ (o_{*} | a_{*})}$

and similarly for $θ^{'}$ . We have $θ π = θ^{'} π$ and $θ (o_{*} | a_{*}) = θ^{'} (o_{*} | a_{*})$ , which implies $[θ_{a_{*} o_{*}}] σ = [θ_{a_{*} o_{*}}^{'}] σ$ . By the induction hypothesis, we get $[θ_{a_{*} o_{*}}] = [θ_{a_{*} o_{*}}^{'}]$ . Combining this with $h_{{I S}_{n - 1}} (θ_{a_{*} o_{*}}) = h_{{I S}_{n - 1}} (θ_{a_{*} o_{*}}^{'})$ , we get $θ_{a_{*} o_{*}} = θ_{a_{*} o_{*}}^{'}$ . $■$

Proposition A.1

Consider any $C \in C o n S e t$ , $θ, θ^{'} \in Cone C$ and $p \in [0, 1]$ . Assume that

$h_{C} (p θ + (1 - p) θ^{'}) > 0$

Then

$[p θ + (1 - p) θ^{'}] = \frac{p h_{C} (θ) [θ] + (1 - p) h_{C} (θ^{'}) [θ^{'}]}{p h_{C} (θ) + (1 - p) h_{C} (θ^{'})}$

Here, $h_{C} (θ) [θ]$ is understood to mean $0$ when $h_{C} (θ) = 0$ and the same for $θ^{'}$ .

Proof of Proposition A.1

Obvious from the definitions.

Proof of Proposition 2

Given $n \in N^{+}$ and some $α, β \in Δ {(A \times O)}^{n}$ , to show that $α = β$ it is sufficient to show that

For any $a_{*} \in A$ and $o_{*} \in O$ ,

$Pr a o \sim α [a o_{: 1} = a_{*} o_{*}] = Pr a o \sim β [a o_{: 1} = a_{*} o_{*}]$

For any $a_{*} \in A$ and $o_{*} \in O$ , if $Pr a o \sim α [a o_{: 1} = a_{*} o_{*}] > 0$ then for any $h \in {(A \times O)}^{n - 1}$

$Pr a o \sim α [a o = a_{*} o_{*} h | a o_{: 1} = a_{*} o_{*}] = Pr a o \sim β [a o = a_{*} o_{*} h | a o_{: 1} = a_{*} o_{*}]$

We will use this to prove the claim by induction on $n$ . For $n = 0$ , there is nothing to prove. For $n > 0$ , by Definition 7

$Pr a o \sim (p θ + (1 - p) θ^{'}) π [a o_{: 1} = a_{*} o_{*}] = π (a_{*} | λ) (p θ + (1 - p) θ^{'}) (o_{*} | a_{*})$

By Definition 6 $Pr a o \sim (p θ + (1 - p) θ^{'}) π [a o_{: 1} = a_{*} o_{*}] = π (a_{*} | λ) h_{{I S}_{n - 1}} ({(p θ + (1 - p) θ^{'})}_{a_{*} o_{*}}^{'})$

$Pr (p θ + (1 - p) θ^{'}) π [a o_{: 1} = a_{*} o_{*}] = p π (a_{*} | λ) h (θ_{a_{*} o_{*}}) + (1 - p) π (a_{*} | λ) h (θ_{a_{*} o_{*}}^{'})$

$Pr (p θ + (1 - p) θ^{'}) π [a o_{: 1} = a_{*} o_{*}] = p π (a_{*} | λ) θ (o_{*} | a_{*}) + (1 - p) π (a_{*} | λ) θ^{'} (o_{*} | a_{*})$

$Pr (p θ + (1 - p) θ^{'}) π [a o_{: 1} = a_{*} o_{*}] = p Pr θ π [a o_{: 1} = a_{*} o_{*}] + (1 - p) Pr θ^{'} π [a o_{: 1} = a_{*} o_{*}]$

$Pr (p θ + (1 - p) θ^{'}) π [a o_{: 1} = a_{*} o_{*}] = Pr p θ π + (1 - p) θ π [a o_{: 1} = a_{*} o_{*}]$

Furthermore, Definitions 6 and 7 imply that

$Pr (p θ + (1 - p) θ^{'}) π [a_{*} o_{*} h | a o_{: 1} = a_{*} o_{*}] = Pr [{(p θ + (1 - p) θ^{'})}_{a_{*} o_{*}}^{'}] σ [h]$

Here, $σ$ is defined by $σ (h) := π (a_{*} o_{*} h)$ .

Denote

$q := \frac{p h (θ_{a_{*} o_{*}})}{p h (θ_{a_{*} o_{*}}) + (1 - p) h (θ_{a_{*} o_{*}}^{'})}$

By Proposition A.1

$[{(p θ + (1 - p) θ^{'})}_{a_{*} o_{*}}^{'}] = q [θ_{a_{*} o_{*}}] + (1 - q) [θ_{a_{*} o_{*}}^{'}]$

We get

$Pr (p θ + (1 - p) θ^{'}) π [a_{*} o_{*} h | a o_{: 1} = a_{*} o_{*}] = Pr (q [θ_{a_{*} o_{*}}] + (1 - q) [θ_{a_{*} o_{*}}^{'}]) σ [h]$

By the induction hypothesis, we get

$Pr (p θ + (1 - p) θ^{'}) π [a_{*} o_{*} h | a o_{: 1} = a_{*} o_{*}] = Pr q [θ_{a_{*} o_{*}}] σ + (1 - q) [θ_{a_{*} o_{*}}^{'}] σ [h]$

Decomposing the right hand side and applying the same reasoning in reverse,

$Pr (p θ + (1 - p) θ^{'}) π [a_{*} o_{*} h | a_{*} o_{*}] = q Pr θ π [a_{*} o_{*} h | a_{*} o_{*}] + (1 - q) Pr θ^{'} π [a_{*} o_{*} h | a_{*} o_{*}]$

By Definition 6, $q$ can be written as

$q = \frac{p θ (o_{*} | a_{*})}{p θ (o_{*} | a_{*}) + (1 - p) θ^{'} (o_{*} | a_{*})}$

Multiplying the numerator and denominator by $π (a_{*} | λ)$ , we get

$q = \frac{p π (a_{*} | λ) θ (o_{*} | a_{*})}{p π (a_{*} | λ) θ (o_{*} | a_{*}) + (1 - p) π (a_{*} | λ) θ^{'} (o_{*} | a_{*})}$

$q = \frac{p Pr θ π [a_{*} o_{*}]}{p Pr θ π [a_{*} o_{*}] + (1 - p) Pr θ^{'} π [a_{*} o_{*}]}$

Combining this with the previous identity, we get

$\dots = \frac{p Pr θ π [a_{*} o_{*}] Pr θ π [a_{*} o_{*} h | a_{*} o_{*}] + (1 - p) Pr θ^{'} π [a_{*} o_{*}] Pr θ^{'} π [a_{*} o_{*} h | a_{*} o_{*}]}{p Pr θ π [a_{*} o_{*}] + (1 - p) Pr θ^{'} π [a_{*} o_{*}]}$

$Pr (p θ + (1 - p) θ^{'}) π [a_{*} o_{*} h | a_{*} o_{*}] = \frac{p Pr θ π [a_{*} o_{*} h] + (1 - p) Pr θ^{'} π [a_{*} o_{*} h]}{p Pr θ π [a_{*} o_{*}] + (1 - p) Pr θ^{'} π [a_{*} o_{*}]}$

$Pr (p θ + (1 - p) θ^{'}) π [a_{*} o_{*} h | a_{*} o_{*}] = \frac{Pr p θ π + (1 - p) θ^{'} π [a_{*} o_{*} h]}{Pr p θ π + (1 - p) θ^{'} π [a_{*} o_{*}]}$

$Pr (p θ + (1 - p) θ^{'}) π [a_{*} o_{*} h | a_{*} o_{*}] = Pr p θ π + (1 - p) θ^{'} π [a_{*} o_{*} h | a_{*} o_{*}] ■$

Given $μ \in Δ X^{ω}$ and $f : X^{ω} \to X^{n}$ defined by $f (x) := x_{: n}$ , we will denote $μ_{: n} := f_{*} μ$ .

Proposition A.2

Consider any $C, D \in C o n S e t$ and $f \in M o r (C, D)$ . Then,

$h_{D} \circ Cone f = h_{C}$

Proof of Proposition A.2

Obvious from the definitions.

Proposition A.3

Consider any $C, D \in C o n S e t$ , $f \in M o r (C, D)$ and $θ \in Cone C ∖ 0$ . Then,

$[(Cone f) (θ)] = f ([θ])$

Proof of Proposition A.3

Obvious from the definitions.

Proposition A.4

Consider any $n \in N$ , $θ \in {I S}_{n + 1}$ , $a \in A$ and $h \in dom θ$ s.t. $| h | < n$ . Then,

${p r}_{n} (θ) (h a) = θ (h a)$

Proof of Proposition A.4

We prove the claim by induction on $n$ . For $n = 0$ , there is nothing to prove. Assume $n > 0$ . For $h = λ$ , we have

${p r}_{n} (θ) (o | a) = h ({p r}_{n} (θ)_{a o}) = h ((Cone {p r}_{n - 1}) (θ_{a o}))$

By Proposition A.2

${p r}_{n} (θ) (o | a) = h (θ_{a o}) = θ (o | a)$

Now suppose that $h = a^{'} o^{'} h^{'}$ . We get

${p r}_{n} (θ) (a^{'} o^{'} h^{'} a) = [{p r}_{n} (θ)_{a^{'} o^{'}}] (h^{'} a) = [(Cone {p r}_{n - 1}) (θ_{a^{'} o^{'}})] (h^{'} a)$

By Proposition A.3

${p r}_{n} (θ) (a^{'} o^{'} h^{'} a) = {p r}_{n - 1} [θ_{a^{'} o^{'}}] (h^{'} a)$

Using the induction hypothesis

${p r}_{n} (θ) (a^{'} o^{'} h^{'} a) = [θ_{a^{'} o^{'}}] (h^{'} a) = θ (a^{'} o^{'} h^{'} a) ■$

Proof of Proposition 3

Proposition A.4 implies that, in Definition 11, we can replace $| h | + 1$ by any $m \geq | h | + 1$ . Therefore, for any $n \in N$ , $({p r}_{n}^{ω} μ) π = μ π_{: n}$ . Hence, $μ π = μ^{'} π$ implies that $({p r}_{n}^{ω} μ) π = ({p r}_{n}^{ω} μ^{'}) π$ . By Proposition 1, we get ${p r}_{n}^{ω} μ = {p r}_{n}^{ω} μ^{'}$ , and therefore $μ = μ^{'}$ . $■$

Proof of Proposition 4

For any $n \in N$ , Proposition 2 implies that

$(p \cdot {p r}_{n}^{ω} μ + (1 - p) \cdot {p r}_{n}^{ω} μ^{'}) π = p ({p r}_{n}^{ω} μ) π + (1 - p) ({p r}_{n}^{ω} μ^{'}) π$

${p r}_{n}^{ω}$ is an affine mapping, therefore

${p r}_{n}^{ω} (p μ + (1 - p) μ^{'}) π = p ({p r}_{n}^{ω} μ) π + (1 - p) ({p r}_{n}^{ω} μ^{'}) π$

Using Proposition A.4

$(p μ + (1 - p) μ^{'}) π_{: n} = p μ π_{: n} + (1 - p) μ^{'} π_{: n} = {(p μ π + (1 - p) μ^{'} π)}_{: n}^{'}$

Since this holds for any $n$ , we must have that

$(p μ + (1 - p) μ^{'}) π = p μ π + (1 - p) μ^{'} π ■$

Proposition A.5

For any terminable policy $π$

$lim n \to \infty sup μ \in {I S}_{ω} Pr h \sim μ π [| h | > n] = 0$

Proof of Proposition A.5

Assume to the contrary that there is $ϵ \in (0, 1)$ and sequences ${n_{k} \in N}_{k \in N}$ and ${μ_{k} \in {I S}_{ω}}_{k \in N}$ s.t. $n_{k + 1} > n_{k}$ and, for any $k \in N$

$Pr h \sim μ_{k} π [| h | > n_{k}] > ϵ$

${I S}_{ω}$ is compact, therefore ${μ_{k}}$ has a convergent subsequence. Without loss of generality, we can assume that ${μ_{k}}$ itself converges and denote $μ_{*} := {lim}_{k \to \infty} μ_{k}$ . For any $k \in N$ and $j > k$ , we have $n_{k} < n_{j}$ and therefore

$Pr h \sim μ_{j} π [| h | > n_{k}] \geq Pr h \sim μ_{j} π [| h | > n_{j}] > ϵ$

Using Proposition A.4, it follows that

$Pr h \sim {p r}_{n_{k} + 1}^{ω} (μ_{j}) π [| h | > n_{k}] > ϵ$

The right hand side is clearly continuous in ${p r}_{n_{k} + 1}^{ω} (μ_{j})$ , and the latter converges to ${p r}_{n_{k} + 1}^{ω} (μ_{*})$ as $j$ goes to $\infty$ . We get

$Pr h \sim μ_{*} π [| h | > n_{k}] = Pr h \sim {p r}_{n_{k} + 1}^{ω} (μ_{*}) π [| h | > n_{k}] > ϵ$

Since this holds for any $k$ , it follows that

$Pr h \sim μ_{*} π [| h | = \infty] = lim k \to \infty Pr h \sim μ_{*} π [| h | > n_{k}] \geq ϵ$

This contradicts the assumption that $π$ is terminable. $■$

Proposition A.6

Consider ${X_{n}}_{n \in N}$ a sequence of compact Polish spaces, and ${{p r}_{n} : X_{n + 1} \to X_{n}}_{n \in N}$ continuous mappings. Denote

$X_{ω} := lim \leftarrow - n X_{n}$

Denote ${p r}_{n}^{ω} : X_{ω} \to X_{n}$ the canonical mapping. Let $f : X \to R$ be continuous. Then,

$lim n \to \infty sup \begin{matrix} x \in X_{n} x_{1, 2} \in {({p r}_{n}^{ω})}^{- 1} (x) \end{matrix} | f (x_{1}) - f (x_{2}) | = 0$

Proof of Proposition A.6

Assume to the contrary that there is $ϵ \in R^{+}$ , and sequences ${n_{k} \in N}_{k \in N}$ , ${x_{1}^{k}, x_{2}^{k} \in X_{ω}}_{k \in N}$ s.t. $n_{k + 1} > n_{k}$ , ${p r}_{n_{k}}^{ω} (x_{1}^{k}) = {p r}_{n_{k}}^{ω} (x_{2}^{k})$ and $f (x_{1}^{k}) - f (x_{2}^{k}) > ϵ$ . Without loss of generality, we assume that $n_{k} = k$ and the limits $x_{1, 2}^{*} := {lim}_{n \to \infty} x_{1, 2}^{n}$ exist (the latter using the fact $X_{ω}$ is compact by Tychonoff's theorem). It follows that $f (x_{1}^{*}) - f (x_{2}^{*}) \geq ϵ$ , and in particular $x_{1}^{*} \neq x_{2}^{*}$ and therefore there is $m \in N$ s.t. ${p r}_{m}^{ω} (x_{1}^{*}) \neq {p r}_{m}^{ω} (x_{2}^{*})$ . On the other hand, ${p r}_{n}^{ω} (x_{1}^{n}) = {p r}_{n}^{ω} (x_{2}^{n})$ , implying that, for $n \geq m$ , ${p r}_{m}^{ω} (x_{1}^{n}) = {p r}_{m}^{ω} (x_{2}^{n})$ and therefore ${p r}_{m}^{ω} (x_{1}^{*}) = {p r}_{m}^{ω} (x_{2}^{*})$ , a contradiction. $■$

Proposition A.7

Let $V$ be a finite-dimensional normed vector space and $W \subseteq V^{*}$ a linear subspace. Suppose $v \in V$ and $ϵ \in R^{+}$ are s.t. for any $α \in W$ , $| α (v) | \leq ϵ ∥ α ∥$ . Then, there is $v^{'} \in W^{⊥} \subseteq V$ s.t. $∥ v - v^{'} ∥ \leq ϵ$ .

In the above, $∥ α ∥$ refers to the standard dual norm on $V^{*}$ , induced by the norm on $V$ . $W^{⊥}$ refers to the set ${v \in V | \forall α \in W : α (v) = 0}$ .

Proof of Proposition A.7

Consider $v^{*} \in W^{*}$ defined by $v^{*} (α) := α (v)$ . By the assumption about $v$ , $∥ v^{*} ∥ \leq ϵ$ . By the Hahn-Banach theorem, there is $u^{*} \in V^{* *}$ s.t. $u^{*} |_{W} = v^{*}$ and $∥ u^{*} ∥ \leq ϵ$ . Using the canonical isomorphism $V ≅ V^{* *}$ , $u^{*}$ corresponds to some $u \in V$ , and it follows that $∥ u ∥ \leq ϵ$ and for any $α \in W$ , $α (u) = α (v)$ . We now take $v^{'} := v - u$ . $■$

Proposition A.8

Let $C, D \in C o n S e t$ , $f \in M o r (C, D)$ and $r \in M o r (C, R)$ . Assume $D$ is a bounded closed finite-dimensional polytope, and $f$ is onto (as a mapping). Suppose $ϵ \in R^{+}$ s.t. for any $x_{1, 2} \in C$ , if $f (x_{1}) = f (x_{2})$ then $| r (x_{1}) - r (x_{2}) | \leq ϵ$ . Then, there exists $r^{'} \in M o r (D, R)$ s.t. for any $x \in C$

$∣ ∣ r (x) - r^{'} (f (x)) ∣ ∣ \leq \frac{3}{2} ϵ$

Proof of Proposition A.8

Let $X$ be the vector space correspond to $C$ , let $Y$ be the vector space corresponding to $D$ (so that $C \subseteq X$ and $D \subseteq Y$ ) and let $Q$ be the (finite) set of vertices of $D$ . Since $f$ is onto, we can choose some $g : Q \to C$ s.t. for any $q \in Q$ , $f (g (q)) = q$ . Define $v \in R^{Q}$ by $v_{q} := r (g (q))$ . Let $R_{0}^{Q}$ be the linear subspace of $R^{Q}$ given by

$R_{0}^{Q} := ⎧ ⎨ ⎩ u \in R^{Q} ∣ ∣ ∣ ∣ \sum q \in Q u_{q} = 0 ⎫ ⎬ ⎭$

Define the linear operators $A : R^{Q} \to X$ and $B : R^{Q} \to Y$ by

$A u := \sum q \in Q u_{q} g (q)$

$B u := \sum q \in Q u_{q} q$

Consider any $w \in ker B \cap R_{0}^{Q} ∖ 0$ . In particular, $w \in R_{0}^{Q}$ and therefore

$0 = \sum q \in Q w_{q} = \sum q \in Q 1_{w_{q} > 0} w_{q} + \sum q \in Q 1_{w_{q} < 0} w_{q}$

Also

${∥ w ∥}_{1} = \sum q \in Q | w_{q} | = \sum q \in Q 1_{w_{q} > 0} w_{q} - \sum q \in Q 1_{w_{q} < 0} w_{q}$

Combining the last two identities, we conclude

$\sum q \in Q 1_{w_{q} > 0} w_{q} = - \sum q \in Q 1_{w_{q} < 0} w_{q} = \frac{{∥ w ∥}_{1}}{2}$

Using the assumption $w \neq 0$ , this allows us to define $w^{+}, w^{-} \in Δ Q$ by

$w_{q}^{+} := \frac{2}{{∥ w ∥}_{1}} 1_{w_{q} > 0} w_{q}$

$w_{q}^{-} := - \frac{2}{{∥ w ∥}_{1}} 1_{w_{q} < 0} w_{q}$

We have $w = \frac{1}{2} {∥ w ∥}_{1} (w^{+} - w^{-})$ . Also, $B w = 0$ and therefore $B w^{+} = B w^{-}$ .

Now, for any $u \in Δ Q$ , we have

$u \cdot v = \sum q \in Q u_{q} v_{q} = \sum q \in Q u_{q} r (g (q)) = r ⎛ ⎝ \sum q \in Q u_{q} g (q) ⎞ ⎠ = r (A u)$

Moreover,

$f (A u) = f ⎛ ⎝ \sum q \in Q u_{q} g (q) ⎞ ⎠ = \sum q \in Q u_{q} f (g (q)) = \sum q \in Q u_{q} q = B u$

It follows that

$w \cdot v = \frac{1}{2} {∥ w ∥}_{1} (w^{+} \cdot v - w^{-} \cdot v) = \frac{1}{2} {∥ w ∥}_{1} (r (A w^{+}) - r (A w^{-}))$

By our previous reasoning, $f (A w^{+}) = B w^{+} = B w^{-} = f (A w^{-})$ . Therefore, we can use the assumption on $r$ to conclude that

$| w \cdot v | = \frac{1}{2} {∥ w ∥}_{1} ∣ ∣ r (A w^{+}) - r (A w^{-}) ∣ ∣ \leq \frac{ϵ}{2} {∥ w ∥}_{1}$

By Proposition A.7, it follows that there is some $v^{'} \in R^{Q}$ s.t. $v^{'} \in {(ker B \cap R_{0}^{Q})}^{⊥}$ and ${∥ v - v^{'} ∥}_{\infty}^{'} \leq \frac{ϵ}{2}$ (the $L^{\infty}$ norm is dual to the $L^{1}$ norm).

Now, consider some $y \in D$ . There is some $u \in Δ Q$ s.t. $y = B u$ . We define $r^{'} (y) := u \cdot v^{'}$ . To see this definition is unambiguous, consider some $u^{'} \in Δ Q$ s.t. also $y = B u^{'}$ . In particular, $B (u - u^{'}) = 0$ and therefore $u - u^{'} \in ker B$ . Moreover, $u - u^{'} \in R_{0}^{Q}$ since $u, u^{'} \in Δ Q$ . Using that $u - u^{'} \in ker B \cap R_{0}^{Q}$ and $v^{'} \in {(ker B \cap R_{0}^{Q})}^{⊥}$ , we get

$u \cdot v^{'} - u^{'} \cdot v^{'} = (u - u^{'}) \cdot v^{'} = 0$

It is easy to see that $r^{'}$ is affine.

Finally, consider any $x \in C$ . Choose $u \in Δ Q$ s.t. $f (x) = B u$ . We have

$∣ ∣ r (x) - r^{'} (f (x)) ∣ ∣ \leq ∣ ∣ r (x) - r (A u) ∣ ∣ + ∣ ∣ r (A u) - r^{'} (f (x)) ∣ ∣$

Since $f (A u) = B u$ , we can use the assumption on $r$ to bound the first term on the right hand side:

$∣ ∣ r (x) - r (A u) ∣ ∣ \leq ϵ$

For the second term, we have

$∣ ∣ r (A u) - r^{'} (f (x)) ∣ ∣ = ∣ ∣ u \cdot v - u \cdot v^{'} ∣ ∣ = ∣ ∣ u \cdot (v - v^{'}) ∣ ∣ \leq {∥ u ∥}_{1} {∥ ∥ v - v^{'} ∥ ∥}_{\infty}^{'} \leq \frac{ϵ}{2}$

Combining the two, we conclude

$∣ ∣ r (x) - r^{'} (f (x)) ∣ ∣ \leq \frac{3}{2} ϵ ■$

Proposition A.9

Let $A$ be a finite set, $C \in C o n S e t$ and $R \in M o r (C^{A}, [0, 1])$ . Assume $R$ attains its infimum at some point $θ^{*} \in C^{A}$ . Then, there exist $π \in Δ A$ and $r : A \to M o r (C, [0, 1])$ s.t. for any $θ \in C^{A}$ $R (θ) = E a \sim π [r (a, θ_{a})]$

Here, we used implicit currying: $r (a, x) := r (a) (x)$ .

Proof of Proposition A.9

For each $a \in A$ , define $ι_{a} \in M o r (C, C^{A})$ by

$ι_{a} (x)_{b} := {\begin{matrix} x if a = b θ_{b}^{*} if a \neq b \end{matrix}$

Denote $R_{a} := R \circ ι_{a}$ and $N_{a} := sup R_{a} - inf R_{a}$ . If $N_{a}$ vanishes for every $a \in A$ , then $R$ is constant, so we can set $r$ to the same constant and take arbitrary $π$ . Otherwise, we define $π$ and $r$ by

$π_{a} := \frac{N_{a}}{\sum_{b \in A} N_{b}}$

$r (a, x) := R (θ^{*}) + \frac{\sum_{b \in A} N_{b}}{N_{a}} (R_{a} (x) - R (θ^{*}))$

We need to show that $r$ takes values in $[0, 1]$ . It is non-negative since $R$ is minimal at $θ^{*}$ , so both terms in $r$ are non-negative. To see it is not greater than $1$ , observe that

$r (a, x) = R (θ^{*}) + \frac{\sum_{b \in A} N_{b}}{N_{a}} (R_{a} (x) - R (θ^{*})) \leq inf R + \frac{\sum_{b \in A} N_{b}}{N_{a}} (sup R_{a} - inf R)$

Clearly $inf R = inf R_{a}$ and therefore $sup R_{a} - inf R = N_{a}$ . It is also easy to see that, since $R$ is affine, $sup R - inf R = \sum_{b \in A} N_{b}$ . We get

$r (a, x) \leq inf R + \frac{\sum_{b \in A} N_{b}}{N_{a}} N_{a} = inf R + \sum b \in A N_{b} = sup R \leq 1$

It remains to show that $R (θ) = E a \sim π [r (a, θ_{a})]$ . We have

$E a \sim π [r (a, θ_{a})] = \sum a \in A π_{a} r (a, θ_{a})$

$E a \sim π [r (a, θ_{a})] = \sum a \in A π_{a} (R (θ^{*}) + \frac{\sum_{b \in A} N_{b}}{N_{a}} (R_{a} (θ_{a}) - R (θ^{*})))$

$E a \sim π [r (a, θ_{a})] = \sum a \in A π_{a} R (θ^{*}) + \sum a \in A π_{a} \frac{\sum_{b \in A} N_{b}}{N_{a}} (R_{a} (θ_{a}) - R (θ^{*}))$

$E a \sim π [r (a, θ_{a})] = R (θ^{*}) + \sum a \in A \frac{N_{a}}{\sum_{b \in A} N_{b}} \cdot \frac{\sum_{b \in A} N_{b}}{N_{a}} (R_{a} (θ_{a}) - R (θ^{*}))$

$E a \sim π [r (a, θ_{a})] = R (θ^{*}) + \sum a \in A (R_{a} (θ_{a}) - R (θ^{*}))$

Using the fact $R$ is affine, the desired conclusion follows. $■$

Proposition A.10

Consider any $n \in N$ and $R \in M o r ({I S}_{n}, [0, 1])$ . Then, there exist $π : {(A \times O)}^{< n} k \to A$ and $r : {(A \times O)}^{n} \to [0, 1]$ s.t. for any $θ \in {I S}_{n}$

$R (θ) = E θ π [r]$

Proof of Proposition A.10

We proceed by induction on $n$ . For $n = 0$ , $R$ is constant and we set $r$ to the same constant. Now, suppose $n > 0$ .

Define ${I S}_{n - \frac{1}{2}} \in C o n S e t$ by

${I S}_{n - \frac{1}{2}} := {(\prod o \in O h_{{I S}_{n - 1}})}_{{I S}_{n - 1}}^{- 1} (Δ O)$

Using Proposition A.9, we get $π_{0} \in Δ A$ and $r_{0} : A \to M o r ({I S}_{n - \frac{1}{2}}, [0, 1])$ s.t. for any $θ \in {I S}_{n}$

$R (θ) = E a \sim π_{0} [r_{0} (a, θ_{a})]$

For every $o \in O$ , we define $j_{o} \in M o r ({I S}_{n - 1}, {I S}_{n - \frac{1}{2}})$ by

$j_{o} (θ)_{o^{'}} := {\begin{matrix} (θ, 1) if o = o^{'} 0 if o \neq o^{'} \end{matrix}$

For every $a \in A$ and $o \in O$ , we apply the induction hypothesis to $r_{0} (a) \circ j_{o}$ , and get $π_{a o} : {(A \times O)}^{< n - 1} k \to A$ and $r_{a o} : {(A \times O)}^{n - 1} \to [0, 1]$ s.t. for any $θ \in {I S}_{n - 1}$

$r_{0} (a, j_{o} (θ)) = E θ π_{a o} [r_{a o}]$

We now define $π$ and $r$ by

$π (λ) := π_{0}$ $π (a o h) := π_{a o} (h)$

$r (a o h) := r_{a o} (h)$

Now, observe that, for any $θ \in {I S}_{n}$

$E h \sim θ π [r (h)] = E \begin{matrix} a \sim π_{0} o \sim θ (a) \end{matrix} [E h^{'} \sim [θ_{a o}] π_{a o} [r_{a o} (h^{'})]] = E \begin{matrix} a \sim π_{0} o \sim θ (a) \end{matrix} [r_{0} (a, j_{o} [θ_{a o}])]$

Using the fact that $r_{0}$ is affine in the second argument, we get

$E h \sim θ π [r (h)] = E \begin{matrix} a \sim π_{0} \end{matrix} [r_{0} (a, E o \sim θ (a) [j_{o} [θ_{a o}]])]$

Moreover, using Definition 6 and the definition of $j_{0}$ , we get

$E o \sim θ (a) [j_{o} [θ_{a o}]] = \sum o \in O h (θ_{a o}) ⨁ o^{'} \in O (1_{o^{'} = o} θ_{a o}, 1_{o^{'} = o})$

$E o \sim θ (a) [j_{o} [θ_{a o}]] = ⨁ o^{'} \in O (\sum o \in O h (θ_{a o}) 1_{o^{'} = o} θ_{a o}, \sum o \in O h (θ_{a o}) 1_{o^{'} = o})$

$E o \sim θ (a) [j_{o} [θ_{a o}]] = ⨁ o^{'} \in O (h (θ_{a o^{'}}) θ_{a o^{'}}, h (θ_{a o^{'}})) = ⨁ o^{'} \in O θ_{a o^{'}} = θ_{a}$

Combining this with the previous identity, we get

$E h \sim θ π [r (h)] = E \begin{matrix} a \sim π_{0} \end{matrix} [r_{0} (a, θ_{a})] = R (θ) ■$

Proposition A.11

Consider some $α \in Δ N$ and sequences

${π_{k} : {(A \times O)}^{*} k \to A ⊔ {⊥}}_{k \in N}^{*}$

${r_{k} : {(A \times O)}^{*} \to [0, 1]}_{k \in N}$

Assume $π_{k}$ is a terminable policy for every $k \in N$ . Then, there is $π$ a terminable policy and $r : {(A \times O)}^{*} \to [0, 1]$ s.t.

$R_{π r} = E k \sim α [R_{π_{k} r_{k}}]$

Proof of Proposition A.11

Define $π$ and $r$ by setting, for any $n \in N$ and $a o \in {(A \times O)}^{n}$

$π (a o) := \frac{E n \sim α [\prod_{m = 0}^{n - 1} π_{n} (a_{m} | a o_{: m}) \cdot π_{n} (a o)]}{E n \sim α [\prod_{m = 0}^{n - 1} π_{n} (a_{m} | a o_{: m})]}$

$r (a o) := \frac{E n \sim α [\prod_{m = 0}^{n - 1} π_{n} (a_{m} | a o_{: m}) \cdot π_{n} (⊥ | a o) r_{n} (a o)]}{E n \sim α [\prod_{m = 0}^{n - 1} π_{n} (a_{m} | a o_{: m}) \cdot π_{n} (⊥ | a o)]}$

When the denominator vanishes, we can make an arbitrary choice.

Essentially, what we do is sample $n$ from $α$ and then run the "experiment" defined by $(π_{n}, r_{n})$ . We have

$R_{π r} (μ) = E μ π [r] = E n \sim α [E μ π_{n} [r_{n}]] = E n \sim α [R_{π_{n} r_{n}} (μ)] ■$

For any $n, m \in N$ s.t. $m \geq n$ , we will denote by ${p r}_{n}^{m} : {I S}_{m} \to {I S}_{n}$ the canonical projection. That is

${p r}_{n}^{m} := {p r}_{m - 1} \circ {p r}_{m - 2} \dots \circ {p r}_{n + 1} \circ {p r}_{n}$

Proof of Theorem 1

Direction 1: $R_{π r}$ is affine by Proposition 4. To verify it is continuous, consider a sequence ${μ_{k} \in {I S}_{ω}}_{k \in N}$ s.t. $μ_{*} := {lim}_{k \to \infty} μ_{k}$ exists. Denote $Δ r := sup r - inf r$ , $μ_{k}^{n} := {p r}_{n}^{ω} (μ_{k})$ and $μ_{*}^{n} := {p r}_{n}^{ω} (μ_{*})$ . For any $n \in N$ , we can decompose the expected value into the contribution of histories at length at most $n$ and the contribution of longer histories.

$| R_{π r} (μ_{k}) - R_{π r} (μ_{*}) | \leq Δ r (d_{tv} (μ_{k}^{n}, μ_{*}^{n}) + Pr μ_{k} π [| h | > n] + Pr μ_{*} π [| h | > n])$

$μ_{k}^{n}$ converges to $μ_{*}^{n}$ as $k$ goes to $\infty$ , therefore $d_{tv} (μ_{k}^{n}, μ_{*}^{n})$ converges to 0 and we get

$limsup k \to \infty | R_{π r} (μ_{k}) - R_{π r} (μ_{*}) | \leq 2 Δ r sup μ \in {I S}_{ω} Pr μ π [| h | > n]$

Since this holds for any $n \in N$ , we get

$limsup k \to \infty | R_{π r} (μ_{k}) - R_{π r} (μ_{*}) | \leq 2 Δ r inf n \in N sup μ \in {I S}_{ω} Pr μ π [| h | > n]$

By Proposition A.5, the right hand side vanishes.

Direction 2: Define ${ϵ_{n} \in [0, 1]}_{n \in N}$ by

$ϵ_{n} := sup \begin{matrix} θ \in {I S}_{n} μ_{1, 2} \in {({p r}_{n}^{ω})}^{- 1} (θ) \end{matrix} | R (μ_{1}) - R (μ_{2}) |$

By Proposition A.6, ${lim}_{n \to \infty} ϵ_{n} = 0$ . By Proposition A.8, there is a sequence ${{~ R}_{n} \in M o r ({I S}_{n}, [0, 1])}_{n \in N}$ s.t. for any $n \in N$

$sup μ \in {I S}_{ω} ∣ ∣ R (μ) - {~ R}_{n} ({p r}_{n}^{ω} (μ)) ∣ ∣ \leq \frac{3}{2} ϵ_{n}$

It follows that there is a sequence ${n_{k} \in N}_{k \in N}$ s.t. $n_{k + 1} > n_{k}$ and

$\infty \sum k = 0 sup μ \in {I S}_{ω} ∣ ∣ {~ R}_{n_{k + 1}} ({p r}_{n_{k + 1}}^{ω} (μ)) - {~ R}_{n_{k}} ({p r}_{n_{k}}^{ω} (μ)) ∣ ∣ < 1$

Define $Δ R_{k} \in M o r ({I S}_{n_{k + 1}}, [- 1, 1])$ by

$Δ R_{k} (θ) := {~ R}_{n_{k + 1}} (θ) - {~ R}_{n_{k}} ({p r}_{n_{k}}^{n_{k + 1}} (θ))$

Denote $δ_{k} := sup | Δ R_{k} |$ . We can assume without loss of generality that $δ_{k} > 0$ for every $k \in N$ (this can be arranged by choosing appropriate $n_{k}$ , unless for some $m \in N$ , $R = {p r}_{n}^{ω} \circ {~ R}_{m}$ ; but in the latter case, the theorem follows directly from Proposition A.10). By Proposition A.10, for every $k \in N$ , there is $π_{k + 1} : {(A \times O)}^{< n_{k + 1}} k \to A$ and $r_{k + 1} : {(A \times O)}^{n_{k + 1}} \to [- 1, 1]$ s.t.

$\frac{1}{δ_{k}} Δ R_{k} (θ) = E θ π_{k + 1} [r_{k + 1}]$

Also by Proposition A.10, there is $π_{0} : {(A \times O)}^{< n_{0}} k \to A$ and $r_{0} : {(A \times O)}^{n_{0}} \to [0, 1]$ s.t.

${~ R}_{n_{0}} (θ) = E θ π_{0} [r_{0}]$

By the construction, $\sum_{k = 0}^{\infty} δ_{k} < 1$ , and moreover, denoting $C := 1 + \sum_{k = 0}^{\infty} δ_{k}$

$R = {p r}_{n_{0}}^{ω} \circ {~ R}_{n_{0}} + \infty \sum k = 0 {p r}_{n_{k + 1}}^{ω} \circ Δ R_{k} = C (\frac{1}{C} {p r}_{n_{0}}^{ω} \circ {~ R}_{n_{0}} + \infty \sum k = 0 \frac{δ_{k}}{C} \cdot \frac{{p r}_{n_{k + 1}}^{ω} \circ Δ R_{k}}{δ_{k}})$

By Proposition A.11, there is $π$ a terminable policy and $r : {(A \times O)}^{*} \to [- 2, 2]$ (the range of $r_{k}$ is $[- 1, 1]$ and $C < 2$ ) s.t. $R_{π r} = R$ . $■$

Proposition A.12

Consider any $λ \in (0, 1)$ , $n \in N$ , $θ \in {I S}_{n}$ and $μ_{1, 2} \in {({p r}_{n}^{ω})}^{- 1} (θ)$ . Then,

$d_{tv}^{λ} (μ_{1}, μ_{2}) \leq λ^{n + 1}$

Proof of Proposition A.12

By Definition 16,

$d_{tv}^{λ} (μ_{1}, μ_{2}) = sup m \in N λ^{m} d_{tv} ({p r}_{m}^{ω} μ_{1}, {p r}_{m}^{ω} μ_{2})$

For $m \leq n$ , ${p r}_{m}^{ω} μ_{1} = {p r}_{m}^{n} θ = {p r}_{m}^{ω} μ_{2}$ , and therefore $d_{tv} ({p r}_{m}^{ω} μ_{1}, {p r}_{m}^{ω} μ_{2}) = 0$ . For $m > n$ , $d_{tv} ({p r}_{m}^{ω} μ_{1}, {p r}_{m}^{ω} μ_{2}) \leq 1$ by Definition 15. We get

$d_{tv}^{λ} (μ_{1}, μ_{2}) \leq sup m > n λ^{m} = λ^{n + 1} ■$

Proof of Proposition 5

Define ${ϵ_{n} \in [0, 1]}_{n \in N}$ by

$ϵ_{n} := sup \begin{matrix} θ \in {I S}_{n} μ_{1, 2} \in {({p r}_{n}^{ω})}^{- 1} (θ) \end{matrix} | R (μ_{1}) - R (μ_{2}) |$

Let $L \in R$ be the Lipschitz constant of $R$ . We have

$ϵ_{n} \leq L sup \begin{matrix} θ \in {I S}_{n} μ_{1, 2} \in {({p r}_{n}^{ω})}^{- 1} (θ) \end{matrix} d_{tv}^{λ} (μ_{1}, μ_{2})$ By Proposition A.12, we get

$ϵ_{n} \leq L λ^{n + 1}$

Without loss of generality, assume the range of $R$ is contained in $[0, 1]$ . By Proposition A.8, it follows that there is a sequence ${{~ R}_{n} \in M o r ({I S}_{n}, [0, 1])}_{n \in N}$ s.t. for any $n \in N$

$sup μ \in {I S}_{ω} ∣ ∣ R (μ) - {~ R}_{n} ({p r}_{n}^{ω} (μ)) ∣ ∣ \leq \frac{3}{2} L λ^{n + 1}$

It follows that

$\infty \sum n = 0 sup μ \in {I S}_{ω} ∣ ∣ {~ R}_{n + 1} ({p r}_{n + 1}^{ω} (μ)) - {~ R}_{n} ({p r}_{n}^{ω} (μ)) ∣ ∣ \leq \frac{3}{2} L \infty \sum n = 0 (λ^{n + 2} + λ^{n + 1})$

$\infty \sum n = 0 sup μ \in {I S}_{ω} ∣ ∣ {~ R}_{n + 1} ({p r}_{n + 1}^{ω} (μ)) - {~ R}_{n} ({p r}_{n}^{ω} (μ)) ∣ ∣ \leq \frac{3 L}{2} \cdot \frac{λ^{2} + λ}{1 - λ}$

Define $Δ R_{n} \in M o r ({I S}_{n + 1}, [- 1, 1])$ by

$Δ R_{n} (θ) := {~ R}_{n + 1} (θ) - {~ R}_{n} ({p r}_{n} (θ))$

Denote $δ_{n} := sup | Δ R_{n} |$ . By Proposition A.10, for every $n \in N$ s.t. $δ_{n} > 0$ , there is $π_{n + 1} : {(A \times O)}^{\leq n} k \to A$ and $r_{n + 1} : {(A \times O)}^{n + 1} \to [- 1, 1]$ s.t.

$\frac{1}{δ_{n}} Δ R_{n} (θ) = E θ π_{n + 1} [r_{n + 1}]$

${~ R}_{0}$ is a constant. Define $C \in R^{+}$ by

$C := 1 + \infty \sum k = 0 δ_{k} \leq 1 + \frac{3 L}{2} \cdot \frac{λ^{2} + λ}{1 - λ}$

We have

$R = {~ R}_{0} + \infty \sum n = 0 {p r}_{n + 1}^{ω} \circ Δ R_{n} = C (\frac{1}{C} {~ R}_{0} + \infty \sum n = 0 \frac{δ_{n}}{C} \cdot \frac{{p r}_{n + 1}^{ω} \circ Δ R_{n}}{δ_{n}})$

Here, the terms with $δ_{n} = 0$ are understood to vanish. By Proposition A.11, there is $π$ a terminable policy and $r : {(A \times O)}^{*} \to [- C, C]$ s.t. $R_{π r} = R$ . Moreover, looking at the construction in the proof of Proposition A.11, we can see that

$E h \sim μ π [| h |] = \infty \sum n = 0 \frac{δ_{n}}{C} (n + 1) \leq \infty \sum n = 0 \frac{3 L}{2} (λ^{n + 2} + λ^{n + 1}) (n + 1) < \infty ■$

In order to prove Theorem 2, we will consider the policy $π_{γ, T}^{PS} : S^{*} \times S k \to A$ implemented by an algorithm which is PSRL with two modifications:

Each episode has random duration, uniformly distributed from $1$ to $2 T - 1$ , for some fixed $T \in N^{+}$ .
Between any two regular episodes, there is an "interlude" which consists of, performing the reward estimation experiment ( $π_{e s t}$ ).

The PSRL algorithm assumes choosing $Π : H \times S \to A$ , a Borel measurable mapping s.t. $Π (T)$ is an optimal policy, i.e.

${E U}_{T R_{T}}^{Π (T)} (γ) = {E U}_{T R_{T}}^{*} (γ)$

As usual, we will consider $(Ω, P)$ , a probability space governing both the uncertainty about the true hypothesis, the stochastic behavior of the environment and the random sampling inside the algorithm. Let $Y_{*} : Ω \to H$ be a random variable representing the true hypothesis, ${Y_{k} : Ω \to H}_{k \in N}$ be random variables s.t. $Y_{n}$ represents the hypothesis sampled at episode $k$ , ${Θ_{n} : Ω \to S}_{n \in N}$ be s.t. $Θ_{n}$ represents the state at time $n$ , ${A_{n} : Ω \to A}_{n \in N}$ be s.t. $A_{n}$ represents the action taken at time $n$ , ${N_{k} : Ω \to N}_{k \in N}$ be s.t. $N_{k}$ represents the time when the $k$ -th episode starts and ${M_{k} : Ω \to N}_{k \in N}$ be s.t. $M_{k}$ represents the time when the $k$ -the interlude starts.

This probability space can be formally defined via recursive equations on the random variables, but this is straightforward and we will omit it.

Proposition A.13

In the setting of Theorem 2, fix $T \in N^{+}$ and $γ \in (0, 1)$ .

Denote $R_{k} := R_{Y_{k}}$ , $R_{*} := R_{Y_{*}}$ , $Δ R_{k} := R_{k} - R_{*}$ , $Δ Y_{k} := Y_{k} - Y_{*}$ , $E_{* k} := Y_{*} Π (Y_{k})$ , $E_{k *} := Y_{k} Π (Y_{*})$ , $E_{* *} := Y_{*} Π (Y_{*})$ , $Δ N_{k} := M_{k} - N_{k}$ and $Δ M_{k} := N_{k + 1} - M_{k}$ . Then,

$R (γ) = \infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} ((1 - γ) Δ R_{k} (Θ_{n}) + γ E Δ Y_{k} [V_{Y_{k} R_{k}} (γ) ∣ ∣ Θ_{n}, A_{n}]) ⎤ ⎦$ $+ \infty \sum k = 0 E ⎡ ⎣ γ^{M_{k}} E E_{k *}^{Δ N_{k}} - E_{* k}^{Δ N_{k}} [V_{Y_{*} R_{*}} (γ) ∣ ∣ Θ_{N_{k}}] ⎤ ⎦$ $+ (1 - γ) \infty \sum k = 0 E ⎡ ⎣ N_{k + 1} - 1 \sum n = M_{k} γ^{n} E E_{* *}^{n - M_{k}} - E_{* k}^{n - M_{k}} [R_{*} ∣ ∣ Θ_{M_{k}}] ⎤ ⎦$ $+ \infty \sum k = 0 E ⎡ ⎣ γ^{N_{k + 1}} E E_{* *}^{Δ M_{k}} - E_{* k}^{Δ M_{k}} [V_{Y_{*} R_{*}} (γ) ∣ ∣ Θ_{M_{k}}] ⎤ ⎦$

Proof of Proposition A.13

For any $n \in N$ , define $Π_{n} : H \times H \times S^{*} \times S k \to A$ as follows.

$Π_{n} (T_{1}, T_{2}, h, s) := {\begin{matrix} Π (T_{1}, s) if | h | < n Π (T_{2}, s) if | h | \geq n \end{matrix}$

That is, $Π_{n} (T_{1}, T_{2})$ is a policy that follows $Π (T_{1})$ for time $n$ and $Π (T_{2})$ afterwards.

In the following, we use the shorthand notation

$V_{*} (s) := V_{Y_{*} R_{*}} (s, γ)$

$V_{k} (s) := V_{Y_{k} R_{k}} (s, γ)$

$V_{k l} (s) := V_{Y_{*} Π_{Δ N_{k} - l} (Y_{k}, Y_{*}) R_{*}} (s, γ)$

It is easy to see that

$R (γ) = \infty \sum k = 0 E [γ^{N_{k}} (V_{*} (Θ_{N_{k}}) - V_{k 0} (Θ_{N_{k}}))]$ $+ (1 - γ) \infty \sum k = 0 E ⎡ ⎣ N_{k + 1} - 1 \sum n = M_{k} γ^{n} E E_{* *}^{n - M_{k}} - E_{* k}^{n - M_{k}} [R_{*} ∣ ∣ Θ_{M_{k}}] ⎤ ⎦$ $+ \infty \sum k = 0 E ⎡ ⎣ γ^{N_{k + 1}} E E_{* *}^{Δ M_{k}} - E_{* k}^{Δ M_{k}} [V_{Y_{*} R_{*}} (γ) ∣ ∣ Θ_{M_{k}}] ⎤ ⎦$

Here, the first term represents the regret incurred during the episodes, whereas the second and third term represent the regret incurred during the interludes (the lost reward and lost value respectively). We will denote the first term $R_{0} (γ)$ in the following.

By definition, $Y_{k}$ and $Y_{*}$ have the same distribution even when conditioned by the history up to $N_{k}$ . Therefore

$E [V_{*} (Θ_{N_{k}})] = E [V_{k} (Θ_{N_{k}})]$

It follows that

$R_{0} (γ) = \infty \sum k = 0 E [γ^{N_{k}} (V_{k} (Θ_{N_{k}}) - V_{k 0} (Θ_{N_{k}}))]$

Denote $Θ_{k l} := Θ_{N_{k} + l}$ and $Δ V_{k l} := V_{k} (Θ_{k l}) - V_{k l} (Θ_{k l})$ . We now prove by induction on $l \in N$ that, with probability 1

$l \leq Δ N_{k} ⟹ E [Δ V_{k 0} | N_{k}, M_{k}] =$ $E ⎡ ⎣ N_{k} + l - 1 \sum n = N_{k} γ^{n - N_{k}} ((1 - γ) Δ R_{k} (Θ_{n}) + γ E Δ Y_{k} [V_{k} | Θ_{n}, A_{n}]) + γ^{l} Δ V_{k l} ∣ ∣ ∣ ∣ N_{k}, M_{k} ⎤ ⎦$

For $l = 0$ this is a tautology. For any $l \in N$ , the Bellman equation says that

$V_{k} (s) = (1 - γ) R_{k} (s) + γ E Y_{k} Π (Y_{k}) [V_{k} | s]$

$l < Δ N_{k} ⟹ V_{k l} (s) = (1 - γ) R_{*} (s) + γ E Y_{*} Π (Y_{k}) [V_{k, l + 1} | s]$

Denote $E_{k k} := Y_{k} Π (Y_{k})$ . We will also use the notation $E_{k} [X] := E [X | N_{k}, M_{k}]$ . Substituting $s = Θ_{k l}$ and subtracting the two identities, we get that, in the subspace of $Ω$ defined by $l < Δ N_{k}$

$E_{k} [Δ V_{k l}] = E_{k} [(1 - γ) Δ R (Θ_{k l}) + γ (E E_{k k} [V_{k} | Θ_{k l}] - E E_{* k} [V_{k, l + 1} | Θ_{k l}])]$

Denote $A_{k l} := A_{N_{k} + l}$ . Since $Π (Y_{k})$ is exactly the policy followed by PSRL at time $N_{k} + l$ , we get

$E_{k} [Δ V_{k l}] = E_{k} [(1 - γ) Δ R (Θ_{k l}) + γ (E Y_{k} [V_{k} | Θ_{k l}, A_{k l}] - E Y_{*} [V_{k, l + 1} | Θ_{k l}, A_{k l}])]$

We now subtract and add $γ E Y_{*} [V_{k} | Θ_{k l}, A_{k l}]$ , and use the fact that $Y_{*} (Θ_{k l}, A_{k l})$ is the conditional distribution of $Θ_{k, l + 1}$ .

$E_{k} [Δ V_{k l}] = E_{k} [(1 - γ) Δ R (Θ_{k l}) + γ (E Δ Y_{k} [V_{k} | Θ_{k l}, A_{k l}] + Δ V_{k, l + 1})]$

Applying this identity to the last term on the right hand side of the induction hypothesis, we prove the induction step. For $l = Δ N_{k}$ , we get

$E_{k} [Δ V_{k 0}] = M_{k} \sum n = N_{k} γ^{n - N_{k}} E_{k} [(1 - γ) Δ R_{k} (Θ_{n}) + γ E Δ Y_{k} [V_{k} | Θ_{n}, A_{n}]] +$ $γ^{Δ N_{k}} E_{k} [V_{k} (Θ_{M_{k}}) - V_{*} (Θ_{M_{k}})]$

Clearly

$E_{k} [V_{k} (Θ_{M_{k}})] = E_{k} ⎡ ⎣ E E_{* k}^{Δ N_{k}} [V_{k} ∣ ∣ Θ_{N_{k}}] ⎤ ⎦$

$E_{k} [V_{*} (Θ_{M_{k}})] = E_{k} ⎡ ⎣ E E_{* k}^{Δ N_{k}} [V_{*} ∣ ∣ Θ_{N_{k}}] ⎤ ⎦$

Using the definition of PSRL, we can exchange and true and sampled hypothesis and get

$E_{k} [V_{k} (Θ_{M_{k}})] = E_{k} ⎡ ⎣ E E_{k *}^{Δ N_{k}} [V_{*} ∣ ∣ Θ_{N_{k}}] ⎤ ⎦$

It follows that

$E_{k} [Δ V_{k 0}] = M_{k} \sum n = N_{k} γ^{n - N_{k}} E_{k} [(1 - γ) Δ R_{k} (Θ_{n}) + γ E Δ Y_{k} [V_{k} | Θ_{n}, A_{n}]] +$ $γ^{Δ N_{k}} E_{k} ⎡ ⎣ E E_{k *}^{Δ N_{k}} - E_{* k}^{Δ N_{k}} [V_{*} ∣ ∣ Θ_{N_{k}}] ⎤ ⎦$

Applying this to each term in the earlier expression for $R_{0} (γ)$ , we get the desired result. $■$

Proposition A.14

Consider $(Ω, P)$ a probability space, ${M_{k}, Δ_{k} : Ω \to N}_{k \in N}$ random variables and $T \in N^{+}$ . Assume that $M_{k + 1} - M_{k} \geq Δ_{k}$ and the $Δ_{k}$ are all independent and uniformly distributed between $1$ and $2 T - 1$ . Define the random variable $K_{n} : Ω \to N$ by

$K_{n} := min {k \in N | M_{k} \geq n}$

Then,

$Pr [K_{n} > \frac{2 n}{T} + 1] \leq exp (- \frac{n}{4 T})$

Proof of Proposition A.14

Define $M_{k}^{'} := \sum_{l = 0}^{k = 1} Δ_{l}$ . Obviously, $M_{k}^{'} \leq M_{k}$ . For any $k \in N$ and $m \in R^{\geq 0}$

$Pr [M_{k} \leq k T - m] \leq Pr [M_{k}^{'} \leq k T - m]$

Apply Hoeffding's inequality to the RHS,

$Pr [M_{k} \leq k T - m] \leq exp (- \frac{2 m^{2}}{k (2 T - 2)^{2}}) \leq exp (- \frac{m^{2}}{2 k T^{2}})$

Taking $m := \frac{1}{2} k T$ , we get

$Pr [M_{k} \leq \frac{k T}{2}] \leq exp ⎛ ⎜ ⎜ ⎝ - \frac{{(\frac{1}{2} k T)}^{2}}{2 k T^{2}} ⎞ ⎟ ⎟ ⎠ = exp (- \frac{k}{8})$

Moreover, by definition of $K_{n}$ , we have, for any $k, n \in N$

$Pr [K_{n} > k] = Pr [M_{k} < n]$

It follows that

$Pr [K_{⌈ \frac{1}{2} k T ⌉} > k] = Pr [M_{k} < ⌈ \frac{k T}{2} ⌉] = Pr [M_{k} < \frac{k T}{2}] \leq exp (- \frac{k}{8})$

Take $k := ⌈ \frac{2 n}{T} ⌉$ . We get

$Pr [K_{n} > \frac{2 n}{T} + 1] \leq Pr [K_{⌈ \frac{1}{2} ⌈ \frac{2 n}{T} ⌉ T ⌉} > ⌈ \frac{2 n}{T} ⌉] \leq exp ⎛ ⎜ ⎜ ⎝ - \frac{⌈ \frac{2 n}{T} ⌉}{8} ⎞ ⎟ ⎟ ⎠ \leq exp (- \frac{n}{4 T}) ■$

Proposition A.15

Consider $Ψ$ be a measurable space, $(Ω, P)$ a probability space, ${H_{n} \subseteq P (Ω)}_{n \in N}$ a filtration, ${X_{n} : Ω \to Ψ}_{n \in N}$ a stochastic process adapted to $H$ , ${M_{k} : Ω \to N}_{k \in N}$ stopping times and ${f_{n} : Ψ^{n + 1} \to [0, 1]}_{n \in N}^{n + 1}$ measurable functions. Consider also $γ \in (0, 1)$ and $C \in R^{+}$ and assume that with probability 1

$\infty \sum k = 0 γ^{k} f_{k} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{k}}) \leq C$

In addition, consider some $T \in N^{+}$ , $T \geq 2$ and assume that for all $k \in N$

$Pr [M_{k} = n | M_{k - 1}] = {\begin{matrix} {(2 T - 1)}^{- 1} if M_{k - 1} < n < M_{k - 1} + 2 T 0 otherwise \end{matrix}$

Here, $M_{- 1}$ is understood to identically equal $- 1$ .

Finally, assume that, conditional on $M_{k} \geq n$ , $M_{k}$ and $X_{n}$ are independent. Define the $H_{n}$ -measurable random variable $K_{n} : Ω \to N$ by

$K_{n} := min {k \in N | M_{k} \geq n}$

Then,

$\infty \sum n = 0 γ^{\frac{2 n}{T} + 1} E [f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n})] \leq 2 C T + \frac{1}{1 - exp (- \frac{1}{4 T})}$

Proof of Proposition A.15

We have, with probability 1

$\infty \sum n = 0 1_{M_{K_{n}} = n} γ^{K_{n}} f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n}) \leq C$

Taking expected value

$\infty \sum n = 0 E [1_{M_{K_{n}} = n} γ^{K_{n}} f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n})] \leq C$

Denote by $K_{n}^{♯}$ the joint random variables $K_{n}$ and $M_{0}, M_{1} \dots M_{K_{n} - 1}$ . Our assumptions imply that, conditional on $K_{n}^{♯}$ , the factor $1_{M_{K_{n}} = n}$ is independent of the rest of the expression inside the expected value. It follows that

$\infty \sum n = 0 E [Pr [M_{K_{n}} = n ∣ ∣ K_{n}^{♯}] γ^{K_{n}} E [f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n}) ∣ ∣ K_{n}^{♯}]] \leq C$

Clearly

$Pr [M_{K_{n}} = n ∣ ∣ K_{n}^{♯}] = \frac{1}{2 T - n + M_{K_{n} - 1}} \geq \frac{1}{2 T}$

We get

$\infty \sum n = 0 E [\frac{1}{2 T} γ^{K_{n}} f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n})] \leq C$

$\infty \sum n = 0 E [γ^{K_{n}} f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n})] \leq 2 C T$

Using Proposition A.14, we get

$\infty \sum n = 0 (E [γ^{\frac{2 n}{T} + 1} f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n})] - exp (- \frac{n}{4 T})) \leq 2 C T$

$\infty \sum n = 0 γ^{\frac{2 n}{T} + 1} E [f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n})] \leq 2 C T + \infty \sum n = 0 exp (- \frac{n}{4 T})$

$\infty \sum n = 0 γ^{\frac{2 n}{T} + 1} E [f_{K_{n}} (X_{M_{0}}, X_{M_{1}} \dots X_{M_{K_{n} - 1}}, X_{n})] \leq 2 C T + \frac{1}{1 - exp (- \frac{1}{4 T})} ■$

Proposition A.16

Consider $(Ω, P)$ a probability space, $T \in N^{+}$ and ${N_{k}, Δ_{k} : Ω \to N}_{k \in N}$ random variables s.t. $N_{k + 1} - N_{k} \geq Δ_{k}$ and the $Δ_{k}$ are all independent and uniformly distributed between $1$ and $2 T - 1$ . Consider also $γ \in (0, 1)$ . Then, for any $k \in N$

$E [γ^{N_{k}}] \leq {(1 - T γ^{2 T - 2} (1 - γ))}^{k}$

In particular,

$\infty \sum k = 0 E [γ^{N_{k}}] \leq \frac{1}{T γ^{2 T - 2} (1 - γ)}$

Proof of Proposition A.16

For any $n \in N$ , we have

$\frac{1 - γ^{n}}{1 - γ} = n - 1 \sum m = 0 γ^{m}$

$1 - γ^{n} = (1 - γ) n - 1 \sum m = 0 γ^{m} \geq n (1 - γ) γ^{n - 1}$

$γ^{n} \leq 1 - n (1 - γ) γ^{n - 1}$

Therefore, since $Δ_{0}$ is uniformly distributed between $1$ and $2 T - 1$ ,

$E [γ^{Δ_{0}}] = \frac{1}{2 T - 1} 2 T - 1 \sum n = 1 γ^{n} \leq \frac{1}{2 T - 1} 2 T - 1 \sum n = 1 (1 - n (1 - γ) γ^{n - 1})$

$E [γ^{Δ_{0}}] \leq \frac{1}{2 T - 1} 2 T - 1 \sum n = 1 (1 - n (1 - γ) γ^{2 T - 2}) = 1 - T (1 - γ) γ^{2 T - 2}$

Also, for any $k \in N$

$N_{k} \geq k - 1 \sum l = 0 Δ_{l}$

$γ^{N_{k}} \leq γ^{\sum_{l = 0}^{k - 1} Δ_{l}} = k - 1 \prod l = 0 γ^{Δ_{l}}$

$E [γ^{N_{k}}] \leq E [k - 1 \prod l = 0 γ^{Δ_{l}}] = k - 1 \prod l = 0 E [γ^{Δ_{l}}] = E {[γ^{Δ_{0}}]}^{k}$

Here, we used that the $Δ_{l}$ are independent and equally distributed. Substituting the expression for $E [Δ_{0}]$ , we get the desired result. $■$

Proposition A.17

In the setting of Proposition A.16, consider some $δ \in R^{+}$ . Denote

$α := 1 - T γ^{2 T - 2} (1 - γ)$

Assume $α > 0$ . Then,

$\infty \sum k = 0 min (E [γ^{N_{k}}], δ) \leq (⌈ \frac{ln δ}{ln α} ⌉ + \frac{1}{1 - α}) δ$

Proof of Proposition A.17

By Proposition A.16

$\infty \sum k = 0 min (E [γ^{N_{k}}], δ) \leq \infty \sum k = 0 min (α^{k}, δ)$

$\infty \sum k = 0 min (E [γ^{N_{k}}], δ) \leq ⌈ \frac{ln δ}{ln α} ⌉ - 1 \sum k = 0 min (α^{k}, δ) + \infty \sum k = ⌈ \frac{ln δ}{ln α} ⌉ min (α^{k}, δ)$

In the second term, we have

$α^{k} \leq α^{⌈ \frac{ln δ}{ln α} ⌉} \leq α^{\frac{ln δ}{ln α}} = e^{ln α \frac{ln δ}{ln α}} = e^{ln δ} = δ$

We get

$\infty \sum k = 0 min (E [γ^{N_{k}}], δ) \leq ⌈ \frac{ln δ}{ln α} ⌉ - 1 \sum k = 0 δ + \infty \sum k = ⌈ \frac{ln δ}{ln α} ⌉ α^{k} = ⌈ \frac{ln δ}{ln α} ⌉ δ + \frac{α^{⌈ \frac{ln δ}{ln α} ⌉}}{1 - α}$

We've already seen that $α^{⌈ \frac{ln δ}{ln α} ⌉} \leq δ$ , and therefore

$\infty \sum k = 0 min (E [γ^{N_{k}}], δ) \leq ⌈ \frac{ln δ}{ln α} ⌉ δ + \frac{δ}{1 - α} = (⌈ \frac{ln δ}{ln α} ⌉ + \frac{1}{1 - α}) δ ■$

We will use the notations of Definitions B.1 and B.2 (see Appendix).

The following is a simple generalization of "Lemma 1" in Osband and Van Roy 2014 and the proof is essentially the same.

Proposition A.18

Consider a set $X$ , a finite-dimensional inner product space $Y$ and some $F \subseteq {X \to Y}$ . Consider also some $x \in X^{ω}$ , $y \in Y^{ω}$ , $T \in N$ , $θ \in R^{+}$ , an increasing sequence ${N_{k} \in N}_{k \in N}$ and a nondecreasing sequence ${β_{k} \in R^{+}}_{k \in N}^{+}$ . Suppose that for any $k \in N$ , $N_{k + 1} \leq N_{k} + T$ . For any $k \in N$ , define $F^{k}$ by

$F^{k} := {C S}^{F} [{x y}_{: N_{k}}, β_{k}^{- 1}]$

Denote $D := {dim}_{R V O} F$ . Then, for any $K \in N$

$∣ ∣ {(k \in [K], n \in N) ∣ ∣ N_{k} \leq n < N_{k + 1}, W^{F^{k}} (x_{n}) > θ} ∣ ∣ \leq (D + 1) (\frac{4 β_{K - 1}}{θ^{2}} + T)$

Here, $β_{- 1}$ is understood to mean $0$ .

Proof of Proposition A.18

Let $A \subset N \times N$ be the set

$A := {(k \in [K], n \in N) ∣ ∣ N_{k} \leq n < N_{k + 1}, W^{F^{k}} (x_{n}) > θ}$

Let $A^{'} \subset N$ be the set

$A^{'} := {n \in N | \exists k \in [K] : (k, n) \in A}$

For each $i \in [| A |]$ , we define $(k_{i}, n_{i}) \in A$ recursively by

$n_{0} := min A^{'}$ $n_{i + 1} := min {n \in A^{'} ∣ ∣ n > n_{i}}$

Denote $D := {dim}_{R V O} F$ and let $L := ⌊ \frac{| A |}{D + 1} ⌋$ . Given any $n \in N^{*}$ , the notation $x_{n}$ will refer to the element of $X^{*}$ s.t. $| x_{n} | = | n |$ and for any $i \in [| n |]$ , ${(x_{n})}_{i} := x_{n_{i}}$ . We define $j \in [| A |]$ and ${m_{l i} \in N^{*}}_{l \in [L], i \in [j + 1]}^{*}$ by recursion over $i$ .

For $i = 0$ , we set $m_{l 0} := λ$ . For any $i$ , we consider whether there is $l \in [L]$ s.t. $x_{n_{i}}$ is $(F, θ)$ -independent of $x_{m_{l i}}$ . If there is such $l$ , we set $m_{l, i + 1} := m_{l i} n_{i}$ and for any $l^{'} \in [L] ∖ l$ , $m_{l^{'}, i + 1} := m_{l i}$ . If there is no such $l$ , we set $j := i$ . The latter situation must occur for some $i \in [| A |]$ , since otherwise we would get

$\sum l \in [L] ∣ ∣ m_{l | A |} ∣ ∣ = | A |$

That would imply that there is $l \in [L]$ s.t.

$∣ ∣ m_{l | A |} ∣ ∣ \geq \frac{| A |}{L} = \frac{| A |}{⌊ \frac{| A |}{D + 1} ⌋} \geq \frac{| A |}{(\frac{| A |}{D + 1})} = D + 1$

This is impossible since, by construction, each element of the sequence $x_{m_{l | A |}}$ is $(F, θ)$ -independent of the preceding subsequence, whereas, by definition of $D$ , it is the maximal length of such a sequence.

Since $(k_{j}, n_{j}) \in A$ , $W^{F^{k_{j}}} (x_{n_{j}}) > θ$ . Therefore, there are $f, ~ f \in F^{k_{j}}$ s.t.

$∥ ∥ f (x_{n_{j}}) - ~ f (x_{n_{j}}) ∥ ∥ > θ$

By construction of $j$ and $m$ , For each $l \in [L]$ , $x_{n_{j}}$ is $(F, θ)$ -dependent of $x_{m_{l j}}$ . Therefore,

$∣ ∣ m_{l j} ∣ ∣ - 1 \sum i = 0 {∥ ∥ f (x_{m_{l j i}}) - ~ f (x_{m_{l j i}}) ∥ ∥}^{2} > θ^{2}$

Define $J \subseteq [L]$ by

$J := {l \in [L] ∣ ∣ \forall i \in [∣ ∣ m_{l j} ∣ ∣] : m_{l j i} < N_{k_{j}}}$

$\sum l \in J ∣ ∣ m_{l j} ∣ ∣ - 1 \sum i = 0 {∥ ∥ f (x_{m_{l j i}}) - ~ f (x_{m_{l j i}}) ∥ ∥}^{2} \geq | J | θ^{2}$

By construction, the sequences $m_{l j}$ for all values of $l$ together are a collection of distinct elements of $[n_{j}]$ . Therefore, $| J | \geq L - (n_{j} - N_{k_{j}}) \geq L - T + 1$ . It follows that

$N_{k_{j}} - 1 \sum n = 0 {∥ ∥ f (x_{n}) - ~ f (x_{n}) ∥ ∥}^{2} \geq (L - T + 1) θ^{2}$

$\sqrt{N_{k_{j}} - 1 \sum n = 0 {∥ ∥ f (x_{n}) - ~ f (x_{n}) ∥ ∥}^{2}} \geq θ \sqrt{max (L - T + 1, 0)}$

Denote $f^{*} := {L S}^{F} [{x y}_{: N_{k_{j}}}]$ .

$\sqrt{N_{k_{j}} - 1 \sum n = 0 {∥ f (x_{n}) - f^{*} (x_{n}) ∥}^{2}} + \sqrt{N_{k_{j}} - 1 \sum n = 0 {∥ ∥ ~ f (x_{n}) - f^{*} (x_{n}) ∥ ∥}^{2}} \geq θ \sqrt{max (L - T + 1, 0)}$

Since $f, ~ f \in F^{k_{j}}$ , by the definition of $F^{k_{j}}$ each of the two terms on the RHS is at most $\sqrt{β_{k_{j}}} \leq \sqrt{β_{K - 1}}$ and we get

$2 \sqrt{β_{K - 1}} \geq θ \sqrt{max (L - T + 1, 0)}$

$4 β_{K - 1} \geq θ^{2} (L + 1 - T)$

$4 β_{K - 1} \geq θ^{2} (\frac{| A |}{D + 1} - T)$

$| A | \leq (D + 1) (\frac{4 β_{K - 1}}{θ^{2}} + T) ■$

Proposition A.19

Consider $α \in N$ and $f : R^{+} \to R^{+}$ non-decreasing. Define $g : R \to R$ by $g (x) := ln f (e^{x})$ , and assume $g$ is Lipschitz continuous with constant $α$ . In other words, for any $t \in R^{+}$ and $λ \in [1, \infty)$ , we have $f (λ t) \leq λ^{α} f (t)$ . Let $L$ denote the Laplace transform operator. Then,

$L [f] (s) \leq \frac{1 - e^{- 1} + α!}{s} f (\frac{1}{s})$

Proof of Proposition A.19

$L [f] (s) = \int_{0}^{\infty} e^{- s t} f (t) d t = \int_{0}^{\frac{1}{s}} e^{- s t} f (t) d t + \int_{\frac{1}{s}}^{\infty} e^{- s t} f (t) d t$

For the first term, we will use that $f$ is non-decreasing, and for the second term, we will use that $g$ is Lipschitz continuous with constant $α$ .

$L [f] (s) \leq \int_{0}^{\frac{1}{s}} e^{- s t} f (\frac{1}{s}) d t + \int_{\frac{1}{s}}^{\infty} e^{- s t} f (\frac{1}{s}) \cdot {(t s)}^{α} d t$

$L [f] (s) \leq (\int_{0}^{\frac{1}{s}} e^{- s t} d t + s^{α} \int_{\frac{1}{s}}^{\infty} e^{- s t} t^{α} d t) f (\frac{1}{s})$

$L [f] (s) \leq (\int_{0}^{\frac{1}{s}} e^{- s t} d t + s^{α} \int_{0}^{\infty} e^{- s t} t^{α} d t) f (\frac{1}{s})$

$L [f] (s) \leq (\frac{1 - e^{- 1}}{s} + s^{α} \cdot α! s^{- α - 1}) f (\frac{1}{s}) = \frac{1 - e^{- 1} + α!}{s} f (\frac{1}{s}) ■$

Proposition A.20

There is some $C_{A .20} \in R^{+}$ s.t. the following holds.

Consider a set $X$ , an inner product space $Y$ and some $F \subseteq {X \to Y}$ . Consider also some $x \in X^{ω}$ , $y \in Y^{ω}$ , $T \in N^{+}$ , $γ \in (0, 1)$ , $θ \in R^{+}$ , $η_{0}, η_{1} \in R^{+}$ , $δ \in (0, 1)$ and an increasing sequence ${N_{k} \in N}_{k \in N}$ . Suppose that $N_{0} = 0$ , for any $k \in N$ , $N_{k + 1} \leq N_{k} + T$ , and $γ^{T} > \frac{1}{2}$ . Denote

$β (t) := η_{0} + η_{1} t ln \frac{e t}{δ}$

For any $k \in N$ , define $F^{k}$ by

$F^{k} := {C S}^{F} [{x y}_{: N_{k}}, β {(N_{k} + 1)}_{k}^{- 1}]$

Denote $D := {dim}_{R V O} F$ . Then,

$\infty \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} γ^{k T} 1_{W^{F^{k}} (x_{n}) > θ} \leq C_{A .20} D (\frac{1}{θ^{2}} β (\frac{1}{1 - γ}) + T)$

Proof of Proposition A.20

By Proposition A.18, for any $K \in N$

$K - 1 \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} 1_{W^{F^{k}} (x_{n}) > θ} \leq (D + 1) (\frac{4 β (N_{K - 1} + 1)}{θ^{2}} + T)$

Here, $N_{- 1}$ is understood to mean $0$ . Observing that $N_{k} \leq k T$ , we get

$K - 1 \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} 1_{W^{F^{k}} (x_{n}) > θ} \leq (D + 1) (\frac{4 β (K T + 1)}{θ^{2}} + T)$

Multiplying the inequality by $γ^{K T}$ and summing over $K$ , we get

$\infty \sum k = 0 (\infty \sum K = k + 1 γ^{K T}) N_{k + 1} - 1 \sum n = N_{k} 1_{W^{F^{k}} (x_{n}) > θ} \leq (D + 1) \infty \sum K = 0 γ^{K T} (\frac{4 β (K T + 1)}{θ^{2}} + T)$

On the left hand side, we sum the geometric series. On the right hand side, we use the observation that

$γ^{K T} (\frac{4 β (K T + 1)}{θ^{2}} + T) \leq \frac{1}{T} \int_{0}^{T} γ^{K T + t - T} (\frac{4 β (K T + t + 1)}{θ^{2}} + T) d t$

$γ^{K T} (\frac{4 β (K T + 1)}{θ^{2}} + T) \leq \frac{1}{T γ^{T}} \int_{0}^{T} γ^{K T + t} (\frac{4 β (K T + t + 1)}{θ^{2}} + T) d t$

Here, we used that $β (t)$ is an increasing function for $t \geq 1$ . We get

$\infty \sum k = 0 \frac{γ^{(k + 1) T}}{1 - γ^{T}} N_{k + 1} - 1 \sum n = N_{k} 1_{W^{F^{k}} (x_{n}) > θ} \leq \frac{D + 1}{T γ^{T}} \int_{0}^{\infty} γ^{t} (\frac{4 β (t + 1)}{θ^{2}} + T) d t$

$\frac{γ^{T}}{1 - γ^{T}} \infty \sum k = 0 γ^{k T} N_{k + 1} - 1 \sum n = N_{k} 1_{W^{F^{k}} (x_{n}) > θ} \leq \frac{D + 1}{T γ^{T}} L_{t} [\frac{4 β (t + 1)}{θ^{2}} + T] (ln \frac{1}{γ})$

It is easy to see that Proposition A.19 is applicable to the RHS for $α = 2$ . We get,

$\frac{γ^{T}}{1 - γ^{T}} \infty \sum k = 0 γ^{k T} N_{k + 1} - 1 \sum n = N_{k} 1_{W^{F^{k}} (x_{n}) > θ} \leq \frac{D + 1}{T γ^{T}} \cdot \frac{3 - e^{- 1}}{ln \frac{1}{γ}} ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \frac{4 β (\frac{1}{ln \frac{1}{γ}} + 1)}{θ^{2}} + T ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠$

Using the condition $γ^{T} > \frac{1}{2}$ and absorbing $O (1)$ factors into the definition of $C_{A.20}$ , we get

$\infty \sum k = 0 γ^{k T} N_{k + 1} - 1 \sum n = N_{k} 1_{W^{F^{k}} (x_{n}) > θ} \leq C_{A.20} D (\frac{1}{θ^{2}} β (\frac{1}{1 - γ}) + T) ■$

Proposition A.21

There is some $C_{A .21} \in R^{+}$ s.t. the following holds. In the setting of Proposition A.20, assume that for any $x \in X$ and $f \in F$ , $∥ f (x) ∥ \leq 1$ . Then,

$\infty \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} γ^{k T} W^{F^{k}} (x_{n}) \leq C_{A.21} (D T + \sqrt{D β (\frac{1}{1 - γ}) \frac{1}{1 - γ}})$

Proof of Proposition A.21

Due to the assumption $∥ f (x) ∥ \leq 1$ , we have $W^{F^{k}} (x) \leq 2$ . For any $t \in R^{+}$ , we have

$W^{F^{k}} (x_{n}) = \int_{0}^{2} 1_{W^{F^{k}} (x_{n}) > θ} d θ \leq t + \int_{t}^{2} 1_{W^{F^{k}} (x_{n}) > θ} d θ$

$\infty \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} γ^{k T} W^{F^{k}} (x_{n}) \leq \frac{t T}{1 - γ^{T}} + \int_{t}^{2} \infty \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} γ^{k T} 1_{W^{F^{k}} (x_{n}) > θ} d θ$

Applying Proposition A.20 to integrand on the RHS

$\infty \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} γ^{k T} W^{F^{k}} (x_{n}) \leq \frac{t T}{1 - γ^{T}} + C_{A.20} D \int_{t}^{2} (\frac{1}{θ^{2}} β (\frac{1}{1 - γ}) + T) d θ$

Evaluating the integral and dropping some negative terms on the right hand side, we get

$\infty \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} γ^{k T} W^{F^{k}} (x_{n}) \leq \frac{t T}{1 - γ^{T}} + C_{A.20} D (\frac{1}{t} β (\frac{1}{1 - γ}) + 2 T)$

We now set $t$ to be

$t := \sqrt{D β (\frac{1}{1 - γ}) \cdot (1 - γ)}$

For an appropriate choice of $C_{A.21}$ , and using the assumption $γ^{T} > \frac{1}{2}$ , it follows that

$\infty \sum k = 0 N_{k + 1} - 1 \sum n = N_{k} γ^{k T} W^{F^{k}} (x_{n}) \leq C_{A.21} (D T + \sqrt{D β (\frac{1}{1 - γ}) \frac{1}{1 - γ}}) ■$

Proof of Theorem 2

We take $π_{γ}^{†} := π_{γ, T}^{PS}$ where $T \in N^{+}$ will be specified later. By Proposition A.13

$R (γ) \leq \infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} ((1 - γ) | Δ R_{k} (Θ_{n}) | + γ ∣ ∣ ∣ E Δ Y_{k} [V_{Y_{k} R_{k}} (γ) ∣ ∣ Θ_{n}, A_{n}] ∣ ∣ ∣) ⎤ ⎦$ $+ \infty \sum k = 0 E ⎡ ⎣ γ^{M_{k}} ∣ ∣ ∣ ∣ E E_{k *}^{Δ N_{k}} - E_{* k}^{Δ N_{k}} [V_{Y_{*} R_{*}} (γ) ∣ ∣ Θ_{N_{k}}] ∣ ∣ ∣ ∣ ⎤ ⎦$ $+ (1 - γ) \infty \sum k = 0 E ⎡ ⎣ N_{k + 1} - 1 \sum n = M_{k} γ^{n} ∣ ∣ ∣ ∣ E E_{* *}^{n - M_{k}} - E_{* k}^{n - M_{k}} [R_{*} ∣ ∣ Θ_{M_{k}}] ∣ ∣ ∣ ∣ ⎤ ⎦$ $+ \infty \sum k = 0 E ⎡ ⎣ γ^{N_{k + 1}} ∣ ∣ ∣ ∣ E E_{* *}^{Δ M_{k}} - E_{* k}^{Δ M_{k}} [V_{Y_{*} R_{*}} (γ) ∣ ∣ Θ_{M_{k}}] ∣ ∣ ∣ ∣ ⎤ ⎦$

We will use the notation

$Δ V (γ) := max T \in H (max s \in S V_{T R_{T}} (s, γ) - min s \in S V_{T R_{T}} (s, γ))$

We will also use the notation $d_{tv} (μ - ν) := d_{tv} (μ, ν)$ .

It follows that

$R (γ) \leq \infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} ((1 - γ) | Δ R_{k} (Θ_{n}) | + γ Δ V (γ) d_{tv} (Δ Y_{k} (Θ_{n}, A_{n}))) ⎤ ⎦$ $+ Δ V (γ) \infty \sum k = 0 E [γ^{M_{k}}] + (1 - γ) \infty \sum k = 0 E ⎡ ⎣ N_{k + 1} - 1 \sum n = M_{k} γ^{n} ⎤ ⎦ + Δ V (γ) \infty \sum k = 0 E [γ^{N_{k + 1}}]$

Denote $R_{0} (γ)$ the first term on the RHS and $R_{1} (γ)$ the sum of the others terms. We have

$R (γ) \leq R_{0} (γ) + R_{1} (γ)$

First, we analyze $R_{1}$ . In the second term, we have $γ^{n} \leq γ^{M_{k}}$ , leading to

$E ⎡ ⎣ N_{k + 1} - 1 \sum n = M_{k} γ^{n} ⎤ ⎦ \leq E [Δ M_{k} γ^{M_{k}}] = E [E [Δ M_{k} | M_{k}] γ^{M_{k}}] \leq t_{e s t} E [γ^{M_{k}}]$

We get

$R_{1} (γ) \leq Δ V (γ) \infty \sum k = 0 E [γ^{M_{k}}] + t_{e s t} (1 - γ) \infty \sum k = 0 E [γ^{M_{k}}] + Δ V (γ) \infty \sum k = 0 E [γ^{N_{k + 1}}]$

Applying Proposition A.16 to each term, we get

$R_{1} (γ) \leq \frac{2 Δ V (γ) + t_{e s t} (1 - γ)}{γ^{2 T - 2} (1 - γ) T}$

Now, we analyze $R_{0}$ . Define $Θ_{M}^{r} \in {(S \times R)}^{ω}$ by

$Θ_{M k}^{r} := (Θ_{M_{k}}, r (Θ A_{M_{k} : N_{k + 1}} Θ_{N_{k + 1}}))$

That is, $Θ_{M}^{r}$ is the history of reward estimation experiments. For any $i \in N$ , let $L_{i}$ be the stopping time defined recursively by

$L_{0} := 0$ $L_{i + 1} := min {l \in N | l > L_{i}, \exists k \in N : N_{k} \leq l < M_{k}}$

That is, $L$ are time indices that traverse the "interior" of the episodes only. Define $Θ A_{L} \in {(S \times A)}^{ω}$ by

$Θ A_{L i} := Θ_{L_{i}} A_{L_{i}}$

We apply Proposition B.1 (see Appendix) with $δ := \frac{1}{2} (1 - γ)^{2}$ and $ϵ := (1 - γ)^{2}$ , to each of the two terms in $R_{0} (γ)$ :

$Pr [R_{*} \in \infty ⋂ k = 0 {C S}^{H_{R}} [Θ_{M : k}^{r}, β_{R} (k + 1)^{- 1}]] \geq 1 - \frac{1}{2} (1 - γ)^{2}$ $Pr [Y_{*} \in \infty ⋂ i = 0 {C S}^{H} [Θ A_{L : i} Θ_{L_{i}}, β_{T} (i + 1)^{- 1}]] \geq 1 - \frac{1}{2} (1 - γ)^{2}$

Here, $β_{R}$ and $β_{T}$ correspond to $β$ in Proposition B.1.

We also define $N_{k}^{'}$ by the condition $L_{N_{k}^{'}} = N_{k}$ . Since the hypotheses $Y_{k}$ is sampled from the posterior, for any $k \in N$ we also have

$Pr [R_{k} \in {C S}^{H_{R}} [Θ_{M : k}^{r}, β_{R} (k + 1)^{- 1}]] \geq 1 - \frac{1}{2} (1 - γ)^{2}$ $Pr [Y_{k} \in {C S}^{H} [Θ A_{L : N_{k}^{'}} Θ_{N_{k}}, β_{T} {(N_{k}^{'} + 1)}^{- 1}]] \geq 1 - \frac{1}{2} (1 - γ)^{2}$

$Pr [R_{*}, R_{k} \in {C S}^{H_{R}} [Θ_{M : k}^{r}, β_{R} (k + 1)^{- 1}]] \geq 1 - (1 - γ)^{2}$ $Pr [Y_{*}, Y_{k} \in {C S}^{H} [Θ A_{L : N_{k}^{'}} Θ_{N_{k}}, β_{T} {(N_{k}^{'} + 1)}^{- 1}]] \geq 1 - (1 - γ)^{2}$

Denote

$G_{R}^{k} := {R_{*}, R_{k} \in {C S}^{H_{R}} [Θ_{M : k}^{r}, β_{R} (k + 1)^{- 1}]} \subseteq Ω$ $G_{T}^{k} := {Y_{*}, Y_{k} \in {C S}^{H} [Θ A_{L : N_{k}^{'}} Θ_{N_{k}}, β_{T} {(N_{k}^{'} + 1)}^{- 1}]} \subseteq Ω$

Define $R_{R} (γ)$ and $R_{T} (γ)$ by

$R_{R} (γ) := (1 - γ) \infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) | ⎤ ⎦$ $R_{T} (γ) := Δ V (γ) \infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n + 1} d_{tv} (Δ Y_{k} (Θ_{n}, A_{n})) ⎤ ⎦$

We have

$R_{0} (γ) = R_{R} (γ) + R_{T} (γ)$

We split $R_{R}$ into the $G_{R}^{k}$ and $G_{R}^{k ∁}$ contributions

$R_{R} (γ) = (1 - γ) \infty \sum k = 0 (E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) |; G_{R}^{k} ⎤ ⎦ +$ $E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) |; G_{R}^{k ∁} ⎤ ⎦)$

For the second term, we have

$M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) | \leq 2 T γ^{N_{k}}$

$E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) |; G_{R}^{k ∁} ⎤ ⎦ \leq 2 T E [γ^{N_{k}}; G_{R}^{k ∁}] \leq 2 T min (E [γ^{N_{k}}], Pr [G_{R}^{k ∁}])$

$E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) |; G_{R}^{k ∁} ⎤ ⎦ \leq 2 T min (E [γ^{N_{k}}], (1 - γ)^{2})$

$\infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) |; G_{R}^{k ∁} ⎤ ⎦ \leq 2 T \infty \sum k = 0 min (E [γ^{N_{k}}], (1 - γ)^{2})$

Applying Proposition A.17 to the RHS, we get

$\infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) |; G_{R}^{k ∁} ⎤ ⎦ \leq 2 T (⌈ \frac{2 ln (1 - γ)}{ln α} ⌉ + \frac{1}{T γ^{2 T - 2} (1 - γ)}) (1 - γ)^{2}$

Here, $α$ is as defined in Proposition A.17. Since we are interested in the asymptotics $γ \to 1$ , and our ultimate choice of $T$ will ensure that $γ^{T} \to 1$ , we will henceforth make the assumption that $γ^{2 T} > \frac{1}{2}$ .

$\infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) |; G_{R}^{k ∁} ⎤ ⎦ \leq 2 T ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ \frac{2 ln (1 - γ)}{ln (1 - \frac{1}{2} T (1 - γ))} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ (1 - γ)^{2} + 4 (1 - γ)$

Denote $ρ (γ, T)$ the expression on the RHS. We get

$R_{R} (γ) \leq (1 - γ) ⎛ ⎝ \infty \sum k = 0 E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} | Δ R_{k} (Θ_{n}) |; G_{R}^{k} ⎤ ⎦ + ρ (γ, T) ⎞ ⎠$

Denote

$H_{R}^{k} := {C S}^{H_{R}} [Θ_{M : k}^{r}, β_{R} (k + 1)^{- 1}]$

Clearly

$Pr [| Δ R_{k} (Θ_{n}) | \leq W^{H_{R}^{k}} (Θ_{n}) ∣ ∣ G_{R}^{k}] = 1$

Using this inequality, dropping the $; G_{R}^{k}$ (since it can only the right hand side smaller) and moving the sum inside the expected value, we get

$R_{R} (γ) \leq (1 - γ) ⎛ ⎝ E ⎡ ⎣ \infty \sum k = 0 M_{k} - 1 \sum n = N_{k} γ^{n} W^{H_{R}^{k}} (Θ_{n}) ⎤ ⎦ + ρ (γ, T) ⎞ ⎠$

Extending the sum on the RHS to $M_{k}$ (that can only increase it), and using the fact that the width is $\leq 1$ ,

$R_{R} (γ) \leq (1 - γ) ⎛ ⎝ E ⎡ ⎣ \infty \sum k = 0 ⎛ ⎝ γ^{N_{k}} + M_{k} \sum n = N_{k} + 1 γ^{n} W^{H_{R}^{k}} (Θ_{n}) ⎞ ⎠ ⎤ ⎦ + ρ (γ, T) ⎞ ⎠$

By Proposition A.16,

$R_{R} (γ) \leq (1 - γ) ⎛ ⎝ E ⎡ ⎣ \infty \sum k = 0 M_{k} \sum n = N_{k} + 1 γ^{n} W^{H_{R}^{k}} (Θ_{n}) ⎤ ⎦ + ρ (γ, T) ⎞ ⎠ + \frac{2}{T}$

We define the random variable $K_{i} : Ω \to N$ by

$K_{i} := min {k \in N | M_{k} \geq L_{i}}$

We can now rewrite the previous inequality as

$R_{R} (γ) \leq (1 - γ) (E [\infty \sum i = 0 γ^{L_{i} + 1} W^{H_{R}^{K_{i}}} (Θ_{L_{i} + 1})] + ρ (γ, T)) + \frac{2}{T}$

Obviously $L_{i} + 1 \geq i$ and hence

$R_{R} (γ) \leq (1 - γ) (\infty \sum i = 0 γ^{i} E [W^{H_{R}^{K_{i}}} (Θ_{L_{i} + 1})] + ρ (γ, T)) + \frac{2}{T}$

On the other hand, by Proposition A.21 (for the $T = 1$ , $N_{k} = k$ case, applied to the subsequence $Θ_{M_{k}}$ , with $γ^{\frac{1}{2} T}$ playing the role of $γ$ )

$\infty \sum k = 0 γ^{\frac{1}{2} T k} W^{H_{R}^{k}} (Θ_{M_{k}}) \leq C_{A.21} ⎛ ⎜ ⎜ ⎝ D_{R V O} + \sqrt{D_{R V O} β_{R} ⎛ ⎝ \frac{1}{1 - γ^{\frac{1}{2} T}} ⎞ ⎠ \frac{1}{1 - γ^{\frac{1}{2} T}}} ⎞ ⎟ ⎟ ⎠$

Denote ${~ β}_{R} (γ, T, D_{R V O})$ the expression on the RHS. Applying Proposition A.15, we conclude

$\infty \sum i = 0 γ^{i + \frac{1}{2} T} E [W^{H_{R}^{K_{i}}} (Θ_{L_{i} + 1})] \leq 2 T {~ β}_{R} (γ, T, D_{R V O}) + \frac{1}{1 - exp (- \frac{1}{4 T})}$

It follows

$R_{R} (γ) \leq (1 - γ) ⎛ ⎜ ⎜ ⎝ γ^{- \frac{1}{2} T} ⎛ ⎜ ⎜ ⎝ 2 T {~ β}_{R} (γ, T, D_{R V O}) + \frac{1}{1 - exp (- \frac{1}{4 T})} ⎞ ⎟ ⎟ ⎠ + ρ (γ, T) ⎞ ⎟ ⎟ ⎠ + \frac{2}{T}$

Now, we analyze $R_{T}$ . By the same reasoning as for $R_{R}$ , we have

$R_{T} (γ) \leq Δ V (γ) ⎛ ⎝ E ⎡ ⎣ M_{k} - 1 \sum n = N_{k} γ^{n} d_{tv} (Δ Y_{k} (Θ_{n}, A_{n})); G_{T}^{k} ⎤ ⎦ + ρ (γ, T) ⎞ ⎠$

Denote

$H_{T}^{k} := {C S}^{H} [Θ A_{L : N_{k}^{'}} Θ_{N_{k}}, β_{T} {(N_{k}^{'} + 1)}^{- 1}]$

It is easy to see that for $N_{k} \leq n < M_{k}$

$Pr [d_{tv} (Δ Y_{k} (Θ_{n}, A_{n})) \leq \frac{| S |}{2} W^{H_{T}^{k}} (Θ_{n} A_{n}) ∣ ∣ ∣ G_{T}^{k}] = 1$

It follows

$R_{T} (γ) \leq Δ V (γ) ⎛ ⎝ \frac{| S |}{2} E ⎡ ⎣ \infty \sum k = 0 M_{k} - 1 \sum n = N_{k} γ^{n} W^{H_{T}^{k}} (Θ_{n} A_{n}) ⎤ ⎦ + ρ (γ, T) ⎞ ⎠$

$R_{T} (γ) \leq Δ V (γ) (\frac{| S |}{2} \infty \sum i = 0 E [γ^{L_{i}} W^{H_{T}^{K_{i}}} (Θ_{L_{i}} A_{L_{i}})] + ρ (γ, T))$

Since $L_{i} \geq i$ , By Proposition A.14

$\infty \sum i = 0 E [γ^{L_{i}} W^{H_{T}^{K_{i}}} (Θ_{L_{i}} A_{L_{i}})] \leq \infty \sum i = 0 (E [γ^{\frac{1}{2} (K_{i} - 1) T} W^{H_{T}^{K_{i}}} (Θ_{L_{i}} A_{L_{i}})] + exp (- \frac{i}{4 T}))$

$\infty \sum i = 0 E [γ^{L_{i}} W^{H_{T}^{K_{i}}} (Θ_{L_{i}} A_{L_{i}})] \leq γ^{- \frac{1}{2} T} \infty \sum i = 0 E [γ^{\frac{1}{2} K_{i} T} W^{H_{T}^{K_{i}}} (Θ_{L_{i}} A_{L_{i}})] + \frac{1}{1 - exp (- \frac{1}{4 T})}$

Denote

$~ ρ (γ, T, | S |) := ρ (γ, T) + \frac{| S |}{2} \cdot \frac{1}{1 - exp (- \frac{1}{4 T})}$

We get, using that $γ^{- \frac{1}{2} T} < 2$

$R_{T} (γ) \leq Δ V (γ) (| S | \infty \sum i = 0 E [γ^{\frac{1}{2} K_{i} T} W^{H_{T}^{K_{i}}} (Θ_{L_{i}} A_{L_{i}})] + ~ ρ (γ, T, | S |))$

On the other hand, ${dim}_{R V O} H \leq | S | | A |$ and by Proposition A.21

$\infty \sum i = 0 γ^{\frac{1}{2} K_{i} T} W^{H_{T}^{K_{i}}} (Θ_{L_{i}} A_{L_{i}}) \leq C_{A.21} ⎛ ⎜ ⎜ ⎝ | S | | A | T + \sqrt{| S | | A | β_{T} ⎛ ⎝ \frac{1}{1 - γ^{\frac{1}{2}}} ⎞ ⎠ \frac{1}{1 - γ^{\frac{1}{2}}}} ⎞ ⎟ ⎟ ⎠$

Denote ${~ β}_{T} (γ, T, | S | | A |)$ the expression on the RHS. It follows that

$R_{T} (γ) \leq Δ V (γ) (| S | {~ β}_{T} (γ, T, | S | | A |) + ~ ρ (γ, T))$

We set

$T := \sqrt[3]{\frac{{(τ + t_{e s t})}_{e s t}^{2}}{(D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}}$

Now, we analyze the $γ \to 1$ limit. In this limit, the expression for $T$ justifies our assumption that $γ^{2 T} > \frac{1}{2}$ . Indeed, we have

$lim γ \to 1 ln γ^{T} = lim γ \to 1 T ln γ = lim γ \to 1 \sqrt[3]{\frac{{(τ + t_{e s t})}_{e s t}^{2}}{(D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} \cdot ln γ = 0$

We now analyze the separate contributions of $R_{R}$ , $R_{T}$ and $R_{1}$ to the limit of interest. We will use the notation $x ≲ y$ to mean, there is some constant $C_{0} \in R^{+}$ that depends on nothing (i.e. on none of the parameters of the problem) s.t. $x \leq C_{0} y$ .

Our previous bound on $R_{R}$ can be written as

$R_{R} (γ) ≲ D_{R V O} (1 - γ) T +$ $\sqrt{D_{R V O} ln N (H_{R}, (1 - γ)^{- 4}) \cdot \frac{1}{1 - γ^{\frac{1}{2} T}}} \cdot (1 - γ) T +$ $\sqrt{D_{R V O} \frac{1}{1 - γ^{\frac{1}{2} T}} ln \frac{1}{1 - γ}} \cdot (1 - γ) T +$ $\frac{1 - γ}{1 - exp (- \frac{1}{4 T})} +$ $T ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ \frac{ln (1 - γ)}{ln (1 - \frac{1}{2} T (1 - γ))} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ (1 - γ)^{3} +$ $(1 - γ)^{2} +$ $\frac{1}{T}$

Here we used that $1 ≲ γ^{T} \leq 1$ , substituted the expressions for ${~ β}_{R}$ , $~ β$ and $ρ$ , and dropped some dominated terms.

We analyze the separate contribution of each term.

$limsup γ \to 1 \frac{D_{R V O} (1 - γ) T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{\sqrt{D_{R V O} ln N (H_{R}, (1 - γ)^{- 4}) \cdot \frac{1}{1 - γ^{\frac{1}{2} T}}} \cdot (1 - γ) T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{\sqrt{D_{R V O} D_{M B} ln \frac{1}{1 - γ} \cdot \frac{1}{(1 - γ) T}} \cdot (1 - γ) T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{\sqrt{D_{R V O} D_{M B} (1 - γ) ln \frac{1}{1 - γ} \cdot T}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲ 1$

$limsup γ \to 1 \frac{\sqrt{D_{R V O} \frac{1}{1 - γ^{\frac{1}{2} T}} ln \frac{1}{1 - γ}} \cdot (1 - γ) T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{\sqrt{D_{R V O} \frac{1}{(1 - γ) T} ln \frac{1}{1 - γ}} \cdot (1 - γ) T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} =$ $limsup γ \to 1 \frac{\sqrt{D_{R V O} (1 - γ) ln \frac{1}{1 - γ} \cdot T}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = \frac{1}{\sqrt[3]{D_{M B} + 1}}$

$limsup γ \to 1 \frac{(\frac{1 - γ}{1 - exp (- \frac{1}{4 T})})}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{(1 - γ) T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{(T ⌈ \frac{ln (1 - γ)}{ln (1 - \frac{1}{2} T (1 - γ))} ⌉ (1 - γ)^{3})}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{ln \frac{1}{1 - γ} (1 - γ)^{2}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{(1 - γ)^{2}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{1}{T \sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = \frac{1}{τ + t_{e s t}}$

Our previous bound on $R_{T}$ can be written as

$R_{T} (γ) ≲ Δ V (γ) {| S |}^{2} | A | T$ $+ Δ V (γ) | S | \sqrt{| S | | A | ln \frac{N (H, (1 - γ)^{- 4})}{(1 - γ)^{2}} \cdot \frac{1}{1 - γ^{\frac{1}{2}}}}$ $+ Δ V (γ) | S | \sqrt{| S | | A | (1 - γ)^{2} \frac{1}{1 - γ^{\frac{1}{2}}} ln \frac{(\frac{1}{1 - γ^{\frac{1}{2}}})}{(1 - γ)^{2}} \cdot \frac{1}{1 - γ^{\frac{1}{2}}}}$ $+ Δ V (γ) T ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ \frac{ln (1 - γ)}{ln (1 - \frac{1}{2} T (1 - γ))} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ (1 - γ)^{2}$ $+ Δ V (γ) \cdot (1 - γ)$ $+ Δ V (γ) \frac{| S |}{2} \cdot \frac{1}{1 - exp (- \frac{1}{4 T})}$

We analyze the contribution of each term.

$limsup γ \to 1 \frac{Δ V (γ) {| S |}^{2} | A | T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{τ (1 - γ) {| S |}^{2} | A | T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{Δ V (γ) | S | \sqrt{| S | | A | ln \frac{N (H, (1 - γ)^{- 4})}{(1 - γ)^{2}} \cdot \frac{1}{1 - γ^{\frac{1}{2}}}}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{τ (1 - γ) | S | \sqrt{| S | | A | ({dim}_{M B} H + 1) ln \frac{1}{1 - γ} \cdot \frac{1}{1 - γ}}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{Δ V (γ) | S | \sqrt{| S | | A | (1 - γ)^{2} \frac{1}{1 - γ^{\frac{1}{2}}} ln \frac{⎛ ⎜ ⎜ ⎝ \frac{1}{1 - γ^{\frac{1}{2}}} ⎞ ⎟ ⎟ ⎠}{(1 - γ)^{2}} \cdot \frac{1}{1 - γ^{\frac{1}{2}}}}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{Δ V (γ) T ⌈ \frac{ln (1 - γ)}{ln (1 - \frac{1}{2} T (1 - γ))} ⌉ (1 - γ)^{2}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{τ (1 - γ)^{2} ln \frac{1}{1 - γ}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{Δ V (γ) \cdot (1 - γ)}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

$limsup γ \to 1 \frac{Δ V (γ) \frac{| S |}{2} \cdot \frac{1}{1 - exp (- \frac{1}{4 T})}}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{τ (1 - γ) | S | T}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

To complete the proof, we need to analyze the contribution of $R_{1}$ . We have

$R_{1} (γ) ≲ \frac{1}{T} (\frac{Δ V (γ)}{(1 - γ)} + t_{e s t})$

$limsup γ \to 1 \frac{\frac{1}{T} (\frac{Δ V (γ)}{(1 - γ)} + t_{e s t})}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲$ $limsup γ \to 1 \frac{τ + t_{e s t}}{T \sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 1$

Putting everything together, we conclude

$limsup γ \to 1 \frac{R (γ)}{\sqrt[3]{(τ + t_{e s t}) (D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ≲ 1 ■$

Appendix

The following is a special case of what appeared in the previous essay [AF · GW] as "Definition 1", introduced here for the sake of simplifying notations.

Definition B.1

Consider a set $X$ , an inner product space $Y$ and some $F \subseteq {X \to Y}$ . Consider also $θ \in R^{+}$ , $n \in N$ , a sequence ${x_{k} \in X}_{k \in [n]}$ and $x^{*} \in X$ . $x^{*}$ is said to be $(F, θ)$ -dependant on ${x_{k}}$ when, for any $f, ~ f \in F$

$n - 1 \sum k = 0 {∥ ∥ f (x_{k}) - ~ f (x_{k}) ∥ ∥}^{2} \leq θ^{2} ⟹ ∥ ∥ f (x^{*}) - ~ f (x^{*}) ∥ ∥ \leq θ$

Otherwise, $x^{*}$ is said to be $(F, θ)$ -independent of ${x_{k}}$ .

The following is a special case of what appeared in the previous essay [AF · GW] as "Definition A.1".

Definition B.2

Consider a set $X$ , a finite-dimensional inner product space $Y$ and some $F \subseteq {X \to Y}$ . Assume $F$ is compact w.r.t. the product topology on $X \to Y ≅ \prod_{x \in X} Y$ . Consider also some $n \in N$ , $x \in X^{n}$ , $y \in Y^{n}$ and $λ \in R^{+}$ . We then use the notation

${L S}^{F} [x y] := a r g m i n f \in F n - 1 \sum m = 0 {∥ f (x_{m}) - y_{m} ∥}_{m}^{2}$

${C S}^{F} [x y, λ] := {f \in F ∣ ∣ ∣ ∣ n - 1 \sum m = 0 {∥ ∥ f (x_{m}) - {L S}^{F} [x y] (x_{m}) ∥ ∥}^{2} \leq \frac{1}{λ}}$

Proposition B.1

There is some $C_{B .1} \in R^{+}$ s.t. the following holds.

Consider a finite set $X$ , a finite-dimensional inner product space $Y$ and some $F \subseteq {X \to Y}$ . Let ${H_{n} \subseteq P (X^{ω} \times Y^{ω})}_{n \in N}$ be the canonical filtration, i.e.

$H_{n} := {A^{'} \subseteq X^{ω} \times Y^{ω} ∣ ∣ A^{'} = {x y | {x y}_{: n} \in A}, A \subseteq X^{n} \times Y^{n} Borel}$

Consider also $f^{*} \in F$ and $μ \in Δ (X^{ω} \times Y^{ω})$ s.t. for any $n \in N$ and $x \in X$

$E x y \sim μ [y_{n} | x_{n} = x, H_{n}] = f^{*} (x)$

Assume that $∥ y_{n} ∥ \leq 1$ with $μ$ -probability $1$ for all $n \in N$ . Fix $ϵ \in R^{+}$ , $δ \in (0, 1)$ . Define $β : R^{+} \to R$ by

$β (t) := C_{B .1} (σ^{2} ln \frac{N (F, ϵ^{- 2} I_{d})}{δ} + ϵ t ln \frac{e t}{δ})$

Then,

$Pr x y \sim μ [f^{*} \notin \infty ⋂ n = 0 {C S}^{F} [{x y}_{: n}, β (n + 1)^{- 1}]] \leq δ$

Proof of Proposition B.1

Straightforward from "Proposition B.1" and "Proposition A.3" in the previous essay [AF · GW]. $■$

1 comments

Comments sorted by top scores.

comment by John_Maxwell (John_Maxwell_IV) · 2019-04-08T05:01:56.091Z · LW(p) · GW(p)

The literature study was very cursory and I will be glad to know about prior work I missed!

This post of mine seems related.

Reinforcement learning with imperceptible rewards

Contents

Background

Results

Partially Observable MDPs with Imperceptible Rewards

Instrumental States and Reward Functions

A Regret Bound for RL with Instrumental Rewards

Discussion

More on the Regret Bound

The Death of the Agent and Kamikaze Strategies

Specifying Instrumental Reward Functions

Proofs

Appendix

1 comments