a utility-maximizing varient of AIXI

post by AlexMennen · 2012-12-17T00:58:17.443Z · LW · GW · Legacy · 0 comments

Contents

  Background
  Problem
  Solution
  Extension to infinite lifetimes
  Proof of criterion for supremum-chasing working
None
No comments

Background


Here is the function implemented by finite-lifetime [AI<img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" />](http://www.hutter1.net/ai/aixigentle.pdf):

<img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\left(r\left(x_{k}\right)+...+r\left(x_{m}\right)\right)\cdot\xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m}\right)}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\left(r\left(x_{k}\right)+...+r\left(x_{m}\right)\right)\cdot\xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m}\right)}" align="bottom" />,

where <img src="http://www.codecogs.com/png.latex?m" title="m" align="bottom" /> is the number of steps in the lifetime of the agent, <img src="http://www.codecogs.com/png.latex?k" title="k" align="bottom" />
is the current step being computed, <img src="http://www.codecogs.com/png.latex?X" title="X" align="bottom" /> is the set of possible observations,
<img src="http://www.codecogs.com/png.latex?Y" title="Y" align="bottom" /> is the set of possible actions, <img src="http://www.codecogs.com/png.latex?r" title="r" align="bottom" /> is a function that extracts
a reward value from an observation, a dot over a variable represents
that its value is known to be the true value of the action or observation
it represents, underlines represent that the variable is an input
to a probability distribution, and <img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" /> is a function that returns
the probability of a sequence of observations, given a certain known
history and sequence of actions, and starting from the Solomonoff
prior. More formally,

<img src="http://www.codecogs.com/png.latex?{\displaystyle \xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m}\right)=\left(\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}2^{-\ell\left(q\right)}\right)}" title="{\displaystyle \xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m}\right)=\left(\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}2^{-\ell\left(q\right)}\right)}" align="bottom" />,

where <img src="http://www.codecogs.com/png.latex?Q" title="Q" align="bottom" /> is the set of all programs, <img src="http://www.codecogs.com/png.latex?\ell" title="\ell" align="bottom" /> is a function that
returns the length of a program in bits, and a program applied to
a sequence of actions returns the resulting sequence of observations.
Notice that the denominator is a constant, depending only on the already
known <img src="http://www.codecogs.com/png.latex?\dot{y}\dot{x}_{<k}" title="\dot{y}\dot{x}_{<k}" align="bottom" />, and multiplying by a positive constant
does not change the argmax, so we can pretend that the denominator
doesn't exist. If <img src="http://www.codecogs.com/png.latex?q" title="q" align="bottom" /> is a valid program, then any longer program
with <img src="http://www.codecogs.com/png.latex?q" title="q" align="bottom" /> as a prefix is not a valid program, so <img src="http://www.codecogs.com/png.latex?{\displaystyle \sum_{q\in Q}2^{-\ell\left(q\right)}\leq1}" title="{\displaystyle \sum_{q\in Q}2^{-\ell\left(q\right)}\leq1}" align="bottom" />.

Problem


A problem with this is that it can only optimize over the input it
receives, not over aspects of the external world that it cannot observe.
Given the chance, AI<img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" /> would hack its input channel so that it
would only observe good things, instead of trying to make good things
happen (in other words, it would [wirehead](http://lesswrong.com/lw/fkx/a\_definition\_of\_wireheading/)
itself). Anja [specified](http://lesswrong.com/lw/feo/universal\_agents\_and\_utility\_functions/)
a variant of AI<img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" /> in which she replaced the sum of rewards with
a single utility value and made the domain of the utility function
be the entire sequence of actions and observations instead of a single
observation, like so:

<img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}U\left(\dot{y}\dot{x}_{<k}yx_{k:m}\right)\cdot\xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m}\right)}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}U\left(\dot{y}\dot{x}_{<k}yx_{k:m}\right)\cdot\xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m}\right)}" align="bottom" />.

This doesn't really solve the problem, because the utility function
still only takes what the agent can see, rather than what is actually
going on outside the agent. The situation is a little better because
the utility function also takes into account the agent's actions,
so it could punish actions that look like the agent is trying to wirehead
itself, but if there was a flaw in the instructions not to wirehead,
the agent would exploit it, so the incentive not to wirehead would
have to be perfect, and this formulation is not very enlightening
about how to do that.

Solution


Here's what I suggest instead: everything that happens is determined
by the program that the world is running on and the agent's actions,
so the domain of the utility function should be <img src="http://www.codecogs.com/png.latex?Q\times Y^{m}" title="Q\times Y^{m}" align="bottom" />.
The apparent problem with that is that the formula for AI<img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" /> does
not contain any mention of elements of <img src="http://www.codecogs.com/png.latex?Q" title="Q" align="bottom" />. If we just take the original
formula and replace <img src="http://www.codecogs.com/png.latex?r\left(x_{k}\right)+...+r\left(x_{m}\right)" title="r\left(x_{k}\right)+...+r\left(x_{m}\right)" align="bottom" />
with <img src="http://www.codecogs.com/png.latex?U\left(q,\dot{y}_{<k}y_{k:m}\right)" title="U\left(q,\dot{y}_{<k}y_{k:m}\right)" align="bottom" />, it wouldn't make any
sense. However, if we expand out <img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" /> in the original formula (excluding
the unnecessary denominator), we can move the sum of rewards inside
the sum over programs, like this:

<img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}\left(r\left(x_{k}\right)+...+r\left(x_{m}\right)\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}\left(r\left(x_{k}\right)+...+r\left(x_{m}\right)\right)2^{-\ell\left(q\right)}}" align="bottom" />.

Now it is easy to replace the sum of rewards with the desired utility
function.

<img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}U\left(q,\dot{y}_{<k}y_{k:m}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m}\in Y}\sum_{x_{m}\in X}\sum_{q\in Q:q\left(y_{\leq m}\right)=x_{\leq m}}U\left(q,\dot{y}_{<k}y_{k:m}\right)2^{-\ell\left(q\right)}}" align="bottom" />.

With this formulation, there is no danger of the agent wireheading,
and all <img src="http://www.codecogs.com/png.latex?U" title="U" align="bottom" /> has to do is compute everything that happens when the
agent performs a given sequence of actions in a given program, and
decide how desirable it is. If the range of <img src="http://www.codecogs.com/png.latex?U" title="U" align="bottom" /> is unbounded, then
this might not converge. Let's assume throughout this post that the
range of <img src="http://www.codecogs.com/png.latex?U" title="U" align="bottom" /> is bounded.

Extension to infinite lifetimes


The previous discussion assumed that the agent would only have the
opportunity to perform a finite number of actions. The situation gets
a little tricky when the agent is allowed to perform an unbounded
number of actions. Hutter uses a finite look-ahead approach for AI<img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" />,
where on each step <img src="http://www.codecogs.com/png.latex?k" title="k" align="bottom" />, it pretends that it will only be performing
<img src="http://www.codecogs.com/png.latex?m_{k}" title="m_{k}" align="bottom" /> actions, where <img src="http://www.codecogs.com/png.latex?\forall k\, m_{k}\gg k" title="\forall k\, m_{k}\gg k" align="bottom" />.

<img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m_{k}}\in Y}\sum_{x_{m_{k}}\in X}\left(r\left(x_{k}\right)+...+r\left(x_{m_{k}}\right)\right)\cdot\xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m_{k}}\right)}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m_{k}}\in Y}\sum_{x_{m_{k}}\in X}\left(r\left(x_{k}\right)+...+r\left(x_{m_{k}}\right)\right)\cdot\xi\left(\dot{y}\dot{x}_{<k}y\underline{x}_{k:m_{k}}\right)}" align="bottom" />.

If we make the same modification to the utility-based variant, we
get

<img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m_{k}}\in Y}\sum_{x_{m_{k}}\in X}\sum_{q\in Q:q\left(y_{\leq m_{k}}\right)=x_{\leq m_{k}}}U\left(q,\dot{y}_{<k}y_{k:m_{k}}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sum_{x_{k}\in X}\max_{y_{k+1}\in Y}\sum_{x_{k+1}\in X}...\max_{y_{m_{k}}\in Y}\sum_{x_{m_{k}}\in X}\sum_{q\in Q:q\left(y_{\leq m_{k}}\right)=x_{\leq m_{k}}}U\left(q,\dot{y}_{<k}y_{k:m_{k}}\right)2^{-\ell\left(q\right)}}" align="bottom" />.

This is unsatisfactory because the domain of <img src="http://www.codecogs.com/png.latex?U" title="U" align="bottom" /> was supposed to
consist of all the information necessary to determine everything that
happens, but here, it is missing all the actions after step <img src="http://www.codecogs.com/png.latex?m_{k}" title="m_{k}" align="bottom" />.
One obvious thing to try is to set <img src="http://www.codecogs.com/png.latex?m_{k}:=\infty" title="m_{k}:=\infty" align="bottom" />. This will be
easier to do using a compacted expression for AI<img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" />:

<img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\max_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}\left(r\left(x_{k}^{pq}\right)+...+r\left(x_{m_{k}}^{pq}\right)\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\max_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}\left(r\left(x_{k}^{pq}\right)+...+r\left(x_{m_{k}}^{pq}\right)\right)2^{-\ell\left(q\right)}}" align="bottom" />,

where <img src="http://www.codecogs.com/png.latex?P" title="P" align="bottom" /> is the set of policies that map sequences of observations
to sequences of actions and <img src="http://www.codecogs.com/png.latex?x_{i}^{pq}" title="x_{i}^{pq}" align="bottom" /> is shorthand for the last
observation in the sequence returned by <img src="http://www.codecogs.com/png.latex?q\left(p\left(\dot{x}_{<k}x_{k:i-1}^{pq}\right)\right)" title="q\left(p\left(\dot{x}_{<k}x_{k:i-1}^{pq}\right)\right)" align="bottom" />.
If we take this compacted formulation, modify it to accommodate the
new utility function, set <img src="http://www.codecogs.com/png.latex?m_{k}:=\infty" title="m_{k}:=\infty" align="bottom" />, and replace the maximum
with a supremum (since there's an infinite number of possible policies),
we get

<img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" align="bottom" />,

where <img src="http://www.codecogs.com/png.latex?y_{i}^{pq}" title="y_{i}^{pq}" align="bottom" /> is shorthand for the last action in the sequence
returned by <img src="http://www.codecogs.com/png.latex?p\left(q\left(\dot{y}_{<k}y_{k}y_{k+1:i-1}^{pq}\right)\right)" title="p\left(q\left(\dot{y}_{<k}y_{k}y_{k+1:i-1}^{pq}\right)\right)" align="bottom" />.

But there is a problem with this, which I will illustrate with a toy
example. Suppose <img src="http://www.codecogs.com/png.latex?Y:=\left\{ a,b\right\} " title="Y:=\left\{ a,b\right\} " align="bottom" />, and <img src="http://www.codecogs.com/png.latex?U\left(q,y_{1:\infty}\right)=0" title="U\left(q,y_{1:\infty}\right)=0" align="bottom" />
when <img src="http://www.codecogs.com/png.latex?\forall k\in\mathbb{N}\, y_{k}=a" title="\forall k\in\mathbb{N}\, y_{k}=a" align="bottom" />, and for any <img src="http://www.codecogs.com/png.latex?n\in\mathbb{N}" title="n\in\mathbb{N}" align="bottom" />,
<img src="http://www.codecogs.com/png.latex?U\left(q,y_{1:\infty}\right)=1-\frac{1}{n}" title="U\left(q,y_{1:\infty}\right)=1-\frac{1}{n}" align="bottom" /> when <img src="http://www.codecogs.com/png.latex?y_{n}=b" title="y_{n}=b" align="bottom" /> and
<img src="http://www.codecogs.com/png.latex?\forall k<n\, y_{k}=a" title="\forall k<n\, y_{k}=a" align="bottom" />. (<img src="http://www.codecogs.com/png.latex?U" title="U" align="bottom" /> does not depend on the program <img src="http://www.codecogs.com/png.latex?q" title="q" align="bottom" />
in this example). An agent following the above formula would output
<img src="http://www.codecogs.com/png.latex?a" title="a" align="bottom" /> on every step, and end up with a utility of <img src="http://www.codecogs.com/png.latex?0" title="0" align="bottom" />, when it could
have gotten arbitrarily close to <img src="http://www.codecogs.com/png.latex?1" title="1" align="bottom" /> by eventually outputting <img src="http://www.codecogs.com/png.latex?b" title="b" align="bottom" />.

To avoid problems like that, we could assume the reasonable-seeming
condition that if <img src="http://www.codecogs.com/png.latex?y_{1:\infty}" title="y_{1:\infty}" align="bottom" /> is an action sequence and <img src="http://www.codecogs.com/png.latex?\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}" title="\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}" align="bottom" />
is a sequence of action sequences that converges to <img src="http://www.codecogs.com/png.latex?y_{1:\infty}" title="y_{1:\infty}" align="bottom" />
(by which I mean <img src="http://www.codecogs.com/png.latex?\forall k\in\mathbb{N}\,\exists N\in\mathbb{N}\,\forall n>N\, y_{k}^{n}=y_{k}" title="\forall k\in\mathbb{N}\,\exists N\in\mathbb{N}\,\forall n>N\, y_{k}^{n}=y_{k}" align="bottom" />),
then <img src="http://www.codecogs.com/png.latex?{\displaystyle \lim_{n\rightarrow\infty}U\left(q,y_{1:\infty}^{n}\right)=U\left(q,y_{1:\infty}\right)}" title="{\displaystyle \lim_{n\rightarrow\infty}U\left(q,y_{1:\infty}^{n}\right)=U\left(q,y_{1:\infty}\right)}" align="bottom" />.
Under that assumption, the supremum is in fact a maximum, and the
formula gives you an action sequence that will reach that maximum
(proof below).

If you don't like the condition I imposed on <img src="http://www.codecogs.com/png.latex?U" title="U" align="bottom" />, you might not be
satisfied by this. But without it, there is not necessarily a best
policy. One thing you can do is, on step 1, pick some extremely small
<img src="http://www.codecogs.com/png.latex?\varepsilon>0" title="\varepsilon>0" align="bottom" />, pick any element from <img src="http://www.codecogs.com/png.latex?{\displaystyle \left\{ p^{*}\in P:\sum_{q\in Q}U\left(q,y_{1:\infty}^{p^{*}q}\right)2^{-\ell\left(q\right)}>\left(\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\varepsilon\right\} }" title="{\displaystyle \left\{ p^{*}\in P:\sum_{q\in Q}U\left(q,y_{1:\infty}^{p^{*}q}\right)2^{-\ell\left(q\right)}>\left(\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\varepsilon\right\} }" align="bottom" />,
and then follow that policy for the rest of eternity, which will guarantee
that you do not miss out on more than <img src="http://www.codecogs.com/png.latex?\varepsilon" title="\varepsilon" align="bottom" /> of expected utility.

Proof of criterion for supremum-chasing working


definition: If <img src="http://www.codecogs.com/png.latex?y_{1:\infty}" title="y_{1:\infty}" align="bottom" /> is an action sequence and <img src="http://www.codecogs.com/png.latex?\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}" title="\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}" align="bottom" />
is an infinite sequence of action sequences, and <img src="http://www.codecogs.com/png.latex?\forall k\in\mathbb{N}\,\exists N\in\mathbb{N}\,\forall n>N\, y_{k}^{n}=y_{k}" title="\forall k\in\mathbb{N}\,\exists N\in\mathbb{N}\,\forall n>N\, y_{k}^{n}=y_{k}" align="bottom" />,
then we say <img src="http://www.codecogs.com/png.latex?\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}" title="\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}" align="bottom" /> converges
to <img src="http://www.codecogs.com/png.latex?y_{1:\infty}" title="y_{1:\infty}" align="bottom" />. If <img src="http://www.codecogs.com/png.latex?p" title="p" align="bottom" /> is a policy and <img src="http://www.codecogs.com/png.latex?\left\{ p_{n}\right\} _{n=1}^{\infty}" title="\left\{ p_{n}\right\} _{n=1}^{\infty}" align="bottom" />
is a sequence of policies, and <img src="http://www.codecogs.com/png.latex?\forall k\in\mathbb{N}\,\forall x_{<k}\in X^{k}\,\exists N\in\mathbb{N}\,\forall n>N\, p\left(x_{<k}\right)=p_{n}\left(x_{<k}\right)" title="\forall k\in\mathbb{N}\,\forall x_{<k}\in X^{k}\,\exists N\in\mathbb{N}\,\forall n>N\, p\left(x_{<k}\right)=p_{n}\left(x_{<k}\right)" align="bottom" />,
then we say <img src="http://www.codecogs.com/png.latex?\left\{ p_{n}\right\} _{n=1}^{\infty}" title="\left\{ p_{n}\right\} _{n=1}^{\infty}" align="bottom" /> converges to
<img src="http://www.codecogs.com/png.latex?p" title="p" align="bottom" />.

assumption (for lemma 2 and theorem): If <img src="http://www.codecogs.com/png.latex?\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}" title="\left\{ y_{1:\infty}^{n}\right\} _{n=1}^{\infty}" align="bottom" />
converges to <img src="http://www.codecogs.com/png.latex?y_{1:\infty}" title="y_{1:\infty}" align="bottom" />, then <img src="http://www.codecogs.com/png.latex?{\displaystyle \lim_{n\rightarrow\infty}U\left(q,y_{1:\infty}^{n}\right)=U\left(q,y_{1:\infty}\right)}" title="{\displaystyle \lim_{n\rightarrow\infty}U\left(q,y_{1:\infty}^{n}\right)=U\left(q,y_{1:\infty}\right)}" align="bottom" />

lemma 1: The agent described by <img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" align="bottom" />
follows a policy that is the limit of a sequence of policies <img src="http://www.codecogs.com/png.latex?\left\{ p_{n}\right\} _{n=1}^{\infty}" title="\left\{ p_{n}\right\} _{n=1}^{\infty}" align="bottom" />
such that <img src="http://www.codecogs.com/png.latex?{\displaystyle \lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" align="bottom" />.

proof of lemma 1: Any policy can be completely described by the last
action it outputs for every finite observation sequence. Observations
are returned by a program, so the set of possible finite observation
sequences is countable. It is possible to fix the last action returned
on any particular finite observation sequence to be the argmax, and
still get arbitrarily close to the supremum with suitable choices
for the last action returned on the other finite observation sequences.
By induction, it is possible to get arbitrarily close to the supremum
while fixing the last action returned to be the argmax for any finite
set of finite observation sequences. Thus, there exists a sequence
of policies approaching the policy implemented by AI<img src="http://www.codecogs.com/png.latex?\xi" title="\xi" align="bottom" /> whose expected
utilities approach the supremum.

lemma 2: If <img src="http://www.codecogs.com/png.latex?p" title="p" align="bottom" /> is a policy and <img src="http://www.codecogs.com/png.latex?\left\{ p_{n}\right\} _{n=1}^{\infty}" title="\left\{ p_{n}\right\} _{n=1}^{\infty}" align="bottom" />
is a sequence of policies converging to <img src="http://www.codecogs.com/png.latex?p" title="p" align="bottom" />, then <img src="http://www.codecogs.com/png.latex?{\displaystyle \sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}=\lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}=\lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}}" align="bottom" />.

proof of lemma 2: Let <img src="http://www.codecogs.com/png.latex?\varepsilon>0" title="\varepsilon>0" align="bottom" />. On any given sequence of inputs
<img src="http://www.codecogs.com/png.latex?x_{1:\infty}\in X^{\infty}" title="x_{1:\infty}\in X^{\infty}" align="bottom" />, <img src="http://www.codecogs.com/png.latex?\left\{ p_{n}\left(x_{1:\infty}\right)\right\} _{n=1}^{\infty}" title="\left\{ p_{n}\left(x_{1:\infty}\right)\right\} _{n=1}^{\infty}" align="bottom" />
converges to <img src="http://www.codecogs.com/png.latex?p\left(x_{1:\infty}\right)" title="p\left(x_{1:\infty}\right)" align="bottom" />, so, by assumption, <img src="http://www.codecogs.com/png.latex?\forall q\in Q\,\exists N\in\mathbb{N}\,\forall n\geq N\,\left|U\left(q,y_{1:\infty}^{pq}\right)-U\left(q,y_{1:\infty}^{p_{n}q}\right)\right|<\frac{\varepsilon}{2}" title="\forall q\in Q\,\exists N\in\mathbb{N}\,\forall n\geq N\,\left|U\left(q,y_{1:\infty}^{pq}\right)-U\left(q,y_{1:\infty}^{p_{n}q}\right)\right|<\frac{\varepsilon}{2}" align="bottom" />.
For each <img src="http://www.codecogs.com/png.latex?N\in\mathbb{N}" title="N\in\mathbb{N}" align="bottom" />, let <img src="http://www.codecogs.com/png.latex?Q_{N}:=\left\{ q\in Q:\forall n\geq N\,\left|U\left(q,y_{1:\infty}^{pq}\right)-U\left(q,y_{1:\infty}^{p_{n}q}\right)\right|<\frac{\varepsilon}{2}\right\} " title="Q_{N}:=\left\{ q\in Q:\forall n\geq N\,\left|U\left(q,y_{1:\infty}^{pq}\right)-U\left(q,y_{1:\infty}^{p_{n}q}\right)\right|<\frac{\varepsilon}{2}\right\} " align="bottom" />.
The previous statement implies that <img src="http://www.codecogs.com/png.latex?{\displaystyle \bigcup_{N\in\mathbb{N}}Q_{N}=Q}" title="{\displaystyle \bigcup_{N\in\mathbb{N}}Q_{N}=Q}" align="bottom" />,
and each element of <img src="http://www.codecogs.com/png.latex?\left\{ Q_{N}\right\} _{N\in\mathbb{N}}" title="\left\{ Q_{N}\right\} _{N\in\mathbb{N}}" align="bottom" /> is
a subset of the next, so <img src="http://www.codecogs.com/png.latex?{\displaystyle \exists N\in\mathbb{N}\,\sum_{q\in Q\setminus Q_{N}}2^{-\ell\left(q\right)}<\frac{\varepsilon}{2\left(\sup U-\inf U\right)}}" title="{\displaystyle \exists N\in\mathbb{N}\,\sum_{q\in Q\setminus Q_{N}}2^{-\ell\left(q\right)}<\frac{\varepsilon}{2\left(\sup U-\inf U\right)}}" align="bottom" />.
The range of <img src="http://www.codecogs.com/png.latex?U" title="U" align="bottom" /> is bounded, so <img src="http://www.codecogs.com/png.latex?\sup U" title="\sup U" align="bottom" /> and <img src="http://www.codecogs.com/png.latex?\inf U" title="\inf U" align="bottom" /> are defined.
This also implies that the difference in expected utility, given any
information, of any two policies, is bounded. More formally: <img src="http://www.codecogs.com/png.latex?{\displaystyle \forall Q'\subset Q\,\forall p',p''\in P\,\left|\left(\left(\sum_{q\in Q'}U\left(q,y_{1:\infty}^{p'q}\right)2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q'}2^{-\ell\left(q\right)}\right)\right)-\left(\left(\sum_{q\in Q'}U\left(q,y_{1:\infty}^{p''q}\right)2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q'}2^{-\ell\left(q\right)}\right)\right)\right|\leq\sup U-\inf U}" title="{\displaystyle \forall Q'\subset Q\,\forall p',p''\in P\,\left|\left(\left(\sum_{q\in Q'}U\left(q,y_{1:\infty}^{p'q}\right)2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q'}2^{-\ell\left(q\right)}\right)\right)-\left(\left(\sum_{q\in Q'}U\left(q,y_{1:\infty}^{p''q}\right)2^{-\ell\left(q\right)}\right)\diagup\left(\sum_{q\in Q'}2^{-\ell\left(q\right)}\right)\right)\right|\leq\sup U-\inf U}" align="bottom" />,
so in particular, <img src="http://www.codecogs.com/png.latex?{\displaystyle \left|\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|\leq\left(\sup U-\inf U\right)\sum_{q\in Q\setminus Q_{N}}2^{-\ell\left(q\right)}<\frac{\varepsilon}{2}}" title="{\displaystyle \left|\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|\leq\left(\sup U-\inf U\right)\sum_{q\in Q\setminus Q_{N}}2^{-\ell\left(q\right)}<\frac{\varepsilon}{2}}" align="bottom" />.
<img src="http://www.codecogs.com/png.latex?{\displaystyle \left|\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|\leq\left|\left(\sum_{q\in Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q_{N}}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|+\left|\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|<\frac{\varepsilon}{2}+\frac{\varepsilon}{2}=\varepsilon}" title="{\displaystyle \left|\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|\leq\left|\left(\sum_{q\in Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q_{N}}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|+\left|\left(\sum_{q\in Q\setminus Q_{N}}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}\right)-\left(\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{N}q}\right)2^{-\ell\left(q\right)}\right)\right|<\frac{\varepsilon}{2}+\frac{\varepsilon}{2}=\varepsilon}" align="bottom" />.

theorem: <img src="http://www.codecogs.com/png.latex?{\displaystyle \sum_{\dot{q}\in Q}U\left(\dot{q},\dot{y}_{1:\infty}\right)2^{-\ell\left(\dot{q}\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \sum_{\dot{q}\in Q}U\left(\dot{q},\dot{y}_{1:\infty}\right)2^{-\ell\left(\dot{q}\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" align="bottom" />,
where, <img src="http://www.codecogs.com/png.latex?{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \dot{y}_{k}:=\arg\max_{y_{k}\in Y}\sup_{p\in P:p\left(\dot{x}_{<k}\right)=\dot{y}_{<k}y_{k}}\sum_{q\in Q:q\left(\dot{y}_{<k}\right)=\dot{x}_{<k}}U\left(q,\dot{y}_{<k}y_{k}y_{k+1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" align="bottom" />.

proof of theorem: Let's call the policy implemented by the agent <img src="http://www.codecogs.com/png.latex?p^{*}" title="p^{*}" align="bottom" />.
By lemma 1, there is a sequence of policies <img src="http://www.codecogs.com/png.latex?\left\{ p_{n}\right\} _{n=1}^{\infty}" title="\left\{ p_{n}\right\} _{n=1}^{\infty}" align="bottom" />
converging to <img src="http://www.codecogs.com/png.latex?p^{*}" title="p^{*}" align="bottom" /> such that <img src="http://www.codecogs.com/png.latex?{\displaystyle \lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \lim_{n\rightarrow\infty}\sum_{q\in Q}U\left(q,y_{1:\infty}^{p_{n}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" align="bottom" />.
By lemma 2, <img src="http://www.codecogs.com/png.latex?{\displaystyle \sum_{q\in Q}U\left(q,y_{1:\infty}^{p^{*}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" title="{\displaystyle \sum_{q\in Q}U\left(q,y_{1:\infty}^{p^{*}q}\right)2^{-\ell\left(q\right)}=\sup_{p\in P}\sum_{q\in Q}U\left(q,y_{1:\infty}^{pq}\right)2^{-\ell\left(q\right)}}" align="bottom" />.

0 comments

Comments sorted by top scores.