How to make AIXI-tl incapable of learning

itaibn0

How to make AIXI-tl incapable of learning

post by itaibn0 · 2014-01-27T00:05:35.767Z · LW · GW · Legacy · 5 comments

5 comments

Consider a simple game: You are shown a random-looking 512-bit string h. You may then press one of two buttons, labeled '0' and '1'. No matter which button you press, you will then be shown a 256-bit string s such that SHA512(s) = h. In addition, if you pressed '1' you are given 1$.

This game seems pretty simple, right? s and h are irrelevant, and you should simply press '1' all the time (I'm assuming you value recieving money). Well, let's see how AIXI and AIXI-tl fare at this game.

Let's say the machine already played the game many times. Its memory is h0, b0, r0, s0, h1, ..., b_(n-1), r_(n-1), s_(n-1), h_n, where the list is in chronological order, inputs are unbolded while decisions are bolded, and r_i is the reward signal. It is always the case that r_i=b_i and h_i=SHA512(s_i).

First let's look at AIXI. It searches for models that compress and extrapolate this history up to the limit of its planning horizon. One class of such models is this: there is a list s0, ..., s_N of random 256-bits strings and a list b0, ..., b_N of (possibly compressible) bits. The history is SHA512(s0), b0, b0, s0, ..., b_(N-1), s_(N-1), SHA512(s_N), b_N. Here s_i for i<n must match the s_i in its memory, and s_n must be the almost certainly unique value with SHA512(s_n) = h_n. While I don't have a proof, it intuitively seems like this class of models will dominate the machines probability mass, and repeated arg-max should lead to the action of outputting 1. It wouldn't always do this due to exploration/exploitation considerations and due to the incentive to minimize K (b0, ... b_N) built into its prior, but it should do it most of the time. So AIXI seems good.

Now let's consider AIXI-tl. It picks outputs by having provably correct programs assign lower bounds to the expected utility, and picking the one with the best lower bound, where expected utility is as measured by AIXI. This would include accepting the analysis I just made with AIXI if that analysis can be made provably accurate. Here lies a problem: the agent has seen h_n but hasn't seen s_n. Therefore, it can't be certain that there is an s_n with SHA512(s_n)=h_n. Therefore, it can't be certain that the models used for AIXI actually works (this isn't a problem for AIXI since it has infinite computational power and can always determine that there is such an s_n).

There is an ad hoc fix for this for AIXI-tl: Take the same model as before, but h_n is a random string rather than being SHA512(s_n). This seems at first to work okay. However, it adds n, the current time, as an input to the model, which adds to its complexity. Now other models dependent on the current time also need to be considered. For instance, what if h_n was a random string, and in addition r_n=not(b_n). This models seems more complicated, but maybe in the programming language used for the Solomonoff prior it is shorter. The key point is that the agent won't be able to update past that initial prior no matter how many times it plays the game. In that case AIXI-tl may consistently prefer 0, assuming there aren't any other models it considers that make its behavior even more complicated.

The key problem is that AIXI-tl is handling logical uncertainty badly. The reasonable thing to do upon seeing h_n is assuming that it, like all h_i before it, is a hash of some 256-bit string. Instead, it finds itself unable to prove this fact and is forced into assuming the present h_n is special. This makes it assume the present time is special and makes it incapable of learning from experience.

5 comments

Comments sorted by top scores.

comment by asr · 2014-01-27T00:21:04.129Z · LW(p) · GW(p)

I would have understood this post better if it had a short introduction to tell me what AIXI-tl is, and why I care about it. As it is, the thing I learned is that there exists a formalism of learning that don't work in some contexts -- which doesn't surprise me.

I don't mean this to be dismissive -- It sounds like there is an interesting point in there somewhere, but right now, readers who aren't experts in learning theory probably can't get it.

Replies from: Kaj_Sotala

↑ comment by Kaj_Sotala · 2014-01-27T16:30:45.916Z · LW(p) · GW(p)

AIXI and AIXI-tl on the Less Wrong wiki.

comment by RHollerith (rhollerith_dot_com) · 2014-06-12T15:15:08.571Z · LW(p) · GW(p)

Upvoted for having math in it and plausibly being relevant :)

comment by Squark · 2014-01-27T07:49:15.863Z · LW(p) · GW(p)

AIXI-tl should handle your problem OK, since the relationship between hn and sn is irrelevant to the relationship between bn and rn. If it failed here it would fail for a problem with hn and sn random noise. However it might be possible to construct a more complex example where the fact that SHA512(sn)=hn is important for optimization of the reward...

Replies from: itaibn0

↑ comment by itaibn0 · 2014-01-28T00:22:40.365Z · LW(p) · GW(p)

I don't think so. Imagine there were there were a way to determine weather a string is a valid hash, and it turned out that hn turned out not to be a valid hash. Then that information is relevant: you now know you are no longer playing the same game as the last times, and so you would no longer use the same strategy. From the perspective of AIXI-tl, not being able to prove that hn is a valid hash is treated the same as having a reliable indicator tell it that hn is not a valid hash, and it would act accordingly.

How to make AIXI-tl incapable of learning

Contents

5 comments