## Posts

Update Then Forget 2013-01-17T18:36:48.373Z · score: 9 (10 votes)
How to Be Oversurprised 2013-01-07T04:02:01.424Z · score: 13 (18 votes)
How to Disentangle the Past and the Future 2013-01-02T17:43:27.453Z · score: 12 (15 votes)
Point-Based Value Iteration 2012-10-08T06:19:34.355Z · score: 9 (10 votes)
Internal Availability 2012-10-08T06:19:29.089Z · score: 2 (12 votes)
The Bayesian Agent 2012-09-18T03:23:34.489Z · score: 11 (14 votes)
Reinforcement, Preference and Utility 2012-08-08T06:23:25.793Z · score: 10 (13 votes)
Reinforcement Learning: A Non-Standard Introduction (Part 2) 2012-08-02T08:17:08.744Z · score: 9 (10 votes)
Reinforcement Learning: A Non-Standard Introduction (Part 1) 2012-07-29T00:13:38.238Z · score: 20 (21 votes)
The Perception-Action Cycle 2012-07-23T01:44:25.808Z · score: 6 (7 votes)

Comment by royf on An overall schema for the friendly AI problems: self-referential convergence criteria · 2015-07-17T11:23:19.938Z · score: 2 (2 votes) · LW · GW

It seems that your research is coming around to some concepts that are at the basis of mine. Namely, that noise in an optimization process is a constraint on the process, and that the resulting constrained optimization process avoids the nasty properties you describe.

Feel free to contact me if you'd like to discuss this further.

Comment by royf on Utility vs Probability: idea synthesis · 2015-03-27T17:59:01.746Z · score: 2 (2 votes) · LW · GW

This is not unlike Neyman-Pearson theory. Surely this will run into the same trouble with more than 2 possible actions.

Comment by royf on [LINK] Causal Entropic Forces · 2013-04-29T18:28:57.178Z · score: 0 (0 votes) · LW · GW

Our research group and collaborators, foremost Daniel Polani, have been studying this for many years now. Polani calls an essentially identical concept empowerment. These guys are welcome to the party, and as former outsiders it's understandable (if not totally acceptable) that they wouldn't know about these piles of prior work.

Comment by royf on A Little Puzzle about Termination · 2013-02-03T15:16:09.299Z · score: 0 (0 votes) · LW · GW

You have a good and correct point, but it has nothing to do with your question.

a machine can never halt after achieving its goal because it cannot know with full certainty whether it has achieved its goal

This is a misunderstanding of how such a machine might work.

To verify that it completed the task, the machine must match the current state to the desired state. The desired state is any state where the machine has "made 32 paperclips". Now what's a paperclip?

For quite some time we've had the technology to identify a paperclip in an image, if one exists. One lesson we've learned pretty well is this: don't overfit. The paperclip you're going to be tested on is probably not one you've seen before. You'll need to know what features are common in paperclips (and less common in other objects) and how much variability they present. Tolerance to this variability will be necessary for generalization, and this means you can never be sure if you're seeing a paperclip. In this sense there's a limit to how well the user can specify the goal.

So after taking a few images of the paperclips it's made, the machine's major source of (unavoidable) uncertainty will be "is this what the user meant?", not "am I really getting a good image of what's on the table?". Any half-decent implementation will go do other things (such as go ask the user).

Comment by royf on Right for the Wrong Reasons · 2013-02-01T22:55:43.945Z · score: 0 (0 votes) · LW · GW

The "world state" of ASH is in fact an "information state" of p("heads")>SOME_THRESHOLD

Actually, I meant p("heads") = 0.999 or something.

(C), if I'm following you, maps roughly to the English phrase "I know for absolutely certain that the coin is almost surely heads".

No, I meant: "I know for absolutely certain that the coin is heads". We agree that this much you can never know. As for getting close to this, for example having the information state (D) where p("heads") = 0.999999: if the world is in the state "heads", (D) is (theoretically) possible; if the world is in the state "ASH", (D) is impossible.

Can you give me some examples of the kinds of cases you have in mind?

Mundane examples may not be as clear, so: suppose we send a coin-flipping machine deep into intergalactic space. After a few billion years it flies permanently beyond our light cone, and then flips the coin.

Now any information state about the coin, other than complete ignorance, is physically impossible. We can still say that the coin is in one of the two states "heads" and "tails", only unknown to us. Alternatively we can say that the coin is in a state of superposition. These two models are epistemologically equivalent.

I prefer the latter, and think many people in this community should agree, based on the spirit of other things they believe: the former model is ontologically more complicated. It's saying more about reality than can be known. It sets the state of the coin as a free-floating property of the world, with nothing to entangle with.

Comment by royf on Right for the Wrong Reasons · 2013-02-01T21:22:13.766Z · score: 1 (1 votes) · LW · GW

I probably need to write a top-level post to explain this adequately, but in a nutshell:

I've tossed a coin. Now we can say that the world is in one of two states: "heads" and "tails". This view is consistent with any information state. The information state (A) of maximal ignorance is a uniform distribution over the two states. The information state (B) where heads is twice as likely as tails is the distribution p("heads") = 2/3, p("tails") = 1/3. The information state (C) of knowing for sure that the result is heads is the distribution p("heads") = 1, p("tails") = 0.

Alternatively, we can say that the world is in one of these two states: "almost surely heads" and "almost surely tails". Now information state (A) is a uniform distribution over these states; (B) is perhaps the distribution p("ASH") = 0.668, p("AST") = 0.332; but (C) is impossible, and so is any information state that is more certain than reality in this strange model.

Now, in many cases we can theoretically have information states arbitrarily close to complete certainty. In such cases we must use the first kind of model. So we can agree to just always use the first kind of model, and avoid all this silly complication.

But then there are cases where there are real (physical) reasons why not every information state is possible. In these cases reality is not constrained to be of the first kind, and it could be of the second kind. As a matter of fact, to say that reality is of the first kind - and that probability is only in the mind - is to say more about reality than can possibly be known. This goes against Jaynesianism.

So I completely agree that not knowing something is a property of the map rather than the territory. But an impossibility of any map to know something is a property of the territory.

Comment by royf on Right for the Wrong Reasons · 2013-02-01T18:04:13.392Z · score: 1 (1 votes) · LW · GW

To clarify further: likelihood is a relative quantity, like speed - it only has meaning relative to a specific frame of reference.

If you're judging my calibration, the proper frame of reference is what I knew at the time of prediction. I didn't know what the result of the fencing match would be, but I had some evidence for who is more likely to win. The (objective) probability distribution given that (subjective) information state is what I should've used for prediction.

If you're judging my diligence as an evidence seeker, the proper frame of reference is what I would've known after reasonable information gathering. I could've taken some actions to put myself in a difference information state, and then my prediction could be better.

But it's unreasonable to expect me to know the result beyond any doubt. Even if Omega is in an information state of perfectly predicting the future, this is never a proper frame of reference by which to judge bounded agents.

And this is the major point on which I'm non-Yudkowskian: since Omega is never a useful frame of reference, I'm not constraining reality to be consistent with it. In this sense, some probabilities are in the territory.

Comment by royf on Right for the Wrong Reasons · 2013-01-24T13:06:15.613Z · score: 0 (0 votes) · LW · GW

This is perhaps not the best description of actualism, but I see your point. Actualists would disagree with this part of my comment:

If I believed that "you will win" (no probability qualifier), then in the many universes where you didn't I'm in Bayes Hell.

on the grounds that those other universes don't exist.

But that was just a figure of speech. I don't actually need those other universes to argue against 0 and 1 as probabilities. And if Frequentists disbelieve in that, there's no place in Bayes Heaven for them.

Comment by royf on Right for the Wrong Reasons · 2013-01-24T02:52:33.745Z · score: 0 (0 votes) · LW · GW

Comment by royf on Right for the Wrong Reasons · 2013-01-24T01:22:54.251Z · score: 12 (12 votes) · LW · GW

Predictions are justified not by becoming a reality, but by the likelihood of their becoming a reality [1]. When this likelihood is hard to estimate, we can take their becoming a reality as weak evidence that the likelihood is high. But in the end, after counting all the evidence, it's really only the likelihood itself that matters.

If I predict [...] that I will win [...] and I in fact lose fourteen touches in a row, only to win by forfeit

If I place a bet on you to win and this happens, I'll happily collect my prize, but still feel that I put my money on the wrong athlete. My prior and the signal are rich enough for me to deduce that your victory, although factual, was unlikely. If I believed that you're likely to win, then my belief wasn't "true for the wrong reasons", it was simply false. If I believed that "you will win" (no probability qualifier), then in the many universes where you didn't I'm in Bayes Hell.

Conversely in the other example, your winning itself is again not the best evidence for its own likelihood. Your scoring 14 touches is. My belief that you're likely to win is true and justified for the right reasons: you're clearly the better athlete.

[1] Where likelihood is measured either given what I know, or what I could know, or what anybody could know - depending on why we're asking the question in the first place.

Comment by royf on Update Then Forget · 2013-01-21T22:39:58.260Z · score: 2 (2 votes) · LW · GW

Thanks!

The best book is doubtlessly Elements of Information Theory by Cover and Thomas. It's very clear (to someone with some background in math or theoretical computer science) and lays very strong introductory foundations before giving a good overview of some of the deeper aspects of the theory.

It's fortunate that many concepts of information theory share some of their mathematical meaning with the everyday meaning. This way I can explain the new theory (popularized here for the first time) without defining these concepts.

I'm planning another sequence where these and other concepts will be expressed in the philosophical framework of this community. But I should've realized that some readers should be interested in a complete mathematical introduction. That book is what you're looking for.

Comment by royf on Update Then Forget · 2013-01-18T18:31:59.831Z · score: 0 (0 votes) · LW · GW

This is a perfect agent, of theoretical interest if not practically realizable.

Comment by royf on Update Then Forget · 2013-01-17T23:26:24.619Z · score: 2 (2 votes) · LW · GW

an intelligent agent should update on it and then forget it.

Should being the operative word. This refers to a "perfect" agent (emphasis added in text; thanks!).

People don't do this, as well they shouldn't, because we update poorly and need the original data to compensate.

If you forget the discarded cards, and later realize that you may have an incorrect map of the deck, aren't you SOL?

If I remember the cards in play, I don't care about the discarded ones. If I don't, the discarded cards could help a bit, but that's not the heart of my problem.

Comment by royf on A fungibility theorem · 2013-01-13T03:45:32.157Z · score: 1 (1 votes) · LW · GW

if you really care about the values on that list, then there are linear aggregations

Of course existence doesn't mean that we can actually find these coefficients. Even if you have only 2 well-defined value functions, finding an optimal tradeoff between them is generally computationally hard.

Comment by royf on How to Be Oversurprised · 2013-01-07T21:13:35.167Z · score: 3 (3 votes) · LW · GW

Philosiphically, yes.

Practically, it may be useful to distinguish between a coin and a toss. The coin has persisting features which make it either fair or loaded for a long time, with correlation between past and future. The toss is transient, and essentially all information about it is lost when I put the coin away - except through the memory of agents.

So yes, the toss is a feature of the present state of the world. But it has the very special property, that given the bias of the coin, the toss is independent of the past and the future. It's sometimes more useful to treat a feature like that as an observation external to the world, but of course it "really" isn't.

Comment by royf on How to Be Oversurprised · 2013-01-07T20:33:39.234Z · score: 1 (1 votes) · LW · GW

I'm trying to balance between introducing terminology to new readers and not boring those who've read my previous posts. Thanks for the criticism, I'll use it (and its upvotes) to correct my balance.

Comment by royf on How to Be Oversurprised · 2013-01-07T06:03:11.041Z · score: 1 (1 votes) · LW · GW

Well, thank you!

Yes, I do this more for the math and the algorithms than for advice for humans.

Still, the advice is perhaps not so trivial: study not what you're most uncertain about (highest entropy given what you know) but those things with entropy generated by what you care about. And even this advice is incomplete - there's more to come.

Comment by royf on How to Be Oversurprised · 2013-01-07T04:02:16.924Z · score: 0 (0 votes) · LW · GW

When the new memory state $M_t$ is generated by a Bayesian update from the previous one $M_{t-1}$ and the new observation $O_t$, it's a sufficient statistic of these information sources for the world state $W_t$, so that $M_t$ keeps all the information about the world that was remembered or observed:

$I\(W\_t;M\_t\$=I(W_t;(M_{t-1},O_t)))

As this is all the information available, other ways to update can only have less information.

The amount of information gained by a Bayesian update is

$I\(W\_t;\(M\_\{t\-1\},O\_t\$)-I(W_t;M_{t-1}))

$=\\mathrm E\\left\[\\log\\frac\{\\Pr\(W\_t,M\_\{t\-1\},O\_t\$}{\Pr(W_t)\Pr(M_{t-1},O_t)}-\log\frac{\Pr(W_t,M_{t-1})}{\Pr(W_t)\Pr(M_{t-1})}\right])

$=\\mathrm E\\left\[\\log\\frac\{\\Pr\(W\_t,M\_\{t\-1\},O\_t\$}{\Pr(W_t,M_{t-1})}-\log\frac{\Pr(M_{t-1},O_t)}{\Pr(M_{t-1})}\right])

$=\\mathrm E\[\\log\\Pr\(O\_t|W\_t,M\_\{t\-1\}\$-\log\Pr(O_t%7CM_{t-1})])

and because the observation only depends on the world

$=\\mathrm E\[\-S\(O\_t|W\_t\$+S(O_t%7CM_{t-1})])

Comment by royf on How to Disentangle the Past and the Future · 2013-01-04T07:44:53.631Z · score: 1 (1 votes) · LW · GW

I explained this in my non-standard introduction to reinforcement learning.

We can define the world as having the Markov property, i.e. as a Markov process. But when we split the world into an agent and its environment, we lose the Markov property for each of them separately.

I'm using non-standard notation and terminology because they are needed for the theory I'm developing in these posts. In future posts I'll try to link more to the handful of researchers who do publish on this theory. I did publish one post relating the terminology I'm using to more standard research.

Comment by royf on How to Disentangle the Past and the Future · 2013-01-03T05:53:47.019Z · score: 1 (1 votes) · LW · GW

Fixed. Thanks!

Comment by royf on Conservation of Expected Evidence · 2012-10-08T21:43:09.646Z · score: 0 (0 votes) · LW · GW

Let's assume a strong version of Bayesianism, which entails the maximum entropy principle. So our belief is the one that has the maximum entropy, among those consistent with our prior information. If we now add the information that some model is true, this generally invalidate our previous belief, making the new maximum-entropy belief one of lower entropy. The reduction in entropy is the amount of information you gain by learning the model. In a way, this is a cost we pay for "narrowing" our belief.

The upside of it is that it tells us something useful about the future. Of course, not all information regarding the world is relevant for future observations. The part that doesn't help control our anticipation is failing to pay rent, and should be evacuated. The part that does inform us about the future may be useful enough to be worth the cost we pay in taking in new information.

I'll expand on all of this in my sequence on reinforcement learning.

Comment by royf on Conservation of Expected Evidence · 2012-10-08T20:33:11.451Z · score: 1 (1 votes) · LW · GW

You're not really wrong. The thing is that "Occam's razor" is a conceptual principle, not one mathematically defined law. A certain (subjectively very appealing) formulation of it does follow from Bayesianism.

P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) <= P(A model).

Your math is a bit off, but I understand what you mean. If we have two sets of models, with no prior information to discriminate between their members, then the prior gives less probability to each model in the larger set than in the smaller one.

More generally, if deciding that model 1 is true gives you more information than deciding that model 2 is true, that means that the maximum entropy given model 1 is lower than that given model 2, which in turn means (under the maximum entropy principle) that model 1 was a-priori less likely.

Anyway, this is all besides the discussion that inspired my previous comment. My point was that even without Popper and Jaynes to enlighten us, science was making progress using other methods of rationality, among which is a myriad of non-Bayesian interpretations of Occam's razor.

Comment by royf on Internal Availability · 2012-10-08T18:38:08.198Z · score: 0 (0 votes) · LW · GW

The ease with which images, events and concepts come to mind is correlated with how frequently they have been observed, which in turn is correlated with how likely they are to happen again.

Yes, and I was trying to make this description one level more concrete.

Things never happen the exact same way twice. The way that past observations are correlated with what may happen again is complicated - in a way, that's exactly what "concepts" capture.

So we don't just recall something that happened and predict that it will happen again. Rather, we compose a prediction based on an integration of bits and patches from past experiences. Recalling these bits and patches as relevant for the context of the prediction - and of each other - is a complicated task, and I propose that an "internal availability" mechanism is needed to perform it.

Comment by royf on Internal Availability · 2012-10-08T09:09:41.188Z · score: 1 (1 votes) · LW · GW

Take for example your analysis of the poker hand I partially described. You give 3 possibilities for what the truth of it may be. Are there any other possibilities? Maybe the player is bluffing to gain the reputation of a bluffer? Maybe she mistook a 4 for an ace (it happened to me once...)? Maybe aliens hijacked her brain?

It would be impossible to enumerate or notice all the possibilities, but fortunately we don't have to. We make only the most likely and important ones available.

Comment by royf on Internal Availability · 2012-10-08T08:35:05.682Z · score: 1 (1 votes) · LW · GW

I was trying to give a specific reason that the availability heuristic is there: it's coupled with another mechanism that actually generates the availability; and then to say a few things about this other mechanism.

Does anyone have specific advice on how I could convey this better?

Comment by royf on The Bayesian Agent · 2012-09-20T23:37:44.113Z · score: 0 (4 votes) · LW · GW

Imagine a bowl of jellybeans. [...]

Allow me to suggest a simpler thought experiment, that hopefully captures the essence of yours, and shows why your interpretation (of the correct math) is incorrect.

There are 100 recording studios, each recording each day with probability 0.5. Everybody knows that.

There's a red light outside each studio to signal that a session is taking place that day, except for one rogue studio, where the signal is reversed, being off when there's a session and on when there isn't. Only persons B and C know that.

A, B and C are standing at the door of a studio, but only C knows that it's the rogue one. How do their beliefs that there's a session inside change by observing that the red light is on? A keeps the 50-50. B now thinks it's 99-1. Only C knows that there's no session.

So your interpretation, as I understand it, would be to say that A and B updated in the "wrong direction". But wait! I practically gave you the same prior information that C has - of course you agree with her! Let's rewrite the last paragraph:

A, B and C are standing at the door of a studio. For some obscure reason, C secretly believes that it's the rogue one. Wouldn't you now agree with B?

And now I can do the same for A, by not revealing to you, the reader, the significance of the red lights. My point is that as long as someone runs a Bayesian update, you can't call that the "wrong direction". Maybe they now believe in things that you judge less likely, based on the information that you have, but that doesn't make you right and them wrong. Reality makes them right or wrong, unfortunately there's no one around who knows reality in any other way than through their subjective information-revealing observations.

Comment by royf on Less Wrong Polls in Comments · 2012-09-20T17:03:05.909Z · score: 5 (5 votes) · LW · GW

To anyone thinking this is not random, with 42 votes in:

• The p-value is 0.895 (this is the probability of seeing at least this much non-randomness, assuming a uniform distribution)

• The entropy is 2.302bits instead of log(5) = 2.322bits, for 0.02bits KL-distance (this is the number of bits you lose for encoding one of these votes as if it was random)

If you think you see a pattern here, you should either see a doctor or a statistician.

Comment by royf on The Bayesian Agent · 2012-09-18T18:30:53.187Z · score: 1 (3 votes) · LW · GW

It is perfectly legal under the bayes to learn nothing from your observations.

Right, in degenerate cases, when there's nothing to be learned, the two extremes of learning nothing and everything coincide.

Or learn in the wrong direction, or sideways, or whatever.

To the extent that I understand your navigational metaphor, I disagree with this statement. Would you kindly explain?

There is no unique "Bayesian belief".

If you mean to say that there's no unique justifiable prior, I agree. The prior in our setting is basically what you assume you know about the dynamics of the system - see my reply to RichardKennaway.

However, given that prior and the agent's observations, there is a unique Bayesian belief, the one I defined above. That's pretty much the whole point of Bayesianism, the existence of a subjectively objective probability.

If you had the "right" prior, you would find that would have to do very little updating, because the right prior is already right.

This is true in a constant world, or with regard to parts of the world which are constant. And mind you, it's true only with high probability: there's always the slight chance that the sky is not, after all, blue.

But in a changing world, where part of the change is revealed to you through new observations, you have to keep pace. The right prior was right yesterday, today there's new stuff to know.

Comment by royf on The Bayesian Agent · 2012-09-18T18:09:59.897Z · score: 3 (3 votes) · LW · GW

Everything you say is essentially true.

As the designer of the agent, will you be explicitly providing it with that information in some future instalment?

Technically, we don't need to provide the agent with p and sigma explicitly. We use these parameters when we build the agent's memory update scheme, but the agent is not necessarily "aware" of the values of the parameters from inside the algorithm.

Let's take for example an autonomous rover on Mars. The gravity on Mars is known at the time of design, so the rover's software, and even hardware, is built to operate under these dynamics. The wind velocity at the time and place of landing, on the other hand, is unknown. The rover may need to take measurements to determine this parameter, and encode it in its memory, before it can take it into account in choosing further actions.

But if we are thoroughly Bayesian, then something is known about the wind prior to experience. Is it likely to change every 5 minutes or can the rover wait longer before measuring again? What should be the operational range of the instruments? And so on. In this case we would include this prior in p, while the actual wind velocity is instead hidden in the world state (only to be observed occasionally and partially).

Ultimately, we could include all of physics in our belief - there's always some Einstein to tell us that Newtonian physics is wrong. The problem is that a large belief space makes learning harder. This is why most humans struggle with intuitive understanding of relativity or quantum mechanics - our brains are not made to represent this part of the belief space.

This is also why reinforcement learning gives special treatment to the case where there are unknown but unchanging parameters of the world dynamics: the "unknown" part makes the belief space large enough to make special algorithms necessary, while the "unchanging" part makes these algorithms possible.

For LaTeX instructions, click "Show help" and then "More Help" (or go here).

Comment by royf on The Bayesian Agent · 2012-09-18T03:12:21.025Z · score: 2 (4 votes) · LW · GW

If you're a devoted Bayesian, you probably know how to update on evidence, and even how to do so repeatedly on a sequence of observations. What you may not know is how to update in a changing world. Here's how:

$B_{t+1}(W_{t+1}\$%3d\Pr(W_{t+1}|O1,\ldots,O{t+1})%3d\frac{\sigma(O{t+1}|W{t+1})\cdot\Pr(W_{t+1}|O_1,\ldots,O_t)}{\sumw\sigma(O{t+1}|w)\cdot\Pr(w|O_1,\ldots,O_t)})

As usual with Bayes' theorem, we only need to calculate the numerator for different values of $W_{t+1}$, and the denominator will normalize them to sum to 1, as probabilities do. We know $\sigma$ as part of the dynamics of the system, so we only need $\Pr(W_{t+1}|O_1,\ldots,O_t\$). This can be calculated by introducing the other variables in the process:

$\Pr(W_{t+1}|O_1,\ldots,O_t\$%3d\sum_{W_t,A_t}\Pr(W_t,At,W{t+1}|O_1,\ldots,O_t))

An important thing to notice is that, given the observable history, the world state $W_t$ and the action $A_t$ are independent - the agent can't act on unseen information. We continue:

$=\sum_{W_t,A_t}\Pr(W_t|O_1,\ldots,O_t\$\cdot\Pr(A_t|O_1,\ldots,Ot)\cdot%20p(W{t+1}|W_t,A_t))

Recall that the agent's belief $B_t$ is a function of the observable history, and that the action only depends on the observable history through its memory $B_t$. We conclude:

$=\sum_{W_t,A_t}B_t(W_t\$\cdot\pi(A_t|Bt)\cdot%20p(W{t+1}|W_t,A_t))

Comment by royf on Argument Screens Off Authority · 2012-08-23T05:16:25.374Z · score: 1 (3 votes) · LW · GW

p(H|E1,E2) [...] is simply not something you can calculate in probability theory from the information given [i.e. p(H|E1) and p(H|E2)].

Jaynes would disapprove.

You continue to give more information, namely that p(H|E1,E2) = p(H|E1). Thanks, that reduces our uncertainty about p(H|E1,E2).

But we are hardly helpless without it. Whatever happened to the Maximum Entropy Principle? Incidentally, the maximum entropy distribution (given the initial information) does have E1 and E2 independent. If your intuition says this before having more information, it is good.

Comment by royf on Reinforcement, Preference and Utility · 2012-08-09T15:06:55.724Z · score: 5 (7 votes) · LW · GW

Clearly you have some password I'm supposed to guess.

This post is not preliminary. It's supposed to be interesting in itself. If it's not, then I'm doing something wrong, and would appreciate constructive criticism.

Comment by royf on Reinforcement Learning: A Non-Standard Introduction (Part 2) · 2012-08-04T01:20:33.079Z · score: 0 (0 votes) · LW · GW

That's an excellent point. Of course one cannot introduce RL without talking about the reward signal, and I've never intended to.

To me, however, the defining feature of RL is the structure of the solution space, described in this post. To you, it's the existence of a reward signal. I'm not sure that debating this difference of opinion is the best use of our time at this point. I do hope to share my reasons in future posts, if only because they should be interesting in themselves.

As for your last point: RL is indeed a very general setting, and classical planning can easily be formulated in RL terms.

Comment by royf on Reinforcement Learning: A Non-Standard Introduction (Part 2) · 2012-08-03T22:02:29.169Z · score: 1 (3 votes) · LW · GW

I'm not sure why you say this.

Please remember that this introduction is non-standard, so you may need to be an expert on standard RL to see the connection. And while some parts are not in place yet, this post does introduce what I consider to be the most important part of the setting of RL.

So I hope we're not arguing over definitions here. If you expand on your meaning of the term, I may be able to help you see the connection. Or we may possibly find that we use the same term for different things altogether.

I should also explain why I'm giving a non-standard introduction, where a standard one would be more helpful in communicating with others who may know it. The main reason is that this will hopefully allow me to describe some non-standard and very interesting conclusions.

Comment by royf on Reinforcement Learning: A Non-Standard Introduction (Part 1) · 2012-07-30T14:51:00.056Z · score: 1 (3 votes) · LW · GW

I internally debated this question myself. Ideally, I'd completely agree with you. But I needed the shorter publishing and feedback cycle for a number of reasons. Sorry, but a longer one may not have happened at all.

Edit: future readers will have the benefit of a link to part 2

Comment by royf on Reinforcement Learning: A Non-Standard Introduction (Part 1) · 2012-07-30T06:28:33.134Z · score: 0 (0 votes) · LW · GW

In the model there's the distribution p, which determines how the world is changing. In the chess example this would include: a) how the agent's action changes the state of the game + b) some distribution we assume (but which we may or may not actually know) about the opponent's action and the resulting state of the game. In a physics example, p should include the relevant laws of physics, together with constants which tell the rate (and manner) in which the world is changing. Any changing parameters should be part of the state.

It seems that you're saying that it may be difficult to know what p is. Then you are very much correct. You probably couldn't infer the laws of physics from the current wave function of the universe, or the rules of chess from the current state of the game. But at this point we're only assuming that such laws exist, not that we know how to learn them.

p and q are probability distributions, which is where we allow for randomness in the process. But note that randomness becomes a tricky concept if you go deep enough into physics.

As for the "quantum mind" theory, as far as I can tell it's fringe science at best. Personally, I'm very skeptical. Regardless, such a model can still have the Markov property, if you include the wave function in your state.

Comment by royf on Reinforcement Learning: A Non-Standard Introduction (Part 1) · 2012-07-28T22:52:03.450Z · score: 0 (0 votes) · LW · GW

There's supposed to be some way to do so partially, if anyone knows what it is.

This should work in Markdown, but it seems broken :(

Edit: t̶e̶s̶t̶ Thanks, Vincent, it works!

Comment by royf on Reinforcement Learning: A Non-Standard Introduction (Part 1) · 2012-07-28T22:40:57.498Z · score: 0 (0 votes) · LW · GW

And how do you strikeout your comment?

Comment by royf on Reinforcement Learning: A Non-Standard Introduction (Part 1) · 2012-07-28T22:31:52.667Z · score: 0 (0 votes) · LW · GW

I'm not sure what you mean. It looks fine to me, and I can't find where to check / change such a setting.

Edit:

Very strange. Fixed, I hope.

Thanks!

Comment by royf on The Perception-Action Cycle · 2012-07-28T20:03:32.899Z · score: 0 (0 votes) · LW · GW

You are expressing a number of misconceptions here. I may address some in future posts, but in short:

By the action of powering the electromagnet you are not increasing your information on the state of the world. You are increasing your information on the state of the coin, but through making it dependent on the state of the electromagnet which you already knew. This point is clearly worth a future post.

There is no "entropy in environment". Entropy is subjective to the viewer.

Comment by royf on The Perception-Action Cycle · 2012-07-24T20:54:18.173Z · score: 0 (0 votes) · LW · GW

I realize now that an example would be helpful, and yours is a good one.

Any process can be described on different levels. The trick is to find a level of description that is useful. We make an explicit effort to model actions and observation so as to separate the two directions of information flow between the agent and the environment. Actions are purely "active" (no information is received by the agent) while observations are purely "passive" (no information is sent by the agent). We do this because these two aspects of the process have very different properties, as I hope to make clear in future posts.

The process of "figuring out where the table is" involves information flowing in both directions, and so is neither an action nor an observation. Some researchers call such a thing "a subgoal". We should break it down further, for example we could have taps as actions and echoes as observations, as you suggest.

If you want to argue that no information is lost by tapping, then fine, I won't be pedantic and point out the tiny bits of information that do get lost. The point is that some information being lost is a representing feature of the process of taking an action. Over time, if you don't take in new information through observations, your will have less and less information about the world, even if some actions you take have a high probability of not losing any information.

Comment by royf on The Perception-Action Cycle · 2012-07-23T15:07:05.006Z · score: 3 (3 votes) · LW · GW

Excellent point. It will be a few posts (if the audience is interested) before I can answer you in a way that is both intuitive and fully convincing.

The technical answer is that the belief update caused by an action is deterministically contractive. It never increases the amount of information.

A more intuitive answer (but perhaps not yet convincing) is that, proximally, your action of asking your friend did not change the location of your laptop, only your friend's mental state. And the effect it had on your friend's mental state is that you are now less sure of what it is. You took some of the things you knew about it ("she was probably going to keep sitting around") and made them no longer true.

Regarding information about the future (of the location of your laptop), it is always included in information about the present. Your own mental state is independent of the future given the present. Put another way, you can't update without more observations. In this case, the location of your laptop merely becomes entangled with information that you already had on your friend's mental state ("this is where she thinks my desk is").

Comment by royf on Mutual Information, and Density in Thingspace · 2012-07-19T22:43:20.337Z · score: 0 (0 votes) · LW · GW

Having a word [...] is a more compact code precisely in those cases where we can infer some of those properties from the other properties. (With the exception perhaps of very primitive words, like "red" [...]).

Remember that mutual information is symmetric. If some things have the property of being red, then "red" has the property of being a property of those things. Saying "blood is red" is really saying "remember that visual experience that you get when you look at certain roses, apples, peppers, lipsticks and English buses and phone booths? The same happens with blood." If I give you the list above, can you find ("infer") more red things? Then "red" is a good word.

But do note that this is a dual sense to the one in which "human" is a good word. Most of the properties of humans are statistically necessary for being human: remove any one of them, and the thing is much less likely to be human. "Human" is a good word because these properties are positively correlated. On the other hand, most of the red things are statistically sufficient for being red: take any one of them, and the thing is much more likely to be red. "Red" is a good word because these things are negatively correlated - they are a bunch of distinct things with a shared aspect.

Comment by royf on My Wild and Reckless Youth · 2012-06-15T18:44:10.046Z · score: 2 (2 votes) · LW · GW

The Harsanyi paper is very enlightening, but he's not really arguing that people have shared priors. Rather, he's making the following points (section 14):

• It is worthwhile for an agent to analyze the game as if all agents have the same prior, because it simplifies the analysis. In particular, the game (from that agent's point of view) then becomes equivalent to a Bayesian complete-information game with private observations.

• The same-prior assumption is less restrictive than it may seem, because agents can still have private observations.

• A wide family of hypothetical scenarios can be analyzed as if all agents have the same prior. Other scenarios can be easily approximated by a member of this family (though the quality of the approximation is not studied).

All of this is mathematically very pleasing, but it doesn't change my point. That's mainly because in the context of the Harsanyi paper "prior" means before any observation, and in the context of this post "prior" means before the shared observation (but possibly after private observations).

Comment by royf on My Wild and Reckless Youth · 2012-06-14T06:55:35.915Z · score: 1 (1 votes) · LW · GW

I'm aware of this result. It specifically requires the two Beyesians to have the same prior. My point is exactly that this doesn't have to be the case, and in reality is sometimes not the case.

EDIT: The original paper by Aumann references a paper by Harsanyi which supposedly addresses my point. Aumann himself is careful in interpreting his result as supporting my point (since evidently there are people who disagree despite trusting each other). I'll report here my understanding of the Harsanyi paper once I get past the paywall.

Comment by royf on My Wild and Reckless Youth · 2012-06-14T06:36:49.117Z · score: 2 (2 votes) · LW · GW

Traditional Rationalists can agree to disagree. Traditional Rationality doesn't have the ideal that thinking is an exact art in which there is only one correct probability estimate given the evidence.

This is also true of Bayesians. The probability estimate given the evidence is a property of the map, not the territory (hence "estimate"). One correct posterior implies one correct prior. What is this "Ultimate Prior"? There isn't one.

Possibly, you meant that there's one correct posterior given the evidence and the prior. That's correct, but it doesn't prevent Bayesians from disagreeing, because they do have different priors.

Alternatively, one can point out that the "given evidence" operator is, in expectation, always non-expansive, and contractive when the priors disagree. This means that the beliefs of Perfect Bayesians with shared observations converge (with probability 1) into a single posterior. But this convergence is too slow for humans. Agreeing to disagree is sometimes our only option.

Incidentally, it's Traditional Rationalists who believed they should never agree to disagree: the set of hypotheses which aren't "ruled out" by confirmed and repeatable experiments, they argued, is a property of the territory.

Comment by royf on Fake Causality · 2012-06-11T02:43:48.522Z · score: 0 (0 votes) · LW · GW

A GAI with the utility of burning itself? I don't think that's viable, no.

What do you mean by "viable"?

Intelligence is expensive. More intelligence costs more to obtain and maintain. But the sentiment around here (and this time I agree) seems to be that intelligence "scales", i.e. that it doesn't suffer from diminishing returns in the "middle world" like most other things; hence the singularity.

For that to be true, more intelligence also has to be more rewarding. But not just in the sense of asymptotically approaching optimality. As intelligence increases, it has to constantly find new "revenue streams" for its utility. It must not saturate its utility function, in fact its utility must be insatiable in the "middle world". A good example is curiosity, which is probably why many biological agents are curious even when it serves no other purpose.

Suicide is not such a utility function. We can increase the degree of intelligence an agent needs to have to successfully kill itself (for example, by keeping the gun away). But in the end, it's "all or nothing".

But anyway it can't be that Godelian reasons prevent intelligences from wanting to burn themselves, because people have burned themselves.

Gödel's theorem doesn't prevent any specific thing. In this case I was referring to information-theoretic reasons. And indeed, suicide is not a typical human behavior, even without considering that some contributing factors are irrelevant for our discussion.

Do you count the more restrictive technology by which humans operate as a constraint which artificial agents may be free of?

Why not? Though of course it may turn out that AI is best programmed on something unlike our current computer technology.

In that sense, I completely agree with you. I usually don't like making the technology distinction, because I believe there's more important stuff going on in higher levels of abstraction. But if that's where you're coming from then I guess we have resolved our differences :)

Comment by royf on Fake Causality · 2012-06-07T04:09:37.226Z · score: 0 (0 votes) · LW · GW

Not at all. If you insist, let's take it from the top:

• I wanted to convey my reasoning, let's call it R.

• I quoted a claim of the form "because P is true, Q is true", where R is essentially "if P then Q". This was a rhetorical device, to help me convey what R is.

• I indicated clearly that I don't know whether P or Q are true. Later I said that I suspect P is false.

• Note that my reasoning is, in principle, falsifiable: if P is true and Q is false, then R must be false.

• While Q may be relatively easy to check, I think P is not.

• I expect to have other means of proving R.

• I feel that I'm allowed to focus on conveying R first, and attempting to prove or falsify it at a later date. The need to clarify my ideas helped me understand them better, in preparation of future proof.

• I stated clearly and repeatedly that I'm just conveying an idea here, not providing evidence for it, and that I agree with readers who choose to doubt it until shown evidence.

Do you still think I'm at fault here?

EDIT: Your main objection to my presentation was that Q could be false. Would you like to revise that objection?

Comment by royf on Fake Causality · 2012-06-07T03:46:23.897Z · score: 0 (0 votes) · LW · GW

I'll try to remember that, if only for the reason that some people don't seem to understand contexts in which the truth value of a statement is unimportant.

Comment by royf on Fake Causality · 2012-06-06T22:19:41.056Z · score: 0 (0 votes) · LW · GW

a GAI with [overwriting its own code with an arbitrary value] as its only goal, for example, why would that be impossible? An AI doesn't need to value survival.

A GAI with the utility of burning itself? I don't think that's viable, no.

I'd be interested in the conclusions derived about "typical" intelligences and the "forbidden actions", but I don't see how you have derived them.

At the moment it's little more than professional intuition. We also lack some necessary shared terminology. Let's leave it at that until and unless someone formalizes and proves it, and then hopefully blogs about it.