Posts
Comments
Yep. If they do acausal trade with each other.
Some thoughts on that idea: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysingdangerousmessagesfromfutureufaiviaoracles
Some thoughts on this idea, thanks for it: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysingdangerousmessagesfromfutureufaiviaoracles
I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.
I agree. I've now added this line, which I thought I'd put in the original post, but apparently missed out:
Note, though, that the converse is true: every example of wireheading is a Goodhart curse.
But really, what's the purpose of trying to distinguish wireheading from other forms of reward hacking?
Because mitigations for different failure modes might not be the same, depending on the circumstances.
Where "measurement channel" not just one specific channel, but anything that has the properties of a measurement channel.
Indeed. It might be possible to construct that complex bias function, from the policy, in a simple way. But that claim needs to be supported, and the fact that it hasn't been found so far (I repeat that it has to be simple) is evidence against it.
Is file1 the degenerate pair and file2 the intended pair, and image1 the policy and image2 the biasfacts?
Yes.
Then what is the "unzip" function?
The "shortest algorithm generating BLAH" is the maximally compressed way of expressing BLAH  the "zipped" version of BLAH.
Ignoring unzip, which isn't very relevant, we know that the degenerate pairs are just above the policy in complexity.
So zip(degenerate pair) zip(policy), while zip(reasonable pair) > zip(policy+complex bias facts) (and zip(policy+complex bias facts) > zip(policy)).
Does that help?
I'm not sure the physics analogy is getting us very far  I feel there is a very natural way of decomposing physics into laws+initial conditions, while there is no such natural way of doing so for preferences and rationality. But if we have different intuitions on that, then discussing the analogy doesn't isn't going to help us converge!
So then every p,R pair compatible with the policy contains more information than the policy. Thus even the simplest p,R pair compatible with the policy contains more information than the policy.
Agreed (though the extra information may be tiny  a few extra symbols).
By analogous reasoning, every algorithm for constructing the policy contains more information than the policy.
That does not follow; the simplest algorithm for building a policy does not go via decomposing into two pieces and then recombining them. We are comparing algorithms that produce a plannerreward pair (two outputs) with algorithms that produce a policy (one output). (but your whole argument shows you may be slightly misunderstanding complexity in this context).
Now, though all pairs are slightly more complex than the policy itself, the bias argument shows that the "proper" pair is considerably more complex. To use an analogy: suppose file1 and file2 are both maximally zipped files. When you unzip file1, you produce image1 (and maybe a small, blank, image2). When you unzip file2, you also produce the same image1, and a large, complex, image2'. Then, as long as image1 and image2' are at least slightly independent, file2 has to be larger than file1. The more complex image2' is, and the more independent it is from image1, the larger file2 has to be.
Does that make sense?
Yep, those are the two levels I mentioned :)
But I like your phrasing.
You can't get too much work from a single bit of information ^_^
Hey there!
Responding to a few points. But first, I want to make the point that treating an agent as (p,R) pair is basically an intentional stance. We choose to treat the agent that way, either for ease of predicting its actions (Dennet's approach) or for extracting its preferences, to satisfy them (my approach). The decomposition is not a natural fact about the world.
If I ever said physicists don't know how to distinguish between laws and initial conditions, I didn't mean it. (Did I?) What I thought I said was that physicists haven't yet found a law+IC pair that can account for the data we've observed. Also that they are in fact using lots of other heuristics and assumptions in their methodology, they aren't just iterating through law+IC pairs and comparing the results to our data. So, in that regard the situation with physics is parallel to the situation with preferences/rationality.
No, the situation is very different. Physicists are trying to model and predict what is happening in the world (and in counterfactual worlds). This is equivalent with trying to figure out the human policy (which can be predicted from observations, as long as you include counterfactual ones). The decomposition of the policy into preferences and rationality is a separate step, very unlike what physicists are doing (quick way to check this: if physicists were unboundedly rational with infinite data, they could solve their problem; whereas we couldn't, we'd still have to make decisions).
(if you want to talk about situations where we know some things but not all about the human policy, then the treatment is more complex, but ultimately the same arguments apply).
My point is that they are irrelevant to what is more complex than what. In particular, just because A has more information than B doesn't mean A is more complex than B. Example: The true Laws + Initial Conditions pair contains more information than E, the set of all events in the world. Why? Because from E you cannot conclude anything about counterfactuals, but from the true Laws+IC pair you can. Yet you can deduce E from the true Laws+IC pair. (Assume determinism for simplicity.) But it's not true that the true Laws+IC pair is more complex than E; the complexity of E is the length of the shortest way to generate it, and (let's assume) the true Laws+IC is the shortest way to generate E. So both have the same complexity.
Well, it depends. Suppose there are multiple TL (true laws) + IC that could generate E. In that case, TL+IC has more complexity than E, since you need to choose among the possible options. But if there is only one feasible TL+IC that generates E, then you can work backwards from E to get that TL+IC, and now you have all the counterfactual info, from E, as well.
For example, the "proper" pair contains all this information about what's a bias and what isn't, because our definition of bias references the planner/reward distinction. But isn't that unfair? Example: We can write 99999999999999999999999 or we can write "20digits of 9's." The latter is shorter, but it contains more information if we cheat and say it tells us things like "how to spell the word that refers to the parts of a written number."
That argument shows that if you look into the algorithm, you can get other differences. But I'm not looking into the algorithm; I'm just using the decomposition into (p, R), and playing around with the p and R pieces, without looking inside.
Anyhow don't the degenerate pairs also contain information about biasesfor example, according to the policyplanner+emptyreward pair, nothing is a bias, because nothing would systematically lead to more reward than what is already being done?
Among the degenerate pairs, the one with the indifferent planner has a bias of zero, the greedy planner has a bias of zero, and the antigreedy planner has a bias of 1 at every timestep. So they do define bias functions, but particularly simple ones. Nothing like the complexity of the biases generated by the "proper" pair.
The relevance of information for complexity is this: given reasonable assumptions, the human policy is simpler than all pairs, and the three degenerate pairs are almost as simple as the policy. However, the "proper" pair can generate a complicated object, the bias function (which has a nontrivial value in almost every possible state). So the proper pair contains at least enough information to specify a) the human policy, and b) the bias function. The kolmogorov complexity of the proper pair is thus at least that of the simplest algorithm that can generate both those objects.
So one of two things are happening: either the human policy can generate the bias function directly, in some simple way^{[1]}, or the proper pair is more complicated that the policy. The first is not impossible, but notice that it has to be "simple". So the fact that we have not yet found a way to generate the bias function from the policy is an argument that it can't be done. Certainly there are no elementary mathematical manipulations of the policy that produces anything suitable.
If it were true that Occam's Razor can't distinguish between P,R and P,R, then... isn't that a pretty general argument against Occam's Razor, not just in this domain but in other domains too?
No, because Occam's razor works in other domains. This is a strong illustration that this domain is actually different.
Let A be the simplest algorithm that generates the human policy, and B the simplest that generates the human policy and the bias function. If there are n different algorithms that generate the human policy and are of length B or shorter, then we need to add log2(n) bits of information to the human policy to generate B, and hence, the bias function. So if B is close is complexity to A, be don't need to add much. ↩︎
Thanks! Error corrected.
Hey there!
Thanks for this critique; I have, obviously, a few comments ^_^
In no particular order:

First of all, the FHI channel has a video going over the main points of the argument (and of the research agenda); it may help to understand where I'm coming from: https://www.youtube.com/watch?v=1M9CvESSeVc

A useful point from that: given human theory of mind, the decomposition of human behaviour into preferences and rationality is simple; without that theory of mind, it is complex. Since it's hard for us to turn off our theory of mind, the decomposition will always feel simple to us. However, the human theory of mind suffers from Moravec's paradox: though the theory of mind seems simple to us, it is very hard to specify, especially into code.

You're entirely correct to decompose the argument into Step 1 and Step 2, and to point out that Step 1 has much stronger formal support than Step 2.

I'm not too worried about the degenerate pairs specifically; you can rule them all out with two bits of information. But, once you've done that, there will be other almostasdegenerate pairs that bit with the new information. To rule them out, you need to add more information... but by the time you've added all of that, you've essentially defined the "proper" pair, by hand.

On speed priors: the standard argument applies for a speed prior, too (see Appendix A of our paper). It applies perfectly for the indifferent planner/zero reward, and applies, given an extra assumption, for the other two degenerate solutions.

Onto the physics analogy! First of all, I'm a bit puzzled by your claim that physicists don't know how to do this division. Now, we don't have a full theory of physics; however, all the physical theories I know of, have a very clear and known division between laws and initial conditions. So physicists do seem to know how to do this. And when we say that "it's very complex", this doesn't seem to mean the division into laws and initial conditions is complex, just that the initial conditions are complex (and maybe that the laws are not yet known).

The indifference planner contains almost exactly the same amount of on information as the policy. The "proper" pair, on the other hand, contains information such as whether the anchoring bias is a bias (it is) compared with whether paying more for better tasting chocolates is a bias (it isn't). Basically, none of the degenerate pairs contain any bias information at all; so everything to do with human biases is extra information that comes along with the "proper" pair.

Even ignoring all that, the fact that (p,R) is of comparable complexity to (p,R) shows that Occams razor cannot distinguish the proper pair from its negative.
Answered your comment there.
compatibility with all the evidence we have observed
That is the whole point of my research agenda: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/researchagendav09synthesisingahumanspreferencesinto
The problem is that the nonsubjective evidence does not map onto facts about the decomposition. A human claims X; well, that's a speech act; are they telling the truth or not, and how do we know? Same for sensory data, which is mainly data about the brain correlated with facts about the outside world; to interpret that, we need to solve human symbol grounding.
All these ideas are in the research agenda (especially section 2). Just as you need something to bridge the isought gap, you need some assumptions to make evidence in the world (eg speech acts) correspond to preferencerelevant facts.
This video may also illustrate the issues: https://www.youtube.com/watch?v=1M9CvESSeVc&t=1s
Goodhart's Law tells us that, in general, blindly maximizing the proxy has lower expected value than other methods that involves not doing that
This is only true for the kind of things humans typically care about; this is not true for utility functions in general. That's the extra info we have.
Even with your stated sense of beauty, knowing "this measure can be manipulated in extreme circumstances" is much better than nothing.
And we probably know quite a bit more; I'll continue this investigation, adding more information.
As far as I can tell we're not actually dividing the space of W's by a plane, we're dividing the space of E(Wπ)'s by a plane.
Because expectation is affine with respect to utility functions, this does divide the space by a plane.
Yes, there is a connection with the optimizer's curse style of reasoning.
You are entirely correct; I don't know why I was confused.
However, looking at the proof again, it seems there might be a potential hole. You use Löb's theorem within an assumption subloop. This seems to assume that from "", we can deduce "".
But this cannot be true in general! To see this, set . Then , trivially; if, from that, we could deduce , we would have for any . But this statement, though it looks like Löb's theorem, is one that we cannot deduce in general (see Eliezer's "mediumhard problem" here).
Can this hole be patched?
(note that if , where is a PA proof that adds A as an extra axiom, then we can deduce ).
If the agent's reasoning is sensible only under certain settings of the default action clause
That was my first rewriting; the second is an example of a more general algorithm that would go something like this. If we assume that both probabilities and utilities are discrete, all of the form q/n for some q, and bounded above and below by N, then something like this (for EU the expected utility, and Actions the set of actions, and b some default action):
for q integer in N*n^2 to N*n^2 (ordered from highest to lowest):
for a in Actions:
if A()=a ⊢ EU=q/n^2 then output a
else output b
Then the Löbian proof fails. The agent will fail to prove any of those "if" implications, until it proves "A()="not cross" ⊢ EU=0". Then it outputs "not cross"; the default action b is not relevant. Also not relevant, here, is the order in which a is sampled from "Actions".
Interesting.
I have two issues with the reasoning as presented; the second one is more important.
First of all, I'm unsure about "Rather, the point is that the agent's "counterfactual" reasoning looks crazy." I think we don't know the agent's counterfactual reasoning. We know, by Löb's theorem, that "there exists a proof that (proof of L implies L)" implies "there exists a proof of L". It doesn't tell us what structure this proof of L has to take, right? Who knows what counterfactuals are being considered to make that proof? (I may be misunderstanding this).
Second of all, it seems that if we change the last line of the agent to [else, "cross"], the argument fails. Same if we insert [else if A()="cross" ⊢ U=10, then output "cross"; else if A()="not cross" ⊢ U=10, then output "not cross"] above the last line. In both cases, this is because U=10 is now possible, given crossing. I'm suspicious when the argument seems to depend so much on the structure of the agent.
To develop that a bit, it seems the agent's algorithm as written implies "If I cross the bridge, I am consistent" (because U=10 is not an option). If we modify the algorithm as I just suggested, then that's no longer the case; it can consider counterfactuals where it crosses the bridge and is inconsistent (or, at least, of unknown consistency). So, given that, the agent's counterfactual reasoning no longer seems so crazy, even if it's as claimed. That's because the agent's reasoning needs to deduce something from "If I cross the bridge, I am consistent" that it can't deduce without that. Given that statement, then being Löbian or similar seems quite natural, as those are some of the few ways of dealing with statements of that type.
Bayesian agents that knowingly disagree
A minor stub, caveating the Aumann's agreement theorem; put here to reference in future posts, if needed.
Aumann's agreement theorem states that rational agents with common knowledge of each other's beliefs cannot agree to disagree. If they exchange their estimates, they will swiftly come to an agreement.
However, that doesn't mean that agents cannot disagree, indeed they can disagree, and know that they disagree. For example, suppose that there are a thousand doors, and behind of these, there are goats, and behind one there is a flying aircraft carrier. The two agents are in separate rooms, and a host will go into each room and execute the following algorithm: they will choose a door at random among the that contain a goat. And, with probability , they will tell that door number to the agent; with probability , they will tell the door number with the aircraft carrier.
Then each agent will have probability of the named door being the aircraft carrier door, and probability on each of the other doors; so the most likely door is the one named by the host.
We can modify the protocol so that the host will never name the same door to each agent (roll a D100; if it comes up 1, tell the truth to the first agent and lie to the second; if it comes up 2, do the opposite; anything else means tell a different lie to either agent). In that case, each agent will have a best guess for the aircraft carrier, and the certainty that the other agent's best guess is different.
If the agents exchanged information, they would swiftly converge on the same distribution; but until that happens, they disagree, and know that they disagree.
I like this analogy. Probably not best to put too much weight on it, but it has some insights.
And whether those programs could then perform well if their opponent forces them into a very unusual situation, such as would not have ever appeared in a chessmaster game.
If I sacrifice a knight for no advantage whatsoever, will the opponent be able to deal with that? What if I set up a trap to capture a piece, relying on my opponent not seeing the trap? A chessmaster playing another chessmaster would never play a simple trap, as it would never succeed; so would the ML be able to deal with it?
PS: the other title I considered was "Why do people feel my result is wrong", which felt too condescending.
I agree we're not as good as we think we are. But there are a lot of things we do agree on, that seem trivial: eg "this person is red in the face, shouting at me, and punching me; I deduce that they are angry and wish to do me harm". We have far, far, more agreement than random agents would.
Your title seems clickbaity
Hehe  I don't normally do this, but I feel I can indulge once ^_^
having implicit access to categorisation modules that themselves are valid only in typical situations... is not a way to generalise well
How do you know this?
Moravec's paradox again. Chessmasters didn't easily program chess programs; and those chess programs didn't generalise to games in general.
Should we turn this into one of those concrete ML experiments?
That would be good. I'm aiming to have a lot more practical experiments from my research project, and this could be one of them.
Hum... It seems that we can stratify here. Let represent the values of a collection of variables that we are uncertain about, and that we are stratifying on.
When we compute the normalising factor for utility under two policies and , we normally do it as:
 , with .
And then we replace with .
Instead we might normalise the utility separately for each value of :
 Conditional on , then , with .
The problem is that, since we're dividing by the , the expectation of is not the same .
Is there an obvious improvement on this?
Note that here, total utilitarianism get less weight in large universes, and more in small ones.
I'll think more...
How about a third AI that gives a (hidden) probability about which one you'll be convinced by, conditional on which argument you see first? That hidden probability is passed to someone else, then the debate is run, and the result recorded. If that third AI gives good calibration and good discrimination over multiple experiments, then we can consider its predictions accurate in the future.
Er, this normalisation system way well solve that problem entirely. If prefers option (utility ), with second choice (utility ), and all the other options as third choice (utility ), then the expected utility of the random dictator is for all (as gives utility , and gives utility for all ), so the normalised weighted utility to maximise is:
 .
Using (because scaling doesn't change expected utility decisions), the utility of any , , is , while the utility of is . So if , the compromise option will get chosen.
Don't confuse the problems of the random dictator, with the problems of maximising the weighted sum of the normalisations that used the random dictator (and don't confuse the other way, either; the random dictator is immune to players' lying, this normalisation is not).
But community isn't about friends; it's about a background level of acquaintances you're comfortable with.
I finished the research agenda on constructing a preference utility function for any given human, and presented the ideas to CHAI and MIRI. Woot!
Something like or or in general (for decreasing, continuous ) could work, I think.
I'd say that intelligence variations are more visible in (modern) humans, not that they're necessarily larger.
Let's go back to the tribal environment. In that situation, humans want to mate, to dominate/be admired, to have food and shelter, and so on. Apart from a few people with mental defects, the variability in outcome is actually quite small  most humans won't get ostracised, many will have children, only very few will rise to the top of the hierarchy (and even there, tribal environments are more egalitarian that most, so the leader is not that much different from the others). So we might say that the variation in human intelligence (or social intelligence) is low.
Fast forward to an agricultural empire, or to the modern world. Now the top minds can become god emperor, invading and sacking other civilizations, or can be part of projects that produce atom bombs and lunar rockets. The variability of outcomes is huge, and so the variability in intelligence appears to be much higher.
That's an excellent summary.
One might think just doing ontology doesn't involve making preference choice but making some preferences impossible to articulate it is in fact a partial preference choice.
Yep, that's my argument: some (but not all) aspects of human preferences have to be included in the setup somehow.
it's more reasonable for a human to taste a salt level differnce, it's more plausible to say "I couldn't know" about radioactivity
I hope you don't taste every bucket of water before putting it away! ^_^
In the later part of the post, it seems you're basically talking about entropy and similar concepts? And I agree that "reversible" is kinda like entropy, in that we want to be able to return to a "macrostate" that is considered indistinguishable from the starting macrostate (even if the details are different).
However, as in the the bucket example above, the problem is that, for humans, what "counts" as the same macrostate can vary a lot. If we need a liquid, any liquid, then replacing the bucket's contents with purpletinted alchool is fine; if we're thinking of the bath water of the dear departed husband, then any change to the contents is irreversible. Human concepts of "acceptably similar" don't match up with entropic ones.
there needs to be an effect that counts as "significant".
Are you deferring this to human judgement of significant? If so, we agree  human judgement needs to be included in some way in the definition.
Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward.
Yep, I agree :)
I generally think that impact measures don't have to be valueagnostic, as long as they require less input about human preferences than the general value learning problem.
Then we are in full agreement :) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. "some" because of arguments like this one; "not all" because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.
Good idea.
intrinsic motivation
That might be the concept I'm looking for. I'll think whether it covers exactly what I'm trying to say...
Ok, we strongly disagree on your simple constraints being enough. I'd need to see these constraints explicitly formulated before I had any confidence in them. I suspect (though I'm not certain) that the more explicit you make them, the more tricky you'll see that it is.
And no, I don't want to throw IRL out (this is an old post), I want to make it work. I got this big impossibility result, and now I want to get around it. This is my current plan: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/researchagendav09synthesisingahumanspreferencesinto
Very worthwhile concern, and I will think about it more.
We may not be disagreeing any more. Just to check, do you agree with both these statements:

Adding a few obvious constraints rule out many different R, including the ones in the OP.

Adding a few obvious constraints is not enough to get a safe or reasonable R.
I've added an edit to the post, to show the problem: sometimes, the robot can't kick the bucket, sometimes it must. And only human preferences distinguish these two cases. So, without knowing these preferences, how can it decide?
Rejecting any specific R is easy  one bit of information (at most) per specific R. So saying "humans have preferences, and they are not always rational or always antirational" rules out R(1), R(2), and R(3). Saying "this apparent preference is genuine" rules out R(4).
But it's not like there are just these five preferences and once we have four of them out of the way, we're done. There are many, many different preferences in the space of preferences, and many, many of them will be simpler than R(0). So to converge to R(0), we need to add huge amounts of information, ruling out more and more examples.
Basically, we need to include enough information to define R(0)  which is what my research project is trying to do. What you're seeing as "adding enough clear examples" is actually "handcrafting R(0) in totality".
For more details see here: https://arxiv.org/abs/1712.05812
kicking the bucket into the pool perturbs most AUs. There’s no real “risk” to not kicking the bucket.
In this specific setup, no. But sometimes kicking the bucket is fine; sometimes kicking the metaphorical equivalent of the bucket is necessary. If the AI is never willing to kick the bucket  ie never willing to take actions that might, for certain utility functions, cause huge and irreparable harm  then it's not willing to take any action at all.
Your presentation had an example with randomly selected utility functions in a block world, that resulted in the agent taking lessirreversible actions around a specific block.
If we have randomly selected utility functions in the bucketandpool world, this may include utilities that care about the salt content or the exact molecules, or not. Depending on whether or not we include these, we run the risk of preserving the bucket when we need not, or kicking it when we should preserve it. This is because the "worth" of the water being in the bucket varies depending on human preferences, not on anything intrinsic to the design of the bucket and the pool.
we have plenty of natural assumptions to choose from.
You'd think so, but nobody has defined these assumptions in anything like sufficient detail to make IRL work. My whole research agenda is essentially a way of defining these assumptions, and it seems to be a long and complicated process.
Basically yes. My take on 2) is that identityaffirming things can be somewhat pleasurable  but they're unlikely to be the most pleasurable thing the human could do at that moment. So they can be valued for something else than pure pleasure.
And you can get other examples where someone, say, is truthful, even if that causes them more pain than a simple lie would.