Posts

Ultra-simplified research agenda 2019-11-22T14:29:41.227Z · score: 24 (7 votes)
Analysing: Dangerous messages from future UFAI via Oracles 2019-11-22T14:17:43.075Z · score: 16 (5 votes)
Defining AI wireheading 2019-11-21T13:04:49.406Z · score: 14 (4 votes)
Platonic rewards, reward features, and rewards as information 2019-11-12T19:38:10.685Z · score: 15 (5 votes)
All I know is Goodhart 2019-10-21T12:12:53.248Z · score: 28 (5 votes)
Full toy model for preference learning 2019-10-16T11:06:03.746Z · score: 12 (4 votes)
Toy model #6: Rationality and partial preferences 2019-10-02T12:04:53.048Z · score: 11 (2 votes)
Stuart_Armstrong's Shortform 2019-09-30T12:08:13.617Z · score: 9 (1 votes)
Toy model piece #5: combining partial preferences 2019-09-12T03:31:25.295Z · score: 12 (3 votes)
Toy model piece #4: partial preferences, re-re-visited 2019-09-12T03:31:08.628Z · score: 9 (1 votes)
Is my result wrong? Maths vs intuition vs evolution in learning human preferences 2019-09-10T00:46:25.356Z · score: 19 (6 votes)
Simple and composite partial preferences 2019-09-09T23:07:26.358Z · score: 11 (2 votes)
Best utility normalisation method to date? 2019-09-02T18:24:29.318Z · score: 15 (5 votes)
Reversible changes: consider a bucket of water 2019-08-26T22:55:23.616Z · score: 25 (24 votes)
Toy model piece #3: close and distant situations 2019-08-26T22:41:17.500Z · score: 10 (2 votes)
Problems with AI debate 2019-08-26T19:21:40.051Z · score: 22 (11 votes)
Gratification: a useful concept, maybe new 2019-08-25T18:58:15.740Z · score: 17 (7 votes)
Under a week left to win $1,000! By questioning Oracle AIs. 2019-08-25T17:02:46.921Z · score: 14 (3 votes)
Toy model piece #2: Combining short and long range partial preferences 2019-08-08T00:11:39.578Z · score: 15 (4 votes)
Preferences as an (instinctive) stance 2019-08-06T00:43:40.424Z · score: 20 (6 votes)
Practical consequences of impossibility of value learning 2019-08-02T23:06:03.317Z · score: 23 (11 votes)
Very different, very adequate outcomes 2019-08-02T20:31:00.751Z · score: 13 (4 votes)
Contest: $1,000 for good questions to ask to an Oracle AI 2019-07-31T18:48:59.406Z · score: 68 (28 votes)
Toy model piece #1: Partial preferences revisited 2019-07-29T16:35:19.561Z · score: 12 (3 votes)
Normalising utility as willingness to pay 2019-07-18T11:44:52.272Z · score: 16 (4 votes)
Intertheoretic utility comparison: examples 2019-07-17T12:39:45.147Z · score: 13 (3 votes)
Indifference: multiple changes, multiple agents 2019-07-08T13:36:42.095Z · score: 16 (3 votes)
Self-confirming prophecies, and simplified Oracle designs 2019-06-28T09:57:35.571Z · score: 6 (3 votes)
Apocalypse, corrupted 2019-06-26T13:46:05.548Z · score: 20 (12 votes)
Research Agenda in reverse: what *would* a solution look like? 2019-06-25T13:52:48.934Z · score: 35 (15 votes)
Research Agenda v0.9: Synthesising a human's preferences into a utility function 2019-06-17T17:46:39.317Z · score: 61 (16 votes)
Preference conditional on circumstances and past preference satisfaction 2019-06-17T15:30:32.580Z · score: 11 (2 votes)
For the past, in some ways only, we are moral degenerates 2019-06-07T15:57:10.962Z · score: 29 (9 votes)
To first order, moral realism and moral anti-realism are the same thing 2019-06-03T15:04:56.363Z · score: 17 (4 votes)
Conditional meta-preferences 2019-06-03T14:09:54.357Z · score: 6 (3 votes)
Uncertainty versus fuzziness versus extrapolation desiderata 2019-05-30T13:52:16.831Z · score: 20 (5 votes)
And the AI would have got away with it too, if... 2019-05-22T21:35:35.543Z · score: 75 (30 votes)
By default, avoid ambiguous distant situations 2019-05-21T14:48:15.453Z · score: 31 (8 votes)
Oracles, sequence predictors, and self-confirming predictions 2019-05-03T14:09:31.702Z · score: 21 (7 votes)
Self-confirming predictions can be arbitrarily bad 2019-05-03T11:34:47.441Z · score: 45 (17 votes)
Nash equilibriums can be arbitrarily bad 2019-05-01T14:58:21.765Z · score: 36 (16 votes)
Defeating Goodhart and the "closest unblocked strategy" problem 2019-04-03T14:46:41.936Z · score: 44 (14 votes)
Learning "known" information when the information is not actually known 2019-04-01T17:56:17.719Z · score: 15 (5 votes)
Relative exchange rate between preferences 2019-03-29T11:46:35.285Z · score: 12 (3 votes)
Being wrong in ethics 2019-03-29T11:28:55.436Z · score: 22 (5 votes)
Models of preferences in distant situations 2019-03-29T10:42:14.633Z · score: 11 (2 votes)
The low cost of human preference incoherence 2019-03-27T11:58:14.845Z · score: 21 (8 votes)
"Moral" as a preference label 2019-03-26T10:30:17.102Z · score: 16 (5 votes)
Partial preferences and models 2019-03-19T16:29:23.162Z · score: 13 (3 votes)
Combining individual preference utility functions 2019-03-14T14:14:38.772Z · score: 12 (4 votes)

Comments

Comment by stuart_armstrong on Analysing: Dangerous messages from future UFAI via Oracles · 2019-11-22T15:46:57.933Z · score: 2 (1 votes) · LW · GW

Yep. If they do acausal trade with each other.

Comment by stuart_armstrong on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-11-22T14:18:45.600Z · score: 2 (1 votes) · LW · GW

Some thoughts on that idea: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles

Comment by stuart_armstrong on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-11-22T14:18:28.457Z · score: 3 (2 votes) · LW · GW

Some thoughts on this idea, thanks for it: https://www.lesswrong.com/posts/6WbLRLdmTL4JxxvCq/analysing-dangerous-messages-from-future-ufai-via-oracles

Comment by stuart_armstrong on Defining AI wireheading · 2019-11-22T11:52:05.750Z · score: 2 (1 votes) · LW · GW

I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.

I agree. I've now added this line, which I thought I'd put in the original post, but apparently missed out:

Note, though, that the converse is true: every example of wireheading is a Goodhart curse.

Comment by stuart_armstrong on Defining AI wireheading · 2019-11-22T11:48:21.172Z · score: 2 (1 votes) · LW · GW

But really, what's the purpose of trying to distinguish wireheading from other forms of reward hacking?

Because mitigations for different failure modes might not be the same, depending on the circumstances.

Comment by stuart_armstrong on Defining AI wireheading · 2019-11-21T17:49:48.599Z · score: 2 (1 votes) · LW · GW

Where "measurement channel" not just one specific channel, but anything that has the properties of a measurement channel.

Comment by stuart_armstrong on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-11-08T13:59:38.293Z · score: 2 (1 votes) · LW · GW

Indeed. It might be possible to construct that complex bias function, from the policy, in a simple way. But that claim needs to be supported, and the fact that it hasn't been found so far (I repeat that it has to be simple) is evidence against it.

Comment by stuart_armstrong on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-11-06T12:50:47.357Z · score: 2 (1 votes) · LW · GW

Is file1 the degenerate pair and file2 the intended pair, and image1 the policy and image2 the bias-facts?

Yes.

Then what is the "unzip" function?

The "shortest algorithm generating BLAH" is the maximally compressed way of expressing BLAH - the "zipped" version of BLAH.

Ignoring unzip, which isn't very relevant, we know that the degenerate pairs are just above the policy in complexity.

So zip(degenerate pair) zip(policy), while zip(reasonable pair) > zip(policy+complex bias facts) (and zip(policy+complex bias facts) > zip(policy)).

Does that help?

Comment by stuart_armstrong on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-28T12:28:57.123Z · score: 2 (1 votes) · LW · GW

I'm not sure the physics analogy is getting us very far - I feel there is a very natural way of decomposing physics into laws+initial conditions, while there is no such natural way of doing so for preferences and rationality. But if we have different intuitions on that, then discussing the analogy doesn't isn't going to help us converge!

So then every p,R pair compatible with the policy contains more information than the policy. Thus even the simplest p,R pair compatible with the policy contains more information than the policy.

Agreed (though the extra information may be tiny - a few extra symbols).

By analogous reasoning, every algorithm for constructing the policy contains more information than the policy.

That does not follow; the simplest algorithm for building a policy does not go via decomposing into two pieces and then recombining them. We are comparing algorithms that produce a planner-reward pair (two outputs) with algorithms that produce a policy (one output). (but your whole argument shows you may be slightly misunderstanding complexity in this context).

Now, though all pairs are slightly more complex than the policy itself, the bias argument shows that the "proper" pair is considerably more complex. To use an analogy: suppose file1 and file2 are both maximally zipped files. When you unzip file1, you produce image1 (and maybe a small, blank, image2). When you unzip file2, you also produce the same image1, and a large, complex, image2'. Then, as long as image1 and image2' are at least slightly independent, file2 has to be larger than file1. The more complex image2' is, and the more independent it is from image1, the larger file2 has to be.

Does that make sense?

Comment by stuart_armstrong on All I know is Goodhart · 2019-10-25T11:41:18.955Z · score: 2 (1 votes) · LW · GW

Yep, those are the two levels I mentioned :-)

But I like your phrasing.

Comment by stuart_armstrong on All I know is Goodhart · 2019-10-25T11:40:19.168Z · score: 3 (2 votes) · LW · GW

You can't get too much work from a single bit of information ^_^

Comment by stuart_armstrong on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-24T10:39:31.703Z · score: 2 (1 votes) · LW · GW

Hey there!

Responding to a few points. But first, I want to make the point that treating an agent as (p,R) pair is basically an intentional stance. We choose to treat the agent that way, either for ease of predicting its actions (Dennet's approach) or for extracting its preferences, to satisfy them (my approach). The decomposition is not a natural fact about the world.

--If I ever said physicists don't know how to distinguish between laws and initial conditions, I didn't mean it. (Did I?) What I thought I said was that physicists haven't yet found a law+IC pair that can account for the data we've observed. Also that they are in fact using lots of other heuristics and assumptions in their methodology, they aren't just iterating through law+IC pairs and comparing the results to our data. So, in that regard the situation with physics is parallel to the situation with preferences/rationality.

No, the situation is very different. Physicists are trying to model and predict what is happening in the world (and in counterfactual worlds). This is equivalent with trying to figure out the human policy (which can be predicted from observations, as long as you include counterfactual ones). The decomposition of the policy into preferences and rationality is a separate step, very unlike what physicists are doing (quick way to check this: if physicists were unboundedly rational with infinite data, they could solve their problem; whereas we couldn't, we'd still have to make decisions).

(if you want to talk about situations where we know some things but not all about the human policy, then the treatment is more complex, but ultimately the same arguments apply).

--My point is that they are irrelevant to what is more complex than what. In particular, just because A has more information than B doesn't mean A is more complex than B. Example: The true Laws + Initial Conditions pair contains more information than E, the set of all events in the world. Why? Because from E you cannot conclude anything about counterfactuals, but from the true Laws+IC pair you can. Yet you can deduce E from the true Laws+IC pair. (Assume determinism for simplicity.) But it's not true that the true Laws+IC pair is more complex than E; the complexity of E is the length of the shortest way to generate it, and (let's assume) the true Laws+IC is the shortest way to generate E. So both have the same complexity.

Well, it depends. Suppose there are multiple TL (true laws) + IC that could generate E. In that case, TL+IC has more complexity than E, since you need to choose among the possible options. But if there is only one feasible TL+IC that generates E, then you can work backwards from E to get that TL+IC, and now you have all the counterfactual info, from E, as well.

For example, the "proper" pair contains all this information about what's a bias and what isn't, because our definition of bias references the planner/reward distinction. But isn't that unfair? Example: We can write 99999999999999999999999 or we can write "20-digits of 9's." The latter is shorter, but it contains more information if we cheat and say it tells us things like "how to spell the word that refers to the parts of a written number."

That argument shows that if you look into the algorithm, you can get other differences. But I'm not looking into the algorithm; I'm just using the decomposition into (p, R), and playing around with the p and R pieces, without looking inside.

Anyhow don't the degenerate pairs also contain information about biases--for example, according to the policy-planner+empty-reward pair, nothing is a bias, because nothing would systematically lead to more reward than what is already being done?

Among the degenerate pairs, the one with the indifferent planner has a bias of zero, the greedy planner has a bias of zero, and the anti-greedy planner has a bias of -1 at every timestep. So they do define bias functions, but particularly simple ones. Nothing like the complexity of the biases generated by the "proper" pair.

The relevance of information for complexity is this: given reasonable assumptions, the human policy is simpler than all pairs, and the three degenerate pairs are almost as simple as the policy. However, the "proper" pair can generate a complicated object, the bias function (which has a non-trivial value in almost every possible state). So the proper pair contains at least enough information to specify a) the human policy, and b) the bias function. The kolmogorov complexity of the proper pair is thus at least that of the simplest algorithm that can generate both those objects.

So one of two things are happening: either the human policy can generate the bias function directly, in some simple way[1], or the proper pair is more complicated that the policy. The first is not impossible, but notice that it has to be "simple". So the fact that we have not yet found a way to generate the bias function from the policy is an argument that it can't be done. Certainly there are no elementary mathematical manipulations of the policy that produces anything suitable.

--If it were true that Occam's Razor can't distinguish between P,R and -P,-R, then... isn't that a pretty general argument against Occam's Razor, not just in this domain but in other domains too?

No, because Occam's razor works in other domains. This is a strong illustration that this domain is actually different.


  1. Let A be the simplest algorithm that generates the human policy, and B the simplest that generates the human policy and the bias function. If there are n different algorithms that generate the human policy and are of length |B| or shorter, then we need to add log2(n) bits of information to the human policy to generate B, and hence, the bias function. So if B is close is complexity to A, be don't need to add much. ↩︎

Comment by stuart_armstrong on All I know is Goodhart · 2019-10-22T21:10:18.255Z · score: 3 (2 votes) · LW · GW

Thanks! Error corrected.

Comment by stuart_armstrong on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-22T11:07:08.517Z · score: 6 (3 votes) · LW · GW

Hey there!

Thanks for this critique; I have, obviously, a few comments ^_^

In no particular order:

  • First of all, the FHI channel has a video going over the main points of the argument (and of the research agenda); it may help to understand where I'm coming from: https://www.youtube.com/watch?v=1M9CvESSeVc

  • A useful point from that: given human theory of mind, the decomposition of human behaviour into preferences and rationality is simple; without that theory of mind, it is complex. Since it's hard for us to turn off our theory of mind, the decomposition will always feel simple to us. However, the human theory of mind suffers from Moravec's paradox: though the theory of mind seems simple to us, it is very hard to specify, especially into code.

  • You're entirely correct to decompose the argument into Step 1 and Step 2, and to point out that Step 1 has much stronger formal support than Step 2.

  • I'm not too worried about the degenerate pairs specifically; you can rule them all out with two bits of information. But, once you've done that, there will be other almost-as-degenerate pairs that bit with the new information. To rule them out, you need to add more information... but by the time you've added all of that, you've essentially defined the "proper" pair, by hand.

  • On speed priors: the standard argument applies for a speed prior, too (see Appendix A of our paper). It applies perfectly for the indifferent planner/zero reward, and applies, given an extra assumption, for the other two degenerate solutions.

  • Onto the physics analogy! First of all, I'm a bit puzzled by your claim that physicists don't know how to do this division. Now, we don't have a full theory of physics; however, all the physical theories I know of, have a very clear and known division between laws and initial conditions. So physicists do seem to know how to do this. And when we say that "it's very complex", this doesn't seem to mean the division into laws and initial conditions is complex, just that the initial conditions are complex (and maybe that the laws are not yet known).

  • The indifference planner contains almost exactly the same amount of on information as the policy. The "proper" pair, on the other hand, contains information such as whether the anchoring bias is a bias (it is) compared with whether paying more for better tasting chocolates is a bias (it isn't). Basically, none of the degenerate pairs contain any bias information at all; so everything to do with human biases is extra information that comes along with the "proper" pair.

  • Even ignoring all that, the fact that (p,R) is of comparable complexity to (-p,-R) shows that Occams razor cannot distinguish the proper pair from its negative.

Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-10-22T08:17:45.118Z · score: 2 (1 votes) · LW · GW

Answered your comment there.

Comment by stuart_armstrong on Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann · 2019-10-22T08:15:30.298Z · score: 2 (1 votes) · LW · GW

compatibility with all the evidence we have observed

That is the whole point of my research agenda: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/research-agenda-v0-9-synthesising-a-human-s-preferences-into

The problem is that the non-subjective evidence does not map onto facts about the decomposition. A human claims X; well, that's a speech act; are they telling the truth or not, and how do we know? Same for sensory data, which is mainly data about the brain correlated with facts about the outside world; to interpret that, we need to solve human symbol grounding.

All these ideas are in the research agenda (especially section 2). Just as you need something to bridge the is-ought gap, you need some assumptions to make evidence in the world (eg speech acts) correspond to preference-relevant facts.

This video may also illustrate the issues: https://www.youtube.com/watch?v=1M9CvESSeVc&t=1s

Comment by stuart_armstrong on All I know is Goodhart · 2019-10-22T07:57:47.398Z · score: 2 (1 votes) · LW · GW

Goodhart's Law tells us that, in general, blindly maximizing the proxy has lower expected value than other methods that involves not doing that

This is only true for the kind of things humans typically care about; this is not true for utility functions in general. That's the extra info we have.

Comment by stuart_armstrong on All I know is Goodhart · 2019-10-22T07:56:10.293Z · score: 1 (3 votes) · LW · GW

Even with your stated sense of beauty, knowing "this measure can be manipulated in extreme circumstances" is much better than nothing.

And we probably know quite a bit more; I'll continue this investigation, adding more information.

Comment by stuart_armstrong on All I know is Goodhart · 2019-10-22T07:54:16.341Z · score: 2 (1 votes) · LW · GW

As far as I can tell we're not actually dividing the space of W's by a plane, we're dividing the space of E(W|π)'s by a plane.

Because expectation is affine with respect to utility functions, this does divide the space by a plane.

Yes, there is a connection with the optimizer's curse style of reasoning.

Comment by stuart_armstrong on Troll Bridge · 2019-10-13T14:48:43.227Z · score: 2 (1 votes) · LW · GW

You are entirely correct; I don't know why I was confused.

However, looking at the proof again, it seems there might be a potential hole. You use Löb's theorem within an assumption sub-loop. This seems to assume that from "", we can deduce "".

But this cannot be true in general! To see this, set . Then , trivially; if, from that, we could deduce , we would have for any . But this statement, though it looks like Löb's theorem, is one that we cannot deduce in general (see Eliezer's "medium-hard problem" here).

Can this hole be patched?

(note that if , where is a PA proof that adds A as an extra axiom, then we can deduce ).

Comment by stuart_armstrong on Troll Bridge · 2019-10-03T13:32:20.434Z · score: 2 (1 votes) · LW · GW

If the agent's reasoning is sensible only under certain settings of the default action clause

That was my first rewriting; the second is an example of a more general algorithm that would go something like this. If we assume that both probabilities and utilities are discrete, all of the form q/n for some q, and bounded above and below by N, then something like this (for EU the expected utility, and Actions the set of actions, and b some default action):

for q integer in N*n^2 to -N*n^2 (ordered from highest to lowest):
    for a in Actions:
        if A()=a ⊢ EU=q/n^2 then output a
else output b

Then the Löbian proof fails. The agent will fail to prove any of those "if" implications, until it proves "A()="not cross" ⊢ EU=0". Then it outputs "not cross"; the default action b is not relevant. Also not relevant, here, is the order in which a is sampled from "Actions".

Comment by stuart_armstrong on Troll Bridge · 2019-10-01T15:05:18.839Z · score: 2 (1 votes) · LW · GW

Interesting.

I have two issues with the reasoning as presented; the second one is more important.

First of all, I'm unsure about "Rather, the point is that the agent's "counterfactual" reasoning looks crazy." I think we don't know the agent's counterfactual reasoning. We know, by Löb's theorem, that "there exists a proof that (proof of L implies L)" implies "there exists a proof of L". It doesn't tell us what structure this proof of L has to take, right? Who knows what counterfactuals are being considered to make that proof? (I may be misunderstanding this).

Second of all, it seems that if we change the last line of the agent to [else, "cross"], the argument fails. Same if we insert [else if A()="cross" ⊢ U=-10, then output "cross"; else if A()="not cross" ⊢ U=-10, then output "not cross"] above the last line. In both cases, this is because U=-10 is now possible, given crossing. I'm suspicious when the argument seems to depend so much on the structure of the agent.

To develop that a bit, it seems the agent's algorithm as written implies "If I cross the bridge, I am consistent" (because U=-10 is not an option). If we modify the algorithm as I just suggested, then that's no longer the case; it can consider counterfactuals where it crosses the bridge and is inconsistent (or, at least, of unknown consistency). So, given that, the agent's counterfactual reasoning no longer seems so crazy, even if it's as claimed. That's because the agent's reasoning needs to deduce something from "If I cross the bridge, I am consistent" that it can't deduce without that. Given that statement, then being Löbian or similar seems quite natural, as those are some of the few ways of dealing with statements of that type.

Comment by stuart_armstrong on Stuart_Armstrong's Shortform · 2019-09-30T12:08:16.004Z · score: 5 (4 votes) · LW · GW

Bayesian agents that knowingly disagree

A minor stub, caveating the Aumann's agreement theorem; put here to reference in future posts, if needed.

Aumann's agreement theorem states that rational agents with common knowledge of each other's beliefs cannot agree to disagree. If they exchange their estimates, they will swiftly come to an agreement.

However, that doesn't mean that agents cannot disagree, indeed they can disagree, and know that they disagree. For example, suppose that there are a thousand doors, and behind of these, there are goats, and behind one there is a flying aircraft carrier. The two agents are in separate rooms, and a host will go into each room and execute the following algorithm: they will choose a door at random among the that contain a goat. And, with probability , they will tell that door number to the agent; with probability , they will tell the door number with the aircraft carrier.

Then each agent will have probability of the named door being the aircraft carrier door, and probability on each of the other doors; so the most likely door is the one named by the host.

We can modify the protocol so that the host will never name the same door to each agent (roll a D100; if it comes up 1, tell the truth to the first agent and lie to the second; if it comes up 2, do the opposite; anything else means tell a different lie to either agent). In that case, each agent will have a best guess for the aircraft carrier, and the certainty that the other agent's best guess is different.

If the agents exchanged information, they would swiftly converge on the same distribution; but until that happens, they disagree, and know that they disagree.

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-12T02:42:10.374Z · score: 2 (1 votes) · LW · GW

I like this analogy. Probably not best to put too much weight on it, but it has some insights.

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-10T18:59:03.995Z · score: 2 (1 votes) · LW · GW

And whether those programs could then perform well if their opponent forces them into a very unusual situation, such as would not have ever appeared in a chessmaster game.

If I sacrifice a knight for no advantage whatsoever, will the opponent be able to deal with that? What if I set up a trap to capture a piece, relying on my opponent not seeing the trap? A chessmaster playing another chessmaster would never play a simple trap, as it would never succeed; so would the ML be able to deal with it?

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-10T16:50:33.181Z · score: 2 (1 votes) · LW · GW

PS: the other title I considered was "Why do people feel my result is wrong", which felt too condescending.

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-10T16:34:12.187Z · score: 5 (3 votes) · LW · GW

I agree we're not as good as we think we are. But there are a lot of things we do agree on, that seem trivial: eg "this person is red in the face, shouting at me, and punching me; I deduce that they are angry and wish to do me harm". We have far, far, more agreement than random agents would.

Comment by stuart_armstrong on Is my result wrong? Maths vs intuition vs evolution in learning human preferences · 2019-09-10T16:23:01.918Z · score: 2 (1 votes) · LW · GW

Your title seems clickbaity

Hehe - I don't normally do this, but I feel I can indulge once ^_^

having implicit access to categorisation modules that themselves are valid only in typical situations... is not a way to generalise well

How do you know this?

Moravec's paradox again. Chessmasters didn't easily program chess programs; and those chess programs didn't generalise to games in general.

Should we turn this into one of those concrete ML experiments?

That would be good. I'm aiming to have a lot more practical experiments from my research project, and this could be one of them.

Comment by stuart_armstrong on Best utility normalisation method to date? · 2019-09-07T23:09:29.936Z · score: 2 (1 votes) · LW · GW

Hum... It seems that we can stratify here. Let represent the values of a collection of variables that we are uncertain about, and that we are stratifying on.

When we compute the normalising factor for utility under two policies and , we normally do it as:

  • , with .

And then we replace with .

Instead we might normalise the utility separately for each value of :

  • Conditional on , then , with .

The problem is that, since we're dividing by the , the expectation of is not the same .

Is there an obvious improvement on this?

Note that here, total utilitarianism get less weight in large universes, and more in small ones.

I'll think more...

Comment by stuart_armstrong on Problems with AI debate · 2019-09-06T01:13:47.636Z · score: 3 (2 votes) · LW · GW

How about a third AI that gives a (hidden) probability about which one you'll be convinced by, conditional on which argument you see first? That hidden probability is passed to someone else, then the debate is run, and the result recorded. If that third AI gives good calibration and good discrimination over multiple experiments, then we can consider its predictions accurate in the future.

Comment by stuart_armstrong on Best utility normalisation method to date? · 2019-09-03T16:38:35.640Z · score: 3 (2 votes) · LW · GW

Er, this normalisation system way well solve that problem entirely. If prefers option (utility ), with second choice (utility ), and all the other options as third choice (utility ), then the expected utility of the random dictator is for all (as gives utility , and gives utility for all ), so the normalised weighted utility to maximise is:

  • .

Using (because scaling doesn't change expected utility decisions), the utility of any , , is , while the utility of is . So if , the compromise option will get chosen.

Don't confuse the problems of the random dictator, with the problems of maximising the weighted sum of the normalisations that used the random dictator (and don't confuse the other way, either; the random dictator is immune to players' lying, this normalisation is not).

Comment by stuart_armstrong on How to Make Billions of Dollars Reducing Loneliness · 2019-09-01T21:16:13.728Z · score: 6 (4 votes) · LW · GW

But community isn't about friends; it's about a background level of acquaintances you're comfortable with.

Comment by stuart_armstrong on September Bragging Thread · 2019-08-31T22:02:04.809Z · score: 44 (24 votes) · LW · GW

I finished the research agenda on constructing a preference utility function for any given human, and presented the ideas to CHAI and MIRI. Woot!

Comment by stuart_armstrong on The Very Repugnant Conclusion · 2019-08-31T13:25:42.902Z · score: 2 (1 votes) · LW · GW

Something like or or in general (for decreasing, continuous ) could work, I think.

Comment by stuart_armstrong on Why so much variance in human intelligence? · 2019-08-30T20:19:02.054Z · score: 10 (5 votes) · LW · GW

I'd say that intelligence variations are more visible in (modern) humans, not that they're necessarily larger.

Let's go back to the tribal environment. In that situation, humans want to mate, to dominate/be admired, to have food and shelter, and so on. Apart from a few people with mental defects, the variability in outcome is actually quite small - most humans won't get ostracised, many will have children, only very few will rise to the top of the hierarchy (and even there, tribal environments are more egalitarian that most, so the leader is not that much different from the others). So we might say that the variation in human intelligence (or social intelligence) is low.

Fast forward to an agricultural empire, or to the modern world. Now the top minds can become god emperor, invading and sacking other civilizations, or can be part of projects that produce atom bombs and lunar rockets. The variability of outcomes is huge, and so the variability in intelligence appears to be much higher.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-30T19:18:55.010Z · score: 6 (3 votes) · LW · GW

That's an excellent summary.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-29T02:39:40.606Z · score: 3 (2 votes) · LW · GW

One might think just doing ontology doesn't involve making preference choice but making some preferences impossible to articulate it is in fact a partial preference choice.

Yep, that's my argument: some (but not all) aspects of human preferences have to be included in the setup somehow.

it's more reasonable for a human to taste a salt level differnce, it's more plausible to say "I couldn't know" about radioactivity

I hope you don't taste every bucket of water before putting it away! ^_^

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-29T02:36:29.769Z · score: 2 (1 votes) · LW · GW

In the later part of the post, it seems you're basically talking about entropy and similar concepts? And I agree that "reversible" is kinda like entropy, in that we want to be able to return to a "macrostate" that is considered indistinguishable from the starting macrostate (even if the details are different).

However, as in the the bucket example above, the problem is that, for humans, what "counts" as the same macrostate can vary a lot. If we need a liquid, any liquid, then replacing the bucket's contents with purple-tinted alchool is fine; if we're thinking of the bath water of the dear departed husband, then any change to the contents is irreversible. Human concepts of "acceptably similar" don't match up with entropic ones.

there needs to be an effect that counts as "significant".

Are you deferring this to human judgement of significant? If so, we agree - human judgement needs to be included in some way in the definition.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-29T02:29:09.841Z · score: 4 (2 votes) · LW · GW

Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward.

Yep, I agree :-)

I generally think that impact measures don't have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.

Then we are in full agreement :-) I argue that low impact, corrigibility, and similar approaches, require some but not all of human preferences. "some" because of arguments like this one; "not all" because humans with very different values can agree on what constitutes low impact, so only part of their values are needed.

Comment by stuart_armstrong on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-08-29T02:26:05.230Z · score: 2 (1 votes) · LW · GW

Good idea.

Comment by stuart_armstrong on Gratification: a useful concept, maybe new · 2019-08-29T02:22:47.592Z · score: 2 (1 votes) · LW · GW

intrinsic motivation

That might be the concept I'm looking for. I'll think whether it covers exactly what I'm trying to say...

Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-08-27T21:31:27.141Z · score: 4 (2 votes) · LW · GW

Ok, we strongly disagree on your simple constraints being enough. I'd need to see these constraints explicitly formulated before I had any confidence in them. I suspect (though I'm not certain) that the more explicit you make them, the more tricky you'll see that it is.

And no, I don't want to throw IRL out (this is an old post), I want to make it work. I got this big impossibility result, and now I want to get around it. This is my current plan: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/research-agenda-v0-9-synthesising-a-human-s-preferences-into

Comment by stuart_armstrong on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-08-27T21:23:33.678Z · score: 3 (2 votes) · LW · GW

Very worthwhile concern, and I will think about it more.

Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-08-27T18:48:07.628Z · score: 2 (1 votes) · LW · GW

We may not be disagreeing any more. Just to check, do you agree with both these statements:

  1. Adding a few obvious constraints rule out many different R, including the ones in the OP.

  2. Adding a few obvious constraints is not enough to get a safe or reasonable R.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-27T16:27:02.182Z · score: 5 (3 votes) · LW · GW

I've added an edit to the post, to show the problem: sometimes, the robot can't kick the bucket, sometimes it must. And only human preferences distinguish these two cases. So, without knowing these preferences, how can it decide?

Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-08-27T16:19:48.161Z · score: 2 (1 votes) · LW · GW

Rejecting any specific R is easy - one bit of information (at most) per specific R. So saying "humans have preferences, and they are not always rational or always anti-rational" rules out R(1), R(2), and R(3). Saying "this apparent preference is genuine" rules out R(4).

But it's not like there are just these five preferences and once we have four of them out of the way, we're done. There are many, many different preferences in the space of preferences, and many, many of them will be simpler than R(0). So to converge to R(0), we need to add huge amounts of information, ruling out more and more examples.

Basically, we need to include enough information to define R(0) - which is what my research project is trying to do. What you're seeing as "adding enough clear examples" is actually "hand-crafting R(0) in totality".

For more details see here: https://arxiv.org/abs/1712.05812

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-27T03:09:29.726Z · score: 4 (3 votes) · LW · GW

kicking the bucket into the pool perturbs most AUs. There’s no real “risk” to not kicking the bucket.

In this specific setup, no. But sometimes kicking the bucket is fine; sometimes kicking the metaphorical equivalent of the bucket is necessary. If the AI is never willing to kick the bucket - ie never willing to take actions that might, for certain utility functions, cause huge and irreparable harm - then it's not willing to take any action at all.

Comment by stuart_armstrong on Reversible changes: consider a bucket of water · 2019-08-26T23:04:31.210Z · score: 6 (5 votes) · LW · GW

Your presentation had an example with randomly selected utility functions in a block world, that resulted in the agent taking less-irreversible actions around a specific block.

If we have randomly selected utility functions in the bucket-and-pool world, this may include utilities that care about the salt content or the exact molecules, or not. Depending on whether or not we include these, we run the risk of preserving the bucket when we need not, or kicking it when we should preserve it. This is because the "worth" of the water being in the bucket varies depending on human preferences, not on anything intrinsic to the design of the bucket and the pool.

Comment by stuart_armstrong on Humans can be assigned any values whatsoever… · 2019-08-26T21:11:08.689Z · score: 7 (3 votes) · LW · GW

we have plenty of natural assumptions to choose from.

You'd think so, but nobody has defined these assumptions in anything like sufficient detail to make IRL work. My whole research agenda is essentially a way of defining these assumptions, and it seems to be a long and complicated process.

Comment by stuart_armstrong on Gratification: a useful concept, maybe new · 2019-08-26T19:24:33.116Z · score: 2 (1 votes) · LW · GW

Basically yes. My take on 2) is that identity-affirming things can be somewhat pleasurable - but they're unlikely to be the most pleasurable thing the human could do at that moment. So they can be valued for something else than pure pleasure.

And you can get other examples where someone, say, is truthful, even if that causes them more pain than a simple lie would.