Posts

Humans are stunningly rational and stunningly irrational 2020-10-23T14:13:59.956Z · score: 18 (6 votes)
Knowledge, manipulation, and free will 2020-10-13T17:47:12.547Z · score: 31 (12 votes)
Dehumanisation *errors* 2020-09-23T09:51:53.091Z · score: 13 (3 votes)
Anthropomorphisation vs value learning: type 1 vs type 2 errors 2020-09-22T10:46:48.807Z · score: 16 (5 votes)
Technical model refinement formalism 2020-08-27T11:54:22.534Z · score: 9 (1 votes)
Model splintering: moving from one imperfect model to another 2020-08-27T11:53:58.784Z · score: 33 (9 votes)
Learning human preferences: black-box, white-box, and structured white-box access 2020-08-24T11:42:34.734Z · score: 23 (8 votes)
AI safety as featherless bipeds *with broad flat nails* 2020-08-19T10:22:14.987Z · score: 35 (17 votes)
Learning human preferences: optimistic and pessimistic scenarios 2020-08-18T13:05:23.697Z · score: 26 (6 votes)
Strong implication of preference uncertainty 2020-08-12T19:02:50.115Z · score: 20 (5 votes)
"Go west, young man!" - Preferences in (imperfect) maps 2020-07-31T07:50:59.520Z · score: 21 (8 votes)
Learning Values in Practice 2020-07-20T18:38:50.438Z · score: 24 (6 votes)
The Goldbach conjecture is probably correct; so was Fermat's last theorem 2020-07-14T19:30:14.806Z · score: 75 (27 votes)
Why is the impact penalty time-inconsistent? 2020-07-09T17:26:06.893Z · score: 16 (5 votes)
Dynamic inconsistency of the inaction and initial state baseline 2020-07-07T12:02:29.338Z · score: 30 (7 votes)
Models, myths, dreams, and Cheshire cat grins 2020-06-24T10:50:57.683Z · score: 21 (8 votes)
Results of $1,000 Oracle contest! 2020-06-17T17:44:44.566Z · score: 55 (20 votes)
Comparing reward learning/reward tampering formalisms 2020-05-21T12:03:54.968Z · score: 9 (1 votes)
Probabilities, weights, sums: pretty much the same for reward functions 2020-05-20T15:19:53.265Z · score: 11 (2 votes)
Learning and manipulating learning 2020-05-19T13:02:41.838Z · score: 40 (12 votes)
Reward functions and updating assumptions can hide a multitude of sins 2020-05-18T15:18:07.871Z · score: 16 (5 votes)
How should AIs update a prior over human preferences? 2020-05-15T13:14:30.805Z · score: 17 (5 votes)
Distinguishing logistic curves 2020-05-15T11:38:04.516Z · score: 23 (9 votes)
Distinguishing logistic curves: visual 2020-05-15T10:33:08.901Z · score: 9 (1 votes)
Kurzweil's predictions' individual scores 2020-05-07T17:10:36.637Z · score: 17 (8 votes)
Assessing Kurzweil predictions about 2019: the results 2020-05-06T13:36:18.788Z · score: 127 (56 votes)
Maths writer/cowritter needed: how you can't distinguish early exponential from early sigmoid 2020-05-06T09:41:49.370Z · score: 39 (14 votes)
Consistent Glomarization should be feasible 2020-05-04T10:06:55.928Z · score: 13 (9 votes)
Last chance for assessing Kurzweil 2020-04-22T11:51:02.244Z · score: 12 (2 votes)
Databases of human behaviour and preferences? 2020-04-21T18:06:51.557Z · score: 10 (2 votes)
Solar system colonisation might not be driven by economics 2020-04-21T17:10:32.845Z · score: 26 (12 votes)
"How conservative" should the partial maximisers be? 2020-04-13T15:50:00.044Z · score: 20 (7 votes)
Assessing Kurzweil's 1999 predictions for 2019 2020-04-08T14:27:21.689Z · score: 37 (13 votes)
Call for volunteers: assessing Kurzweil, 2019 2020-04-02T12:07:57.246Z · score: 27 (9 votes)
Anthropics over-simplified: it's about priors, not updates 2020-03-02T13:45:11.710Z · score: 9 (1 votes)
If I were a well-intentioned AI... IV: Mesa-optimising 2020-03-02T12:16:15.609Z · score: 26 (8 votes)
If I were a well-intentioned AI... III: Extremal Goodhart 2020-02-28T11:24:23.090Z · score: 20 (7 votes)
If I were a well-intentioned AI... II: Acting in a world 2020-02-27T11:58:32.279Z · score: 20 (7 votes)
If I were a well-intentioned AI... I: Image classifier 2020-02-26T12:39:59.450Z · score: 35 (17 votes)
Other versions of "No free lunch in value learning" 2020-02-25T14:25:00.613Z · score: 16 (5 votes)
Subagents and impact measures, full and fully illustrated 2020-02-24T13:12:05.014Z · score: 32 (10 votes)
(In)action rollouts 2020-02-18T14:48:19.160Z · score: 11 (2 votes)
Counterfactuals versus the laws of physics 2020-02-18T13:21:02.232Z · score: 16 (3 votes)
Subagents and impact measures: summary tables 2020-02-17T14:09:32.029Z · score: 11 (2 votes)
Appendix: mathematics of indexical impact measures 2020-02-17T13:22:43.523Z · score: 12 (3 votes)
Stepwise inaction and non-indexical impact measures 2020-02-17T10:32:01.863Z · score: 12 (3 votes)
In theory: does building the subagent have an "impact"? 2020-02-13T14:17:23.880Z · score: 17 (5 votes)
Building and using the subagent 2020-02-12T19:28:52.320Z · score: 17 (6 votes)
Plausibly, almost every powerful algorithm would be manipulative 2020-02-06T11:50:15.957Z · score: 41 (13 votes)
The Adventure: a new Utopia story 2020-02-05T16:50:42.909Z · score: 61 (39 votes)

Comments

Comment by stuart_armstrong on Knowledge, manipulation, and free will · 2020-10-14T13:15:29.512Z · score: 2 (1 votes) · LW · GW

Some people (me included) value a certain level of non-manipulation. I'm trying to cash out that instinct. And it's also needed for some ideas like corrigibility. Manipulation also combines poorly with value learning, see eg our paper here https://arxiv.org/abs/2004.13654

I do agree that saving the world is a clearly positive case of that ^_^

Comment by stuart_armstrong on The Presumptuous Philosopher, self-locating information, and Solomonoff induction · 2020-10-14T07:11:46.215Z · score: 4 (2 votes) · LW · GW

I have an article on "Anthropic decision theory". with the video version here.

Basically, it's not that the presumptuous philosopher is more likely to be right in a given universe, its that there are far more presumptuous philosophers in the large universe. So if we count "how many presumptuous philosophers are correct", we get a different answer to "in how many universes is the presumptuous philosopher correct". These things only come apart in anthropic situations.

Comment by stuart_armstrong on Comparing reward learning/reward tampering formalisms · 2020-10-01T13:06:26.241Z · score: 2 (1 votes) · LW · GW

Suart, by " is complex" are you referring to...

I mean that that defining can be done in many different ways, and hence has a lot of contingent structure. In contrast, in , the $\rho is a complex distribution on , conditional on ; hence itself is trivial and just encodes "apply to and in the obvious way.

Comment by stuart_armstrong on Stuart_Armstrong's Shortform · 2020-09-25T14:05:41.962Z · score: 11 (2 votes) · LW · GW

This is a link to "An Increasingly Manipulative Newsfeed" about potential social media manipulation incentives (eg FaceBook).

I'm putting the link here because I keep losing the original post (since it wasn't published by me, but I co-wrote it).

Comment by stuart_armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-25T13:44:21.258Z · score: 4 (2 votes) · LW · GW

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

Comment by stuart_armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T16:20:44.766Z · score: 2 (1 votes) · LW · GW

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

Comment by stuart_armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T08:08:59.084Z · score: 2 (1 votes) · LW · GW

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical fact about the universe. And two very different interpretations can be equally valid, with no way of distinguishing between them.

(I like the anthropomorphising/dehumanising symmetry, but I'm focusing on the aspects of dehumanising that cause you to make errors of interpretation. For example, out-groups are perceived as being coherent, acting in concert without disagreements, and often being explicitly evil. This is an error, not just a reduction in social emotions)

Comment by stuart_armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T08:05:54.189Z · score: 3 (2 votes) · LW · GW

For instance throughout history people have been able to model and interact with traders from neighbouring or distant civilizations, even though they might think very differently.

Humans think very very similarly to each other, compared with random minds from the space of possible minds. For example, we recognise anger, aggression, fear, and so on, and share a lot of cultural universals https://en.wikipedia.org/wiki/Cultural_universal

Comment by stuart_armstrong on Why haven't we celebrated any major achievements lately? · 2020-09-10T09:53:44.294Z · score: 2 (1 votes) · LW · GW

There haven’t been as many big accomplishments.

I think we should look at the demand side, not the supply side. We are producing lots of technological innovations, but there aren't so many major problems left for them to solve. The flush toilet was revolutionary; a super-flush ecological toilet with integrated sensors that can transform into a table... is much more advanced from the supply side, but barely more from the demand side: it doesn't fulfil many more needs than the standard flush toilet.

Comment by stuart_armstrong on Model splintering: moving from one imperfect model to another · 2020-08-31T14:00:01.159Z · score: 4 (2 votes) · LW · GW

Cool, good summary.

Comment by stuart_armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T11:54:44.529Z · score: 2 (1 votes) · LW · GW

Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.

Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).

So we need to figure out if we're in the optimistic or the pessimistic scenario.

Comment by stuart_armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T09:35:00.915Z · score: 4 (2 votes) · LW · GW

My understanding of the OP was that there is a robot [...]

That understanding is correct.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also possible that another labelling would be clearer or more useful for our purposes. It might be a "natural" abstraction, once we've put some effort into defining what preferences "naturally" are.

but "white box" is any source code that produces the same input-output behavior

What that section is saying is that there are multiple white boxes that produce the same black box behaviour (hence we cannot read the white box simply from the black box).

Comment by stuart_armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T09:28:11.984Z · score: 2 (1 votes) · LW · GW

modularization is super helpful for simplifying things.

The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).

Comment by stuart_armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-25T15:03:30.016Z · score: 2 (1 votes) · LW · GW

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?

Comment by stuart_armstrong on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-20T13:09:27.588Z · score: 3 (2 votes) · LW · GW

Thanks! Useful insights in your post, to mull over.

Comment by stuart_armstrong on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-19T08:17:22.986Z · score: 5 (3 votes) · LW · GW

An imminent incoming post on this very issue ^_^

Comment by stuart_armstrong on For the past, in some ways only, we are moral degenerates · 2020-08-19T08:16:56.839Z · score: 3 (2 votes) · LW · GW

Yes, things like honour and anger serve important signalling and game-theoretic functions. But they also come to be valued intrinsically (the same way people like sex, rather than just wanting to spread their genes), and strongly valued. This makes it hard to agree that "oh, your sacred core value is only in the service of this hidden objective, so we can focus on that instead".

Comment by stuart_armstrong on "Go west, young man!" - Preferences in (imperfect) maps · 2020-08-04T09:55:53.782Z · score: 4 (2 votes) · LW · GW

Cool, neat summary.

Comment by stuart_armstrong on "Go west, young man!" - Preferences in (imperfect) maps · 2020-08-04T09:47:10.106Z · score: 2 (1 votes) · LW · GW

Sorry, had a few terrible few days, and missed your message. How about Friday, 12pm UK time?

Comment by stuart_armstrong on "Go west, young man!" - Preferences in (imperfect) maps · 2020-08-01T07:37:15.252Z · score: 2 (1 votes) · LW · GW

Stuart, I'm writing a review of all the work done on corrigibility. Would you mind if I asked you some questions on your contributions?

No prob. Email or Zoom/Hangouts/Skype?

Comment by stuart_armstrong on The ground of optimization · 2020-07-31T15:48:12.804Z · score: 6 (3 votes) · LW · GW

Very good. A lot of potential there, I feel.

Comment by stuart_armstrong on "Go west, young man!" - Preferences in (imperfect) maps · 2020-07-31T11:35:20.984Z · score: 2 (1 votes) · LW · GW

The information to distinguish between these interpretations is not within the request to travel west.

Yes, but I'd argue that most of moral preferences are similarly underdefined when the various interpretations behind them come apart (eg purity).

Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2020-07-30T15:00:10.808Z · score: 3 (2 votes) · LW · GW

There are computer programs that can print their own code: https://en.wikipedia.org/wiki/Quine_(computing)

There are also programs which can print their own code and add something to it. Isn't that a way in which the program fully knows itself?

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-18T11:31:49.187Z · score: 2 (1 votes) · LW · GW

Thanks! It's cool to see his approach.

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-16T13:59:13.657Z · score: 2 (1 votes) · LW · GW

Wiles proved the presence of a very rigid structure - not the absence - and the presence of this structure implied FLT via the work of other mathematicians.

If you say that "Wiles proved the Taniyama–Shimura conjecture" (for semistable elliptic curves), then I agree: he's proved a very important structural result in mathematics.

If you say he proved Fermat's last theorem, then I'd say he's proved an important-but-probable lack of structure in mathematics.

So yeah, he proved the existence of structure in one area, and (hence) the absence of structure in another area.

And "to prove Fermat's last theorem, you have to go via proving the Taniyama–Shimura conjecture", is, to my mind, strong evidence for "proving lack of structure is hard".

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-16T13:51:33.270Z · score: 2 (1 votes) · LW · GW

You can see this as sampling times sorta-independently, or as sampling times with less independence (ie most sums are sampled twice).

Either view works, and as you said, it doesn't change the outcome.

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-15T22:18:15.038Z · score: 2 (1 votes) · LW · GW

Yes, I got that result too. The problem is that the prime number theorem isn't a very good approximation for small numbers. So we'd need a slightly more sophisticated model that has more low numbers.

I suspect that moving from "sampling with replacement" to "sampling without replacement" might be enough for low numbers, though.

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-15T22:13:03.799Z · score: 2 (1 votes) · LW · GW

Note that the probabilistic argument fails for n=3 for Fermat's last theorem; call this (3,2) (power=3, number of summands is 2).

So we know (3,2) is impossible; Euler's conjecture is the equivalent of saying that (n+1,n) is also impossible for all n. However, the probabilistic argument fails for (n+1,n) the same way as it fails for (3,2). So we'd expect Euler's conjecture to fail, on probabilistic grounds.

In fact, the surprising thing on probabilistic grounds is that Fermat's last theorem is true for n=3.

Comment by stuart_armstrong on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-14T16:52:13.279Z · score: 4 (2 votes) · LW · GW

Good, cheers!

Comment by stuart_armstrong on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-07T15:47:58.411Z · score: 4 (2 votes) · LW · GW

Another key reason for time-inconsistent preferences: bounded rationality.

Comment by stuart_armstrong on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-07T15:39:57.790Z · score: 2 (1 votes) · LW · GW

Why do the absolute values cancel?

Because , so you can remove the absolute values.

Comment by stuart_armstrong on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-07T12:54:38.276Z · score: 4 (2 votes) · LW · GW

Cheers, interesting read.

Comment by stuart_armstrong on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-07T12:10:19.312Z · score: 4 (2 votes) · LW · GW

I also think the pedestrian example illustrates why we need more semantic structure: "pedestrian alive" -> "pedestrian dead" is bad, but "pigeon on road" -> "pigeon in flight" is fine.

Comment by stuart_armstrong on Is Molecular Nanotechnology "Scientific"? · 2020-07-07T12:04:14.663Z · score: 3 (2 votes) · LW · GW

Nope! Part of my own research has made more optimistic about the possibilities of understanding and creating intelligence.

Comment by stuart_armstrong on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-07T12:02:16.107Z · score: 4 (2 votes) · LW · GW

I think this shows that the step-wise inaction penalty is time-inconsistent: https://www.lesswrong.com/posts/w8QBmgQwb83vDMXoz/dynamic-inconsistency-of-the-stepwise-inaction-baseline

Comment by stuart_armstrong on Assessing Kurzweil predictions about 2019: the results · 2020-06-25T12:12:15.810Z · score: 6 (4 votes) · LW · GW

https://www.futuretimeline.net/forum/topic/17903-kurzweils-2009-is-our-2019/ , forwarded to me by Daniel Kokotajlo (I added a link in the post as well).

Comment by stuart_armstrong on Models, myths, dreams, and Cheshire cat grins · 2020-06-25T12:09:01.782Z · score: 4 (2 votes) · LW · GW

"How to think about features of models and about consistency", in a relatively fun way as an intro to a big post I'm working on.

Comment by stuart_armstrong on Models, myths, dreams, and Cheshire cat grins · 2020-06-25T12:08:43.652Z · score: 2 (1 votes) · LW · GW

Then it has a wrong view of wings and fur (as well as a wrong view of pigs). The more features it has to get right, the harder the adversarial model is to construct - it's not just moving linearly in a single direction.

Comment by stuart_armstrong on Models, myths, dreams, and Cheshire cat grins · 2020-06-24T12:02:38.123Z · score: 6 (3 votes) · LW · GW

Thanks! Good insights there. Am reproducing the comment here for people less willing to click through:

I haven't read the literature on "how counterfactuals ought to work in ideal reasoners" and have no opinion there. But the part where you suggest an empirical description of counterfactual reasoning in humans, I think I basically agree with what you wrote.

I think the neocortex has a zoo of generative models, and a fast way of detecting when two are compatible, and if they are, snapping them together like Legos into a larger model.

For example, the model of "falling" is incompatible with the model of "stationary"—they make contradictory predictions about the same boolean variables—and therefore I can't imagine a "falling stationary rock". On the other hand, I can imagine "a rubber wine glass spinning" because my rubber model is about texture etc., my wine glass model is about shape and function, and my spinning model is about motion. All 3 of those models make non-contradictory predictions (mostly because they're issuing predictions about non-overlapping sets of variables), so the three can snap together into a larger generative model.

So for counterfactuals, I suppose that we start by hypothesizing some core of a model ("a bird the size of an adult blue whale") and then searching out more little generative model pieces that can snap onto that core, growing it out as much as possible in different ways, until you hit the limits where you can't snap on any more details without making it unacceptably self-contradictory. Something like that...

Comment by stuart_armstrong on Cortés, Pizarro, and Afonso as Precedents for Takeover · 2020-05-22T14:55:54.665Z · score: 2 (1 votes) · LW · GW

(this is, obviously, very speculative ^_^ )

Comment by stuart_armstrong on Cortés, Pizarro, and Afonso as Precedents for Takeover · 2020-05-22T13:02:25.968Z · score: 2 (1 votes) · LW · GW

...which also means that they didn't have an empire to back them up?

Comment by stuart_armstrong on Cortés, Pizarro, and Afonso as Precedents for Takeover · 2020-05-22T10:59:09.485Z · score: 4 (2 votes) · LW · GW

Thanks for your research, especially the Afonso stuff. One question for that: were these empires used to gaining/losing small pieces of territory? ie did they really dedicate all their might to getting these ports back, or did they eventually write them off as minor losses not worth the cost of fighting (given Portuguese naval advantages)?

Comment by stuart_armstrong on Cortés, Pizarro, and Afonso as Precedents for Takeover · 2020-05-21T11:27:10.348Z · score: 4 (2 votes) · LW · GW

Based on what I recall reading about Pizzaro's conquest, I feel you might be underestimating the importance of horses. It took centuries for European powers to figure out how to break a heavy cavalry charge with infantry; the amerindians didn't have the time to figure it out (see various battles where small cavalry forces routed thousands of troops). Once they had got more used to horses, later Inca forces (though much diminished) were more able to win open battles against the Spanish.

Maybe this was the problem for these empires: they were used to winning open battles, but were presented with a situation where only irregular warfare or siege defences could win. They reacted as an empire, when they should have been reacting as a recalcitrant province.

Comment by stuart_armstrong on Learning and manipulating learning · 2020-05-20T09:31:20.441Z · score: 2 (1 votes) · LW · GW

Also, giving something more points for killing people than making cake sounds like a bad incentive scheme.

In the original cake-or-death example, it wasn't that killing got more points, it's that killing is easier (and hence gets more points over time). This is a reflection of the fact that "true" human values are complex and difficult to maximise, but many other values are much easier to maximise.

Comment by stuart_armstrong on Reward functions and updating assumptions can hide a multitude of sins · 2020-05-18T16:54:02.467Z · score: 4 (2 votes) · LW · GW

My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.

Ah, ok, I see ^_^ Thanks for making me write this post, though, as it has useful things for other people to see, that I had been meaning to write up for some time.

On your main point: if the prior and updating process are over things that are truly beyond the AI's influence, then there will be no rigging (or, in my terms: uninfluenceable->unriggable). But there are many things that look like this, that are entirely riggable. For example, "have a prior 50-50 on cake and death, and update according to what the programmer says". This seems to be a prior-and-update combination, but it's entirely riggable.

So, another way of seeing my paper is "this thing looks like a prior-and-update process. If it's also unriggable, then (given certain assumptions) it's truly beyond the AI's influence".

Comment by stuart_armstrong on How should AIs update a prior over human preferences? · 2020-05-18T15:21:00.362Z · score: 4 (2 votes) · LW · GW

Thanks! Responded here: https://www.lesswrong.com/posts/EYEkYX6vijL7zsKEt/reward-functions-and-updating-assumptions-can-hide-a

Comment by stuart_armstrong on How should AIs update a prior over human preferences? · 2020-05-18T15:20:43.843Z · score: 2 (1 votes) · LW · GW

Thanks! Responded here: https://www.lesswrong.com/posts/EYEkYX6vijL7zsKEt/reward-functions-and-updating-assumptions-can-hide-a

Comment by stuart_armstrong on How should AIs update a prior over human preferences? · 2020-05-16T09:09:43.219Z · score: 2 (1 votes) · LW · GW

thanks! Changed title (and corrected badly formatted footnote)

Comment by stuart_armstrong on How should AIs update a prior over human preferences? · 2020-05-15T19:19:25.854Z · score: 2 (1 votes) · LW · GW

I agree that for such a system, the optimal policy of the actor is to rig the estimator, and to "intentionally" bias it towards easy-to-satisfy rewards like "the human loves heroin".

The part that confuses me is why we're having two separate systems with different objectives where one system is dumb and the other system is smart.

We don't need to have two separate systems. There's two meaning to your "bias it towards" phrase: the first one is the informal human one, where "the human loves heroin" is clearly a bias. The second is some formal definition of what is biasing and what isn't. And the system doesn't have that. The "estimator" doesn't "know" that "the human loves heroin" is a bias; instead, it sees this as a perfectly satisfactory way of accomplishing its goals, according to the bridging function it's been given. There is no conflict between estimator and actor.

Imagine that you have a complex CIRL game that models the real world well but assumes that the human is Boltzmann-rational. [...] Such a policy is going to "try" to learn preferences, learn incorrectly, and then act according to those incorrect learned preferences, but it is not going to "intentionally" rig the learning process.

The AI would not see any of these actions as "rigging", even if we would.

It might think "hey, I should check whether the human likes heroin by giving them some", and then think "oh they really do love heroin, I should pump them full of it".

It will do this if it can't already predict the effect of giving them heroin.

It won't think "aha, if I give the human heroin, then they'll ask for more heroin, causing my Boltzmann-rationality estimator module to predict they like heroin, and then I can get easy points by giving humans heroin".

If it can predict the effect of giving humans heroin, it will think something like that. It think: "if I give the humans heroin, they'll ask for more heroin; my Boltzmann-rationality estimator module confirms that this means they like heroin, so I can efficiently satisfy their preferences by giving humans heroin".

Comment by stuart_armstrong on Assessing Kurzweil predictions about 2019: the results · 2020-05-06T17:16:14.732Z · score: 5 (3 votes) · LW · GW

Strong upvote for writing your own predictions before seeing the 2019 graph.