## Posts

Syntax, semantics, and symbol grounding, simplified 2020-11-23T16:12:11.678Z
The ethics of AI for the Routledge Encyclopedia of Philosophy 2020-11-18T17:55:49.952Z
Extortion beats brinksmanship, but the audience matters 2020-11-16T21:13:18.822Z
Humans are stunningly rational and stunningly irrational 2020-10-23T14:13:59.956Z
Knowledge, manipulation, and free will 2020-10-13T17:47:12.547Z
Dehumanisation *errors* 2020-09-23T09:51:53.091Z
Anthropomorphisation vs value learning: type 1 vs type 2 errors 2020-09-22T10:46:48.807Z
Technical model refinement formalism 2020-08-27T11:54:22.534Z
Model splintering: moving from one imperfect model to another 2020-08-27T11:53:58.784Z
Learning human preferences: black-box, white-box, and structured white-box access 2020-08-24T11:42:34.734Z
AI safety as featherless bipeds *with broad flat nails* 2020-08-19T10:22:14.987Z
Learning human preferences: optimistic and pessimistic scenarios 2020-08-18T13:05:23.697Z
Strong implication of preference uncertainty 2020-08-12T19:02:50.115Z
"Go west, young man!" - Preferences in (imperfect) maps 2020-07-31T07:50:59.520Z
Learning Values in Practice 2020-07-20T18:38:50.438Z
The Goldbach conjecture is probably correct; so was Fermat's last theorem 2020-07-14T19:30:14.806Z
Why is the impact penalty time-inconsistent? 2020-07-09T17:26:06.893Z
Dynamic inconsistency of the inaction and initial state baseline 2020-07-07T12:02:29.338Z
Models, myths, dreams, and Cheshire cat grins 2020-06-24T10:50:57.683Z
Results of 1,000 Oracle contest! 2020-06-17T17:44:44.566Z Comparing reward learning/reward tampering formalisms 2020-05-21T12:03:54.968Z Probabilities, weights, sums: pretty much the same for reward functions 2020-05-20T15:19:53.265Z Learning and manipulating learning 2020-05-19T13:02:41.838Z Reward functions and updating assumptions can hide a multitude of sins 2020-05-18T15:18:07.871Z How should AIs update a prior over human preferences? 2020-05-15T13:14:30.805Z Distinguishing logistic curves 2020-05-15T11:38:04.516Z Distinguishing logistic curves: visual 2020-05-15T10:33:08.901Z Kurzweil's predictions' individual scores 2020-05-07T17:10:36.637Z Assessing Kurzweil predictions about 2019: the results 2020-05-06T13:36:18.788Z Maths writer/cowritter needed: how you can't distinguish early exponential from early sigmoid 2020-05-06T09:41:49.370Z Consistent Glomarization should be feasible 2020-05-04T10:06:55.928Z Last chance for assessing Kurzweil 2020-04-22T11:51:02.244Z Databases of human behaviour and preferences? 2020-04-21T18:06:51.557Z Solar system colonisation might not be driven by economics 2020-04-21T17:10:32.845Z "How conservative" should the partial maximisers be? 2020-04-13T15:50:00.044Z Assessing Kurzweil's 1999 predictions for 2019 2020-04-08T14:27:21.689Z Call for volunteers: assessing Kurzweil, 2019 2020-04-02T12:07:57.246Z Anthropics over-simplified: it's about priors, not updates 2020-03-02T13:45:11.710Z If I were a well-intentioned AI... IV: Mesa-optimising 2020-03-02T12:16:15.609Z If I were a well-intentioned AI... III: Extremal Goodhart 2020-02-28T11:24:23.090Z If I were a well-intentioned AI... II: Acting in a world 2020-02-27T11:58:32.279Z If I were a well-intentioned AI... I: Image classifier 2020-02-26T12:39:59.450Z Other versions of "No free lunch in value learning" 2020-02-25T14:25:00.613Z Subagents and impact measures, full and fully illustrated 2020-02-24T13:12:05.014Z (In)action rollouts 2020-02-18T14:48:19.160Z Counterfactuals versus the laws of physics 2020-02-18T13:21:02.232Z Subagents and impact measures: summary tables 2020-02-17T14:09:32.029Z Appendix: mathematics of indexical impact measures 2020-02-17T13:22:43.523Z Stepwise inaction and non-indexical impact measures 2020-02-17T10:32:01.863Z In theory: does building the subagent have an "impact"? 2020-02-13T14:17:23.880Z ## Comments Comment by stuart_armstrong on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-13T10:04:21.800Z · LW · GW (I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.) Cheers, that would be very useful. Comment by stuart_armstrong on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-12T16:13:44.578Z · LW · GW (I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.) I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1 Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2020-12-14T21:13:56.802Z · LW · GW Thanks _ Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2020-12-14T21:13:30.119Z · LW · GW I like that way of seeing it. Comment by stuart_armstrong on Syntax, semantics, and symbol grounding, simplified · 2020-12-01T13:38:19.771Z · LW · GW Express, express away _ Comment by stuart_armstrong on Just another day in utopia · 2020-11-24T08:54:40.999Z · LW · GW Enjoyed writing it, too. Comment by stuart_armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-23T13:44:41.434Z · LW · GW Because a reputation for following up brinksmanship threats means that people won't enter into deals with you at all; extortion works because, to some extent, people have to "deal" with you even if they don't want to. This is why I saw a Walmart-monopsony (monopolistic buyer) as closer to extortion, since not trading with them is not an option. Comment by stuart_armstrong on The ethics of AI for the Routledge Encyclopedia of Philosophy · 2020-11-23T13:42:28.285Z · LW · GW Kiitos! Comment by stuart_armstrong on The ethics of AI for the Routledge Encyclopedia of Philosophy · 2020-11-20T17:12:12.145Z · LW · GW Thanks! Comment by stuart_armstrong on The ethics of AI for the Routledge Encyclopedia of Philosophy · 2020-11-20T17:11:45.395Z · LW · GW Thanks! Comment by stuart_armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-20T17:10:30.134Z · LW · GW I'm think of it this way: investigating a supplier to check they are reasonable costs1 to Walmart. The minimum price any supplier will offer is $10. After investigating, one supplier offers$10.5. Walmart refuses, knowing the supplier will not got lower, and publicises the exchange.

The reason this is extortion, at least in the sense of this post, is that Walmart takes a cost (it will cost them at least 11 to investigate and hire another supplier) in order to build a reputation. Comment by stuart_armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-18T14:11:36.047Z · LW · GW The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade. Comment by stuart_armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-18T13:58:13.787Z · LW · GW I think the distinction is, from the point of view of the extortioner, "would it be in my interests to try and extort , *even if I know for a fact that cannot be extorted and would force me to act on my threat, to the detriment of myself in that situation?" If the answer is yes, then it's extortion (in the meaning of this post). Trying to extort the un-extortable, then acting on the threat, makes sense as a warning to other. Comment by stuart_armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-18T13:54:58.803Z · LW · GW That's a misspelling that's entirely my fault, and has now been corrected. Comment by stuart_armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-17T17:49:16.093Z · LW · GW (1) You say that releasing nude photos is in the blackmail category. But who's the audience? The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss. (2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices? Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy from you" is the same as "I will ruin you". Abstracting away the property rights (Walmart is definitely legally allowed to do this), this seems very much an extorsion. Comment by stuart_armstrong on Humans are stunningly rational and stunningly irrational · 2020-10-27T14:42:34.648Z · LW · GW "within the limits of their intelligence" can mean anything, excuse any error, bias, and failure. Thus, they are not rational, and (form one perspective) very very far from it. Comment by stuart_armstrong on Knowledge, manipulation, and free will · 2020-10-14T13:15:29.512Z · LW · GW Some people (me included) value a certain level of non-manipulation. I'm trying to cash out that instinct. And it's also needed for some ideas like corrigibility. Manipulation also combines poorly with value learning, see eg our paper here https://arxiv.org/abs/2004.13654 I do agree that saving the world is a clearly positive case of that ^_^ Comment by stuart_armstrong on The Presumptuous Philosopher, self-locating information, and Solomonoff induction · 2020-10-14T07:11:46.215Z · LW · GW I have an article on "Anthropic decision theory". with the video version here. Basically, it's not that the presumptuous philosopher is more likely to be right in a given universe, its that there are far more presumptuous philosophers in the large universe. So if we count "how many presumptuous philosophers are correct", we get a different answer to "in how many universes is the presumptuous philosopher correct". These things only come apart in anthropic situations. Comment by stuart_armstrong on Comparing reward learning/reward tampering formalisms · 2020-10-01T13:06:26.241Z · LW · GW Suart, by " is complex" are you referring to... I mean that that defining can be done in many different ways, and hence has a lot of contingent structure. In contrast, in , the\rho is a complex distribution on , conditional on ; hence itself is trivial and just encodes "apply to and in the obvious way.

Comment by stuart_armstrong on Stuart_Armstrong's Shortform · 2020-09-25T14:05:41.962Z · LW · GW

This is a link to "An Increasingly Manipulative Newsfeed" about potential social media manipulation incentives (eg FaceBook).

I'm putting the link here because I keep losing the original post (since it wasn't published by me, but I co-wrote it).

Comment by stuart_armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-25T13:44:21.258Z · LW · GW

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

Comment by stuart_armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T16:20:44.766Z · LW · GW

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

Comment by stuart_armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T08:08:59.084Z · LW · GW

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical fact about the universe. And two very different interpretations can be equally valid, with no way of distinguishing between them.

(I like the anthropomorphising/dehumanising symmetry, but I'm focusing on the aspects of dehumanising that cause you to make errors of interpretation. For example, out-groups are perceived as being coherent, acting in concert without disagreements, and often being explicitly evil. This is an error, not just a reduction in social emotions)

Comment by stuart_armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T08:05:54.189Z · LW · GW

For instance throughout history people have been able to model and interact with traders from neighbouring or distant civilizations, even though they might think very differently.

Humans think very very similarly to each other, compared with random minds from the space of possible minds. For example, we recognise anger, aggression, fear, and so on, and share a lot of cultural universals https://en.wikipedia.org/wiki/Cultural_universal

Comment by stuart_armstrong on Why haven't we celebrated any major achievements lately? · 2020-09-10T09:53:44.294Z · LW · GW

There haven’t been as many big accomplishments.

I think we should look at the demand side, not the supply side. We are producing lots of technological innovations, but there aren't so many major problems left for them to solve. The flush toilet was revolutionary; a super-flush ecological toilet with integrated sensors that can transform into a table... is much more advanced from the supply side, but barely more from the demand side: it doesn't fulfil many more needs than the standard flush toilet.

Comment by stuart_armstrong on Model splintering: moving from one imperfect model to another · 2020-08-31T14:00:01.159Z · LW · GW

Cool, good summary.

Comment by stuart_armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T11:54:44.529Z · LW · GW

Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.

Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).

So we need to figure out if we're in the optimistic or the pessimistic scenario.

Comment by stuart_armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T09:35:00.915Z · LW · GW

My understanding of the OP was that there is a robot [...]

That understanding is correct.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also possible that another labelling would be clearer or more useful for our purposes. It might be a "natural" abstraction, once we've put some effort into defining what preferences "naturally" are.

but "white box" is any source code that produces the same input-output behavior

What that section is saying is that there are multiple white boxes that produce the same black box behaviour (hence we cannot read the white box simply from the black box).

Comment by stuart_armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T09:28:11.984Z · LW · GW

modularization is super helpful for simplifying things.

The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).

Comment by stuart_armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-25T15:03:30.016Z · LW · GW

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?

Comment by stuart_armstrong on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-20T13:09:27.588Z · LW · GW

Thanks! Useful insights in your post, to mull over.

Comment by stuart_armstrong on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-19T08:17:22.986Z · LW · GW

An imminent incoming post on this very issue ^_^

Comment by stuart_armstrong on For the past, in some ways only, we are moral degenerates · 2020-08-19T08:16:56.839Z · LW · GW

Yes, things like honour and anger serve important signalling and game-theoretic functions. But they also come to be valued intrinsically (the same way people like sex, rather than just wanting to spread their genes), and strongly valued. This makes it hard to agree that "oh, your sacred core value is only in the service of this hidden objective, so we can focus on that instead".

Comment by stuart_armstrong on "Go west, young man!" - Preferences in (imperfect) maps · 2020-08-04T09:55:53.782Z · LW · GW

Cool, neat summary.

Comment by stuart_armstrong on "Go west, young man!" - Preferences in (imperfect) maps · 2020-08-04T09:47:10.106Z · LW · GW

Sorry, had a few terrible few days, and missed your message. How about Friday, 12pm UK time?

Comment by stuart_armstrong on "Go west, young man!" - Preferences in (imperfect) maps · 2020-08-01T07:37:15.252Z · LW · GW

Stuart, I'm writing a review of all the work done on corrigibility. Would you mind if I asked you some questions on your contributions?

No prob. Email or Zoom/Hangouts/Skype?

Comment by stuart_armstrong on The ground of optimization · 2020-07-31T15:48:12.804Z · LW · GW

Very good. A lot of potential there, I feel.

Comment by stuart_armstrong on "Go west, young man!" - Preferences in (imperfect) maps · 2020-07-31T11:35:20.984Z · LW · GW

The information to distinguish between these interpretations is not within the request to travel west.

Yes, but I'd argue that most of moral preferences are similarly underdefined when the various interpretations behind them come apart (eg purity).

Comment by stuart_armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2020-07-30T15:00:10.808Z · LW · GW

There are computer programs that can print their own code: https://en.wikipedia.org/wiki/Quine_(computing)

There are also programs which can print their own code and add something to it. Isn't that a way in which the program fully knows itself?

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-18T11:31:49.187Z · LW · GW

Thanks! It's cool to see his approach.

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-16T13:59:13.657Z · LW · GW

Wiles proved the presence of a very rigid structure - not the absence - and the presence of this structure implied FLT via the work of other mathematicians.

If you say that "Wiles proved the Taniyama–Shimura conjecture" (for semistable elliptic curves), then I agree: he's proved a very important structural result in mathematics.

If you say he proved Fermat's last theorem, then I'd say he's proved an important-but-probable lack of structure in mathematics.

So yeah, he proved the existence of structure in one area, and (hence) the absence of structure in another area.

And "to prove Fermat's last theorem, you have to go via proving the Taniyama–Shimura conjecture", is, to my mind, strong evidence for "proving lack of structure is hard".

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-16T13:51:33.270Z · LW · GW

You can see this as sampling times sorta-independently, or as sampling times with less independence (ie most sums are sampled twice).

Either view works, and as you said, it doesn't change the outcome.

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-15T22:18:15.038Z · LW · GW

Yes, I got that result too. The problem is that the prime number theorem isn't a very good approximation for small numbers. So we'd need a slightly more sophisticated model that has more low numbers.

I suspect that moving from "sampling with replacement" to "sampling without replacement" might be enough for low numbers, though.

Comment by stuart_armstrong on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2020-07-15T22:13:03.799Z · LW · GW

Note that the probabilistic argument fails for n=3 for Fermat's last theorem; call this (3,2) (power=3, number of summands is 2).

So we know (3,2) is impossible; Euler's conjecture is the equivalent of saying that (n+1,n) is also impossible for all n. However, the probabilistic argument fails for (n+1,n) the same way as it fails for (3,2). So we'd expect Euler's conjecture to fail, on probabilistic grounds.

In fact, the surprising thing on probabilistic grounds is that Fermat's last theorem is true for n=3.

Comment by stuart_armstrong on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-14T16:52:13.279Z · LW · GW

Good, cheers!

Comment by stuart_armstrong on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-07T15:47:58.411Z · LW · GW

Another key reason for time-inconsistent preferences: bounded rationality.

Comment by stuart_armstrong on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-07T15:39:57.790Z · LW · GW

Why do the absolute values cancel?

Because , so you can remove the absolute values.

Comment by stuart_armstrong on Dynamic inconsistency of the inaction and initial state baseline · 2020-07-07T12:54:38.276Z · LW · GW

Comment by stuart_armstrong on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-07T12:10:19.312Z · LW · GW

I also think the pedestrian example illustrates why we need more semantic structure: "pedestrian alive" -> "pedestrian dead" is bad, but "pigeon on road" -> "pigeon in flight" is fine.

Comment by stuart_armstrong on Is Molecular Nanotechnology "Scientific"? · 2020-07-07T12:04:14.663Z · LW · GW

Nope! Part of my own research has made more optimistic about the possibilities of understanding and creating intelligence.