## Posts

Why unriggable *almost* implies uninfluenceable 2021-04-09T17:07:07.016Z
A possible preference algorithm 2021-04-08T18:25:25.855Z
If you don't design for extrapolation, you'll extrapolate poorly - possibly fatally 2021-04-08T18:10:52.420Z
Which counterfactuals should an AI follow? 2021-04-07T16:47:42.505Z
Toy model of preference, bias, and extra information 2021-03-24T10:14:34.629Z
Preferences and biases, the information argument 2021-03-23T12:44:46.965Z
Why sigmoids are so hard to predict 2021-03-18T18:21:51.203Z
Connecting the good regulator theorem with semantics and symbol grounding 2021-03-04T14:35:40.214Z
Cartesian frames as generalised models 2021-02-16T16:09:20.496Z
Generalised models as a category 2021-02-16T16:08:27.774Z
Counterfactual control incentives 2021-01-21T16:54:59.309Z
Short summary of mAIry's room 2021-01-18T18:11:36.035Z
Syntax, semantics, and symbol grounding, simplified 2020-11-23T16:12:11.678Z
The ethics of AI for the Routledge Encyclopedia of Philosophy 2020-11-18T17:55:49.952Z
Extortion beats brinksmanship, but the audience matters 2020-11-16T21:13:18.822Z
Humans are stunningly rational and stunningly irrational 2020-10-23T14:13:59.956Z
Knowledge, manipulation, and free will 2020-10-13T17:47:12.547Z
Dehumanisation *errors* 2020-09-23T09:51:53.091Z
Anthropomorphisation vs value learning: type 1 vs type 2 errors 2020-09-22T10:46:48.807Z
Technical model refinement formalism 2020-08-27T11:54:22.534Z
Model splintering: moving from one imperfect model to another 2020-08-27T11:53:58.784Z
Learning human preferences: black-box, white-box, and structured white-box access 2020-08-24T11:42:34.734Z
AI safety as featherless bipeds *with broad flat nails* 2020-08-19T10:22:14.987Z
Learning human preferences: optimistic and pessimistic scenarios 2020-08-18T13:05:23.697Z
Strong implication of preference uncertainty 2020-08-12T19:02:50.115Z
"Go west, young man!" - Preferences in (imperfect) maps 2020-07-31T07:50:59.520Z
Learning Values in Practice 2020-07-20T18:38:50.438Z
The Goldbach conjecture is probably correct; so was Fermat's last theorem 2020-07-14T19:30:14.806Z
Why is the impact penalty time-inconsistent? 2020-07-09T17:26:06.893Z
Dynamic inconsistency of the inaction and initial state baseline 2020-07-07T12:02:29.338Z
Models, myths, dreams, and Cheshire cat grins 2020-06-24T10:50:57.683Z
Results of 1,000 Oracle contest! 2020-06-17T17:44:44.566Z Comparing reward learning/reward tampering formalisms 2020-05-21T12:03:54.968Z Probabilities, weights, sums: pretty much the same for reward functions 2020-05-20T15:19:53.265Z Learning and manipulating learning 2020-05-19T13:02:41.838Z Reward functions and updating assumptions can hide a multitude of sins 2020-05-18T15:18:07.871Z How should AIs update a prior over human preferences? 2020-05-15T13:14:30.805Z Distinguishing logistic curves 2020-05-15T11:38:04.516Z Distinguishing logistic curves: visual 2020-05-15T10:33:08.901Z Kurzweil's predictions' individual scores 2020-05-07T17:10:36.637Z Assessing Kurzweil predictions about 2019: the results 2020-05-06T13:36:18.788Z Maths writer/cowritter needed: how you can't distinguish early exponential from early sigmoid 2020-05-06T09:41:49.370Z Consistent Glomarization should be feasible 2020-05-04T10:06:55.928Z Last chance for assessing Kurzweil 2020-04-22T11:51:02.244Z Databases of human behaviour and preferences? 2020-04-21T18:06:51.557Z Solar system colonisation might not be driven by economics 2020-04-21T17:10:32.845Z "How conservative" should the partial maximisers be? 2020-04-13T15:50:00.044Z Assessing Kurzweil's 1999 predictions for 2019 2020-04-08T14:27:21.689Z Call for volunteers: assessing Kurzweil, 2019 2020-04-02T12:07:57.246Z Anthropics over-simplified: it's about priors, not updates 2020-03-02T13:45:11.710Z ## Comments Comment by Stuart_Armstrong on Which counterfactuals should an AI follow? · 2021-04-08T10:18:17.143Z · LW · GW I like the subagent approach there. Comment by Stuart_Armstrong on Counterfactual control incentives · 2021-04-05T18:05:13.218Z · LW · GW Thanks. I think we mainly agree here. Comment by Stuart_Armstrong on Preferences and biases, the information argument · 2021-03-24T07:31:41.903Z · LW · GW No. But I expect that it would be much more in the right ballpark than other approaches, and I think it might be refined to be correct. Comment by Stuart_Armstrong on Preferences and biases, the information argument · 2021-03-24T07:30:52.389Z · LW · GW Look at the paper linked for more details ( https://arxiv.org/abs/1712.05812 ). Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality. Comment by Stuart_Armstrong on Why sigmoids are so hard to predict · 2021-03-19T15:55:12.821Z · LW · GW "Aha! We seem to be past the inflection point!" It's generally possible to see where the inflection point is, when we're past it. Comment by Stuart_Armstrong on Why sigmoids are so hard to predict · 2021-03-19T12:00:22.603Z · LW · GW Possibly. What would be the equivalent of a dampening term for a superexponential? A further growth term? Comment by Stuart_Armstrong on Model splintering: moving from one imperfect model to another · 2021-03-03T13:16:19.175Z · LW · GW But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree. Comment by Stuart_Armstrong on Model splintering: moving from one imperfect model to another · 2021-02-24T22:28:38.186Z · LW · GW I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent. I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally). Comment by Stuart_Armstrong on Model splintering: moving from one imperfect model to another · 2021-02-24T08:00:43.566Z · LW · GW Thanks! Lots of useful insights in there. So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself. Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa. Comment by Stuart_Armstrong on Generalised models as a category · 2021-02-19T15:30:07.142Z · LW · GW Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models. Comment by Stuart_Armstrong on Generalised models as a category · 2021-02-17T10:25:41.430Z · LW · GW Thanks! Corrected both of those; is a subset of . Comment by Stuart_Armstrong on Stuart_Armstrong's Shortform · 2021-02-17T09:31:40.649Z · LW · GW Thanks! That's useful to know. Comment by Stuart_Armstrong on Introduction to Cartesian Frames · 2021-02-16T16:24:59.921Z · LW · GW Did posts on generalised models as a category and how one can see Cartesian frames as generalised models. Comment by Stuart_Armstrong on Stuart_Armstrong's Shortform · 2021-02-05T14:12:45.939Z · LW · GW Partial probability distribution A concept that's useful for some of my research: a partial probability distribution. That's a that defines for some but not all and (with for being the whole set of outcomes). This is a partial probability distribution iff there exists a probability distribution that is equal to wherever is defined. Call this a full extension of . Suppose that is not defined. We can, however, say that is a logical implication of if all full extension has . Eg: , , will logically imply the value of . Comment by Stuart_Armstrong on Introduction to Cartesian Frames · 2021-02-04T12:18:33.194Z · LW · GW I like it. I'll think about how it fits with my ways of thinking (eg model splintering). Comment by Stuart_Armstrong on Counterfactual control incentives · 2021-02-03T12:58:17.393Z · LW · GW Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology. We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent. I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended. Comment by Stuart_Armstrong on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-13T10:04:21.800Z · LW · GW (I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.) Cheers, that would be very useful. Comment by Stuart_Armstrong on AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy · 2021-01-12T16:13:44.578Z · LW · GW (I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.) I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1 Comment by Stuart_Armstrong on mAIry's room: AI reasoning to solve philosophical problems · 2020-12-14T21:13:30.119Z · LW · GW I like that way of seeing it. Comment by Stuart_Armstrong on Syntax, semantics, and symbol grounding, simplified · 2020-12-01T13:38:19.771Z · LW · GW Express, express away _ Comment by Stuart_Armstrong on Just another day in utopia · 2020-11-24T08:54:40.999Z · LW · GW Enjoyed writing it, too. Comment by Stuart_Armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-23T13:44:41.434Z · LW · GW Because a reputation for following up brinksmanship threats means that people won't enter into deals with you at all; extortion works because, to some extent, people have to "deal" with you even if they don't want to. This is why I saw a Walmart-monopsony (monopolistic buyer) as closer to extortion, since not trading with them is not an option. Comment by Stuart_Armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-20T17:10:30.134Z · LW · GW I'm think of it this way: investigating a supplier to check they are reasonable costs1 to Walmart. The minimum price any supplier will offer is $10. After investigating, one supplier offers$10.5. Walmart refuses, knowing the supplier will not got lower, and publicises the exchange.

The reason this is extortion, at least in the sense of this post, is that Walmart takes a cost (it will cost them at least 11 to investigate and hire another supplier) in order to build a reputation. Comment by Stuart_Armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-18T14:11:36.047Z · LW · GW The connection to AI alignment is combining the different utilities of different entities without extortion ruining the combination, and dealing with threats and acausal trade. Comment by Stuart_Armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-18T13:58:13.787Z · LW · GW I think the distinction is, from the point of view of the extortioner, "would it be in my interests to try and extort , *even if I know for a fact that cannot be extorted and would force me to act on my threat, to the detriment of myself in that situation?" If the answer is yes, then it's extortion (in the meaning of this post). Trying to extort the un-extortable, then acting on the threat, makes sense as a warning to other. Comment by Stuart_Armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-18T13:54:58.803Z · LW · GW That's a misspelling that's entirely my fault, and has now been corrected. Comment by Stuart_Armstrong on Extortion beats brinksmanship, but the audience matters · 2020-11-17T17:49:16.093Z · LW · GW (1) You say that releasing nude photos is in the blackmail category. But who's the audience? The other people of whom you have nude photos, who are now incentivised to pay up rather than kick up a fuss. (2) For n=1, m large: Is an example of brinkmanship here a monopolistic buyer who will only choose suppliers giving cutrate prices? Interesting example that I hadn't really considered. I'd say its fits more under extortion than brinksmanship, though. A small supplier has to sell, or they won't stay in business. If there's a single buyer, "I won't buy from you" is the same as "I will ruin you". Abstracting away the property rights (Walmart is definitely legally allowed to do this), this seems very much an extorsion. Comment by Stuart_Armstrong on Humans are stunningly rational and stunningly irrational · 2020-10-27T14:42:34.648Z · LW · GW "within the limits of their intelligence" can mean anything, excuse any error, bias, and failure. Thus, they are not rational, and (form one perspective) very very far from it. Comment by Stuart_Armstrong on Knowledge, manipulation, and free will · 2020-10-14T13:15:29.512Z · LW · GW Some people (me included) value a certain level of non-manipulation. I'm trying to cash out that instinct. And it's also needed for some ideas like corrigibility. Manipulation also combines poorly with value learning, see eg our paper here https://arxiv.org/abs/2004.13654 I do agree that saving the world is a clearly positive case of that ^_^ Comment by Stuart_Armstrong on The Presumptuous Philosopher, self-locating information, and Solomonoff induction · 2020-10-14T07:11:46.215Z · LW · GW I have an article on "Anthropic decision theory". with the video version here. Basically, it's not that the presumptuous philosopher is more likely to be right in a given universe, its that there are far more presumptuous philosophers in the large universe. So if we count "how many presumptuous philosophers are correct", we get a different answer to "in how many universes is the presumptuous philosopher correct". These things only come apart in anthropic situations. Comment by Stuart_Armstrong on Comparing reward learning/reward tampering formalisms · 2020-10-01T13:06:26.241Z · LW · GW Suart, by " is complex" are you referring to... I mean that that defining can be done in many different ways, and hence has a lot of contingent structure. In contrast, in , the\rho is a complex distribution on , conditional on ; hence itself is trivial and just encodes "apply to and in the obvious way.

Comment by Stuart_Armstrong on Stuart_Armstrong's Shortform · 2020-09-25T14:05:41.962Z · LW · GW

This is a link to "An Increasingly Manipulative Newsfeed" about potential social media manipulation incentives (eg FaceBook).

I'm putting the link here because I keep losing the original post (since it wasn't published by me, but I co-wrote it).

Comment by Stuart_Armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-25T13:44:21.258Z · LW · GW

A boundedly-rational agent is assumed to be mostly rational, failing to be fully rational because of a failure to figure things out in enough detail.

Humans are occasionally rational, often biased, often inconsistent, sometimes consciously act against their best interests, often follow heuristics without thinking, sometimes do think things through. This doesn't seem to correspond to what is normally understood as "boundedly-rational".

Comment by Stuart_Armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T16:20:44.766Z · LW · GW

that paper was about fitting observations of humans to a mathematical model of "boundedly-rational agent pursuing a utility function"

It was "any sort of agent pursuing a reward function".

Comment by Stuart_Armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T08:08:59.084Z · LW · GW

We don't need a special module to get an everyday definition of doorknobs, and likewise I don't think we don't need a special module to get an everyday definition of human motivation.

I disagree. Doornobs exist in the world (even if the category is loosely defined, and has lots of edge cases), whereas goals/motivations are interpretations that we put upon agents. The main result of the Occam's razor paper is that there the goals of an agent are not something that you can know without putting your own interpretation on it - even if you know every physical fact about the universe. And two very different interpretations can be equally valid, with no way of distinguishing between them.

(I like the anthropomorphising/dehumanising symmetry, but I'm focusing on the aspects of dehumanising that cause you to make errors of interpretation. For example, out-groups are perceived as being coherent, acting in concert without disagreements, and often being explicitly evil. This is an error, not just a reduction in social emotions)

Comment by Stuart_Armstrong on Anthropomorphisation vs value learning: type 1 vs type 2 errors · 2020-09-23T08:05:54.189Z · LW · GW

For instance throughout history people have been able to model and interact with traders from neighbouring or distant civilizations, even though they might think very differently.

Humans think very very similarly to each other, compared with random minds from the space of possible minds. For example, we recognise anger, aggression, fear, and so on, and share a lot of cultural universals https://en.wikipedia.org/wiki/Cultural_universal

Comment by Stuart_Armstrong on Why haven't we celebrated any major achievements lately? · 2020-09-10T09:53:44.294Z · LW · GW

There haven’t been as many big accomplishments.

I think we should look at the demand side, not the supply side. We are producing lots of technological innovations, but there aren't so many major problems left for them to solve. The flush toilet was revolutionary; a super-flush ecological toilet with integrated sensors that can transform into a table... is much more advanced from the supply side, but barely more from the demand side: it doesn't fulfil many more needs than the standard flush toilet.

Comment by Stuart_Armstrong on Model splintering: moving from one imperfect model to another · 2020-08-31T14:00:01.159Z · LW · GW

Cool, good summary.

Comment by Stuart_Armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T11:54:44.529Z · LW · GW

Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.

Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).

So we need to figure out if we're in the optimistic or the pessimistic scenario.

Comment by Stuart_Armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T09:35:00.915Z · LW · GW

My understanding of the OP was that there is a robot [...]

That understanding is correct.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also possible that another labelling would be clearer or more useful for our purposes. It might be a "natural" abstraction, once we've put some effort into defining what preferences "naturally" are.

but "white box" is any source code that produces the same input-output behavior

What that section is saying is that there are multiple white boxes that produce the same black box behaviour (hence we cannot read the white box simply from the black box).

Comment by Stuart_Armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-26T09:28:11.984Z · LW · GW

modularization is super helpful for simplifying things.

The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).

Comment by Stuart_Armstrong on Learning human preferences: black-box, white-box, and structured white-box access · 2020-08-25T15:03:30.016Z · LW · GW

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?

Comment by Stuart_Armstrong on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-20T13:09:27.588Z · LW · GW

Thanks! Useful insights in your post, to mull over.

Comment by Stuart_Armstrong on Learning human preferences: optimistic and pessimistic scenarios · 2020-08-19T08:17:22.986Z · LW · GW

An imminent incoming post on this very issue ^_^