AGIs as populations 2020-05-22T20:36:52.843Z · score: 20 (10 votes)
Multi-agent safety 2020-05-16T01:59:05.250Z · score: 21 (9 votes)
Competitive safety via gradated curricula 2020-05-05T18:11:08.010Z · score: 34 (9 votes)
Against strong bayesianism 2020-04-30T10:48:07.678Z · score: 49 (27 votes)
What is the alternative to intent alignment called? 2020-04-30T02:16:02.661Z · score: 10 (3 votes)
Melting democracy 2020-04-29T20:10:01.470Z · score: 26 (8 votes)
ricraz's Shortform 2020-04-26T10:42:18.494Z · score: 6 (1 votes)
What achievements have people claimed will be warning signs for AGI? 2020-04-01T10:24:12.332Z · score: 17 (7 votes)
What information, apart from the connectome, is necessary to simulate a brain? 2020-03-20T02:03:15.494Z · score: 17 (7 votes)
Characterising utopia 2020-01-02T00:00:01.268Z · score: 27 (8 votes)
Technical AGI safety research outside AI 2019-10-18T15:00:22.540Z · score: 36 (13 votes)
Seven habits towards highly effective minds 2019-09-05T23:10:01.020Z · score: 39 (10 votes)
What explanatory power does Kahneman's System 2 possess? 2019-08-12T15:23:20.197Z · score: 33 (16 votes)
Why do humans not have built-in neural i/o channels? 2019-08-08T13:09:54.072Z · score: 26 (12 votes)
Book review: The Technology Trap 2019-07-20T12:40:01.151Z · score: 30 (14 votes)
What are some of Robin Hanson's best posts? 2019-07-02T20:58:01.202Z · score: 36 (13 votes)
On alien science 2019-06-02T14:50:01.437Z · score: 46 (15 votes)
A shift in arguments for AI risk 2019-05-28T13:47:36.486Z · score: 32 (13 votes)
Would an option to publish to AF users only be a useful feature? 2019-05-20T11:04:26.150Z · score: 14 (5 votes)
Which scientific discovery was most ahead of its time? 2019-05-16T12:58:14.628Z · score: 40 (11 votes)
When is rationality useful? 2019-04-24T22:40:01.316Z · score: 29 (7 votes)
Book review: The Sleepwalkers by Arthur Koestler 2019-04-23T00:10:00.972Z · score: 75 (22 votes)
Arguments for moral indefinability 2019-02-12T10:40:01.226Z · score: 54 (18 votes)
Coherent behaviour in the real world is an incoherent concept 2019-02-11T17:00:25.665Z · score: 39 (17 votes)
Vote counting bug? 2019-01-22T15:44:48.154Z · score: 7 (2 votes)
Disentangling arguments for the importance of AI safety 2019-01-21T12:41:43.615Z · score: 124 (45 votes)
Comments on CAIS 2019-01-12T15:20:22.133Z · score: 74 (20 votes)
How democracy ends: a review and reevaluation 2018-11-27T10:50:01.130Z · score: 17 (9 votes)
On first looking into Russell's History 2018-11-08T11:20:00.935Z · score: 35 (11 votes)
Speculations on improving debating 2018-11-05T16:10:02.799Z · score: 26 (10 votes)
Implementations of immortality 2018-11-01T14:20:01.494Z · score: 21 (8 votes)
What will the long-term future of employment look like? 2018-10-24T19:58:09.320Z · score: 11 (4 votes)
Book review: 23 things they don't tell you about capitalism 2018-10-18T23:05:29.465Z · score: 19 (11 votes)
Book review: The Complacent Class 2018-10-13T19:20:05.823Z · score: 21 (9 votes)
Some cruxes on impactful alternatives to AI policy work 2018-10-10T13:35:27.497Z · score: 155 (54 votes)
A compendium of conundrums 2018-10-08T14:20:01.178Z · score: 12 (12 votes)
Thinking of the days that are no more 2018-10-06T17:00:01.208Z · score: 13 (6 votes)
The Unreasonable Effectiveness of Deep Learning 2018-09-30T15:48:46.861Z · score: 88 (28 votes)
Deep learning - deeper flaws? 2018-09-24T18:40:00.705Z · score: 43 (18 votes)
Book review: Happiness by Design 2018-09-23T04:30:00.939Z · score: 14 (6 votes)
Book review: Why we sleep 2018-09-19T22:36:19.608Z · score: 52 (25 votes)
Realism about rationality 2018-09-16T10:46:29.239Z · score: 174 (81 votes)
Is epistemic logic useful for agent foundations? 2018-05-08T23:33:44.266Z · score: 19 (6 votes)
What we talk about when we talk about maximising utility 2018-02-24T22:33:28.390Z · score: 27 (8 votes)
In Defence of Conflict Theory 2018-02-17T03:33:01.970Z · score: 30 (11 votes)
Is death bad? 2018-01-13T04:55:25.788Z · score: 8 (4 votes)


Comment by ricraz on The ground of optimization · 2020-07-01T08:34:39.024Z · score: 4 (2 votes) · LW · GW

That's weird; thanks for the catch. Fixed.

Comment by ricraz on The ground of optimization · 2020-06-28T17:27:11.780Z · score: 8 (4 votes) · LW · GW
But now every system is an optimizing system, because we can always come up with some preference ordering that explains a system as an optimizing system.

Hmmm, I'm a little uncertain about whether this is the case. E.g. suppose you have a box with a rock in it, in an otherwise empty universe. Nothing happens. You perturb the system by moving the rock outside the box. Nothing else happens in response. How would you describe this as an optimising system? (I'm assuming that we're ruling out the trivial case of a constant utility function; if not, we should analogously include the trivial case of all states being target states).

As a more general comment: I suspect that what starts to happen after you start digging into what "perturbation" means, and what counts as a small or big perturbation, is that you run into the problem that a *tiny* perturbation can transform a highly optimising system to a non-optimising system (e.g. flicking the switch to turn off the AGI). In order to quantify size of perturbations in an interesting way, you need the pre-existing concept of which subsystems are doing the optimisation.

My preferred solution to this is just to stop trying to define optimisation in terms of *outcomes*, and start defining it in terms of *computation* done by systems. E.g. a first attempt might be: an agent is an optimiser if it does planning via abstraction towards some goal. Then we can zoom in on what all these words mean, or what else we might need to include/exclude (in this case, we've ruled out evolution, so we probably need to broaden it). The broad philosophy here is that it's better to be vaguely right than precisely wrong. Unfortunately I haven't written much about this approach publicly - I briefly defend it in a comment thread on this post though.

Comment by ricraz on The ground of optimization · 2020-06-27T12:11:14.820Z · score: 14 (4 votes) · LW · GW

Two examples which I'd be interested in your comments on:

1. Consider adding a big black hole in the middle of a galaxy. Does this turn the galaxy into a system optimising for a really big black hole in the middle of the galaxy? (Credit for the example goes to Ramana Kumar).

2. Imagine that I have the goal of travelling as fast as possible. However, there is no set of states which you can point to as the "target states", since whatever state I'm in, I'll try to go even faster. This is another argument for, as I argue below, defining an optimising system in terms of increasing some utility function (rather than moving towards target states).

Comment by ricraz on The ground of optimization · 2020-06-23T20:28:46.549Z · score: 2 (1 votes) · LW · GW

It would work at least as well as the original proposal, because your utility function could just be whatever metric of "getting closer to the target states" would be used in the original proposal.

Comment by ricraz on Do Women Like Assholes? · 2020-06-23T12:51:26.580Z · score: 2 (1 votes) · LW · GW

Yes, this seems reasonable. I guess I'm curious about which of these traits is more robustly attractive. That is: assuming the ideal male protagonist is both an alpha male, and honorable and kind, would their attractiveness drop more if you removed just the "honorable and kind" bit, or just the "alpha male" bit? I suspect the latter, but that's just speculation. We might be able to get more quantitative data by seeing how many male protagonists fall into each category.

Comment by ricraz on Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate · 2020-06-22T09:58:43.622Z · score: 4 (2 votes) · LW · GW

Thanks for the post :) To be clear, I'm very excited about conceptual and deconfusion work in general, in order to come up with imprecise theories of rationality and intelligence. I guess this puts my position in world 1. The thing I'm not excited about is the prospect of getting to this final imprecise theory via doing precise technical research. In other words, I'd prefer HRAD work to draw more on cognitive science and less on maths and logic. I outline some of the intuitions behind that in this post.

Having said that, when I've critiqued HRAD work in the past, on a couple of occasions I've later realised that the criticism wasn't aimed at a crux for people actually working on it (here's my explanation of one of those cases). To some extent this is because, without a clearly-laid-out position to criticise, the critic has the difficult task of first clarifying the position then rebutting it. But I should still flag that I don't know how much HRAD researchers would actually disagree with my claims in the first paragraph.

Comment by ricraz on The ground of optimization · 2020-06-22T08:52:11.794Z · score: 18 (8 votes) · LW · GW

This seems great, I'll read and comment more thoroughly later. Two quick comments:

It didn't seem like you defined what it meant to evolve towards the target configuration set. So it seems like either you need to commit to the system actually reaching one of the target configurations to call it an optimiser, or you need some sort of metric over the configuration space to tell whether it's getting closer to or further away from the target configuration set. But if you're ranking all configurations anyway, then I'm not sure it adds anything to draw a binary distinction between target configurations and all the others. In other words, can't you keep the definition in terms of a utility function, but just add perturbations?

Also, you don't cite Dennett here, but his definition has some important similarities. In particular, he defines several different types of perturbation (such as random perturbations, adversarial perturbations, etc) and says that a system is more agentic when it can withstand more types of perturbations. Can't remember exactly where this is from - perhaps The Intentional Stance?

Comment by ricraz on Do Women Like Assholes? · 2020-06-22T08:42:49.119Z · score: 9 (3 votes) · LW · GW

Thanks for actually doing some solid data analysis instead of just speculating on the internet :) Having said that, I'll now proceed to respond by speculating on the internet. Apologies.

I suspect that it'd be very helpful to disentangle "I prefer" into "I reflectively endorse being involved with" and "I am attracted to". Right now it seems like you're using some combination of those two. But people can be more attracted to things they reflectively endorse less, and may then act inconsistently, leading to different results when you look at different evidence sources.

One way to disentangle these two is to look at porn, where it's purely about attraction and you don't need to worry about what you actually endorse. And then you see things like 50 shades of grey or 365 days being very popular with women - where (especially in the latter) the male love interest's defining trait is being a bit of an asshole.

(I think the analogous thing for men might be: reflectively endorsing dating really strong, assertive women, but in practice being more attracted to quieter, shyer women).

Comment by ricraz on I'm leaving AI alignment – you better stay · 2020-06-20T17:01:18.609Z · score: 5 (3 votes) · LW · GW

I very much appreciate your efforts both in safety research and in writing this retrospective :)

For other people who are or will be in a similar position to you: I agree that focusing on producing results immediately is a mistake. I don't think that trying to get hired immediately is a big mistake, but I do think that trying to get hired specifically at an AI alignment research organisation is very limiting, especially if you haven't taken much time to study up ML previously.

For example, I suspect that for most people there would be very little difference in overall impact between working as a research engineer on an AI safety team straight out of university, versus working as an ML engineer somewhere else for 1-2 years then joining an AI safety team. (Ofc this depends a lot on how much you already know + how quickly you learn + how much supervision you need).

Perhaps this wouldn't suit people who only want to do theoretical stuff - but given that you say that you find implementing ML fun, I'm sad you didn't end up going down the latter route. So this is a signal boost for others: there's a lot of ways to gain ML skills and experience, no matter where you're starting from - don't just restrict yourself to starting with safety.

Comment by ricraz on An overview of 11 proposals for building safe advanced AI · 2020-06-16T20:24:00.401Z · score: 16 (5 votes) · LW · GW

But if you could always get arbitrarily high performance with long enough training, then claiming "the performance isn't high enough" is equivalent to saying "we haven't trained long enough". So it reduces to just one dimension of competitiveness, which is how steep the curve of improvement over time is on average.

For the actual reason I think it makes sense to separate these, see my other comment: you can't usually get arbitrarily high performance by training longer.

Comment by ricraz on An overview of 11 proposals for building safe advanced AI · 2020-06-16T20:20:23.165Z · score: 10 (5 votes) · LW · GW
My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance.

I think this is incorrect. Most training setups eventually flatline, or close to it (e.g. see AlphaZero's ELO curve), and need algorithmic or other improvements to do better.

Comment by ricraz on An overview of 11 proposals for building safe advanced AI · 2020-06-16T20:16:28.194Z · score: 4 (2 votes) · LW · GW

Nice gradings. I was confused about this one, though:

Narrow reward modeling + transparency tools: High TOW, Medium WIT.

Why is this the most trustworthy thing on the list? To me it feels like one of the least trustworthy, because it's one of the few where we haven't done recursive safety checks (especially given lack of adversarial training).

Comment by ricraz on What are the high-level approaches to AI alignment? · 2020-06-16T19:15:59.701Z · score: 10 (2 votes) · LW · GW

I suspect that nobody is actually pursuing the third one as you've described it. Rather, my impression is that MIRI researchers tend to think of decision theory as a more fundamental problem in understanding AI, not directly related to human interests.

Comment by ricraz on Two Kinds of Mistake Theorists · 2020-06-11T18:41:22.913Z · score: 5 (4 votes) · LW · GW

I wrote a defence of conflict theory here, in case that's of interest. (Also crossposted to LessWrong here). It has some similarities to your 0-sum/positive-sum framing (which I like) but more focused on historical examples.

Comment by ricraz on Does equanimity prevent negative utility? · 2020-06-11T18:33:27.189Z · score: 4 (2 votes) · LW · GW

I think you need to be a bit more precise about what you mean by "produce less negative utility", for this question to make sense. Do you mean something like "is morally wrong"? Or "something that is bad for me"?

Comment by ricraz on Goal-directedness is behavioral, not structural · 2020-06-11T01:12:20.011Z · score: 2 (1 votes) · LW · GW

By "internal structure" or "cognitive terms" I also mean what's inside the system, but usually at a higher level of abstraction than physical implementation. For instance, we can describe AlphaGo's cognition as follows: it searches through a range of possible games, and selects moves that do well in a lot of those games. If we just take the value network by itself (which is still very good at Go) without MCTS, then it's inaccurate to describe that network as searching over many possible games; it's playing Go well using only a subset of the type of cognition the full system does.

This differs from the intentional stance by paying more attention to what's going on inside the system, as opposed to just making inferences from behaviour. It'd be difficult to tell that the full AlphaGo system and the value network alone are doing different types of cognition, just from observing their behaviour - yet knowing that they do different types of cognition is very useful for making predictions about their behaviour on unobserved board positions.

What I was pointing at is that searching a definition (sorry, used the taboo word) of goal-directedness in terms of the internal structure (that is, the source code for example), is misguided.

You can probably guess what I'm going to say here: I still don't know what you mean by "definition", or why we want to search for it.

Comment by ricraz on Goal-directedness is behavioral, not structural · 2020-06-09T17:45:27.439Z · score: 6 (2 votes) · LW · GW

"What I argue for is that, insofar as the predictions you want to make are about behaviors, what the internal cognition and structure give you is a way to extrapolate the behavior, perhaps more efficiently. Not a useful definition of goal-directedness."

Perhaps it'd be useful to taboo the word "definition" here. We have this phenomenon, goal-directedness. Partly we think about it in cognitive terms; partly we think about it in behavioural terms. It sounds like you're arguing that the former is less legitimate. But clearly we're still going to think about it in both ways - they're just different levels of abstraction for some pattern in the world. Or maybe you're saying that it'll be easier to decompose it when we think about it on a behavioural level? But the opposite seems true to me - we're much better at reasoning about intentional systems than we are at abstractly categorising behaviour.

"What still stands for me is that given two systems with the same behavior, I want to give them the same goal-directedness."

I don't see how you can actually construct two generally intelligent systems which have this property, without them doing basically the same cognition. In theory, of course, you could do so using an infinite lookup table. But I claim that thinking about finite systems based on arguments about the infinite limit is often very misleading, for reasons I outline in this post. Here's a (somewhat strained) analogy: suppose that I'm trying to build a rocket, and I have this concept "length", which I'm using to make sure that the components are the right length. Now you approach me, and say "You're assuming that this rocket engine is longer than this door handle. But if they're both going at the speed of light, then they both become the same length! So in order to build a rocket, you need a concept of length which is robust to measuring things at light speed."

To be more precise, my argument is: knowing that two AGIs have exactly the same behaviour but cognition which we evaluate as differently goal-directed is an epistemic situation that is so far removed from what we might ever experience that it shouldn't inform our everyday concepts.

Comment by ricraz on Goal-directedness is behavioral, not structural · 2020-06-09T03:54:37.276Z · score: 7 (4 votes) · LW · GW
For if it didn't, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.

This seems like a bad argument to me, because goal-directedness is not meant to be a complete determinant of safety and competitiveness; other things matter too. As an analogy, one property of my internal cognition is that sometimes I am angry. We like to know whether people are angry because (amongst other things) it helps us predict whether they are safe to be around - but there's nothing inconsistent about two people with the same level of anger being differently safe (e.g. because one of them is also tired and decides to go sleep instead of starting a fight).

If we tried to *define* anger in terms of behaviour, then I predict we'd have a very difficult time, and end up not being able to properly capture a bunch of important aspects of it (like: being angry often makes you fantasise about punching people; or: you can pretend to be angry without actually being angry), because it's a concept that's most naturally formulated in terms of internal state and cognition. The same is true for goal-directedness - in fact you agree that the main way we get evidence about goal-directedness in practice is by looking at, and making inferences about, internal cognition. If we think of a concept in cognitive terms, and learn about it in cognitive terms, then I suspect that trying to define it in behavioural terms will only lead to more confusion, and similar mistakes to those that the behaviourists made.

On the more general question of how tractable and necessary a formalism is - leaving aside AI, I'd be curious if you're optimistic about the prospect of formalising goal-directedness in humans. I think it's pretty hopeless, and don't see much reason that this would be significantly easier for neural networks. Fortunately, though, humans already have very sophisticated cognitive machinery for reasoning in non-mathematical ways about other agents.

Comment by ricraz on Growing Independence · 2020-06-07T22:44:19.718Z · score: 18 (8 votes) · LW · GW

I haven't thought much about parenting in general, and don't have kids. Overall this seems like an interesting and probably valuable approach. But it also feels very individualistic, which would make me concerned in applying this myself. I expect that most Westerners, and especially Americans, are already too individualistic, and that it'd be useful for them to think of families and friendships more as cohesive units - as opposed to combinations of individuals, which is what it seems like your approach pushes towards. The two most salient examples of this for me:

Often they wanted me to lift them or support them in their climbing, and I wouldn't.


I told her that if she left the trike it would be available for anyone to take. And that I would probably take it, but it would then be my trike.

This sort of distinction between parents' property and children's property seems strange to me. What would be the consequences of this becoming your trike - especially given that you bought it for her in the first place? Apparently it worked in this case, but still... idk.

Comment by ricraz on Reply to Paul Christiano's “Inaccessible Information” · 2020-06-05T15:39:18.664Z · score: 14 (11 votes) · LW · GW

"But it seems strange to describe this approach as 'hope you can find some other way to produce powerful AI', as though we know of no other approach to engineering sophisticated systems other than search."

If I had to summarise the history of AI in one sentence, it'd be something like: a bunch of very smart people spent a long time trying to engineer sophisticated systems without using search, and it didn't go very well until they started using very large-scale search.

I'd also point out that the most sophisticated systems we can currently engineer are much complex than brains. So the extent to which this analogy applies seems to me to be fairly limited.

Comment by ricraz on An overview of 11 proposals for building safe advanced AI · 2020-05-30T22:17:07.715Z · score: 4 (2 votes) · LW · GW

"I usually don’t think about outer amplification as what happens with optimal policies"

Do you mean outer alignment?

Comment by ricraz on AGIs as populations · 2020-05-29T08:19:17.565Z · score: 2 (1 votes) · LW · GW

But I can't do the wrong thing, by my standards of value, if my "value system no longer applies". So that's part of what I'm trying to tease out.

Another part is: I'm not sure if Wei thinks this is just a governance problem (i.e. we're going to put people in charge who do the wrong thing, despite some people advocating caution) or a more fundamental problem that nobody would do the right thing.

If the former, then I'd characterise this more as "more power magnifies leadership problems". But maybe it won't, because there's also a much larger space of morally acceptable things you can do. It just doesn't seem that easy to me to accidentally do a moral catastrophe if you've got a huge amount of power, and less so an irreversible one. But maybe this is just because I don't know of whatever possible examples Wei thinks about.

Comment by ricraz on AGIs as populations · 2020-05-28T19:15:15.304Z · score: 9 (2 votes) · LW · GW

My thoughts on each of these. The common thread is that it seems to me you're using abstractions at way too high a level to be confident that they will actually apply, or that they even make sense in those contexts.

AGIs and economies of scale

  • Do we expect AGIs to be so competitive that reducing coordination costs is a big deal? I expect that the dominant factor will be AGI intelligence, which will vary enough that changes in coordination costs aren't a big deal. Variations in human intelligence have a huge effect, and presumably variations in AGI intelligence will be much bigger.
  • There's an obvious objection to giving one AGI all of your resources, which is "how do you know it's aligned"? And this seems like an issue where there'd be unified dissent from people worried about both short-term and long-term safety.
  • Oh, another concern: if they're all intent aligned to the same person, then this amounts to declaring that person dictator. Which is often quite a difficult thing to convince people to do.
  • Consider also that we'll be in an age of unprecedented plenty, once we have aligned AGIs that can do things for us. So I don't see why economic competition will be very strong. Perhaps military competition will be strong, but will countries really be converting so much of their economy to military spending that they need this edge to keep up?

So this seems possible, but very far from a coherent picture in my mind.

Some thoughts on metaphilosophy

  • These are a bunch of fun analogies here. But it is very unclear to me what you mean by "philosophy" here, since most, or perhaps all, of your descriptions would be equally applicable to "thinking" or "reasoning". The model you give of philosophy is also a model of choosing the next move in the game of chess, and countless other things.
  • Similarly, what is metaphilosophy, and what would it mean to solve it? Reach a dead end? Be able to answer any question? Why should we think that the concept of a "solution" to metaphilosophy makes any sense?

Overall, this posts feels like it's pointing at something interesting but I don't know if it actually communicated any content to me. Like, is the point of the sections headed "Philosophy as interminable debate" and "Philosophy as Jürgen Schmidhuber's General TM" just to say that we can never be certain of any proposition? As written, the post is consistent both with you having some deep understanding of metaphilosophy that I just am not comprehending, and also with you using this word in a nonsensical way.

Two Neglected Problems in Human-AI Safety

  • "There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to." There are plenty of reasons to think that we don't have similar problems - for instance, we're much smarter than the ML systems on which we've seen adversarial examples. Also, there are lots of us, and we keep each other in check.
  • "For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, and their value systems no longer apply or give essentially random answers." What does this actually look like? Suppose I'm made the absolute ruler of a whole virtual universe - that's a lot of power. How might my value system "not keep up"?
  • The second half of this post makes a lot of sense to me, in large part because you can replace "corrupt human values" with "manipulate people", and then it's very analogous to problems we face today. Even so, a *lot* of additional work would need to be done to make a plausible case that this is an existential risk.
  • "An objective that is easy to test/measure (just check if the target has accepted the values you're trying to instill, or has started doing things that are more beneficial to you)". Since when was it easy to "just check" someone's values? Like, are you thinking of an AI reading them off our neurons?

Here's a slightly stretched analogy to try and explain my overall perspective. If you talked to someone born a thousand years ago about the future, they might make claims like "the most important thing is making process on metatheology" or "corruption of our honour is an existential risk", or "once instantaneous communication exists then economies of scale will be so great that countries will be forced to nationalise all their resources". How do we distinguish our own position from theirs? The only way is to describe our own concepts at a level of clarity and detail that they just couldn't have managed. So what I want is a description of what "metaphilosophy" is such that it would have been impossible to give an equally clear description of "metatheology" without realising that this concept is not useful or coherent. Maybe that's too high a target, but I think it's one we should keep in mind as what is *actually necessary* to reason at such an abstract level without getting into confusion.

Comment by ricraz on Speculations on the Future of Fiction Writing · 2020-05-28T18:14:57.256Z · score: 22 (9 votes) · LW · GW

Only tangentially related:

I should totally have expected this, but boy are film budgets heavy-tailed. According to your link, Terminator 3 spent $35 million on its cast, which consisted of:

  • Arnold Schwarzenegger: $29.25 million + 20% gross profits
  • Arnold's perks: $1.5 million
  • Rest of principal cast: $3.85 million
  • Extras: $450,000

Arnold's perks alone might have cost more than any other actor on set... (although it's not subdivided finely enough to know for sure).

Comment by ricraz on AGIs as populations · 2020-05-27T09:17:23.612Z · score: 2 (1 votes) · LW · GW

What do you mean by "hard to resolve to convo with Richard"? I can't parse that grammar.

I didn't downvote those comments, but if you interpret me as saying "More rigour for important arguments please", and Wei as saying "I'm too lazy to provide this rigour", then I can see why someone might have downvoted them.

Like, on one level I'm fine with Wei having different epistemic standards to me, and I appreciate his engagement. And I definitely don't intend my arguments as attacks on Wei specifically, since he puts much more effort into making intellectual progress than almost anyone on this site.

But on another level, the whole point of this site is to have higher epistemic standards, and (I would argue) the main thing preventing that is just people being so happy to accept blog-post-sized insights without further scrutiny.

Comment by ricraz on AGIs as populations · 2020-05-27T09:00:31.628Z · score: 2 (1 votes) · LW · GW

As opposed to coming up with powerful and predictive concepts, and refining them over time. Of course argument and counterargument are crucial to that, so there's no sharp line between this and "patching", but for me the difference is: are you starting with the assumption that the idea is fundamentally sound, and you just need to fix it up a bit to address objections? If you are in that position despite not having fleshed out the idea very much, that's what I'd characterise as "patching your way to good arguments".

Comment by ricraz on AGIs as populations · 2020-05-26T23:52:38.113Z · score: -1 (4 votes) · LW · GW

Mostly "Wei Dai should write a blogpost that more clearly passes your "sniff test" of "probably compelling enough to be worth more of my attention"". And ideally a whole sequence or a paper.

It's possible that Wei has already done this, and that I just haven't noticed. But I had a quick look at a few of the blog posts linked in the "Disjunctive scenarios" post, and they seem to overall be pretty short and non-concrete, even for blog posts. Also, there are literally thirty items on the list, which makes it hard to know where to start (and also suggests low average quality of items). Hence why I'm asking Wei for one which is unusually worth engaging with; if I'm positively surprised, I'll probably ask for another.

Comment by ricraz on AGIs as populations · 2020-05-26T23:16:23.136Z · score: 2 (1 votes) · LW · GW
Many of my "disjunctive" arguments were written specifically with that scenario in mind.

Cool, makes sense. I retract my pointed questions.

I guess I have a high prior that making something smarter than human is dangerous unless we know exactly what we're doing including the social/political aspects, and you don't, so you think the burden of proof is on me?

This seems about right. In general when someone proposes a mechanism by which the world might end, I think the burden of proof is on them. You're not just claiming "dangerous", you're claiming something like "more dangerous than anything else has ever been, even if it's intent-aligned". This is an incredibly bold claim and requires correspondingly thorough support.

does the current COVID-19 disaster not make you more pessimistic about "whatever efforts people will make when the problem starts becoming more apparent"?

Actually, COVID makes me a little more optimistic. First because quite a few countries are handling it well. Secondly because I wasn't even sure that lockdowns were a tool in the arsenal of democracies, and it seemed pretty wild to shut the economy down for so long. But they did. Also essential services have proven much more robust than I'd expected (I thought there would be food shortages, etc).

Comment by ricraz on AGIs as populations · 2020-05-26T23:07:53.648Z · score: -1 (4 votes) · LW · GW

I'm pretty skeptical of this as a way of making progress. It's not that I already have strong disagreements with your arguments. But rather, if you haven't yet explained them thoroughly, I expect them to be underspecified, and use some words and concepts that are wrong in hard-to-see ways. One way this might happen is if those arguments use concepts (like "metaphilosophy") that kinda intuitively seem like they're pointing at something, but come with a bunch of connotations and underlying assumptions that make actually understanding them very tricky.

So my expectation for what happens here is: I look at one of your arguments, formulate some objection X, and then you say either: "No, that wasn't what I was claiming", or "Actually, ~X is one of the implicit premises", or "Your objection doesn't make any sense in the framework I'm outlining" and then we repeat this a dozen or more times. I recently went through this process with Rohin, and it took a huge amount of time and effort (both here and in private conversation) to get anywhere near agreement, despite our views on AI being much more similar than yours and mine.

And even then, you'll only have fixed the problems I'm able to spot, and not all the others. In other words, I think of patching your way to good arguments as kinda like patching your way to safe AGI. (To be clear, none of this is meant as specific criticism of your arguments, but rather as general comments about any large-scale arguments using novel concepts that haven't been made very thoroughly and carefully).

Having said this, I'm open to trying it for one of your arguments. So perhaps you can point me to one that you particularly want engagement on?

Comment by ricraz on AGIs as populations · 2020-05-26T09:48:04.591Z · score: 7 (2 votes) · LW · GW
my own epistemic state, which is that arguments for AI risk are highly disjunctive, most types of AGI (not just highly agentic ones) are probably unsafe (i.e., are likely to lead us away from rather than towards a success story), at best probably only a few very specific AGI designs (which may well be agentic if combined with other properties) are both feasible and safe (i.e., can count as success stories)

Yeah, I guess I'm not surprised that we have this disagreement. To briefly sketch out why I disagree (mostly for common knowledge; I don't expect this to persuade you):

I think there's something like a logistic curve for how seriously we should take arguments. Almost all arguments are bad, and have many many ways in which they might fail. This is particularly true for arguments trying to predict the future, since they have to invent novel concepts to do so. Only once you've seen a significant amount of work put into exploring an argument, the assumptions it relies on, and the ways it might be wrong, should you start to assign moderate probability that the argument is true, and that the concepts it uses will in hindsight make sense.

Most of the arguments mentioned in your post on disjunctive safety arguments fall far short of any reasonable credibility threshold. Most of them haven't even had a single blog post which actually tries to scrutinise them in a critical way, or lay out their key assumptions. And to be clear, a single blog post is just about the lowest possible standard you might apply. Perhaps it'd be sufficient in a domain where claims can be very easily verified, but when we're trying to make claims that a given effect will be pivotal for the entire future of humanity despite whatever efforts people will make when the problem starts becoming more apparent, we need higher standards to get to the part of the logistic curve with non-negligible gradient.

This is not an argument for dismissing all of these possible mechanisms out of hand, but an argument that they shouldn't (yet) be given high credence. I think they are often given too high credence because there's a sort of halo effect from the arguments which have been explored in detail, making us more willing to consider arguments that in isolation would seem very out-there. When you think about the arguments made in your disjunctive post, how hard do you try to imagine each one conditional on the knowledge that the other arguments are false? Are they actually compelling in a world where Eliezer is wrong about intelligence explosions and Paul is wrong about influence-seeking agents? (Maybe you'd say that there are legitimate links between these arguments, e.g. common premises - but if so, they're not highly disjunctive).

Getting to an AGI that can safely do human or superhuman level safety work would be a success story in itself, which I labeled "Research Assistant" in my post

Good point, I shall read that post more carefully. I still don't think that this post is tied to the Research Assistant success story though.

Comment by ricraz on AGIs as populations · 2020-05-23T08:42:05.339Z · score: 5 (3 votes) · LW · GW

My thought process when I use "safer" and "less safe" in posts like this is: the main arguments that AGI will be unsafe depends on it having certain properties, like agency, unbounded goals, lack of interpretability, desire and ability to self-improve, and so on. So reducing the extent to which it has those properties will make it safer, because those arguments will be less applicable.

I guess you could have two objections to this:

  • Maybe safety is non-monotonic in those properties.
  • Maybe you don't get any reduction in safety until you hit a certain threshold (corresponding to some success story).

I tend not to worry so much about these two objections because to me, the properties I outlined above are still too vague to have a good idea of the landscape of risks with respect to those properties. Once we know what agency is, we can talk about its monotonicity. For now my epistemic state is: extreme agency is an important component of thee main argument for risk, so all else equal reducing it should reduce risk.

I like the idea of tying safety ideas to success stories in general, though, and will try to use it for my next post, which proposes more specific interventions during deployment. Having said that, I also believe that most safety work will be done by AGIs, and so I want to remain open-minded to success stories that are beyond my capability to predict.

Comment by ricraz on AGIs as populations · 2020-05-22T23:33:47.287Z · score: 2 (1 votes) · LW · GW

Nothing in particular. My main intention with this post was to describe a way the world might be, and some of the implications. I don't think such work should depend on being related to any specific success story.

Comment by ricraz on Multi-agent safety · 2020-05-19T10:29:48.504Z · score: 6 (3 votes) · LW · GW

I'm hoping there's a big qualitative difference between fine-tuning on the CEO task versus the "following instructions" task. Perhaps the magnitude of the difference would be something like: starting training on the new task 99% of the way through training, versus starting 20% of the way through training. (And 99% is probably an underestimate: the last 10000 years of civilisation are much less than 1% of the time we've spent evolving from, say, the first mammals).

Plus on the follow human instructions task you can add instructions which specifically push against whatever initial motivations they had, which is much harder on the CEO task.

I agree that this is a concern though.

Comment by ricraz on Multi-agent safety · 2020-05-16T11:22:45.756Z · score: 4 (2 votes) · LW · GW

I should clarify that when I think about obedience, I'm thinking obedience to the spirit of an instruction, not just the wording of it. Given this, the two seem fairly similar, and I'm open to arguments about whether it's better to talk in terms of one or the other. I guess I favour "obedience" because it has fewer connotations of agency - if you're "doing what a human wants you to do", then you might run off and do things before receiving any instructions. (Also because it's shorter and pithier - "the goal of doing what humans want" is a bit of a mouthful).

Comment by ricraz on Competitive safety via gradated curricula · 2020-05-12T14:57:27.822Z · score: 2 (1 votes) · LW · GW

Yeah, so I guess opinions on this would differ depending on how likely people think existential risk from AGI is. Personally, it's clear to me that agentic misaligned superintelligences are bad news - but I'm much less persuaded by descriptions of how long-term maximising behaviour arises in something like an oracle. The prospect of an AGI that's much more intelligent than humans and much less agentic seems quite plausible - even, perhaps, in a RL agent.

Comment by ricraz on Against strong bayesianism · 2020-05-03T20:40:54.440Z · score: 5 (3 votes) · LW · GW

I think some parts of it do - e.g. in this post. But yes, I do really like Chapman's critique and wish I'd remembered about it before writing this so that I could reference it and build on it.

Especially: Understanding informal reasoning is probably more important than understanding technical methods. I very much agree with this.

Comment by ricraz on Against strong bayesianism · 2020-05-03T20:31:47.274Z · score: 6 (3 votes) · LW · GW

Yes, I saw Chapman's critiques after someone linked one in the comments below, and broadly agree with them.

I also broadly agree with the conclusion that you quote; that seems fairly similar to what I was trying to get at in the second half of the post. But in the first half of the post, I was also trying to gesture at a mistake made not by people who want simple, practical insights, but rather people who do research in AI safety, learning human preferences, and so on, using mathematical models of near-ideal reasoning. However, it looks like making this critique thoroughly would require much more effort than I have time for.

Comment by ricraz on Against strong bayesianism · 2020-05-03T19:47:13.865Z · score: 3 (2 votes) · LW · GW
Game-theoretic concepts like social dilemma, equilibrium selection, costly signaling, and so on seem indispensable

I agree with this. I think I disagree that "stating them crisply" is indispensable.

I wouldn’t know where to start if I couldn’t model agents using Bayesian tools.

To be a little contrarian, I want to note that this phrasing has a certain parallel with the streetlight effect: you wouldn't know how to look for your keys if you didn't have the light from the streetlamp. In particular, this is also what someone would say if we currently had no good methods for modelling agents, but bayesian tools were the ones which seemed good.

Anyway, I'd be interested in having a higher-bandwidth conversation with you about this topic. I'll get in touch :)

Comment by ricraz on Against strong bayesianism · 2020-05-03T18:28:50.222Z · score: 2 (1 votes) · LW · GW
There is a third use of Bayesianism, the way that sophisticated economists and political scientists use it: as a useful fiction for modeling agents who try to make good decisions in light of their beliefs and preferences. I’d guess that this is useful for AI, too. These will be really complicated systems and we don’t know much about their details yet, but it will plausibly be reasonable to model them as “trying to make good decisions in light of their beliefs and preferences”.
Perhaps a fourth use is that we might actively want to try to make our systems more like Bayesian reasoners, at least in some cases.

My post was intended to critique these positions too. In particular, the responses I'd give are that:

  • There are many ways to model agents as “trying to make good decisions in light of their beliefs and preferences”. I expect bayesian ideas to be useful for very simple models, where you can define a set of states to have priors and preferences over. For more complex and interesting models, I think most of the work is done by considering the cognition the agents are doing, and I don't think bayesianism gives you particular insight into that for the same reasons I don't think it gives you particular insight into human cognition.
  • In response to "The Bayesian framework plausibly allows us to see failure modes that are common to many boundedly rational agents": in general I believe that looking at things from a wide range of perspectives allows you to identify more failure modes - for example, thinking of an agent as a chaotic system might inspire you to investigate adversarial examples. Nevertheless, apart from this sort of inspiration, I think that the bayesian framework is probably harmful when applied to complex systems because it pushes people into using misleading concepts like "boundedly rational" (compare your claim with the claim that a model in which all animals are infinitely large helps us identify properties that are common to "boundedly sized" animals).
  • "We might actively want to try to make our systems more like Bayesian reasoners": I expect this not to be a particularly useful approach, insofar as bayesian reasoners don't do "reasoning". If we have no good reason to think that explicit utility functions are something that is feasible in practical AGI, except that it's what ideal bayesian reasoners do, then I want to discourage people from spending their time on that instead of something else.
Comment by ricraz on Against strong bayesianism · 2020-05-02T05:36:22.125Z · score: 2 (1 votes) · LW · GW

I'm a little confused by this one, because in your previous response you say that you think Bob accurately represents Eliezer's position, and now you seem to be complaining about the opposite?

Comment by ricraz on ricraz's Shortform · 2020-05-01T10:56:36.212Z · score: 4 (2 votes) · LW · GW
Instead I read it as something like "some unreasonable percentage of an agent's actions are random"

This is in fact the intended reading, sorry for ambiguity. Will edit. But note that there are probably very few situations where exploring via actual randomness is best; there will almost always be some type of exploration which is more favourable. So I don't think this helps.

We care about utility-maximizers because they're doing their backwards assignment, using their predictions of the future to guide their present actions to try to shift the future to be more like what they want it to be.

To be pedantic: we care about "consequence-desirability-maximisers" (or in Rohin's terminology, goal-directed agents) because they do backwards assignment. But I think the pedantry is important, because people substitute utility-maximisers for goal-directed agents, and then reason about those agents by thinking about utility functions, and that just seems incorrect.

And so if I read the original post as "the further a robot's behavior is from optimal, the less likely it is to demonstrate convergent instrumental goals"

What do you mean by optimal here? The robot's observed behaviour will be optimal for some utility function, no matter how long you run it.

Comment by ricraz on Against strong bayesianism · 2020-05-01T10:45:55.503Z · score: 16 (6 votes) · LW · GW

The very issue in question here is what this set of tools tells us about the track record of the machine. It could be uninformative because there are lots of other things that come from the machine that we are ignoring. Or it could be uninformative because they didn't actually come from the machine, and the link between them was constructed post-hoc.

Comment by ricraz on Against strong bayesianism · 2020-05-01T10:34:39.661Z · score: 3 (2 votes) · LW · GW

I agree that I am not critiquing "Bayesianism to the rest of the world", but rather a certain philosophical position that I see as common amongst people reading this site. For example, I interpret Eliezer as defending that position here (note that the first paragraph is sarcastic):

Clearly, then, a Carnot engine is a useless tool for building a real-world car.  The second law of thermodynamics, obviously, is not applicable here.  It's too hard to make an engine that obeys it, in the real world.  Just ignore thermodynamics - use whatever works.
This is the sort of confusion that I think reigns over they who still cling to the Old Ways.
No, you can't always do the exact Bayesian calculation for a problem.  Sometimes you must seek an approximation; often, indeed.  This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms.  Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation - and fails to the extent that it departs.

Also, insofar as AIXI is a "hypothetical general AI that doesn't understand that its prediction algorithms take time to run", I think "strawman" is a little inaccurate.

Anyway, thanks for the comment. I've updated the first paragraph to make the scope of this essay clearer.

Comment by ricraz on Against strong bayesianism · 2020-04-30T23:30:31.496Z · score: 3 (2 votes) · LW · GW
All of the applications in which Bayesian statistics/ML methods work so well. All the robotics/AI/control theory applications where Bayesian methods are used in practice.

This does not really seem like much evidence to me, because for most of these cases non-bayesian methods work much better. I confess I personally am in the "throw a massive neural network at it" camp of machine learning; and certainly if something with so little theoretical validation works so well, it makes one question whether the sort of success you cite really tells us much about bayesianism in general.

All of the psychology/neuroscience research on human intelligence approximating Bayesianism.

I'm less familiar with this literature. Surely human intelligence *as a whole* is not a very good approximation to bayesianism (whatever that means). And it seems like most of the heuristics and biases literature is specifically about how we don't update very rationally. But at a lower level, I defer to your claim that modules in our brain approximate bayesianism.

Then I guess the question is how to interpret this. It certainly feels like a point in favour of some interpretation of bayesianism as a general framework. But insofar as you're thinking about an interpretation which is being supported by empirical evidence, it seems important for someone to formulate it in such a way that it could be falsified. I claim that the way bayesianism has been presented around here (as an ideal of rationality) is not a falsifiable framework, and so at the very least we need someone else to make the case for what they're standing for.

Comment by ricraz on What is the alternative to intent alignment called? · 2020-04-30T22:38:11.912Z · score: 2 (1 votes) · LW · GW

Probably H intends A to achieve a narrow subset of H's goals, but doesn't necessarily want A pursuing them in general.

Similarly, if I have an employee, I may intend for them to do some work-related tasks for me, but I probably don't intend for them to go and look after my parents, even though ensuring my parents are well looked-after is a goal of mine.

Comment by ricraz on Against strong bayesianism · 2020-04-30T22:34:52.221Z · score: 2 (1 votes) · LW · GW
We have tons of empirical evidence on this.

What sort of evidence are you referring to; can you list a few examples?

Comment by ricraz on Against strong bayesianism · 2020-04-30T20:04:21.727Z · score: 10 (6 votes) · LW · GW

In general I very much appreciate people reasoning from examples like these. The sarcasm does make me less motivated to engage with this thoroughly, though. Anyway, idk how to come up with general rules for which abstractions are useful and which aren't. Seems very hard. But when we have no abstractions which are empirically verified to work well in modelling a phenomenon (like intelligence), it's easy to overestimate how relevant our best mathematics is, because proofs are the only things that look like concrete progress.

On Big-O analysis in particular: this is a pretty interesting example actually, since I don't think it was obvious in advance that it'd work as well as it has (i.e. that the constants would be fairly unimportant in practice). Need to think more about this one.

Comment by ricraz on Against strong bayesianism · 2020-04-30T19:47:46.627Z · score: 5 (3 votes) · LW · GW
This makes me think that you're (mostly) arguing against 'Bayesianism', i.e. effectively requesting that we 'taboo' that term and discuss its components ("tenets") separately.

This is not an unreasonable criticism, but it feels slightly off. I am not arguing against having a bunch of components which we put together into a philosophy with a label; e.g. liberalism is a bunch of different components which get lumped together, and that's fine. I am arguing that the current way that the tenets of bayesianism are currently combined is bad, because there's this assumption that they are a natural cluster of ideas that can be derived from the mathematics of Bayes' rule. It's specifically discarding this assumption that I think is helpful. Then we could still endorse most of the same ideas as before, but add more in which didn't have any link to bayes' rule, and stop privileging bayesianism as a tool for thinking about AI. (We'd also want a new name for this cluster, I guess; perhaps reasonism? Sounds ugly now, but we'd get used to it).

Comment by ricraz on Against strong bayesianism · 2020-04-30T19:33:42.173Z · score: 9 (5 votes) · LW · GW

Yeah, actually, I think your counterargument is correct. I basically had a cached thought that Block was trying to do with Blockhead a similar thing to what Searle was trying to do with the Chinese Room. Should have checked it more carefully.

I've now edited to remove my critique of Block himself, while still keeping the argument that Blockhead is uninformative about AI for (some of) the same reasons that bayesianism is.

Comment by ricraz on ricraz's Shortform · 2020-04-30T13:34:52.891Z · score: 5 (3 votes) · LW · GW

I've just put up a post which serves as a broader response to the ideas underpinning this type of argument.