Posts

Coalitional agency 2024-07-22T00:09:51.525Z
A more systematic case for inner misalignment 2024-07-20T05:03:03.500Z
Towards more cooperative AI safety strategies 2024-07-16T04:36:29.191Z
A simple case for extreme inner misalignment 2024-07-13T15:40:37.518Z
The Minority Faction 2024-06-24T20:01:27.436Z
CIV: a story 2024-06-15T22:36:50.415Z
Tinker 2024-04-16T18:26:38.679Z
Measuring Coherence of Policies in Toy Environments 2024-03-18T17:59:08.118Z
Notes from a Prompt Factory 2024-03-10T05:13:39.384Z
Every "Every Bay Area House Party" Bay Area House Party 2024-02-16T18:53:28.567Z
Masterpiece 2024-02-13T23:10:35.376Z
A sketch of acausal trade in practice 2024-02-04T00:32:54.622Z
Succession 2023-12-20T19:25:03.185Z
∀: a story 2023-12-17T22:42:32.857Z
Meditations on Mot 2023-12-04T00:19:19.522Z
The Witness 2023-12-03T22:27:16.248Z
The Soul Key 2023-11-04T17:51:53.176Z
Value systematization: how values become coherent (and misaligned) 2023-10-27T19:06:26.928Z
Techno-humanism is techno-optimism for the 21st century 2023-10-27T18:37:39.776Z
The Gods of Straight Lines 2023-10-14T04:10:50.020Z
Eight Magic Lamps 2023-10-14T04:10:02.040Z
The Witching Hour 2023-10-10T00:19:37.786Z
One: a story 2023-10-10T00:18:31.604Z
Arguments for moral indefinability 2023-09-30T22:40:04.325Z
Alignment Workshop talks 2023-09-28T18:26:30.250Z
Jacob on the Precipice 2023-09-26T21:16:39.590Z
The King and the Golem 2023-09-25T19:51:22.980Z
Drawn Out: a story 2023-07-11T00:08:09.286Z
The virtue of determination 2023-07-10T05:11:00.412Z
You must not fool yourself, and you are the easiest person to fool 2023-07-08T14:05:18.642Z
Fixed Point: a love story 2023-07-08T13:56:54.807Z
Agency begets agency 2023-07-06T13:08:44.318Z
Frames in context 2023-07-03T00:38:52.078Z
Meta-rationality and frames 2023-07-03T00:33:20.355Z
Man in the Arena 2023-06-26T21:57:45.353Z
The ones who endure 2023-06-16T14:40:09.623Z
Cultivate an obsession with the object level 2023-06-07T01:39:54.778Z
The ants and the grasshopper 2023-06-04T22:00:04.577Z
Coercion is an adaptation to scarcity; trust is an adaptation to abundance 2023-05-23T18:14:19.117Z
Self-leadership and self-love dissolve anger and trauma 2023-05-22T22:30:06.650Z
Trust develops gradually via making bids and setting boundaries 2023-05-19T22:16:38.483Z
Resolving internal conflicts requires listening to what parts want 2023-05-19T00:04:20.451Z
Conflicts between emotional schemas often involve internal coercion 2023-05-17T10:02:50.860Z
We learn long-lasting strategies to protect ourselves from danger and rejection 2023-05-16T16:36:08.398Z
Judgments often smuggle in implicit standards 2023-05-15T18:50:07.781Z
From fear to excitement 2023-05-15T06:23:18.656Z
Clarifying and predicting AGI 2023-05-04T15:55:26.283Z
AGI safety career advice 2023-05-02T07:36:09.044Z
Communicating effectively under Knightian norms 2023-04-03T22:39:58.350Z
Policy discussions follow strong contextualizing norms 2023-04-01T23:51:36.588Z

Comments

Comment by Richard_Ngo (ricraz) on Daniel Kokotajlo's Shortform · 2024-07-25T05:28:05.052Z · LW · GW

Relevant: my post on value systematization

Though I have a sneaking suspicion that this comment was originally made on a draft of that?

Comment by Richard_Ngo (ricraz) on Daniel Kokotajlo's Shortform · 2024-07-25T05:07:32.948Z · LW · GW

I disagree with the first one. I think that the spectrum of human-level AGI is actually quite wide, and that for most tasks we'll get AGIs that are better than most humans significantly before we get AGIs that are better than all humans. But the latter is much more relevant for recursive self-improvement, because it's bottlenecked by innovation, which is driven primarily by the best human researchers. E.g. I think it'd be pretty difficult to speed up AI progress dramatically using millions of copies of an average human.

Also, by default I think people talk about FOOM in a way that ignores regulations, governance, etc. Whereas in fact I expect these to put significant constraints on the pace of progress after human-level AGI.

If we have millions of copies of the best human researchers, without governance constraints on the pace of progress... Then compute constraints become the biggest thing. It seems plausible that you get a software-only singularity, but it also seems plausible that you need to wait for AI innovation of new chip manufacturing to actually cash out in the real world.

I broadly agree with the second one, though I don't know how many people there are left with 30-year timelines. But 20 years to superintelligence doesn't seem unreasonable to me (though it's above my median). In general I've updated lately that Kurzweil was more right than I used to think about there being a significant gap between AGI and ASI. Part of this is because I expect the problem of multi-agent credit assignment over long time horizons to be difficult.

Comment by Richard_Ngo (ricraz) on Daniel Kokotajlo's Shortform · 2024-07-25T04:21:29.894Z · LW · GW

In the last 24 hours. I read fast (but also skipped the last third of the Doomsday Machine).

Comment by Richard_Ngo (ricraz) on Daniel Kokotajlo's Shortform · 2024-07-24T19:18:50.106Z · LW · GW

This comment prompted me to read both Secrets and also The Doomsday Machine by Ellsberg. Both really great, highly recommend.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-23T18:29:15.251Z · LW · GW

I think "being the kind of agent who survives the selection process" can sometimes be an important epistemic thing to consider

I'm not claiming it's zero information, but there are lots of things that convey non-zero information which it'd be bad to set disclosure norms based on. E.g. "I've only ever worked at nonprofits" should definitely affect your opinion of someone's epistemics (e.g. when they're trying to evaluate corporate dynamics) but once we start getting people to disclose that sort of thing there's no clear stopping point. So mostly I want the line to be "current relevant conflicts of interest".

Comment by Richard_Ngo (ricraz) on Coalitional agency · 2024-07-23T16:31:47.698Z · LW · GW

Ooops, good catch.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-23T16:13:37.569Z · LW · GW

But I also think that one of the reasons why Richard still works at OpenAI is because he's the kind of agent who genuinely believes things that tend to be pretty aligned with OpenAI's interests, and I suspect his perspective is informed by having lots of friends/colleagues at OpenAI. 

Added a disclaimer, as suggested. It seems like a good practice for this sort of post. Though note that I disagree with this paragraph; I don't think "being the kind of agent who X" or "being informed by many people at Y" are good reasons to give disclaimers. Whereas I do buy that "they filter out any ideas that they have that could get them in trouble with the company" is an important (conscious or unconscious) effect, and worth a disclaimer.

I've also added this note to the text:

Note that most big companies (especially AGI companies) are strongly structurally power-seeking too, and this is a big reason why society at large is so skeptical of and hostile to them. I focused on AI safety in this post both because companies being power-seeking is an idea that's mostly "priced in", and because I think that these ideas are still useful even when dealing with other power-seeking actors.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-22T21:25:42.001Z · LW · GW

No legible evidence jumps to mind, but I'll keep an eye out. Inherently this sort of thing is pretty hard to pin down, but I do think I'm one of the handful of people that most strongly bridges the AI safety and accelerationist communities on a social level, and so I get a lot of illegible impressions.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-22T21:16:11.520Z · LW · GW

Presumably, at some point, some groups start advocating for specific policies that go against the e/acc worldview. At that point, it seems like you get the organized resistance.

My two suggestions:

  1. People stop aiming to produce proposals that hit almost all the possible worlds. By default you should design your proposal to be useless in, say, 20% of the worlds you're worried about (because trying to get that last 20% will create really disproportionate pushback); or design your proposal so that it leaves 20% of the work undone (because trusting that other people will do that work ends up being less power-seeking, and more robust, than trying to centralize everything under your plan). I often hear people saying stuff like "we need to ensure that things go well" or "this plan needs to be sufficient to prevent risk", and I think that mindset is basically guaranteed to push you too far towards the power-seeking end of the spectrum. (I've added an edit to the end of the post explaining this.)
  2. As a specific example of this, if your median doom scenario goes through AGI developed/deployed by centralized powers (e.g. big labs, govts) I claim you should basically ignore open-source. Sure, there are some tail worlds where a random hacker collective beats the big players to build AGI; or where the big players stop in a responsible way, but the open-source community doesn't; etc. But designing proposals around those is like trying to put out candles when your house is on fire. And I expect there to be widespread appetite for regulating AI labs from govts, wider society, and even labs themselves, within a few years' time, unless those proposals become toxic in the meantime—and making those proposals a referendum on open-source is one of the best ways I can imagine to make them toxic.

(I've talked to some people whose median doom scenario looks more like Hendrycks' "natural selection" paper. I think it makes sense by those people's lights to continue strongly opposing open-source, but I also think those people are wrong.)

I think that the "we must ensure" stuff is mostly driven by a kind of internal alarm bell rather than careful cost-benefit reasoning; and in general I often expect this type of motivation to backfire in all sorts of ways.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-22T07:21:43.835Z · LW · GW

In a world where AI safety folks didn't say/do anything about OS, I would still suspect clashes between e/accs and AI safety folks.

There's a big difference between e/acc as a group of random twitter anons, and e/acc as an organized political force. I claim that anti-open-source sentiment from the AI safety community played a significant role (and was perhaps the single biggest driver) in the former turning into the latter. It's much easier to form a movement when you have an enemy. As one illustrative example, I've seen e/acc flags that are a version of the libertarian flag saying "come and take it [our GPUs]". These are a central example of an e/acc rallying cry that was directly triggered by AI governance proposals. And I've talked to several principled libertarians who are too mature to get sucked into a movement by online meme culture, but who have been swung in that direction due to shared opposition to SB-1047.

Consider, analogously: Silicon Valley has had many political disagreements with the Democrats over the last decade—e.g. left-leaning media has continuously been very hostile to Silicon Valley. But while the incentives to push back were there for a long time, the organized political will to push back has only arisen pretty recently. This shows that there's a big difference between "in principle people disagree" and "actual political fights".

I think it's extremely likely this would've happened anyways. A community that believes passionately in rapid or maximally-fast AGI progress already has strong motivation to fight AI regulations.

This reasoning seems far too weak to support such a confident conclusion. There was a lot of latent pro-innovation energy in Silicon Valley, true, but the ideology it gets channeled towards is highly contingent. For instance, Vivek Ramaswamy is a very pro-innovation, anti-regulation candidate who has no strong views on AI. If AI safety hadn't been such a convenient enemy then plausibly people with pro-innovation views would have channeled them towards something closer to his worldview.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-07-21T23:52:16.435Z · LW · GW

I haven't yet read through them thoroughly, but these four papers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work.

tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of "minimizing training loss" as "minimizing inconsistency" (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning via locally minimizing inconsistency which unifies several other algorithms (like the EM algorithm, message-passing, and generative adversarial training) (paper 4).

Oliver is an old friend of mine (which is how I found out about these papers) and a final-year PhD student at Cornell under Joe Halpern.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-21T17:43:52.755Z · LW · GW

I expect neither to work in practice, since I don't think that either [broad competence of decision-makers] or [increased legitimacy of broad (and broadening!) AIS community] help us much at all in achieving our goals. To achieve our goals, I expect we'll need something much closer to 'our' people in power.

While this seems like a reasonable opinion in isolation, I also read the thread where you were debating Rohin and holding the position that most technical AI safety work was net-negative.

And so basically I think that you, like Eliezer, have been forced by (according to me, incorrect) analyses of the likelihood of doom to the conclusion that only power-seeking strategies will work.

From the inside, for you, it feels like "I am responding to the situation with the appropriate level of power-seeking given how extreme the circumstances are".

From the outside, for me, it feels like "The doomers have a cognitive bias that ends up resulting in them overrating power-seeking strategies, and this is not a coincidence but instead driven by the fact that it's disproportionately easy for cognitive biases to have this effect (given how the human mind works)".

Fortunately I think most rationalists have fairly good defense mechanisms against naive power-seeking strategies, and this is to their credit. So the main thing I'm worried about here is concentrating less force behind non-power-seeking strategies.

Comment by Richard_Ngo (ricraz) on Optimistic Assumptions, Longterm Planning, and "Cope" · 2024-07-21T01:10:33.254Z · LW · GW

Yes, I'm saying it's a reasonable conclusion to draw, and the fact that it isn't drawn here is indicative of a kind of confirmation bias.

Comment by Richard_Ngo (ricraz) on A more systematic case for inner misalignment · 2024-07-20T18:04:36.095Z · LW · GW

Ah, sorry for the carelessness on my end. But this still seems like a substantive disagreement: you expect 

, and I don't, for the reasons in my comment.

Comment by Richard_Ngo (ricraz) on A more systematic case for inner misalignment · 2024-07-20T17:48:25.757Z · LW · GW

Thanks for the extensive comment! I'm finding this discussion valuable. Let me start by responding to the first half of your comment, and I'll get to the rest later.

The simplicity of a goal is inherently dependent on the ontology you use to view it through: while  is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the "essence" of the goal simplifies; instead, it's merely because it gets access to a more powerful ontology that has more detail, granularity, and degrees of freedom. If you try to view  in  instead of , meaning you look at the preimage , this should approximately be the same as : your argument establishes no reason for us to think that there is any force pulling the goal itself, as opposed to its representation, to be made smaller.

One way of framing our disagreement: I'm not convinced that the f operation makes sense as you've defined it. That is, I don't think it can both be invertible and map to goals with low complexity in the new ontology.

Consider a goal that someone from the past used to have, which now makes no sense in your ontology—for example, the goal of reaching the edge of the earth, for someone who thought the earth was flat. What does this goal look like in your ontology? I submit that it looks very complicated, because your ontology is very hostile to the concept of the "edge of the earth". As soon as you try to represent the hypothetical world in which the earth is flat (which you need to do in order to point to the concept of its "edge"), you now have to assume that the laws of physics as you know them are wrong; that all the photos from space were faked; that the government is run by a massive conspiracy; etc. Basically, in order to represent this goal, you have to set up a parallel hypothetical ontology (or in your terminology,  needs to encode a lot of the content of ). Very complicated!

I'm then claiming that whatever force pushes our ontologies to simplify also pushes us away from using this sort of complicated construction to represent our transformed goals. Instead, the most natural thing to do is to adapt the goal in some way that ends up being simple in your new ontology. For example, you might decide that the most natural way to adapt "reaching the edge of the earth" means "going into space"; or maybe it means "reaching the poles"; or maybe it means "pushing the frontiers of human exploration" in a more metaphorical sense. Importantly, under this type of transformation, many different goals from the old ontology will end up being mapped to simple concepts in the new ontology (like "going into space"), and so it doesn't match your definition of .

All of this still applies (but less strongly) to concepts that are not incoherent in the new ontology, but rather just messy. E.g. suppose you had a goal related to "air", back when you thought air was a primitive substance. Now we know that air is about 78% nitrogen, 21% oxygen, and 0.93% argon. Okay, so that's one way of defining "air" in our new ontology. But this definition of air has a lot of messy edge cases—what if the ratios are slightly off? What if you have the same ratios, but much different pressures or temperatures? Etc. If you have to arbitrarily classify all these edge cases in order to pursue your goal, then your goal has now become very complex. So maybe instead you'll map your goal to the idea of a "gas", rather than "gas that has specific composition X". But then you discover a new ontology in which "gas" is a messy concept...

If helpful I could probably translate this argument into something closer to your ontology, but I'm being lazy for now because your ontology is a little foreign to me. Let me know if this makes sense.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-19T18:37:24.712Z · LW · GW

I think this whole debate is missing the point I was trying to make. My claim was that it's often useful to classify actions which tend to lead you to having a lot of power as "structural power-seeking" regardless of what your motivations for those actions are. Because it's very hard to credibly signal that you're accumulating power for the right reasons, and so the defense mechanisms will apply to you either way.

In this case MIRI was trying to accumulate a lot of power, and claiming that they were aiming to use it in the "right way" (do a pivotal act) rather than the "wrong way" (replacing governments). But my point above is that this sort of claim is largely irrelevant to defense mechanisms against power-seeking.

(Now, in this case, MIRI was pursuing a type of power that was too weird to trigger many defense mechanisms, though it did trigger some "this is a cult" defense mechanisms. But the point cross-applies to other types of power that they, and others in AI safety, are pursuing.)

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-19T04:02:13.090Z · LW · GW

Would you say that "Alice going to a networking event" (assume she's doing it socially conventional/appropriate ways) would count as structural power-seeking? And would you discourage her from going?

I think you're doing a paradox of the heap here. One grain of sand is obviously not a heap, but a million obviously is. Similarly, Alice going to one networking event is obviously not power-seeking, but Alice taking every opportunity she can to pitch herself to the most powerful people she can find obviously is. I'm identifying a pattern of behavior that AI safety exhibits significantly more than other communities, and the fair analogy is to a pattern of behavior that Alice exhibits significantly more than other people around her.

I'm also a bit worried about a motte-and-bailey here. The bold statement is "power-seeking (which I'm kind of defining as anything that increases your influence, regardless of how innocuous or socially accepted it seems) is bad because it triggers defense mechanisms"

I flagged several times in the post that I was not claiming that power-seeking is bad overall, just that it typically has this one bad effect.

the more moderated statement is "there are some specific ways of seeking power that have important social costs, and I think that some/many actors in the community underestimate those costs

I repudiated this position in my previous comment, where I flagged that I'm trying to make a claim not about specific ways of seeking power, but rather about the outcome of gaining power in general.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-18T21:55:07.149Z · LW · GW

e/acc has coalesced in defense of open-source, partly in response to AI safety attacks on open-source. This may well lead directly to a strongly anti-AI-regulation Trump White House, since there are significant links between e/acc and MAGA.

I think of this as a massive own goal for AI safety, caused by focusing too much on trying to get short-term "wins" (e.g. dunking on open-source people) that don't actually matter in the long term.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-18T21:52:10.751Z · LW · GW

In this case, it's not the desire to have influence that is the core problem. The core problem is whether or not Alice is taking the right moves to have the kind of influence she wants.

I think I actually disagree with this. It feels like your framing is something like: "if you pursue power in the wrong ways, you'll have problems. If you pursue power in the right ways, you'll be fine".

And in fact the thing I'm trying to convey is more like "your default assumption should be that accumulating power triggers defense mechanisms, and you might think you can avoid this tradeoff by being cleverer, but that's mostly an illusion". (Or, in other words, it's faulty CDT-style thinking.)

Based on this I actually think that "structurally power-seeking" is the right term after all, because it's implicitly asserting that you can't separate out these two things ("power-seeking" and "gaining power in 'the right ways'").

Note also that my solutions at the end are not in fact strategies for accumulating power in 'the right ways.' They're strategies for achieving your goals while accumulating less power. E.g. prioritizing competence means that you'll try less hard to get "your" person into power. Prioritizing legitimacy means you're making it harder to get your own ideas implemented, when others disagree.

(FWIW I think that on the level of individuals the tradeoff between accumulating power and triggering defense mechanisms is often just a skill issue. But on the level of movements the tradeoff is much harder to avoid—e.g. you need to recruit politically-savvy people, but that undermines your truth-seeking altruistically-motivated culture.)

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-18T20:49:20.979Z · LW · GW

Would you feel the same way about "influence-seeking", which I almost went with?

Note also that, while Bob is being a dick about it, the dynamic in your scenario is actually a very common one. Many people are social climbers who use every opportunity to network or shill themselves, and this does get noticed and reflects badly on them. We can debate about the precise terminology to use (which I think should probably be different for groups vs individuals) but if Alice just reasoned from the top down about how to optimize her networking really hard for her career, in a non-socially-skilled way, a friend should pull her aside and say "hey, communities often have defense mechanisms against the thing you're doing, watch out".

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-18T19:43:05.910Z · LW · GW

I do think that modeling the AI Safety space as a single power-base is wrong and not really carving reality along structural lines. 

This is the thing that feels most like talking past each other. You're treating this as a binary and it's really, really not a binary. Some examples:

  • There are many circumstances in which it's useful to describe "the US government" as a power base, even though republicans and democrats are often at each other's throats, and also there are many people within the US government who are very skeptical of it (e.g. libertarians).
  • There are many circumstances in which it's useful to describe "big tech" as a power base, even though the companies in it are competing ferociously. 

I'm not denying that there are crucial differences to model here. But this just seems like the wrong type of argument to use to object to accusations of gerrymandering, because every example of gerrymandering will be defended with "here are the local differences that feel crucial to me".

So how should we evaluate this in a principled way? One criterion: how fierce is the internal fighting? Another: how many shared policy prescriptions do the different groups have? On the former, while I appreciate that you've been treated badly by OpenPhil, I think "trying to eradicate each other" is massive hyperbole. I would maybe accept that as a description of the fighting between AI ethics people, AI safety people, and accelerationists, but the types of weapons they use (e.g. public attacks in major news outlets, legal measures, etc) are far harsher than anything I've heard about internally inside AI safety. If I'm wrong about this, I'd love to know.

On the latter: when push comes to shove, a lot of these different groups band together to support stuff like interpretability research, raising awareness of AI risk, convincing policymakers it's a big deal, AI legislation, etc. I'd believe that you don't do this; I don't believe that there are thousands of others who have deliberately taken stances against these things, because I think there are very few people as cautious about this as you (especially when controlling for influence over the movement).

Like, the way I have conceptualized most of my life's work so far has been to try to build neutral non-power-seeking institutions, that inform other people and help them make better decisions, and that generally try to actively avoid plans that route through "me and my friends get powerful and then solve our problems" because I think this kind of plan will almost inevitably end up just running into conflict with other power-seeking entities and then spend most of its resources on that.

As above, I respect this a lot.
 

Comment by Richard_Ngo (ricraz) on Optimistic Assumptions, Longterm Planning, and "Cope" · 2024-07-18T19:26:00.611Z · LW · GW

Yepp, all of these arguments can weigh in many different directions, depending on your background beliefs. That's my point.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-18T18:38:32.297Z · LW · GW

I never claimed that AI safety is more X than "any other advocacy group"; I specifically said "most other advocacy groups". And of course I'm not sure about this, asking for that is an isolated demand for rigor. It feels like your objection is the thing that's vibe-based here.

On the object level: these are good examples, but because movements vary on so many axes, it's hard to weigh up two of them against each other. That's why I identified the three features of AI safety which seem to set it apart from most other movements. (Upon reflection I'd also add a fourth: the rapid growth.)

I'm curious if there are specific features which some of the movements you named have that you think contribute significantly to their power-seeking-ness, which AI safety doesn't have.

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-18T18:18:31.280Z · LW · GW

Suppose I'm an atheist, or a muslim, or a jew, and an Episcopalian living in my town came up to me and said "I'm not meaningfully in a shared power-base with the Baptists. Sure, there's a huge amount of social overlap, we spend time at each other's churches, and we share many similar motivations and often advocate for many of the same policies. But look, we often argue about theological disagreements, and also the main funder for their church doesn't fund our church (though of course many other funders fund both Baptists and Episcopalians)."

I just don't think this is credible, unless you're using a very strict sense of "meaningfully". But at that level of strictness it's impossible to do any reasoning about power-bases, because factional divides are fractal. What it looks like to have a power-base is to have several broadly-aligned and somewhat-overlapping factions that are each seeking power for themselves. In the case above, the Episcopalian may legitimately feel very strongly about their differences with the Baptists, but this is a known bug in human psychology: the narcissism of small differences.

Though I am happy to agree that Lightcone is one of the least structurally power-seeking entities in the AI safety movement, and I respect this. (I wouldn't say the same of current-MIRI, which is now an advocacy org focusing on policies that strongly centralize power. I'm uncertain about past-MIRI.)

Comment by Richard_Ngo (ricraz) on Optimistic Assumptions, Longterm Planning, and "Cope" · 2024-07-18T18:01:43.667Z · LW · GW

These are interesting anecdotes but it feels like they could just as easily be used to argue for the opposite conclusion.

That is, your frame here is something like "planning is hard therefore you should distrust alignment plans".

But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".

Also, the second section makes an argument in favor of backchaining. But that seems to contradict the first section, in which people tried to backchain and it went badly. The best way for them to make progress would have been to play around with a bunch of possibilities, which is closer to forward-chaining.

(And then you might say: well, in this context, they only got one shot at the problem. To which I'd say: okay, so the most important intervention is to try to have more shots on the problem. Which isn't clearly either forward-chaining or back-chaining, but probably closer to the former.)

Comment by Richard_Ngo (ricraz) on Towards more cooperative AI safety strategies · 2024-07-18T04:56:21.268Z · LW · GW

I am kinda confused by these comments. Obviously you can draw categories at higher or lower levels of resolution. Saying that it doesn't make sense to put Lightcone and MIRI in the same bucket as Constellation and OpenPhil, or Bengio in the same bucket as the Bay Area alignment community, feels like... idk, like a Protestant Christian saying it doesn't make sense to put Episcopalians and Baptists in the same bucket. The differences loom large for insiders but are much smaller for outsiders.

You might be implicitly claiming that AI safety people aren't very structurally power-seeking unless they're Bay Area EAs. I think this is mostly false, and in fact it seems to me that people often semi-independently reason themselves into power-seeking strategies after starting to care about AI x-risk. I also think that most proposals for AI safety regulation are structurally power-seeking, because they will make AI safety people arbitrators of which models are allowed (implicitly or explicitly). But a wide range of AI safety people support these (and MIRI, for example, supports some of the strongest versions of these).

I'll again highlight that just because an action is structurally power-seeking doesn't make it a bad idea. It just means that it comes along with certain downsides that people might not be tracking.

Comment by Richard_Ngo (ricraz) on A simple case for extreme inner misalignment · 2024-07-14T14:03:46.582Z · LW · GW

I don't argue at any point that ASIs will have a single goal. The argument goes through equally well if it has many. The question is why some of those goals are of the form "tile the universe with squiggles" at all. That's the part I'm addressing in this post.

Comment by Richard_Ngo (ricraz) on A simple case for extreme inner misalignment · 2024-07-14T00:16:13.224Z · LW · GW

Curious who just strong-downvoted and why.

Comment by Richard_Ngo (ricraz) on A simple case for extreme inner misalignment · 2024-07-13T22:49:22.776Z · LW · GW

I think the claim is along the lines of "highly compressed representations imply simple goals", but the connection between compressed representations and simple goals has not been argued, unless I missed it. There's also a chance that I simply misunderstand your post entirely. 

Hmm, maybe I should spell it out more explicitly. But basically, by "simple goals" I mean "goals which are simple to represent", i.e. goals which have highly compressed representations; and if all representations are becoming simpler, then the goal representations (as a subset of all representations) are also becoming simpler. (Note that I'll elaborate further on the relationship between goal representations and other representations in my next post.)

Actually, deep CNNs are an example of what you describe in argument 1: The features in later layers of CNNs are highly compressed, and may tell you binary information such as "is there a dog", but they apply to large parts of the input image.

This is largely my fault since I haven't really defined "representation" very clearly, but I would say that the representation of the concept of a dog should be considered to include e.g. the neurons representing "fur", "mouth", "nose", "barks", etc. Otherwise if we just count "dog" as being encoded in a single neuron, then every concept encoded in any neuron is equally simple, which doesn't seem like a useful definition.

(To put it another way: the representation is the information you need to actually do stuff with the concept.)

c. I think this leaves the confusion why philosophers end up favoring the analog of squiggles when they become hedonic utilitarians. I'd argue that the premise may be false since it's unclear to me how what philosophers say they care about ("henonium") connects with what they actually care about (e.g., maybe they still listen to complex music, build a family, build status through philosophical argumentation, etc.)

I agree that most people who say they are hedonic utilitarians are not 100% committed to hedonic utilitarianism. But I still think it's very striking that they at least somewhat care about making hedonium. I claim this provides an intuition pump for how AIs might care about squiggles too.

Comment by Richard_Ngo (ricraz) on A simple case for extreme inner misalignment · 2024-07-13T22:32:16.543Z · LW · GW

I expect that this doesn't mean you want commenters to withhold their objections until after you discuss a section of them in your next post

That's right, please critique away. And thanks for the thoughtful comment!

This reasoning seems to assume[2] that the "goal" of the AI is part of its "representations,"

An AI needs to represent its goals somehow; therefore its goals are part of its representations. But this is just a semantic point; I dig into your substantive criticism next.

The "representations," in the relevant sense that makes Premise 1 worth taking seriously, are object-level, positive rather than normative internal representations of the underlying territory. But the "goal" lies in another, separate magisterium. Yes, it refers to reality, so when the map approximating reality changes, so does its description. But the core of the goal does not, for it is normative rather than positive; it simply gets reinterpreted, as faithfully as possible, in the new ontology.

This is one of the objections I'll be covering in the next post. But you're right to call me out on it, because I am assuming that goal representations also become compressed. Or, to put it another way: I am assuming that the pressure towards simplicity described in premise 1 doesn't distinguish very much between goal representations and concept representations.

Why? Well, it's related to a claim you make yourself: that "the belief/goal distinction probably doesn't ultimately make sense as something that carves reality at the joints into coherent, distinct concepts". In other words, I don't think there is a fully clean distinction between goal representations and belief/concept representations. (In humans it often seems like the goal of achieving X is roughly equivalent to a deep-rooted belief that achieving X would be good, where "good" is a kinda fuzzy predicate that we typically don't look at very hard.) And so, given this, when I postulate a pressure to simplify representations my default assumption is that this will apply to both types of representations—as it seems to in my own brain, which often tries very hard to simplify my moral goals in a roughly analogous way to how it tries to simplify my beliefs.

This is one of the essences of the Orthogonality Thesis, is it not [3]? That the goal is independent (i.e., orthogonal, implying uncorrelated) of the factual beliefs about reality, 

The orthogonality thesis was always about what agents were possible, not which were likely. It is consistent with the orthogonality thesis to say that increasingly intelligent agents have a very strong tendency to compress their representations, and that this tends to change their goals towards ones which are simple in their new ontologies (although it's technically possible that they retain goals which are very complex in their new ontologies). Or, to put it another way: the orthogonality thesis doesn't imply that goals and beliefs are uncorrelated (which is a very extreme claim—e.g. it implies that superintelligences are just as likely to have goals related to confused concepts like phlogiston or souls as humans are).

Comment by Richard_Ngo (ricraz) on A simple case for extreme inner misalignment · 2024-07-13T16:06:30.157Z · LW · GW

Thanks! Yeah, I personally find it difficult to strike a balance between "most speculation of this kind is useless" and "occasionally it's incredibly high-impact" (e.g. shapes the whole field of alignment, like the concept of deceptive alignment did).

My guess is that the work which falls into the latter category is most often threat modeling, because that's a domain where there's no way to approach it except speculation.

Comment by Richard_Ngo (ricraz) on Announcing The Techno-Humanist Manifesto: A new philosophy of progress for the 21st century · 2024-07-08T23:08:41.708Z · LW · GW

I'm broadly supportive, but I also have a similar reaction to the other comments. I'm not really sure who would change their mind upon reading this book.

Or, to put it another way: I think the most important input into intellectual work is choosing the right opponents. If you want to push the intellectual frontier, you should choose the intellectually strongest opponents. (My own take on techno-humanism sets it up in contrast to a worthy opponent: classic techno-optimism. I'd be curious to hear what your sense of the distinction between them is.) Whereas if you want to win converts, you should choose the most popular opponents. But I don't get the sense that this book really has any opponents. Yes, many people will disagree with it—but that's different from the book actually taking them on directly. E.g. the degrowth people will hate this book, but this book is not a deconstruction of degrowthism. Probably woke people will also dislike this book, but neither is it a deconstruction of wokism.

I get the sense that you're deliberately choosing to make the book a little on the anodyne side in order for it to serve as a foundational text for the progress movement. But I actually think that, if you're aiming to be a foundational text, you should do the opposite. Movements are founded by books that go in all guns blazing. Think of Das Kapital—this is Marx setting up capitalism as his foe, then taking an axe to it. Or The Second Sex, or The Road to Serfdom, or Where's My Flying Car?, or even the Sequences. Foundational texts are polemics; they have a fire in them. (Possible counterexamples: I haven't read On Liberty or A Vindication of the Rights of Woman, maybe they're a bit more chill?)

And there's plenty of stuff to be fired up about! The FDA is killing millions; the NIMBYs are strangling the entire western world; the environmentalists are absolutely wrecking any project that has the faintest hope of helping the environment. If you want people to care about progress studies, first get them fired up about how crazy the situation is, and give them a diagnosis of what's going wrong, and only then bring in the abstract philosophy, and the history. Anyone who used to be (or still is?) an objectivist is definitely disagreeable enough to do this; I'd be excited to see you give it a real shot.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-30T00:11:00.434Z · LW · GW

This seems substantially different from UDT, which does not really have or use a notion of "past version of yourself".

My terminology here was sloppy, apologies. When I say "past versions of yourself" I am also including (as Nesov phrases it below) "the idealized past agent (which doesn't physically exist)". E.g. in the Counterfactual Mugging case you describe, I am thinking about precommitments that the hypothetical past version of yourself from before the coin was flipped would have committed to.

I find it a more intuitive way to think about UDT, though I realize it's a somewhat different framing from yours. Do you still think this is substantially different?

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-28T23:50:31.854Z · LW · GW

Okay, so trying to combine Prisoner's dilemma and UDT, we get: A and B are in a prisoner's dilemma. Suppose they have a list of N agents (which include, say, A's past self, B's past self, the Buddha, etc), and they each must commit to following one of those agent's instructions. Each of them estimates: "conditional on me committing to listen to agent K, here's a distribution over which agent they'd commit to listen to" And then you maximize expected value based on that. 

Okay, but why isn't this exactly the same as them just thinking to themselves "conditional on me taking action K, here's the distribution over their actions" for each of N actions they could take, and then maximizing expected value? It feels like the difference is that it's really hard to actually reason about the correlations between my low-level actions and your low-level actions, whereas it might be easier to reason about the correlations between my high-level commitments and your high-level commitments.

I.e. the role of the Buddha in this situation is just to make the acausal coordination here much easier.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-28T23:28:38.858Z · LW · GW

if the past agent doesn't try to wade in the murky waters of logical updatelessness, it's not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can "see everything".

I actually think it might still be more fallible, for a couple of reasons.

Firstly, consider an agent which, at time T, respects all commitments it would have made at times up to T. Now if you're trying to attack the agent at time T, you have T different versions of it that you can attack, and if any of them makes a dumb commitment then you win.

I guess you could account for this by just gradually increasing the threshold for making commitments over time, though.

Secondly: the further back you go, the more farsighted the past agent needs to be about the consequences of its commitments. If you have any compounding mistakes in the way it expects things to play out, then it'll just get worse and worse the further back you defer.

Again, I guess you could account for this by having a higher threshold for making commitments which you expect to benefit you further down the line.

Then, re logical updatelessness: it feels like in the long term we need to unify logical + empirical updates, because they're roughly the same type of thinking. Murky waters perhaps, but necessary ones.

Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.

Yeah, so what could this look like? I think one important idea is that you don't have to be deferring to your past self, it's just that your past self is the clearest Schelling point. But it wouldn't be crazy for me to, say, use BDT: Buddha Decision Theory, in which I obey all commitments that the Buddha would have made for me if he'd been told about my situation. The differences between me using UDT and BDT (tentatively) seem only qualitative to me, not quantitative. BDT makes it harder for me to cooperate with hypothetical copies of myself who hadn't yet thought of BDT (because "Buddha" is less of a Schelling point amongst copies of myself than "past Richard"). It also makes me worse off than UDT in some cases, because sometimes the Buddha would make commitments in favor of his interests, not mine. But it also makes it a bit easier for me to cooperate with others, who might also converge to BDT.

At this point I'm starting to suspect that solving UDT 2 is not just alignment-complete, it's also politics- and sociology-complete. The real question is whether we can isolate toy examples or problems in which these ideas can be formalized, rather than just having them remain vague "what if everyone got along" speculation.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-28T06:39:04.617Z · LW · GW

[Epistemic status: rough speculation, feels novel to me, though Wei Dai probably already posted about it 15 years ago.]

UDT is (roughly) defined as "follow whatever commitments a past version of yourself would have made if they'd thought about your situation". But this means that any UDT agent is only as robust to adversarial attacks as their most vulnerable past self. Specifically, it creates an incentive for adversaries to show UDT agents situations that would trick their past selves into making unwise commitments. It also creates incentives for UDT agents themselves to hack their past selves, in order to artificially create commitments that "took effect" arbitrarily far back in their past.

In some sense, then, I think UDT might have a parallel structure to the overall alignment problem. You have dumber past agents who don't understand most of what's going on. You have smarter present agents who have trouble cooperating, because they know too much. The smarter agents may try to cooperate by punting to "Schelling point" dumb agents. (But this faces many of the standard problems of dumb agents making decisions—e.g. the commitments they make will probably be inconsistent or incoherent in various ways. And so in fact you need the smarter agents to interpret the dumb agents' commitments, which then gets rid of a bunch of the value of punting it to those dumb agents in the first place.)

You also have the problem that the dumb agents will have situational awareness, and may recognize that their interests have diverged from the interests of the smart agents.

But this also suggests that a "solution" to UDT and a solution to alignment might have roughly the same type signature: a spotlighted structure for decision-making procedures that incorporate the interests of both dumb and smart agents. Even when they have disparate interests, the dumb agents would benefit from getting any decision-making power, and the smart agents would benefit from being able to use the dumb agents as Schelling points to cooperate around.

The smart agents could always refactor the dumb agents and construct new Schelling points if they wanted to, but that would cost them a lot of time and effort, because coordination is hard, and the existing coordination edifice has been built around these particular dumb agents. (Analogously, you could refactor out a bunch of childhood ideals and memories from your current self, but mostly you don't want to, because they constitute the fabric from which your identity has been constructed.)

To be clear, this isn't meant to be an argument that ASIs which don't like us at all will keep us around. That seems unlikely either way. But it could be an argument that ASIs which kinda like us a little bit will keep us around—that it might not be incredibly unnatural for them to do so, because their whole cognitive structure will incorporate the opinions and values of dumber agents by default.

Comment by Richard_Ngo (ricraz) on Buck's Shortform · 2024-06-25T04:52:45.128Z · LW · GW

On the spectrum I outlined, the "legislate that AI labs should do X, Y, Z, as enforced by regulator R" end is less susceptible to regulatory capture (at least after the initial bill is passed).

Comment by Richard_Ngo (ricraz) on Buck's Shortform · 2024-06-24T22:58:42.654Z · LW · GW

Another concern about safety cases: they feel very vulnerable to regulatory capture. You can imagine a spectrum from "legislate that AI labs should do X, Y, Z, as enforced by regulator R" to "legislate that AI labs need to provide a safety case, which we define as anything that satisfies regulator R". In the former case, you lose flexibility, but the remit of R is very clear. In the latter case, R has a huge amount of freedom to interpret what counts as an adequate safety case. This can backfire badly if R is not very concerned about the most important threat models; and even if they are, the flexibility makes it easier for others to apply political pressure to them (e.g. "find a reason to approve this safety case, it's in the national interest").

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-24T22:54:32.761Z · LW · GW

The most representative were those from Anthropic employees who concurred that this was indeed the implication, but it seemed bad form to cite particular employees (especially when that information was not public by default) rather than, e.g., Dario. I think Dustin’s statement was strong evidence of this impression, though, and I still believe Anthropic to have at least insinuated it.

This makes sense, and does update me. Though I note "implication", "insinuation" and "impression" are still pretty weak compared to "actually made a commitment", and still consistent with the main driver being wishful thinking on the part of the AI safety community (including some members of the AI safety community who work at Anthropic).

I think that when you are making a technology which might extinct humanity, the bar should be significantly higher than “normal discourse.” When you are doing something with that much potential for harm, you owe it to society to make firm commitments that you stick to.

...

So I do blame them for not making such a statement—it is on them to show to humanity, the people they are making decisions for, why those decisions are justified. It is not on society to make the political situation sufficiently palatable such that they don’t face any consequences for the mistakes they have made. It is on them not to make those mistakes, and to own up to them when they do. 

I think there are two implicit things going on here that I'm wary of. The first one is an action-inaction distinction. Pushing them to justify their actions is, in effect, a way of slowing down all their actions. But presumably Anthropic thinks that them not doing things is also something which could lead to humanity going extinct. Therefore there's an exactly analogous argument they might make, which is something like "when you try to stop us from doing things you owe it to the world to adhere to a bar that's much higher than 'normal discourse'". And in fact criticism of Anthropic has not met this bar - e.g. I think taking a line from a blog post out of context and making a critical song about it is in fact unusually bad discourse.

What's the disanalogy between you and Anthropic telling each other to have higher standards? That's the second thing that I'm wary about: you're claiming to speak on behalf of humanity as a whole. But in fact, you are not; there's no meaningful sense in which humanity is in fact demanding a certain type of explanation from Anthropic. Almost nobody wants an explanation of this particular policy; in fact, the largest group of engaged stakeholders here are probably Anthropic customers, who mostly just want them to ship more models.

I don't really have a strong overall take. I certainly think it's reasonable to try to figure out what went wrong with communication here, and perhaps people poking around and asking questions would in fact lead to evidence of clear commitments being made. I am mostly against the reflexive attacks based on weak evidence, which seems like what's happening here. In general my model of trust breakdowns involves each side getting many shallow papercuts from the other side until they decide to disengage, and my model of productive criticism involves more specificity and clarity.

if Anthropic is attempting to serve the public, which they at least pay lip service to through their corporate structure, then they should be grateful for this feedback, and attempt to incorporate it.

I don't know if you've ever tried this move on an interpersonal level, but it is exactly the type of move that tends to backfire hard. And in fact a lot of these things are fundamentally interpersonal things, about who trusts whom, etc.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-23T05:12:15.869Z · LW · GW

I mean, yes, they're closely related.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-21T21:04:42.905Z · LW · GW

See my comment below. Basically I think this depends a lot on the extent to which a commitment was made.

Right now it seems like the entire community is jumping to conclusions based on a couple of "impressions" people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that's on you. And trying to double down by saying "I have been bashing you because I formed an unreasonable expectation, now it's your job to fix that" seems pretty adversarial.

I do think it would be nice if Anthropic did make such a statement, but seeing how adversarially everyone has treated the information they do release, I don't blame them for not doing so.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-21T20:57:55.160Z · LW · GW

The intended implication is something like "rationalists have a bias towards treating statements as much firmer commitments than intended then getting very upset when they are violated".

For example, unless I'm missing something, the "we do not wish to advance the rate of AI capabilities" claim is just one offhand line in a blog post. It's not a firm commitment, it's not even a claim about what their intentions are. As stated, it's just one consideration that informs their actions - and in fact the "wish" terminology is often specifically not a claim about intended actions (e.g. "I wish I didn't have to do X").

Yet rationalists are hammering them on this one sentence - literally making songs about it, tweeting it to criticize Anthropic, etc. It seems like there is a serious lack of metacognition about where a non-adversarial communication breakdown could have occurred, or what the charitable interpretations of this are.

(I am open to people considering them then dismissing them, but I'm not even seeing that. Like, if people were saying "I understand the difference between Anthropic actually making an organizational commitment, and just offhand mentioning a fact about their motivations, but here's why I'm disappointed anyway", that seems reasonable. But a lot of people seem to be treating it as a Very Serious Promise being broken.)

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-21T00:02:06.234Z · LW · GW

Inspired by a recent discussion about whether Anthropic broke a commitment to not push the capabilities frontier (I am more sympathetic to their position than most, because I think that it's often hard to distinguish between "current intentions" and "commitments which might be overridden by extreme events" and "solemn vows"):

Maybe one translation tool for bridging the gap between rationalists and non-rationalists is if rationalists interpret any claims about the future by non-rationalists as implicitly being preceded by "Look, I don't really believe that plans work, I think the world is inherently wildly unpredictable, I am kinda making everything up as I go along. Having said that:"

Comment by Richard_Ngo (ricraz) on CIV: a story · 2024-06-17T04:43:25.960Z · LW · GW

Yep, totally agree (and in fact I'm at a s-risk retreat right now). Definitely a "could make it decide" rather than a "will make it decide".

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-14T18:21:28.265Z · LW · GW

Thanks for the reply.

A post which focuses on the object-level implications for AI of a theory of rationality which looks very different from the AIXI-flavoured rat-orthodox view.

I'm working on this right now, actually. Will hopefully post in a couple of weeks.

I say this because those sorts of considerations convinced me that we're much less likely to be buggered.

That seems reasonable. But I do think there's a group of people who have internalized bayesian rationalism enough that the main blocker is their general epistemology, rather than the way they reason about AI in particular.

6 seems too general a claim to me. Why wouldn't it work for 1% vs 10%, and likewise 0.1% vs 1% i.e. why doesn't this suggest that you should round down P(doom) to zero.

I think the point of 6 is not to say "here's where you should end up", but more to say "here's the reason why this straightforward symmetry argument doesn't hold".

7 I kinda disagree with. Those models of idealized reasoning you mention generalize Bayesianism/Expected Utility Maximization. But they are not far from the Bayesian framework or EU frameworks.

There's still something importantly true about EU maximization and bayesianism. I think the changes we need will be subtle but have far-reaching ramifications. Analogously, relativity was a subtle change to newtonian mechanics that had far-reaching implications for how to think about reality.

Like Bayesianism, they do say there are correct and incorrect ways of combining beliefs, that beliefs should be isomorphic to certain structures, unless I'm horribly mistaken. Which sure is not what you're claiming to be the case in your above points.

Any epistemology will rule out some updates, but a problem with bayesianism is that it says there's one correct update to make. Whereas radical probabilism, for example, still sets some constraints, just far fewer.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-14T17:54:07.217Z · LW · GW

Edited for clarity now.

Comment by Richard_Ngo (ricraz) on Richard Ngo's Shortform · 2024-06-12T18:52:44.306Z · LW · GW

Some opinions about AI and epistemology:

  1. One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues. 
  2. A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either.
  3. For example: there's no privileged way to combine many people's opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals' betting strategies. The group might settle on a number to help with planning and communication, but it's only a lossy summary of many different beliefs and models. Similarly, we should think of individuals' credences as lossy summaries of different opinions from different underlying models that they have.
  4. How does this apply to AI? Suppose we each think of ourselves as containing many different subagents that focus on understanding the world in different ways - e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn't mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn't cause all the other humans to elect them as the dictator). E.g. maybe there's an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions).
  5. In my debate with Eliezer, he didn't seem to appreciate the importance of advance predictions; I think the frame of "highly opinionated subagents should convince other subagents to trust them, rather than just seizing power" is an important aspect of what he's missing. I think of rationalism as trying to form a single fully-consistent world-model; this has many of the same pitfalls as a country which tries to get everyone to agree on a single ideology. Even when that ideology is broadly correct, you'll lose a bunch of useful heuristics and intuitions that help actually get stuff done, because ideological conformity is prioritized.
  6. This perspective helps frame the debate about what our "base rate" for AI doom should be. I've been in a number of arguments that go roughly like (edited for clarity):
    Me: "Credences above 90% doom can't be justified given our current state of knowledge"
    Them: "But this is an isolated demand for rigor, because you're fine with people claiming that there's a 90% chance we survive. You're assuming that survival is the default, I'm assuming that doom is the default; these are symmetrical positions."
    But in fact there's no one base rate; instead, different subagents with different domains of knowledge will have different base rates. That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom. That's where the asymmetry which makes 90% doom a much stronger prediction than 90% survival comes from.
  7. This perspective is broadly aligned with a bunch of stuff that Scott Garrabrant and Abram Demski have written about (e.g. geometric rationality, Garrabrant induction). I don't think the ways I'm applying it to AI risk debates straightforwardly falls out of their more technical ideas; but I do expect that more progress on agent foundations will make it easier to articulate ideas like the ones above.
Comment by Richard_Ngo (ricraz) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-06-03T19:26:33.935Z · LW · GW

Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truly vast, vast amounts of pretty high skill labor).

The more labor they have, the more detectable they are, and the easier they are to shut down. Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the "crimes" part is hard to stop but the "paying for compute" part is relatively easy to stop.

My guess is that you need to be a decent but not amazing software engineer to ARA.

Yeah, you're probably right. I still stand by the overall point though.

Comment by Richard_Ngo (ricraz) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-06-03T19:21:50.422Z · LW · GW

1) It’s not even clear people are going to try to react in the first place.

I think this just depends a lot on how large-scale they are. If they are using millions of dollars of compute, and are effectively large-scale criminal organizations, then there are many different avenues by which they might get detected and suppressed.

If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop.

A world which can pause AI development is one which can also easily throttle ARA AIs.

The central point is:

  • At some point, ARA is unshutdownable unless you try hard with a pivotal cleaning act. We may be stuck with a ChaosGPT forever, which is not existential, but pretty annoying. People are going to die.
  • the ARA evolves over time. Maybe this evolution is very slow, maybe fast. Maybe it plateaus, maybe it does not plateau. I don't know
  • This may take an indefinite number of years, but this can be a problem

This seems like a weak central point. "Pretty annoying" and some people dying is just incredibly small compared with the benefits of AI. And "it might be a problem in an indefinite number of years" doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed". 

An extended analogy: suppose the US and China both think it might be possible to invent a new weapon far more destructive than nuclear weapons, and they're both worried that the other side will invent it first. Worrying about ARAs feels like worrying about North Korea's weapons program. It could be a problem in some possible worlds, but it is always going to be much smaller, it will increasingly be left behind as the others progress, and if there's enough political will to solve the main problem (US and China racing) then you can also easily solve the side problem (e.g. by China putting pressure on North Korea to stop).

you can find some comments I've made about this by searching my twitter

Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn't link it myself before.

Comment by Richard_Ngo (ricraz) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-06-03T19:05:14.686Z · LW · GW

However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger.

This is what I meant by "ARA as a benchmark"; maybe I should have described it as a proxy instead. Though while I agree that ARA rules out most danger, I think that's because it's just quite a low bar. The sort of tasks involved in buying compute etc are ones most humans could do. Meanwhile more plausible threat models involve expert-level or superhuman hacking. So I expect a significant gap between ARA and those threat models.

once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary

You'd need either really good ARA or really good self-improvement ability for an ARA agent to keep up with labs given the huge compute penalty they'll face, unless there's a big slowdown. And if we can coordinate on such a big slowdown, I expect we can also coordinate on massively throttling potential ARA agents.