Posts

Anthropic, and taking "technical philosophy" more seriously 2025-03-13T01:48:54.184Z
"Think it Faster" worksheet 2025-02-08T22:02:27.697Z
Voting Results for the 2023 Review 2025-02-06T08:00:37.461Z
C'mon guys, Deliberate Practice is Real 2025-02-05T22:33:59.069Z
Wired on: "DOGE personnel with admin access to Federal Payment System" 2025-02-05T21:32:11.205Z
Last week of the Discussion Phase 2025-01-09T19:26:59.136Z
What are the most interesting / challenging evals (for humans) available? 2024-12-27T03:05:26.831Z
ReSolsticed vol I: "We're Not Going Quietly" 2024-12-26T17:52:33.727Z
Hire (or Become) a Thinking Assistant 2024-12-23T03:58:42.061Z
The "Think It Faster" Exercise 2024-12-11T19:14:10.427Z
Subskills of "Listening to Wisdom" 2024-12-09T03:01:18.706Z
The 2023 LessWrong Review: The Basic Ask 2024-12-04T19:52:40.435Z
JargonBot Beta Test 2024-11-01T01:05:26.552Z
The Cognitive Bootcamp Agreement 2024-10-16T23:24:05.509Z
OODA your OODA Loop 2024-10-11T00:50:48.119Z
Scaffolding for "Noticing Metacognition" 2024-10-09T17:54:13.657Z
"Slow" takeoff is a terrible term for "maybe even faster takeoff, actually" 2024-09-28T23:38:25.512Z
2024 Petrov Day Retrospective 2024-09-28T21:30:14.952Z
[Completed] The 2024 Petrov Day Scenario 2024-09-26T08:08:32.495Z
What are the best arguments for/against AIs being "slightly 'nice'"? 2024-09-24T02:00:19.605Z
Struggling like a Shadowmoth 2024-09-24T00:47:05.030Z
Interested in Cognitive Bootcamp? 2024-09-19T22:12:13.348Z
Skills from a year of Purposeful Rationality Practice 2024-09-18T02:05:58.726Z
What is SB 1047 *for*? 2024-09-05T17:39:39.871Z
Forecasting One-Shot Games 2024-08-31T23:10:05.475Z
LessWrong email subscriptions? 2024-08-27T21:59:56.855Z
Please stop using mediocre AI art in your posts 2024-08-25T00:13:52.890Z
Would you benefit from, or object to, a page with LW users' reacts? 2024-08-20T16:35:47.568Z
Optimistic Assumptions, Longterm Planning, and "Cope" 2024-07-17T22:14:24.090Z
Fluent, Cruxy Predictions 2024-07-10T18:00:06.424Z
80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly) 2024-07-03T20:34:50.741Z
What percent of the sun would a Dyson Sphere cover? 2024-07-03T17:27:50.826Z
What distinguishes "early", "mid" and "end" games? 2024-06-21T17:41:30.816Z
"Metastrategic Brainstorming", a core building-block skill 2024-06-11T04:27:52.488Z
Can we build a better Public Doublecrux? 2024-05-11T19:21:53.326Z
some thoughts on LessOnline 2024-05-08T23:17:41.372Z
Prompts for Big-Picture Planning 2024-04-13T03:04:24.523Z
"Fractal Strategy" workshop report 2024-04-06T21:26:53.263Z
One-shot strategy games? 2024-03-11T00:19:20.480Z
Rationality Research Report: Towards 10x OODA Looping? 2024-02-24T21:06:38.703Z
Exercise: Planmaking, Surprise Anticipation, and "Baba is You" 2024-02-24T20:33:49.574Z
Things I've Grieved 2024-02-18T19:32:47.169Z
CFAR Takeaways: Andrew Critch 2024-02-14T01:37:03.931Z
Skills I'd like my collaborators to have 2024-02-09T08:20:37.686Z
"Does your paradigm beget new, good, paradigms?" 2024-01-25T18:23:15.497Z
Universal Love Integration Test: Hitler 2024-01-10T23:55:35.526Z
2022 (and All Time) Posts by Pingback Count 2023-12-16T21:17:00.572Z
Raemon's Deliberate (“Purposeful?”) Practice Club 2023-11-14T18:24:19.335Z
Hiring: Lighthaven Events & Venue Lead 2023-10-13T21:02:33.212Z
"The Heart of Gaming is the Power Fantasy", and Cohabitive Games 2023-10-08T21:02:33.526Z

Comments

Comment by Raemon on Levels of Friction · 2025-03-18T02:39:57.029Z · LW · GW

Curated. This concept seems like an important building block for designing incentive structures / societies, and this seems like a good comprehensive reference post for the concept.

Comment by Raemon on One pager · 2025-03-18T01:27:00.095Z · LW · GW

Note: it looks like you probably want this to be a markdown file. You can go to https://www.lesswrong.com/account, with the "site customizations" section, and click "activate Markdown" to enable the markdown editor. 

Comment by Raemon on Paper: Field-building and the epistemic culture of AI safety · 2025-03-17T16:57:50.754Z · LW · GW

Fyi I think it’s time to do minor formatting adjustments to make papers/abstracts easier to read on LW

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-16T22:25:33.867Z · LW · GW

I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is "careful conceptual thinking might be required rather than pure naive empiricism (because we won't be given good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this" and the bailey is "extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed".

Yeah I agree that was happening somewhat. The connecting dots here are "in worlds where it turns out we need a long Philosophical Pause, I think you and Buck would probably be above some threshold where you notice and navigate it reasonably." 

I think my actual belief is "the Motte is high likelihood true, the Bailey is... medium-ish likelihood true, but, like, it's a distribution, there's not a clear dividing line between them" 

I also think the pause can be "well, we're running untrusted AGIs and ~trusted pseudogeneral LLM-agents that help with the philosophical progress, but, we can't run them that long or fast, they help speed things up and make what'd normally be a 10-30 year pause into a 3-10 year pause, but also the world would be going crazy left to it's own devices, and the sort of global institutional changes necessary are still similarly-outside-of-overton window as a 20 year global moratorium and the "race with China" rhetoric is still bad.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-16T19:21:19.353Z · LW · GW

Thanks for laying this out thus far. I'mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here.

Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on "extreme philosophical competence".)

This makes sense as a crux for the claim "we need philosophical competence to align unboundedly intelligent superintelligences." But, it doesn't make sense for the claim "we need philosophical competence to align general, openended intelligence." I suppose my OP didn't really distinguish these claims and there were a few interpretations of how the arguments fit together. I was more saying the second (although to be fair I'm not sure I was actually distinguishing them well in my head until now)

It doesn't make sense for "we just' need to be able to hand off to an AI which is seriously aligned" to be a crux for the second. A thing can't be a crux for itself. 

I notice my "other-guy-feels-like-they're-missing-the-point" -> "check if I'm not listening well, or if something is structurally wrong with the convo" alarm is firing, so maybe I do want to ask for one last clarification on "did you feel like you understood this the first time? Does it feel like I'm missing the point of what you said? Do you think you understand why it feels to me like you were missing the point (even if you think it's because I'm being dense about something?)

Takes on your proposal

Meanwhile, here's some takes based on my current understanding of your proposal.

These bits:

We need to ensure that our countermeasures aren't just shifting from a type of misalignment we can detect to a type we can't. Qualitatively analyzing the countermeasures and our tests should help here.

...is a bit I think is philosophical-competence bottlenecked. And this bit:

"Actually, we didn't have any methods available to try which could end up with a model that (always) isn't egregiously misaligned. So, even if you can iterate a bunch, you'll just either find that nothing works or you'll just fool yourself."

...is a mix of "philosophically bottlenecked" and "rationality bottlenecked." (i.e. you both have to be capable of reasoning about whether you've found things that really worked, and, because there are a lot of degrees of freedom, capable of noticing if you're deploying that reasoning accurately)

I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough. 

(I think at least some people on the alignment science or interpretability teams might be. I bet against the median such teammembers being able to navigate it. And ultimately, what matters is "does Anthropic leadership go forward with the next training run", so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people. And Anthropic leadership already seem to basically be ignoring arguments of this type, and I don't actually expect to get the sort of empirical clarity that (it seems like) they'd need to update before it's too late.)

Second, we can study how generalization on this sort of thing works in general

I think this counts as the sort of empiricism I'm somewhat optimisic about in my post. i.e. if you are able to find experiments that actually give you evidence about deeper laws, that let you then make predictions about new Actually Uncertain questions of generalization that you then run more experiments on... that's the sort of thing I feel optimistic about. (Depending on the details, of course)

But, you still need technical philosophical competence to know if you're asking the right questions about generalization, and to know when the results actually imply that the next scale-up is safe.

Comment by Raemon on Paper: Field-building and the epistemic culture of AI safety · 2025-03-15T18:19:04.561Z · LW · GW

FYI I found this intro fairly hard to read – partly due to generally large blocks of text (see: Abstracts should be either Actually Short™, or broken into paragraphs) and also because it just... doesn't actually really say what the main point is, AFAICT. (It describes a bunch of stuff you do, but I had trouble finding the actual main takeaway, or primary sorts of new information I might get by reading it)

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-15T02:51:29.177Z · LW · GW

I don't really see why this is a crux. I'm currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn't really change my strategic orientation. Maybe you're focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.

I think you kinda convinced me here this reasoning isn't (as stated) very persuasive.

I think my reasoning had some additional steps like:

  • when I'm 15% on 'alignment might be philosophically hard', I still expect to maybe learn more and update to 90%+, and it seems better to pursue strategies that don't actively throw that world under the bus. (and, while I don't fully understand the Realpolitik, it seems to me that Anthropic could totally be pursuing strategies that achieve a lot of it's goals without Policy Comms that IMO actively torch the "long pause" worlds)
  • you are probably right I was oriented around "getting to like 5% risk" than reducing risk on the margin.
  • I'm probably partly just not really visualizing what it'd be like to be a 15%-er and bringing some bias in.
Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-15T02:47:50.409Z · LW · GW

I'm pretty skeptical of the "extreme philosophical competence" perspective. This is basically because we "just" need to be able to hand off to an AI which is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for use to check).

The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.

Looking over the comments of the first joshc post, it seems like that's also basically asserted it wasn't necessary by fiat. And, the people who actively believe in "alignment is philosophically loaded" showed up to complain that this ignored the heart of the problem.

My current summary of the arguments (which I put ~60% on, and I think Eliezer/Oli/Wentworth treat much more confidently and maybe believe a stronger version of) are something like:

  1. Anything general enough to really tackle openended, difficult-to-evaluate plans, will basically need to operate in a goal directed way in order to do that. (i.e. What's Up With Confusingly Pervasive Goal Directedness?)
  2. The goal-directedness means it's very likely to be self/situationally aware, and the requisite intelligence to solve these sorts of problems means even if it's not full blown anti-aligned, it's at least probably going to want to try to build more option value for itself.
  3. The fact that you can't evaluate the results means it has a lot of room to give you answers that help preserve it's goals and bootstrap (at least on the margin), even if it's not massively smart enough to one-shot escape. And you can't solve that problem with Control (i.e. The Case Against AI Control Research).
  4. You can maybe have interpretability tools that check for schemingness (if it's the first generation of generally capable agent and isn't too smart yet, maybe you've done a good job preserving Chain of Thought as a reasonably faithful representation, for now). But, you'll then just see "yep, the agent is unaligned", and not actually be able to fix it. 

I think my current model of you (Ryan) is like:

"Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from "don't think about scheming ever" to "secretly think about scheming" to avoid getting caught, and that probably just won't work?" 

(or, in example of the second joshC post, they need to learn to be really reliably good at truth-tracking-patterns and articulating their reasoning, after internalizing that for thousands of reps, an AI is just gonna have a hard time jumping to reasoning that isn't truth tracking).

I don't have a clear model of how you respond to point #4 – that we'll just reliably find them to be scheming if we succeed at the interpretability steps, and not have a good way of dealing with it. (Maybe you just don't think this is as overwhelmingly likely?)

Interested in whatever Real You's cruxes are, 1-2 steps removed.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-15T02:25:46.375Z · LW · GW

Thanks. I'll probably reply to different parts in different threads.

For the first bit:

My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.

The rough number you give are helpful. I'm not 100% sure I see the dots you're intending to connect with "leadership thinks 1/5-ryan-misalignment and 2x-ryan-totalitariansm" / "rest of alignment science team closer to ryan" -> "this explains a lot." 

Is this just the obvious "whelp, leadership isn't bought into this risk model and call most of the shots, but in conversations with several employees that engage more with misalignment?". Or was there a more specific dynamic you thought it explained?

Comment by Raemon on AI Tools for Existential Security · 2025-03-14T23:57:41.300Z · LW · GW

Do you have existing ones you recommend?

I'd been working on a keylogger / screenshot-parser that's optimized for a) playing nicely will LLMs while b) being unopinionated about what other tools you plug it into. (in my search for existing tools, I didn't find keyloggers that actually did the main thing I wanted, and the existing LLM-tools that did similar things were walled-garden-ecosystems that didn't give me much flexibility on what I did with the data)

Comment by Raemon on AI4Science: The Hidden Power of Neural Networks in Scientific Discovery · 2025-03-14T22:11:29.528Z · LW · GW

Minor note but I found the opening section hard to read. See: Abstracts should be either Actually Short™, or broken into paragraphs 

Comment by Raemon on johnswentworth's Shortform · 2025-03-14T21:26:10.810Z · LW · GW

Were you by any chance writing in Cursor? I think they recently changed the UI such that it's easier to end up in "agent mode" where it sometimes randomly does stuff.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T20:23:58.688Z · LW · GW

I am kinda intrigued by how controversial this post seems (based on seeing the karma creep upwards and then back down over the past day). I am curious if the downvoters tend more like:

  • Anti-Anthropic-ish folk who think the post is way too charitable/soft on Anthropic
  • Pro-Anthropic-ish folk who think the post doesn't make very good/worthwhile arguments against Anthropic
  • "Alignment-is-real-hard" folks who think this post doesn't represent the arguments for that very well.
  • "other?"
Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T20:05:09.223Z · LW · GW

I agree with this (and think it's good to periodically say all of this straightforwardly).

I don't know that it'll be particularly worth your time, but, the thing I was hoping for this post was to ratchet the conversation-re-anthropic forward in, like, "doublecrux-weighted-concreteness." (i.e. your arguments here are reasonably crux-and-concrete, but don't seem to be engaging much with the arguments in this post that seemed more novel and representative of where anthropic employees tend to be coming from, instead just repeated AFAICT your cached arguments against Anthropic)

I don't have much hope of directly persuading Dario, but I feel some hope of persuading both current and future-prospective employees who aren't starting from the same prior of "alignment is hard enough that this plan is just crazy", and for that to have useful flow-through effects. 

My experience talking at least with Zac and Drake has been "these are people with real models, who share many-but-not-all-MIRI-ish assumptions but don't intuitively buy that the Anthropic's downsides are high, and would respond to arguments that were doing more to bridge perspectives." (I'm hoping they end up writing comments here outlining more of their perspective/cruxes, which they'd expressed interest in in the past, although I ended up shipping the post quickly without trying to line up everything)

I don't have a strong belief that contributing to that conversation is a better use of your time than whatever else you're doing, but it seemed sad to me for the conversation to not at least be attempted. 

(I do also plan to write 1-2 posts that are more focused on "here's where Anthropic/Dario have done things that seem actively bad to me and IMO are damning unless accounted for," that are less "attempt to maintain some kind of discussion-bridge", but, it seemed better to me to start with this one)

Comment by Raemon on Elizabeth's Shortform · 2025-03-14T19:04:27.981Z · LW · GW

Yeah I was staring at the poll and went "oh no." They aren't often actually used this way so it's not obviously right to special-case it, although maybe we do polls enough that we generally should present react in "first->last posted" rather than sorting by number

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T17:59:47.549Z · LW · GW

I don't particularly disagree with the first half, but your second sentence isn't really a crux for me for the first part. 

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T05:19:10.606Z · LW · GW

I think (moderately likely, though not super confident) it makes more sense to model Dario as:

"a person who actually is quite worried about misuse, and is making significant strategic decisions around that (and doesn't believe alignment is that hard)" 

than as "a generic CEO who's just generally following incentives and spinning narrative post-hoc rationalizations."  

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T05:06:06.407Z · LW · GW

I think... agree denotationally and (lean towards) disagreeing connotationally? (like, seems like this is implying "and because he doesn't seem like he obviously has coherent views on alignment-in-particular, it's not worth arguing the object level?")

(to be clear, I don't super expect this post to affect Dario's decisionmaking models, esp. directly. I do have at least some hope for Anthropic employees to engage with these sorts of models/arguments, and my sense from talking them is that a lot of the LW-flavored arguments have often missed their cruxes)

Comment by Raemon on Trojan Sky · 2025-03-14T04:11:27.277Z · LW · GW

Also, the video you linked has a lot of additional opinionated features that I think are targeting a much more specific group than even "people who aren't put off by AI" - it would never show up on my youtube.

For frame of reference, do regular movie trailers normally show up in your youtube? This video seemed relatively "mainstream"-vibing to me, although somewhat limited by the medium.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T01:40:55.360Z · LW · GW

I expect that AIs will be obedient when they initially become capable enough to convince governments that further AI development would be harmful (if it would in fact be harmful).

 

Seems like "the AIs are good enough at persuasion to persuade governments and someone is deploying them for that" is right when you need to be very high confidence they're obedient (and, don't have some kind of agenda). If they can persuade governments, they can also persuade you of things.

I also think it gets into a point where I'd sure feel way more comfortable if we had more satisfying answers to "where exactly are we supposed to draw the line between 'informing' and 'manipulating'" (I'm not 100% sure what you're imagining here tho) 

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-13T02:08:24.550Z · LW · GW

Does Anthropic shorten timelines, by working on automatic AI research? 

I think "at least a little", though not actually that much. 

There's a lot of other AI companies now, but not that many of them are really frontier labs. I think Anthropic's presence in the race still puts marginal pressure on OpenAI companies to rush things out the door a bit with less care than they might have otherwise. (Even if you model other labs as caring ~zero about x-risk, there are still ordinary security/bugginess reasons to delay releases so you don't launch a broken product. Having more "real" competition seems like it'd make people more willing to cut corners to avoid getting scooped on product releases)

(I also think earlier work by Dario at OpenAI, and the founding of Anthropic in the first place, probably did significantly shorten timelines. But, this factor isn't significant at this point, and while I'm mad about the previous stuff it's not actually a crux for their current strategy)

Subquestions:

  • How many bits does Anthropic leak by doing their research? This is plausibly low-ish. I don't know of them actually leaked much about reasoning models until after OpenAI and Deepseek had pretty thoroughly exposed that vein of research. 
  • How many other companies are actually focused on automating AI research, or pushing frontier AI in ways that are particularly relevant? If it's a small number, then I think Anthropic's contribution to this race is larger and more costly.  I think the main mechanism here might be Anthropic putting pressure on OpenAI in particular (by being one of 2-3 real competitors on 'frontier AI', which pushes OpenAI to release things with less safety testing)

     

Is Anthropic institutionally capable of noticing "it's really time to stop our capabilities research," and doing so, before it's too late? 

I know they have the RSP. I think there is a threshold of danger where I believe they'd actually stop. 

The problem is, before we get to "if you leave this training run overnight it might bootstrap into deceptive alignment that fools their interpretability and then either FOOMs, or gets deployed" territory, there will be a period of "Well, maybe it might do that but also The Totalitarian Guys Over There are still working on their training and we don't want to fall behind". And meanwhile, it's also just sort of awkward/difficult[10] to figure out how to reallocate all your capabilities researches onto non-dangerous tasks.

 

How realistic is it to have a lead over "labs at more dangerous companies?" (Where "more dangerous" might mean more reckless, or more totalitarian)

This is where I feel particularly skeptical. I don't get how Anthropic's strategy of race-to-automate-AI can make sense without actually expecting to get a lead, and with the rest of the world also generally racing in this direction, it seems really unlikely for them to have much lead. 

Relatedly... (sort of a subquestion but also an important top-level question)

 

Does racing towards Recursive Self Improvement makes timelines worse (as opposed to "shorter"?)

Maybe Anthropic pushing the frontier doesn't shorten timelines (because there's already at least a few other organizations who are racing with each other, and no one wants to fall behind).

But, Anthropic being in the race (and, also publicly calling for RSI in a fairly adversarial way, i.e. "gaining a more durable advantage") might cause there to be more companies and nations explicitly racing for full AGI, and doing so in a more adversarial way, and generally making the gameboard more geopolitically chaotic at a crucial time.

This seems more true to me, than the "does Anthropic shorten timelines?" question.I think there are currently few enough labs doing this that a marginal lab going for AGI does make that seem more "real," and give FOMO to other companies/countries.[11]

But, given that Anthropic has already basically stated they are doing this, the subquestion is more like:

  • If Anthropic publicly/credibly shifted away from racing, would that make race dynamics better? I think the answer here is "yes, but, it does depend on how you actually go about it."

 

Assuming Anthropic got powerful but controllable ~human-genius-ish level AI, can/will they do something useful with it to end the acute risk period?

In my worldview, getting to AGI only particularly matters if you leverage it to prevent other people from creating reckless/powerseeking AI. Otherwise, whatever material benefits you get from it are short lived.

I don't know how Dario thinks about this question. This could mean a lot of things. Some ways of ending the acute risk period are adversarial, or unilateralist, and some are more cooperative (either with a coalition of groups/companies/nations, or with most of the world). 

This is the hardest to have good models about. Partly it's just, like, quite a hard problem for anyone to know what it looks like to handle this sanely. Partly, it's the sort of thing people are more likely to not be fully public about.

Some recent interviews have had him saying "Guys this is a radically different kind of technology, we need to come together and think about this. It's bigger than one company should be deciding what to do with." There's versions of this that are a cheap platitude more than earnest plea, but, I do basically take him at his word here. 

He doesn't talk about x-risk, or much about uncontrollable AI. The "Core views on AI safety" lists "alignment might be very hard" as a major plausibility they are concerned with, and implies it ends up being like 1/3 or something of their 

Subquestions:

  • Are there useful things you can do here with controllable power levels of AI? i.e.
    • Can you get to very high power levels using the set of skills/approaches Anthropic is currently bringing to bear?
    • Can we muddle through the risk period with incremental weaker tech and moderate coalition-size advantage?
  • Will Anthropic be able to leverage this sanely/safely under time pressure?
Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-13T02:07:42.276Z · LW · GW

Cruxes and Questions

The broad thrust of my questions are:

Anthropic Research Strategy

  • Does Anthropic building towards automated AGI research make timelines shorter (via spurring competition or leaking secrets)
  • ...or, make timelines worse (by inspiring more AI companies or countries to directly target AGI, as opposed to merely trying to cash in on the current AI hype)
  • Is it realistic for Anthropic to have enough of a lead to safely build AGI in a way that leads to durably making the world safer? 

"Is Technical Philosophy actually that big a deal?"

  • Can there be pivotal acts that require high AI powerlevels, but not unboundedly high, in a reasonable timeframe, such that they're achievable without solving The Hard Parts of robust pointing?

Governance / Policy Comms

  • Is it practical for a western coalition to stop the rest of the world (and, governments and other major actors within the western coalition) from building reckless or evil AI?
Comment by Raemon on Trojan Sky · 2025-03-13T01:12:49.722Z · LW · GW

I would bet they are <1% of the population. Do you disagree, or think they disproportionately matter?

Comment by Raemon on Trojan Sky · 2025-03-13T00:47:40.090Z · LW · GW

I'm skeptical that there are actually enough people so ideologically opposed to this, that it outweighs the upside of driving home that capabilities are advancing, through the medium itself. (similar to how even though tons of people hate FB, few people actually leave)

I'd be wanting to target a quality level similar to this:

Comment by Raemon on Trojan Sky · 2025-03-12T21:24:41.963Z · LW · GW

One of the things I track are "ingredients for a good movie or TV show that would actually be narratively satisfying / memetically fit," that would convey good/realistic AI hard sci-fi to the masses.

One of the more promising strategies withint that I can think of is "show multiple timelines" or "flashbacks from a future where the AI wins but it goes slowly enough to be human-narrative-comprehensible" (with the flashbacks being about the people inventing the AI). 

This feels like one of the reasonable options for a "future" narrative. (A previous one I was interested in was the Green goo is plausible concept)

Also, I think many Richard Ngo stories would lend themselves well to being some kind of cool youtube video, leveraging AI generated content to make things feel higher budget and also sending an accompanying message of "the future is coming, like now." (King and the Golem was nice but felt more like a lecture than a video, or something). A problem with AI generated movies is that the tech's not there yet for it not being slightly uncanny, but I think Ngo stories have a vibe where the uncanniness will be kinda fine.

Comment by Raemon on TsviBT's Shortform · 2025-03-12T19:11:05.675Z · LW · GW

I also kinda thought this. I actually thought it sounded sufficiently academic that I didn't realize at first it was your org, instead of some other thing you were supporting.

Comment by Raemon on Neil Warren's Shortform · 2025-03-11T03:51:54.359Z · LW · GW

LW moderators have a policy of generally rejecting LLM stuff, but some things slip through cracks. (I think maybe LLM writing got a bit better recently and some of the cues I used are less reliable now, so I may have been missing some)

Comment by Raemon on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-03-11T03:31:22.020Z · LW · GW

Curated. This was one of the more interesting results from the alignment scene in awhile. 

I did like Martin Randall's comment distinguishing "alignment" from "harmless" in the Helpful/Harmless/Honest sense (i.e. the particular flavor of 'harmlessness' that got trained into the AI). I don't know whether Martin's particular articulation is correct for what's going on here, but in general it seems important to track that just because we've identified some kind of vector, that doesn't mean we necessarily understand what that vector means. (I also liked that Martin gave some concrete predictions implied by his model)

Comment by Raemon on when will LLMs become human-level bloggers? · 2025-03-10T21:26:40.651Z · LW · GW

@kave @habryka

Comment by Raemon on johnswentworth's Shortform · 2025-03-10T19:55:08.021Z · LW · GW

main bottleneck to counterfactuality

I don't think the social thing ranks above "be able to think useful important thoughts at all". (But maybe otherwise agree with the rest of your model as an important thing to think about)

[edit: hrm, "for smart people with a strong technical background" might be doing most of the work here"]

Comment by Raemon on A Bear Case: My Predictions Regarding AI Progress · 2025-03-10T17:10:25.385Z · LW · GW

It seems good for me to list my predictions here. I don't feel very confident. I feel an overall sense of "I don't really see why major conceptual breakthroughs are necessary." (I agree we haven't seen, like, an AI do something like "discover actually significant novel insights.")

This doesn't translate into me being confident in very short timelines, because the remaining engineering work (and "non-major" conceptual progress) might take a while, or require a commitment of resources that won't materialize before a hype bubble pops.

But:

a) I don't see why novel insights or agency wouldn't eventually fall out of relatively straightforward pieces of:

  • "make better training sets" (and training-set generating processes)
  • "do RL training on a wide variety of tasks"
  • "find some algorithmic efficiency advances that, sure, require 'conceptual advances' from humans, but of a sort of straightforward kind that doesn't seem like it requires deep genius?" 

b) Even if A doesn't work, I think "make AIs that are hyperspecialized at augmenting humans doing AI research" is pretty likely to work, and that + just a lot of money/attention generally going into the space seems to increase the likelihood of it hitting The Crucial AGI Insights (if they exist) in a brute-force-but-clever kinda way.

Assembling the kind of training sets (or, building the process that automatedly generates such sets) you'd need to do the RL seems annoyingly-hard but not genius-level hard. 

I expect there to be a couple innovations that are roughly on the same level as "inventing attention" that improve efficiency a lot, but don't require a deep understanding of intelligence. 

Comment by Raemon on How Much Are LLMs Actually Boosting Real-World Programmer Productivity? · 2025-03-05T05:41:57.471Z · LW · GW

One thing is I'm definitely able to spin up side projects that I just would not have been able to do before, because I can do them with my "tired brain."

Some of them might turn out to be real projects, although it's still early stage.

Comment by Raemon on Self-fulfilling misalignment data might be poisoning our AI models · 2025-03-03T23:34:48.564Z · LW · GW

My current guess is:

1. This is more relevant for up-to-the first couple generations of "just barely superintelligent" AIs.

2. I don't really expect it to be the deciding factor after many iterations of end-to-end RSI that gets you to the "able to generate novel scientific or engineering insights much faster than a human or institution could." 

I do think it's plausible that the initial bias towards "evil/hackery AI" could start it off in a bad basin of attraction, but a) even if you completely avoided that, I would still basically expect this to rediscover this on it's own as it gained superhuman levels of competence, b) one of the things I most want to use a slightly-superhuman AI to do is to robustly align massively superhuman AI, and I don't really see how to do that without directly engaging with the knowledge of the failure modes there.

I think there are other plans that route more though "use STEM AI to build an uploader or bioenhancer, and then have an accelerated human-psyche do the technical philosophy necessary to handle the unbounded alignment case. I could see that being the right call, and I could imagine the bias from the "already knows about deceptive alignment etc" being large-magnitude enough to matter in the initial process. [edit: In those cases I'd probably want to filter out a lot more than just "unfriendly AI strategies"]

But, basically, how this applies depends on what it is you're trying to do with the AI, and what stage/flavor of AI you're working with and how it's helping.

Comment by Raemon on "AI Rapidly Gets Smarter, And Makes Some of Us Dumber," from Sabine Hossenfelder · 2025-02-27T02:10:32.547Z · LW · GW

Yep, thank you!

Comment by Raemon on LoganStrohl's Shortform · 2025-02-26T23:28:12.424Z · LW · GW

This says "remembering to do things in the future."

Comment by Raemon on "AI Rapidly Gets Smarter, And Makes Some of Us Dumber," from Sabine Hossenfelder · 2025-02-26T22:38:58.871Z · LW · GW

It'd be nice to have the key observations/evidence in the tl;dr here. I'm worried about this but would like to stay grounded in how bad it is exactly.

Comment by Raemon on LoganStrohl's Shortform · 2025-02-26T21:55:16.973Z · LW · GW

I think I became at least a little wiser reading this sentence. I know you're mostly focused on other stuff but I think I'd benefit from some words connecting more of the dots.

Comment by Raemon on So You Want To Make Marginal Progress... · 2025-02-24T07:10:28.407Z · LW · GW

I think the Gears Which Turn The World sequence, and Specializing in Problems We Don't Understand, and some other scattered John posts I don't remember as well, are a decent chunk of an answer.

Comment by Raemon on So You Want To Make Marginal Progress... · 2025-02-23T20:05:57.634Z · LW · GW

Curated. I found this a clearer explanation of "how to think about bottlenecks, and things that are not-especially-bottlenecks-but-might-be-helpful" than I previously had. 

Previously, I had thought about major bottlenecks, and I had some vague sense of "well, there definitely seems like there should be more ways to be helpful than just tackling central bottlenecks, but a lot of ways to do that misguidedly." But I didn't have any particular models for thinking about it, and I don't think I could have explained it very well.

I think there are better ways of doing forward-chaining and backward-chaining than listed here (but which roughly correspond to "the one who thought about it a bit," but with a bit more technique for getting traction).

I do think the question of "to what degree is your field shaped like 'there's a central bottleneck that is to a first approximation the only thing that matters here'?" is an important question that hasn't really been argued for here. (I can't recall offhand if John has previously written a post exactly doing that in those terms, although the Gears Which Turn the World sequence is at least looking at the same problemspace)

Comment by Raemon on AI #104: American State Capacity on the Brink · 2025-02-20T20:27:18.374Z · LW · GW

Update: In a slack I'm in, someone said:

A friend of mine who works at US AISI advised:

> "My sense is that relevant people are talking to relevant people (don't know specifics about who/how/etc.) and it's better if this is done in a carefully controlled manner."

And another person said:

Per the other thread, a bunch of attention on this from EA/xrisk coded people could easily be counterproductive, by making AISI stick out as a safety thing that should be killed

And while I don't exactly wanna trust "the people behind the scenes have it handled", I do think the failure mode here seems pretty real.

Comment by Raemon on Arbital has been imported to LessWrong · 2025-02-20T20:22:44.304Z · LW · GW

I guess I'm just kinda surprised "perspective" feels metaphorical to you – it seems like that's exactly what it is.

(I think it's a bit of a long clunky word so not obviously right here, but, still surprised about your take)

Comment by Raemon on Arbital has been imported to LessWrong · 2025-02-20T20:09:08.114Z · LW · GW

What would be less metaphorical than perspective that still captures the ‘one opinionated viewpoint?’ thing?

Comment by Raemon on AI #104: American State Capacity on the Brink · 2025-02-20T19:20:35.993Z · LW · GW

I called some congresspeople but honestly, I think we should have enough people-in-contact with Elon to say "c'mon man, please don't do that?". I'd guess that's more likely to work than most other things?

Comment by Raemon on AI #104: American State Capacity on the Brink · 2025-02-20T18:52:06.355Z · LW · GW

The Trump Administration is on the verge of firing all ‘probationary’ employees in NIST, as they have done in many other places and departments, seemingly purely because they want to find people they can fire. But if you fire all the new employees and recently promoted employees (which is that ‘probationary’ means here) you end up firing quite a lot of the people who know about AI or give the government state capacity in AI.

This would gut not only America’s AISI, its primary source of a wide variety of forms of state capacity and the only way we can have insight into what is happening or test for safety on matters involving classified information. It would also gut our ability to do a wide variety of other things, such as reinvigorating American semiconductor manufacturing. It would be a massive own goal for the United States, on every level.

Please, it might already be too late, but do whatever you can to stop this from happening. Especially if you are not a typical AI safety advocate, helping raise the salience of this on Twitter could be useful here.

 

Do you (or anyone) have any gears as to who is the best person to contact here?

I'm slightly worried about making it salient on twitter because I think the pushback from people who do want them all fired might outweigh whatever good it does.

Comment by Raemon on Raemon's Shortform · 2025-02-18T21:52:14.231Z · LW · GW

I've now worked with 3 Thinking Assistants, and there are a couple more I haven't gotten to try out yet. So far I've been doing it with remote ones, who I share my screen with. If you would like to try them out I can DM you information and my sense of their various strengths.

The baseline benefit is just them asking "hey, are you working on what you mean to work on?" every 5 minutes. I think I a thing I should do but haven't yet is have them be a bit more proactive in asking if I've switched tasks (because sometimes it's hard to tell looking at my screen), and nagging me a bit harder about "is this the right thing?" if I'm either switching a lot, or doing one that seems at-odds with my stated goals for the day.

Sometimes I have them do various tasks that are easy to outsource, depending on their skills and what I need that day.

I have a google doc I have them read in advance that lays out my overall approach, and which includes a journal for myself I'm often taking notes in, and a journal for each assistant I work with for them to take notes. I think something-like-this is a good practice. 

For reference, here's my intro:

Intro

There’s a lot of stuff I want done. I’m experimenting with hiring a lot of assistants to help me do it. 

My plans are very in-flux, so I prefer not to make major commitments, just hire people piecemeal to either do particular tasks for me, or sit with me and help me think when I’m having trouble focusing.

My working style is “We just dive right into it, usually with a couple hours where I’m testing to see if we work well together.” I explain things as we go. This can be a bit disorienting, but I’ve tried to write important things in this doc which you can read first. Over time I may give you more openended, autonomous tasks, if that ends up making sense.

Default norms

  • Say “checking in?” and if it’s a good time to check in I’ll say “ok” or “no.” If I don’t respond at all, wait 30-60 seconds and then ask again more forcefully (but still respect a “no”)
  • Paste in metastrategies from the metastrategy tab into whatever area I’m currently working in when it seems appropriate.

For Metacognitive Assistants

Metacognitive Assistants sit with me and help me focus. Basic suggested workflow:

  • By default, just watch me work (coding/planning/writing/operations), and occasionally give signs you’re still attentive, without interrupting.
  • Make a tab in the Assistant Notes section. Write moment to moment observations which feel useful to you, as well as general thoughts. This helps you feel more proactively involved and makes you focused on noticing patterns and ways in which you could be more useful as an assistant.
  • The Journal tab is for his plans and thoughts about what to generally do. Read it as an overview.
  • This Context tab is for generally useful information about what you should do and about relevant strategies and knowledge Ray has in mind. Reading this helps you get a more comprehensive view on what his ideal workflow looks like, and what your ideal contributions look like.

Updating quickly

There’s a learning process for figuring out “when it is good to check if Ray’s stuck?” vs “when is it bad to interrupt his thought process?”. It’s okay if you don’t get it perfectly right at first, by try… “updating a lot, in both directions?” like, if it seemed like something was an unhelpful interruption, try speaking up half-as-often, or half-as-loudly, but then later if I seem stuck, try checking in on me twice-as-often or twice-as loudly, until we settle into a good rhythm.

Comment by Raemon on The "Think It Faster" Exercise · 2025-02-15T19:03:42.724Z · LW · GW

The "10x" here was meant more to refer to how long it took him to figure it out, than how much better it was. I'm less sure how to quantify how much better.

I'm busy atm but will see if I can get a screeshot from an earlier draft

Comment by Raemon on The "Think It Faster" Exercise · 2025-02-13T20:59:43.727Z · LW · GW

Thanks! I'll keep this in mind both for potential rewrites here, and for future posts.

Comment by Raemon on "Think it Faster" worksheet · 2025-02-13T00:08:35.337Z · LW · GW

Curious how this takes you typically?

Comment by Raemon on The Paris AI Anti-Safety Summit · 2025-02-12T22:14:01.982Z · LW · GW

Well, this is the saddest I've been since April 1st 2022.

It really sucks that SB 1047 didn't pass. I don't know if Anthropic could have gotten it passed if they had said "dudes this this fucking important, pass it now" instead of "for some reason we should wait until things are already 

It is nice that at least Anthropic did still get to show up to the table, and that they said anything at all. I sure wish their implied worldview didn't seem so crazy. (I really don't get how you can think it's workable to race here, even if you think Phase I alignment is easy, as well as it seeming really wrong to think Phase I alignment is that likely to be easy)

It feels like winning pathways right now mostly route through:

  • Some kind of miracle of Vibe Shift (ideally mediated through a miracle of Sanity). I think this needs masterwork-level communication / clarity / narrative setting.
  • Just... idk, somehow figure out how to just Solve The Hard Part Real Fast.
  • Somehow muddle through with scary demos that get a few key people to change their mind before it's too late.
Comment by Raemon on Elephant seal 2 · 2025-02-12T21:19:22.494Z · LW · GW

Oh lol right.