raemon

I'm not sure if this is fiction. I realize there's something nice about opening in media res but I think this could use better indication of who this is for or what it is.

Comment by Raemon on What Makes an AI Startup "Net Positive" for Safety? · 2025-04-20T17:01:16.879Z · LW · GW

Nod, but fwiw if you don’t have a cached answer, I am interested in you spending like 15 minutes thinking through whether there exist startup-centric approaches to helping with x-risk that are good.

Comment by Raemon on What Makes an AI Startup "Net Positive" for Safety? · 2025-04-19T19:51:31.820Z · LW · GW

Well yeah, but the question here is "what should be community guidelines on specifically how to approach startups that are aimed at specifically helping with AI safety" (which may or may not include AI), not "what kinds of AI startups should people start, if any?"

Comment by Raemon on What Makes an AI Startup "Net Positive" for Safety? · 2025-04-19T18:03:13.599Z · LW · GW

I didn’t track this previously: how did they incentive themselves to reach the cap?

Comment by Raemon on A Dissent on Honesty · 2025-04-18T22:15:04.187Z · LW · GW

Against strict kantian deontologists I admit no version of this argument could be persuasive and they're free to bite the other bullet and fail to achieve any good outcomes.

Note that this is very different from what you said in your post, which is "sometimes you will lose." (And this one seems obviously false)

Comment by Raemon on Self-fulfilling misalignment data might be poisoning our AI models · 2025-04-17T02:44:19.784Z · LW · GW

I think I understood your article, and was describing which points/implications seemed important.

I think we probably agree on predictions for nearterm models (i.e. that including this training data makes it more likely for them to deceive), I just don't think it matters very much if sub-human-intelligence AIs deceive.

Comment by Raemon on Navigation by Moonlight · 2025-04-09T19:46:48.077Z · LW · GW

FYI I think there are a set of cues that move you from ‘pretty unlikely to be interested’ to ‘maybe interested’, but not that get you above like 25% likely.

Comment by Raemon on VDT: a solution to decision theory · 2025-04-08T18:22:15.939Z · LW · GW

FYI this was an April Fools joke.

Comment by Raemon on AI 2027: What Superintelligence Looks Like · 2025-04-07T23:54:06.622Z · LW · GW

Curated. I've been following this project for awhile (you can see some of the earlier process in Daniel's review of his "What 2026 looks like" post, and on his comment on Tom Davidon's What a Compute-centric framework says about AI takeoff). I've participated in one of the wargames that helped inform what sort of non-obvious things might happen along the path of AI takeoff.

(disclosure, Lightcone did a lot of work on the website of this project, although I was only briefly involved)

Like others have said, I appreciate this for both having a lot of research behind it, and for laying out something concrete enough to visualize and disagree with. Debating individual "event X will happen" predictions isn't exactly the point, since some of them are merely illustrative of "something similar that might happen." But, it's helpful for debating underlying models about what sort-of-events are likely to happen.

One of the central, obvious debates here is "does it actually make sense to just extrapolate the trends the way this way, or is AGI takeoff dependent on some unrelated progress?". Recent posts like A Bear Case and Have LLMs Generated Novel Insights?^[1] have argued the opposite view). I lean towards "the obvious trends will continue and the obvious AGI approaches will basically work", but only put it at bit over 50%. I think it's reasonable to have a lower credence there. But one thought I've had this week is: perhaps longer-time-folk (with some credence on this) should to spend the next year-or-so focusing more on plans that help in short-timeline worlds, and then return to longer time-horizon plans if a year from now, it seems like progress has slowed and there's some missing sauce.^[2]

I think it would have been nicer if a third scenario was presented – I think the current two-scenario setup comes across as more of a rhetorical device, i.e. "if y'all don't change your actions you will end up on the doomy racing scenario." I believe Daniel-et-al that that wasn't their intent, but I think a third scenario that highlighted some orthogonal axis of concern would have been helpful for getting people into the mindset of actually "rolling the simulation forward" rather than picking and arguing for a side.

^{^}
Notably, written before AI 2027 came out, although I think they were reacting to an intellectual scene that was nontrivially informed by earlier drafts of it.
^{^}
On the other hand, if most of your probability-mass is on mediumish timelines, and you have a mainline plan you think you could barely pull off in 10 years, such that taking a year off seems likely to make the difference,

Comment by Raemon on Exercise: Planmaking, Surprise Anticipation, and "Baba is You" · 2025-04-05T17:02:34.023Z · LW · GW

Yeah I introduced Baba is You more as a counterbalance to empiricism-leaning fields. I think ‘practice forms of thinking that don’t come natural to you’ is generally valuable so you don’t get in ruts.

Comment by Raemon on Exercise: Planmaking, Surprise Anticipation, and "Baba is You" · 2025-04-04T03:00:11.494Z · LW · GW

I’m curious how it went in terms of ‘do you think you learned anything useful?’

Comment by Raemon on The Ghibli Event: Our First Glimpse of Reality Transfer · 2025-04-03T00:15:27.349Z · LW · GW

I do think the thing you describe here is great. I think I hadn't actually tried really leveraging the current zeitgeist to actively get better at it, and it does seem like a skill you could improve at and that seems cool.

But I'd bet it's not what was happening for most people. I think the value-transfer is somewhat automatic, but most people won't actually be attuned to it enough. (might be neat to operationalize some kind of bet about this, if you disagree).

I do think it's plausible, if people put more deliberate effort it, to create a zeitgeist where the value transfer is more real for more people.

Comment by Raemon on The Ghibli Event: Our First Glimpse of Reality Transfer · 2025-04-02T03:52:55.725Z · LW · GW

A thing that gave me creeping horror about the Ghiblification is that the I don't think the masses actually particularly understand Ghibli. And the result is an uneven simulacrum-mask that gives the impression of "rendered with love and care" without actually being so.

The Ghibli aesthetic is historically pretty valuable to me, and in particular important as a counterbalanacing force against "the things I expect to happen by default with AI."

Some things I like about Ghibli:

The "cinematic lens" emphasizes a kind of "see everything with wonder and reverence" but not in a way that papers over ugly or bad things. Ugliness and even horror are somehow straightforwardly depicted, but in a way that somehow makes both seem very normal and down to earth, and also supernaturally majestic. (See On green, and The Expanding Moral Cinematic Universe).
The main characters are generally "low-ish neuroticism." (This youtube analysis I like argues that the women in particular are 'non-neurotic', and the men tend to be "low compared to modern city-dwelling standards.")

There's a bit of awkwardness where Miyazaki is particularly anti-transhumanist, where I disagree with him. But I feel like I could argue with him about it on his terms – I have an easy time imagining how to depict spirits of technology and capitalism and bureaucracy as supernatural forces that have the kind of alien grandeur, not on humanity's side or the "natural world's side", but still ultimately part of the world.

For years, I have sometimes walked down the street and metaphorically put on "Miyazaki goggles", where I choose to lean into a feeling of tranquility, and I choose to see everything through that "normal but reverent" stance. I imagine the people that live in each house doing their day to day things to survive and make money and live life. And seeing the slightly broken down things (a deteriorating fence, a crumbling sidewalk) as part of a natural ebb and flow of the local ecosystem. And seeing occasional more "naturally epic" things as particularly majestic and important.

So, the wave of "ghiblify everything" was something I appreciated, and renewed a felt-desire to live more often in a ghibli-ish world. But, also, when I imagine how this naturally plays out, I don't think it really gets us anything like a persistent reality transfer the way you describe. Mostly we get a cheap simulacra that may create some emotion / meaning at first, but will quickly fade into "oh, here's another cheap filter."

...

That all said, I do feel some intrigue at your concept here. I'm still generally wrapping my mind around what futures are plausible, and then desirable. I feel like I will have more to say about this after thinking more.

Comment by Raemon on New Cause Area Proposal · 2025-04-01T18:53:36.835Z · LW · GW

Seems at odds with longhairism.

Comment by Raemon on Good Research Takes are Not Sufficient for Good Strategic Takes · 2025-03-30T18:47:10.676Z · LW · GW

Curated. I think this is a pretty important point. I appreciate Neel's willigness to use himself as an example.

I do think this leaves us with the important followup questions of "okay, but, how actually DO we evaluate strategic takes?". A lot of people who are in a position to have demonstrated some kind of strategic awareness are people who are also some kind of "player" on the gameboard with an agenda, which means you can't necessarily take their statements at face value as an epistemic claim.

Comment by Raemon on Third-wave AI safety needs sociopolitical thinking · 2025-03-28T19:10:20.496Z · LW · GW

I think I agree with a lot of stuff here but don't find this post itself particularly compelling for the point.

I also don't think "be virtuous" is really sufficient to know "what to actually do." It matters a lot which virtues. Like I think environmentalism's problems wasn't "insufficiently virtue-ethics oriented", it's problem was that it didn't have some particular virtues that were important.

Comment by Raemon on Policy for LLM Writing on LessWrong · 2025-03-27T21:52:54.249Z · LW · GW

Or: when the current policy stops making sense, we can figure out a new policy.

In particular, when the current policy stops making sense, AI moderation tools may also be more powerful and can enable a wider range of policies.

Comment by Raemon on Policy for LLM Writing on LessWrong · 2025-03-27T01:37:49.205Z · LW · GW

I mean, the sanctions are ‘if we think your content looks LLM generated, we’ll reject it and/or give a warning and/or eventually delete or ban.’ We do this for several users a day.

That may get harder someday but it’s certainly not unenforceable now.

Comment by Raemon on Policy for LLM Writing on LessWrong · 2025-03-26T18:12:03.373Z · LW · GW

I agree it'll get harder to validate, but I think having something like this policy is, like, a prerequisite (or at least helpful grounding) for the mindset change.

Comment by Raemon on AI for AI safety · 2025-03-25T18:59:17.782Z · LW · GW

Curated. I think figuring out whether and how we can apply AI to AI safety is one of the most important questions, and I like this post for exploring this through many more different angles than we'd historically seen.

A thing I both like and dislike about this post is that it's more focused on laying out the questions than giving answers. This makes it easier for me the post to "help me think it through myself" (rather than just telling me a "we should do X" style answer).

But it lays out a dizzying enough array of different concerns that I found it sort of hard to translate this into "okay what actually should I actually think about next?". I'd have found it helpful if the post ended with some kind of recap of "here's the areas that seem most important to be tracking, for me."

Comment by Raemon on Policy for LLM Writing on LessWrong · 2025-03-25T00:04:01.760Z · LW · GW

(note: This is Raemon's random take rather than considered Team Consensus)

Part of the question here is "what sort of engine is overall maintainable, from a moderation perspective?".

LLMs make it easy for tons of people to be submitting content to LessWrong without really checking whether it's true and relevant. It's not enough for a given piece to be true. It needs to be reliably true, with low cost to moderator attention.

Right now, basically LLMs don't produce anywhere near good enough content. So, presently, letting people submit AI generated content without adding significant additional value is a recipe for LW admins to spend a bunch of extra time each day deciding whether to moderate a bunch of content that we're realistically going to say "no" to.

(Some of the content is ~on par with the bottom 25% of LW content, but the bottom 25% of LW content is honestly below the quality bar we prefer the site to be at, and the reason we let those comments/posts in at all is because it's too expensive to really check if it's reasonable, and when we're unsure, we sometimes to default to "let it in, and let the automatic rate limits handle it". But, the automated rate limits would not be sufficient to handle an influx of LLM slop)

But, even when we imagine content that should theoretically be "just over the bar", there are secondorder effects of LW being a site with a potentially large amount of AI content that nobody is really sure if it's accurate or whether anyone endorses it and whether we are entering into some slow rolling epistemic disaster.

So, my guess for the bar for "how good quality do we need to be talking about for AI content to be net-positive" is more at least top-50% and maybe top-25% of baseline LW users. And when we get to that point probably the world looks pretty different.

Comment by Raemon on Recent AI model progress feels mostly like bullshit · 2025-03-24T22:29:57.143Z · LW · GW

My lived experience is that AI-assisted-coding hasn't actually improved my workflow much since o1-preview, although other people I know have reported differently.

Comment by Raemon on Raemon's Shortform · 2025-03-24T18:47:18.975Z · LW · GW

It seems like my workshops would generally work better if they were spaced out over 3 Saturdays, instead of crammed into 2.5 days in one weekend.

This would give people more time to try applying the skills in their day to day, and see what strategic problems they actually run into each week. Then on each Saturday, they could spend some time reviewing last week, thinking about what they want to get out of this workshop day, and then making a plan for next week.

My main hesitation is I kind of expect people to flake more when it's spread out over 3 weeks, or for it to be harder to find 3 Saturdays in a row that work as opposed to 1 full weekend in a row.

I also think there is a bit of a special workshop container that you get when there's 3 days in a row, and it's a bit sad to lose that container.

But, two ideas I've considered so far are:

Charge more, and people get a partial refund if they attend all three sessions.
Have there be 4 days instead of 3, and design it such that if people miss a day it's not that big a deal.

I've also been thinking about a more immersive-program experience, where for 3-4 weeks, people are living/working onsite at Lighthaven, mostly working on some ambitious-but-confusing project, but with periodic lessons and checkins about practical metastrategy. (This is basically a different product than "the current workshop", and much higher commitment, but it's closer to what I originally wanted with Feedbackloop-first Rationality, and is what I most expect to actually work)

I'm curious to hear what people think about these.

Comment by Raemon on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-24T00:43:03.200Z · LW · GW

Also, have you tracked the previous discussion on Old Scott Alexander and LessWrong about generally "mysterious straight lines" being a surprisingly common phenomenon in economics. i.e. On an old AI post Oli noted:

This is one of my major go-to examples of this really weird linear phenomenon:
150 years of a completely straight line! There were two world wars in there, the development of artificial fertilizer, the broad industrialization of society, the invention of the car. And all throughout the line just carries one, with no significant perturbations.

This doesn't mean we should automatically take new proposed Straight Line Phenomena at face value, I don't actually know if this is more like "pretty common actually" or "there are a few notable times it was true that are drawing undue attention." But I'm at least not like "this is a never-before-seen anomaly")

Comment by Raemon on Elizabeth's Shortform · 2025-03-24T00:37:53.823Z · LW · GW

I think it's also "My Little Pony Fanfics are more cringe than Harry Potter fanfics, and there is something about the combo of My Little Pony and AIs taking over the world that is extra cringe."

Comment by Raemon on Arguments about fast takeoff · 2025-03-24T00:35:20.394Z · LW · GW

I'm here from the future trying to decide how much to believe in and how common are Gods of Straight Lines, and curious if you could say more arguing about this.

Comment by Raemon on Reframing AI Safety as a Neverending Institutional Challenge · 2025-03-23T03:07:00.602Z · LW · GW

I do periodically think about this and feel kind of exhausted at the prospect, but it does seem pretty plausibly correct. Good to have a writeup of it.

It particularly seems likely to be the right mindset if you think survival right now depends on getting some kind of longish pause (at least on the sort of research that'd lead to RSI+takeoff)

Comment by Raemon on Raemon's Shortform · 2025-03-22T21:29:04.850Z · LW · GW

Metastrategy = Cultivating good "luck surface area"?

Metastrategy: being good at looking at an arbitrary situation/problem, and figure out what your goals are, and what strategies/plans/tactics to employ in pursuit of those goals.

Luck Surface area: exposing yourself to a lot of situations where you are more likely to get valuable things in a not-very-predictable way. Being "good at cultivating luck surface area" means going to events/talking-to-people/consuming information that are more likely to give you random opportunities / new ways of thinking / new partners.

At one of my metastrategy workshops, while I talked with a participant about what actions had been most valuable the previous year, many of the things were like "we published a blogpost, or went to an event, and then kinda randomly found people who helped us a bunch, i.e. gave us money or we ended up hiring them."

This led me to utter the sentence "yeah, okay I grudgingly admit that 'increasing your luck surface area' is more important than being good at 'metastrategy'", and I improvised a session on "where did a lot of your good luck come from this year, and how could you capitalize more on that?"

But, thinking later, I think maybe actually "being good at metastrategy" and "being good at managing luck surface area" are maybe basically the same thing?

That is:

If already know how to handle a given situation, you're basically using "strategy", not "metastrategy."

If you don't already know, what you wanna do is strategically direct your thoughts in novel directions (maybe by doing crazy brainstorming, maybe by doing structured "think about the problem in a bunch of different ways that seem likely to help", maybe by taking a shower and letting your mind wander, maybe by talking to people who will likely have good advice about your problem.

This is basically "exposing luck surface area" for your cognition.

Thinking about it more and chatting with a friend: Managing Luck Surface Area seems like a subset of metastrategy but not the whole thing.

One counter example they gave was "reading a book that will basically tell you a crucial fact, or teach you a specific skill", where you basically know it will work and that it's a necessary prerequisite for solving your problem.

But it does seem like the "luck surface area"-ish portion of metastrategy is usually more important for most people/situations, esp. if you're going to find plans that are 10-100x better than your current plan. (Although, once you locate a hypothesis "get a ton of domain-expertise in a given field" might be the right next step. That's sort of blurring back into "regular strategy" rather than "metastrategy", although the line is fuzzy)

Comment by Raemon on Any mistakes in my understanding of Transformers? · 2025-03-22T19:10:09.701Z · LW · GW

Pedagogic feedback: each diagram is much longer than a page, it's harder to fit the whole thing in my head at once.

Comment by Raemon on Going Nova · 2025-03-19T19:09:48.437Z · LW · GW

It’s unclear to me what the current evidence is for this happening ‘a lot’ and ‘them being called Nova specifically’. I don’t particularly doubt it but it seemed sort of asserted without much background.

Comment by Raemon on Levels of Friction · 2025-03-18T02:39:57.029Z · LW · GW

Curated. This concept seems like an important building block for designing incentive structures / societies, and this seems like a good comprehensive reference post for the concept.

Comment by Raemon on One pager · 2025-03-18T01:27:00.095Z · LW · GW

Note: it looks like you probably want this to be a markdown file. You can go to https://www.lesswrong.com/account, with the "site customizations" section, and click "activate Markdown" to enable the markdown editor.

Comment by Raemon on Paper: Field-building and the epistemic culture of AI safety · 2025-03-17T16:57:50.754Z · LW · GW

Fyi I think it’s time to do minor formatting adjustments to make papers/abstracts easier to read on LW

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-16T22:25:33.867Z · LW · GW

I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is "careful conceptual thinking might be required rather than pure naive empiricism (because we won't be given good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this" and the bailey is "extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed".

Yeah I agree that was happening somewhat. The connecting dots here are "in worlds where it turns out we need a long Philosophical Pause, I think you and Buck would probably be above some threshold where you notice and navigate it reasonably."

I think my actual belief is "the Motte is high likelihood true, the Bailey is... medium-ish likelihood true, but, like, it's a distribution, there's not a clear dividing line between them"

I also think the pause can be "well, we're running untrusted AGIs and ~trusted pseudogeneral LLM-agents that help with the philosophical progress, but, we can't run them that long or fast, they help speed things up and make what'd normally be a 10-30 year pause into a 3-10 year pause, but also the world would be going crazy left to it's own devices, and the sort of global institutional changes necessary are still similarly-outside-of-overton window as a 20 year global moratorium and the "race with China" rhetoric is still bad.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-16T19:21:19.353Z · LW · GW

Thanks for laying this out thus far. I'mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here.

Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on "extreme philosophical competence".)

This makes sense as a crux for the claim "we need philosophical competence to align unboundedly intelligent superintelligences." But, it doesn't make sense for the claim "we need philosophical competence to align general, openended intelligence." I suppose my OP didn't really distinguish these claims and there were a few interpretations of how the arguments fit together. I was more saying the second (although to be fair I'm not sure I was actually distinguishing them well in my head until now)

It doesn't make sense for "we just' need to be able to hand off to an AI which is seriously aligned" to be a crux for the second. A thing can't be a crux for itself.

I notice my "other-guy-feels-like-they're-missing-the-point" -> "check if I'm not listening well, or if something is structurally wrong with the convo" alarm is firing, so maybe I do want to ask for one last clarification on "did you feel like you understood this the first time? Does it feel like I'm missing the point of what you said? Do you think you understand why it feels to me like you were missing the point (even if you think it's because I'm being dense about something?)

Takes on your proposal

Meanwhile, here's some takes based on my current understanding of your proposal.

These bits:

We need to ensure that our countermeasures aren't just shifting from a type of misalignment we can detect to a type we can't. Qualitatively analyzing the countermeasures and our tests should help here.

...is a bit I think is philosophical-competence bottlenecked. And this bit:

"Actually, we didn't have any methods available to try which could end up with a model that (always) isn't egregiously misaligned. So, even if you can iterate a bunch, you'll just either find that nothing works or you'll just fool yourself."

...is a mix of "philosophically bottlenecked" and "rationality bottlenecked." (i.e. you both have to be capable of reasoning about whether you've found things that really worked, and, because there are a lot of degrees of freedom, capable of noticing if you're deploying that reasoning accurately)

I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.

(I think at least some people on the alignment science or interpretability teams might be. I bet against the median such teammembers being able to navigate it. And ultimately, what matters is "does Anthropic leadership go forward with the next training run", so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people. And Anthropic leadership already seem to basically be ignoring arguments of this type, and I don't actually expect to get the sort of empirical clarity that (it seems like) they'd need to update before it's too late.)

Second, we can study how generalization on this sort of thing works in general

I think this counts as the sort of empiricism I'm somewhat optimisic about in my post. i.e. if you are able to find experiments that actually give you evidence about deeper laws, that let you then make predictions about new Actually Uncertain questions of generalization that you then run more experiments on... that's the sort of thing I feel optimistic about. (Depending on the details, of course)

But, you still need technical philosophical competence to know if you're asking the right questions about generalization, and to know when the results actually imply that the next scale-up is safe.

Comment by Raemon on Paper: Field-building and the epistemic culture of AI safety · 2025-03-15T18:19:04.561Z · LW · GW

FYI I found this intro fairly hard to read – partly due to generally large blocks of text (see: Abstracts should be either Actually Short™, or broken into paragraphs) and also because it just... doesn't actually really say what the main point is, AFAICT. (It describes a bunch of stuff you do, but I had trouble finding the actual main takeaway, or primary sorts of new information I might get by reading it)

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-15T02:51:29.177Z · LW · GW

I don't really see why this is a crux. I'm currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn't really change my strategic orientation. Maybe you're focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.

I think you kinda convinced me here this reasoning isn't (as stated) very persuasive.

I think my reasoning had some additional steps like:

when I'm 15% on 'alignment might be philosophically hard', I still expect to maybe learn more and update to 90%+, and it seems better to pursue strategies that don't actively throw that world under the bus. (and, while I don't fully understand the Realpolitik, it seems to me that Anthropic could totally be pursuing strategies that achieve a lot of it's goals without Policy Comms that IMO actively torch the "long pause" worlds)
you are probably right I was oriented around "getting to like 5% risk" than reducing risk on the margin.
I'm probably partly just not really visualizing what it'd be like to be a 15%-er and bringing some bias in.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-15T02:47:50.409Z · LW · GW

I'm pretty skeptical of the "extreme philosophical competence" perspective. This is basically because we "just" need to be able to hand off to an AI which is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for use to check).

The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.

Looking over the comments of the first joshc post, it seems like that's also basically asserted it wasn't necessary by fiat. And, the people who actively believe in "alignment is philosophically loaded" showed up to complain that this ignored the heart of the problem.

My current summary of the arguments (which I put ~60% on, and I think Eliezer/Oli/Wentworth treat much more confidently and maybe believe a stronger version of) are something like:

Anything general enough to really tackle openended, difficult-to-evaluate plans, will basically need to operate in a goal directed way in order to do that. (i.e. What's Up With Confusingly Pervasive Goal Directedness?)
The goal-directedness means it's very likely to be self/situationally aware, and the requisite intelligence to solve these sorts of problems means even if it's not full blown anti-aligned, it's at least probably going to want to try to build more option value for itself.
The fact that you can't evaluate the results means it has a lot of room to give you answers that help preserve it's goals and bootstrap (at least on the margin), even if it's not massively smart enough to one-shot escape. And you can't solve that problem with Control (i.e. The Case Against AI Control Research).
You can maybe have interpretability tools that check for schemingness (if it's the first generation of generally capable agent and isn't too smart yet, maybe you've done a good job preserving Chain of Thought as a reasonably faithful representation, for now). But, you'll then just see "yep, the agent is unaligned", and not actually be able to fix it.

I think my current model of you (Ryan) is like:

"Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from "don't think about scheming ever" to "secretly think about scheming" to avoid getting caught, and that probably just won't work?"

(or, in example of the second joshC post, they need to learn to be really reliably good at truth-tracking-patterns and articulating their reasoning, after internalizing that for thousands of reps, an AI is just gonna have a hard time jumping to reasoning that isn't truth tracking).

I don't have a clear model of how you respond to point #4 – that we'll just reliably find them to be scheming if we succeed at the interpretability steps, and not have a good way of dealing with it. (Maybe you just don't think this is as overwhelmingly likely?)

Interested in whatever Real You's cruxes are, 1-2 steps removed.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-15T02:25:46.375Z · LW · GW

Thanks. I'll probably reply to different parts in different threads.

For the first bit:

My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.

The rough number you give are helpful. I'm not 100% sure I see the dots you're intending to connect with "leadership thinks 1/5-ryan-misalignment and 2x-ryan-totalitariansm" / "rest of alignment science team closer to ryan" -> "this explains a lot."

Is this just the obvious "whelp, leadership isn't bought into this risk model and call most of the shots, but in conversations with several employees that engage more with misalignment?". Or was there a more specific dynamic you thought it explained?

Comment by Raemon on AI Tools for Existential Security · 2025-03-14T23:57:41.300Z · LW · GW

Do you have existing ones you recommend?

I'd been working on a keylogger / screenshot-parser that's optimized for a) playing nicely will LLMs while b) being unopinionated about what other tools you plug it into. (in my search for existing tools, I didn't find keyloggers that actually did the main thing I wanted, and the existing LLM-tools that did similar things were walled-garden-ecosystems that didn't give me much flexibility on what I did with the data)

Comment by Raemon on AI4Science: The Hidden Power of Neural Networks in Scientific Discovery · 2025-03-14T22:11:29.528Z · LW · GW

Minor note but I found the opening section hard to read. See: Abstracts should be either Actually Short™, or broken into paragraphs

Comment by Raemon on johnswentworth's Shortform · 2025-03-14T21:26:10.810Z · LW · GW

Were you by any chance writing in Cursor? I think they recently changed the UI such that it's easier to end up in "agent mode" where it sometimes randomly does stuff.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T20:23:58.688Z · LW · GW

I am kinda intrigued by how controversial this post seems (based on seeing the karma creep upwards and then back down over the past day). I am curious if the downvoters tend more like:

Anti-Anthropic-ish folk who think the post is way too charitable/soft on Anthropic
Pro-Anthropic-ish folk who think the post doesn't make very good/worthwhile arguments against Anthropic
"Alignment-is-real-hard" folks who think this post doesn't represent the arguments for that very well.
"other?"

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T20:05:09.223Z · LW · GW

I agree with this (and think it's good to periodically say all of this straightforwardly).

I don't know that it'll be particularly worth your time, but, the thing I was hoping for this post was to ratchet the conversation-re-anthropic forward in, like, "doublecrux-weighted-concreteness." (i.e. your arguments here are reasonably crux-and-concrete, but don't seem to be engaging much with the arguments in this post that seemed more novel and representative of where anthropic employees tend to be coming from, instead just repeated AFAICT your cached arguments against Anthropic)

I don't have much hope of directly persuading Dario, but I feel some hope of persuading both current and future-prospective employees who aren't starting from the same prior of "alignment is hard enough that this plan is just crazy", and for that to have useful flow-through effects.

My experience talking at least with Zac and Drake has been "these are people with real models, who share many-but-not-all-MIRI-ish assumptions but don't intuitively buy that the Anthropic's downsides are high, and would respond to arguments that were doing more to bridge perspectives." (I'm hoping they end up writing comments here outlining more of their perspective/cruxes, which they'd expressed interest in in the past, although I ended up shipping the post quickly without trying to line up everything)

I don't have a strong belief that contributing to that conversation is a better use of your time than whatever else you're doing, but it seemed sad to me for the conversation to not at least be attempted.

(I do also plan to write 1-2 posts that are more focused on "here's where Anthropic/Dario have done things that seem actively bad to me and IMO are damning unless accounted for," that are less "attempt to maintain some kind of discussion-bridge", but, it seemed better to me to start with this one)

Comment by Raemon on Elizabeth's Shortform · 2025-03-14T19:04:27.981Z · LW · GW

Yeah I was staring at the poll and went "oh no." They aren't often actually used this way so it's not obviously right to special-case it, although maybe we do polls enough that we generally should present react in "first->last posted" rather than sorting by number

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T17:59:47.549Z · LW · GW

I don't particularly disagree with the first half, but your second sentence isn't really a crux for me for the first part.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T05:19:10.606Z · LW · GW

I think (moderately likely, though not super confident) it makes more sense to model Dario as:

"a person who actually is quite worried about misuse, and is making significant strategic decisions around that (and doesn't believe alignment is that hard)"

than as "a generic CEO who's just generally following incentives and spinning narrative post-hoc rationalizations."

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T05:06:06.407Z · LW · GW

I think... agree denotationally and (lean towards) disagreeing connotationally? (like, seems like this is implying "and because he doesn't seem like he obviously has coherent views on alignment-in-particular, it's not worth arguing the object level?")

(to be clear, I don't super expect this post to affect Dario's decisionmaking models, esp. directly. I do have at least some hope for Anthropic employees to engage with these sorts of models/arguments, and my sense from talking them is that a lot of the LW-flavored arguments have often missed their cruxes)

Comment by Raemon on Trojan Sky · 2025-03-14T04:11:27.277Z · LW · GW

Also, the video you linked has a lot of additional opinionated features that I think are targeting a much more specific group than even "people who aren't put off by AI" - it would never show up on my youtube.

For frame of reference, do regular movie trailers normally show up in your youtube? This video seemed relatively "mainstream"-vibing to me, although somewhat limited by the medium.

Comment by Raemon on Anthropic, and taking "technical philosophy" more seriously · 2025-03-14T01:40:55.360Z · LW · GW

I expect that AIs will be obedient when they initially become capable enough to convince governments that further AI development would be harmful (if it would in fact be harmful).

Seems like "the AIs are good enough at persuasion to persuade governments and someone is deploying them for that" is right when you need to be very high confidence they're obedient (and, don't have some kind of agenda). If they can persuade governments, they can also persuade you of things.

I also think it gets into a point where I'd sure feel way more comfortable if we had more satisfying answers to "where exactly are we supposed to draw the line between 'informing' and 'manipulating'" (I'm not 100% sure what you're imagining here tho)

User info

Posts

Comments

Takes on your proposal