The case for stopping AI safety research

post by catubc (cat-1) · 2024-05-23T15:55:18.713Z · LW · GW · 21 comments

TLDR: AI systems are failing in obvious and manageable ways for now. Fixing them will push the failure modes beyond our ability to understand and anticipate, let alone fix. The AI safety community is also doing a huge economic service to developers. Our belief that our minds can "fix" a super-intelligence - especially bit by bit - needs to be re-thought. 

I wanted to write this post forever, but now seems like a good time.  The case is simple, I hope it takes you 1min to read.

  1. AI safety research is still solving easy problems.  We are patching up the most obvious (to us) problems. As time goes we will no longer be able to play this existential risk game of chess with AI systems. I've argued this a lot (ICML 2024 spotlight paper; also www.agencyfoundations.ai). Seems others have this thought.
  2. Capability development is getting AI safety research for free. It's likely in the millions to tens of millions of dollars. All the "hackathons", and "mini" prizes to patch something up or propose a new way for society to digest/adjust to some new normal (and increasingly incentivizing existing academic labs). 
  3. AI safety research is speeding up capabilities. I hope this is somewhat obvious to most.

I write this now because in my view we are about 5-7 years before massive human biometric and neural datasets will enter our AI training.  These will likely generate amazing breakthroughs in long-term planning and emotional and social understanding of the human world.  They will also most likely increase x-risk radically.

Stopping AI safety research or taking it in-house with security guarantees etc, will  slow down capabilities somewhat - and may expose capabilities developers more directly to public opinion of still manageable harmful outcomes. 

21 comments

Comments sorted by top scores.

comment by Garrett Baker (D0TheMath) · 2024-05-23T16:11:06.495Z · LW(p) · GW(p)

This seems more an argument against evals, interpretability, trojans, jailbreak protection, adversarial robustness, control, etc right? Other (less funded & staffed) approaches don’t have the problems you mention.

Replies from: cat-1
comment by catubc (cat-1) · 2024-05-23T16:20:18.450Z · LW(p) · GW(p)

Thanks Garrett. There is obviously nuance that a 1min post can't get at. I am just hoping for at least some discussion to be had on this topic. There seems to be little to none now.

comment by JenniferRM · 2024-05-24T03:23:20.612Z · LW(p) · GW(p)

I feel like you're saying "safety research" when the examples of what corporations centrally want is "reliable control over their slaves"... that is to say, they want "alignment" and "corrigibility" research.

This has been my central beef for a long time.

Eliezer's old Friendliness proposals were at least AIMED at the right thing (a morally praiseworthy vision of humanistic flourishing) and CEV is more explicitly trying for something like this, again, in a way that mostly just tweaks the specification (because Eliezer stopped believing that his earliest plans would "do what they said on the tin they were aimed at" and started over). 

If an academic is working on AI, and they aren't working on Friendliness, and aren't working on CEV, and it isn't "alignment to benevolence " or making "corrigibly seeking humanistic flourishing for all"... I don't understand why it deserves applause lights.

(EDITED TO ADD: exploring the links more, I see "benevolent game theory, algorithmic foundations of human rights" as topics you raise. This stuff seems good! Maybe this is the stuff you're trying to sneak into getting more eyeballs via some rhetorical strategy that makes sense in your target audience?)

"The alignment problem" (without extra qualifications) is an academic framing that could easily fit in a grant proposal by an academic researcher to get funding from a slave company to make better slaves. "Alignment IS capabilities research".

Similarly, there's a very easy way to be "safe" from skynet: don't built skynet!

I wouldn't call a gymnastics curriculum that focused on doing flips while you pick up pennies in front of a bulldozer "learning to be safe". Similarly, here, it seems like there's some insane culture somewhere that you're speaking to whose words are just systematically confused (or intentionally confusing).

Can you explain why you're even bothering to use the euphemism of "Safety" Research? How does it ever get off the ground of "the words being used denote what naive people would think those words mean" in any way that ever gets past "research on how to put an end to all AI capabilities research in general, by all state actors, and all corporations, and everyone (until such time as non-safety research, aimed at actually good outcomes (instead of just marginally less bad outcomes from current AI) has clearly succeeding as a more important and better and more funding worthy target)"? What does "Safety Research" even mean if it isn't inclusive of safety from the largest potential risks?

Replies from: Seth Herd
comment by Seth Herd · 2024-05-24T22:55:30.493Z · LW(p) · GW(p)

I think this is a good steelman of the original post. I find it more compelling.

Your "easy way to be safe," just not building AGI is commonly considered near-impossible. Can you point me to plans or arguments for how we can convince people not to build AGI? The arguments I'm aware of, that alignment is very very hard, they'll have the moral status of slaves, or that they're likely to lock in a bad future, are not complete enough to be compelling even to me, let alone technologists or politicians with their own agenda and limited attention for the arguments.

I suspect we'd be wiser not to build AGI, and definitely wiser to go slower, but I see no route to convincing enough of the world to do that.

What does "Safety Research" even mean if it isn't inclusive of safety from the largest potential risks?

I very much agree. I don't call my work safety research, to differentiate it from all of the stuff that may-or-may-not actually help with AGI alignment. To be fair, steering and interpretability work might contribute to building safe AGI, there's just not a very clear plan for how it would be applied to LLM-based AGI, rather than tool LLMs - so much of it probably contributes approximately nothing (depending on how you factor the capabilities appplications) to mitigating the largest risk: misaligned AGI.

comment by Thomas Kwa (thomas-kwa) · 2024-05-24T02:41:39.420Z · LW(p) · GW(p)

The burden of proof is on you that current safety research is not incremental progress towards safety research that matters on superintelligent AI. Generally the way that people solve hard problems is to solve related easy problems first, and this is true even if the technology in question gets much more powerful. Imagine if we had to land rockets on barges before anyone had invented PID controllers and observed their failure modes.

Also, the directions suggested in section 5 of the paper you linked seem to fall well within the bounds of normal AI safety research.

Edit: Two people reacted to taboo "burden of proof". I mean that the claim is contrary to reference classes I can think of, and to argue for it there needs to be some argument why it is true in this case. It is also possible that the safety effect is significant but outweighted by the speedup effect, but that should also be clearly stated if it is what OP believes.

Replies from: Seth Herd
comment by Seth Herd · 2024-05-24T22:46:43.209Z · LW(p) · GW(p)

I think logically the safety research needs to more than incremental progress toward alignment (your implied claim in that burden of proof). It needs to speed alignment toward the finish line (working alignment for the AGI we actually build) more than it speeds capabilities toward the finish line of building takeover-capable AGI.

I agree with you that in general, research tends to make progress toward its stated goals.

But isn't it a little odd that nobody I know of has a specific story for how we get from tuning and interpretability of LLMs to functionally safe AGI and ASI? I do have such a story, but the tuning and interpretability play only a minor role despite making up the vast bulk of "safety research".

Research usually just goes in a general direction, and gets unexpected benefits as well as eventually accomplishing some of its stated goals. But having a more specific roadmap seems wise when some of those "unexpected benefits" might kill everyone.

That's not to say I think we should shut down safety research; I just think we should have a bit more of a plan for how it accomplishes the stated goals. I'm afraid we've gotten a bit distracted from AGI x-risk by making LLMs safe - when nobody ever thought LLMs by themselves are likely to be very dangerous.

comment by the gears to ascension (lahwran) · 2024-05-23T19:11:25.974Z · LW(p) · GW(p)

This conflates research that is well enough aimed to prevent the end of everything good, with the common safety research that is not well aimed and mostly is picking easy, vaguely-useful-sounding things; yup, agreed that most of that research is just bad. It builds capability that could, in principle, be used to ensure everything good survives, but it is by no means the default and nobody should assume publishing their research is automatically a good idea. It very well might be! but if your plan is "advance a specific capability, which is relevant to ensuring good outcomes", consider the possibility that it's at least worth not publishing.

Not doing the research entirely is a somewhat different matter, but also one to consider.

comment by RussellThor · 2024-05-23T18:58:36.137Z · LW(p) · GW(p)

OK but what is your plan for a positive Singularity? Just putting AGI/ASI off by say 1 year doesn't necessarily give a better outcome at all.

Replies from: valley9
comment by Ebenezer Dukakis (valley9) · 2024-05-24T11:19:39.867Z · LW(p) · GW(p)

Perhaps we should focus on alignment problems that only appear for more powerful systems, as a form of differential technological development. Those problems are harder (will require more thought to solve), and are less economically useful to solve in the near-term.

Replies from: RussellThor
comment by RussellThor · 2024-05-24T20:44:12.030Z · LW(p) · GW(p)

How do you practically do that? We don't know what they are, and that seems to be assuming our present progress, e.g. in Mechanical Interpretability doesn't help at all. Such work requires the existence of more powerful systems than exist today surely?

comment by Seth Herd · 2024-05-23T17:20:40.164Z · LW(p) · GW(p)

Can you say more about what types of AI safety research you are referring to? Interpretability, evals, and steering for deep nets, I assume, but not work that's attempting to look forward and apply to AGI and ASI?

comment by Nate Showell · 2024-05-24T19:42:03.678Z · LW(p) · GW(p)

AI safety research is speeding up capabilities. I hope this is somewhat obvious to most.

This contradicts the Bitter Lesson, though. Current AI safety research doesn't contribute to increased scaling, either through hardware advances or through algorithmic increases in efficiency. To the extent that it increases the usability of AI for mundane tasks, current safety research does so in a way that doesn't involve making models larger. Fears of capabilities externalities from alignment research are unfounded as long as the scaling hypothesis continues to hold.

Replies from: RussellThor, yonatan-cale-1
comment by RussellThor · 2024-05-24T20:41:59.972Z · LW(p) · GW(p)

Doesn't the whole concept of takeoff contradict the Bitter Lesson according to some uses of it? That is our present hardware could be much more capable if we had the right software.

comment by Yonatan Cale (yonatan-cale-1) · 2024-05-24T20:25:53.222Z · LW(p) · GW(p)

Scaling matters, but it's not all that matters.

For example, RLHF

comment by ryan_greenblatt · 2024-05-23T18:47:34.620Z · LW(p) · GW(p)
  1. AI safety research is still solving easy problems. [...]
  2. Capability development is getting AI safety research for free. [...]
  3. AI safety research is speeding up capabilities. [...]

Even if (2) and (3) are true and (1) is mostly true (e.g. most safety research is worthless), I still think it can easily be worthwhile to indiscriminately increase the supply of safety research[1].

The core thing is a quantitative argument: there are far more people working on capabities than x-safety and if no one works on safety, nothing will happen at all.

Copying a version of this argument from a prior comment I made [LW(p) · GW(p)]:

There currently seems to be >10x as many people directly trying to build AGI/improve capabilities as trying to improve safety.

Suppose that the safety people have as good ideas and research ability as the capabilities people. (As a simplifying assumption.)

Then, if all the safety people switched to working full time on maximally advancing capabilities, this would only advance capabilites by less than 10%.

If, on the other hand, they stopped publically publishing safety work and this resulted in a 50% slow down, all safety work would slow down by 50%.

Naively, it seems very hard for publishing less to make sense if the number of safety researchers is much smaller than the number of capabilities researchers and safety researchers aren't much better at capabilities than capabilities researchers.


  1. Of course, there might be better things to do than indiscriminately increase the supply. E.g., maybe it is better to try to steer the direction of the field. ↩︎

Replies from: eggsyntax
comment by eggsyntax · 2024-05-23T22:38:41.017Z · LW(p) · GW(p)

there are far more people working on safety than capabilities

If only...

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-05-23T23:40:28.259Z · LW(p) · GW(p)

(Oops, fixed.)

comment by O O (o-o) · 2024-05-23T17:14:50.024Z · LW(p) · GW(p)

AI control is useful to corporations even if it doesn't result in more capabilities. This is why so much money is invested in it. Customers want predictable and reliable AI. There is a great post here about AI's aligning to Do What I want and Double Checking in the short term. There's your motive.  

Also in a world where we stop safety research, it's not obvious to me why capabilities research will be stopped or even slowed down. I can imagine them being slightly less economically valuable but not much less capable. If anything, without reliability, devs might be pushed to extract value out of these models by making them more capable. 

Fixing them will push the failure modes beyond our ability to understand and anticipate, let alone fix.

So that's why this point isn't very obvious to me. It seems like we can just have both failures we can understand and can't understand. They aren't mutually exclusive.[1]

  1. ^

    Also if we can't understand why something is bad, even given a long amount of time, is it really bad?

comment by Stephen Fowler (LosPolloFowler) · 2024-05-24T00:42:29.950Z · LW(p) · GW(p)

I think this is an important discussion to have but I suspect this post might not convince people who don't already share similar beliefs.

1. I think the title is going to throw people off. 

I think what you're actually saying "stop the current strain of research focused on improving and understanding contemporary systems which has become synonymous with the term AI safety" but many readers might interpret this as if you're saying "stop research that is aimed at reducing existential risks from AI". It might be best to reword it as "stopping prosaic AI safety research". 

In fairness, the first, narrower definition of AI Safety certainly describes a majority of work under the banner of AI Safety. It certainly seems to be where most of the funding is going and describes the work done at industrial labs. It is certainly what educational resources (like the AI Safety Fundamentals course) focus on. 
 

2. I've had a limited number of experiences informally having discussions with researchers on similar ideas (not necessarily arguing for stopping AI safety research entirely though). My experience is that people either agree immediately or do not really appreciate the significance of concerns about AI safety research largely being on the wrong track. Convincing people in the second category seems to be rather difficult.

To summarize what I'm trying to convey:
I think this is a crucial discussion to have and it would be beneficial to the community to write this up into a longer post if you have the time. 
 

comment by Aprillion (Peter Hozák) (Aprillion) · 2024-06-04T10:32:42.464Z · LW(p) · GW(p)

I agree with the premises (except "this is somewhat obvious to most" 🤷).

On the other hand, stopping AI safety research sounds like a proposal to go from option 1 to option 2:

  1. many people develop capabilities, some of them care about safety
  2. many people develop capabilities, none of them care about safety
comment by andeslodes · 2024-05-23T16:40:04.333Z · LW(p) · GW(p)

Could you expand upon your points in the second-to-last paragraph? I feel there are a lot of interesting thoughts leading to these conclusions, but it's not immediately clear to me what they are.