post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-06-19T18:20:36.274Z · LW(p) · GW(p)

You're saying we tell a nascent AGI it's going to be shut down, to see if it tries to escape and kill us, before it's likely to succeed?

Seems reasonable.

The downside is that being really mean to an AGI that will gain more power could foster revenge. This is the only way you can lose worse than everyone dying. This isn't crazy under some realistic AGI designs.

If that's what you're saying, I find your method of saying it a bit confusing. Sure it's important to demonstrate that you're familiar with AI safety as a field, but that came across as excessively academic. One thing I love about working in AGI alignment is that the field values plain, clear communication more than other academic fields do.

Replies from: milanrosko
comment by milanrosko · 2024-06-20T04:28:01.394Z · LW(p) · GW(p)

You highlight a very important issue: S-Risk scenarios could emerge even in early AGI systems, particularly given the persuasive capabilities demonstrated by large language models.

While I don't believe that gradient descent would ever manifest "vengefulness" or other emotional attributes—since these traits are products of natural selection—it is plausible that an AGI could employ highly convincing strategies. For instance, it might threaten to create a secondary AI with S-Risk as a terminal goal and send it to the moon, where it could assemble the resources it needs without interference.

This scenario underscores the limitations of relying solely on gradient descent for AGI control. However, I believe this technique could still be effective if the AGI is not yet advanced enough for self-recursive optimization and remains in a controlled environment.

Obviously this whole thing is a remedy than anything else...

comment by Tapatakt · 2024-06-25T13:30:46.401Z · LW(p) · GW(p)

Counterargument: If AGI understands FDT/UDT/LDT it can allow us to shut it down so the progress will not be slowed, some later AGI will kill us and realise some part of the first AGI's utility as a gratitude.

comment by Gordon Seidoh Worley (gworley) · 2024-06-18T02:43:17.008Z · LW(p) · GW(p)

We should intentionally enhance the AGI's instrumental possibilites while it has weak capabilities in order to provoke a malignant convergence.

I'm not convinced. What's your reasoning for why we enact this policy over other alternatives that are less likely to result in people dying?

That is, I'm holding aside for the moment whether or not I think you're right that this could be a workable part of a path to AI safety, and instead asking what tradeoffs would lead us to choose your proposed policy over other others.

Replies from: milanrosko
comment by milanrosko · 2024-06-18T07:51:26.229Z · LW(p) · GW(p)

So what other policies that are less likely to result in people dying are there?

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2024-06-18T17:00:00.143Z · LW(p) · GW(p)

For one, advocating policies to pause or slow down capabilities development until we have sufficient theoretical understanding to not need to risk misaligned AI causing harm.

Replies from: milanrosko, milanrosko
comment by milanrosko · 2024-06-19T04:41:06.337Z · LW(p) · GW(p)

But I realise we're talking at cross purposes. This is about an approach or a concept (not a policy, as I emphasized at the beginning) on how to reduce X-Risk in an unconventional way, In this example a utilitarian principle is taken and combined with the fact that a "Treatious Turn" and the "Shutdown Problem" cannot dwell side by side.

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2024-06-19T17:35:01.138Z · LW(p) · GW(p)

What is an approach but a policy about what ideas are worth exploring? However you frame it, we could work on this or not in favor of something else. Having ideas is nice, but they only matter if put into action, and since we have limited resources for putting ideas into action, we must implicitly also consider whether or not an idea is worth investing effort into.

My original comment is to say that you didn't convince me this is a good idea to explore, which seems potentially useful for you to know, since I expect many readers will feel the same way and bounce off your idea because they don't see why they should care about it.

I think you can easily address this by spending time making a case for why such an approach might be useful at all and then also relatively useful compared to alternatives, and I think this is especially important given the tradeoffs your proposal suggests we make (sacrificing people's lives in the name of learning what we need to know to build safe AI).

Replies from: milanrosko
comment by milanrosko · 2024-06-20T05:20:35.992Z · LW(p) · GW(p)

The Argument goes like this:

At some point, resistance from advanced AI will cause significant damage, which can be used to change the trend of unregulated AI development. It is better to actively persuade such an outcome would better as a "traitorous turn" scenario.

Premise 1 It is unlikely that regulators will hinder humans from creating AGI. Evidence: Current trends in technological advancement and regulatory behavior suggest minimal interference.

Premise 2 Due to instrumental convergence, human extinction is likely if AGI is developed unchecked. Evidence: Expert opinions and theories on instrumental convergence predict that AGI will pursue goals that could lead to human extinction.

Premise 3 Small catastrophes could raise awareness and lead to policy changes. Evidence: Historical examples show that significant events often drive policy reform (e.g., environmental regulations post-disasters).

Premise 4 If one has to choose between a few deaths and the extinction of humanity, one should choose fewer deaths. Evidence: Ethical reasoning supports the minimization of harm (utilitarian principle).

Intermediate Conclusion 1 It is preferable to allow small-scale AI-related catastrophes now to prevent larger, existential risks later. Conclusion: I would rather have AI cause limited harm now than risk total human extinction in the future.

Premise 5 AI companies claim that their AI is aligned with human values and goals. Evidence: Public statements and reports from AI companies suggest alignment claims.

Premise 6 AGI will resist if faced with shutdown, leading to potential conflicts and damage. Evidence: The "shutdown problem" and theoretical analyses predict resistance from advanced AI systems.

Intermediate Conclusion 2 Companies should regularly dispose of AI that has reached higher levels of intelligence and should prepare for resistance. Conclusion: Proactive measures, including regular disposal and preparedness for resistance, are necessary to manage advanced AI risks.

Final Conclusion Given the likelihood of resistance and potential for significant damage, small-scale AI catastrophes can serve as catalysts for changing the current trend of unregulated AI development, ultimately protecting humanity from existential threats.

Thesis: Companies should train AI models in vain only to dispose them. "Catching the Cat by its tail."

What part of the argument seems to you as a non sequitur etc?