RerM's Shortform

rerm

RerM's Shortform

post by RerM (robert-m) · 2025-04-28T03:30:44.100Z · LW · GW · 2 comments

2 comments

2 comments

Comments sorted by top scores.

comment by RerM (robert-m) · 2025-04-28T03:30:44.099Z · LW(p) · GW(p)

Generally, hypothetical hostile AGI is assumed to be made on software/hardware that's more advanced from what we have now. This makes sense, as Chat-GPT is very stupid in a lot of ways.
Has anyone considered purposefully creating a hostile AGI on this "stupid" software so we can wargame how a highly advanced, hostile AGI would act? Obviously the difference between what we have now and what we may have later will be quite large, but I think we could create a project were we "fight" stupid AIs, then slowly move up the intelligence ladder as new models come out, using our newfound knowledge of fighting hostile intelligence to mitigate the risk that comes with creating hostile AIs.

Has anyone ever thought of this? Also, what are your thoughts on this? Alignment and AI are not my specialties, but I thought this idea sounded interesting enough to share.

Replies from: Max Lee

↑ comment by Knight Lee (Max Lee) · 2025-04-28T10:04:23.522Z · LW(p) · GW(p)

I think Anthropic did tests like this, e.g. in Alignment Faking in Large Language Models [LW · GW].

But I guess that's more of a "test how they behave in adversarial situations" study. If you're talking about a "test how to fight against them" study, that consists of "red teams" trying to hack various systems to make sure they are secure.

I'm not sure if the red teams used AI, but they are smart people and if AI improved their hacking ability I'm sure they would use them. So they're already stronger than AI.

RerM's Shortform

Contents

2 comments