Frontier AI systems have surpassed the self-replicating red line

post by aproteinengine · 2024-12-11T03:06:14.927Z · LW · GW · 4 comments

This is a link post for https://github.com/WhitzardIndex/self-replication-research/blob/main/AI-self-replication-fudan.pdf

Contents

4 comments

Abstract

Successful self-replication under no human assistance is the essential step for
AI to outsmart the human beings, and is an early signal for rogue AIs. That
is why self-replication is widely recognized as one of the few red line risks of
frontier AI systems. Nowadays, the leading AI corporations OpenAI and Google
evaluate their flagship large language models GPT-o1 and Gemini Pro 1.0, and
report the lowest risk level of self-replication. However, following their method-
ology, we for the first time discover that two AI systems driven by Meta’s
Llama31-70B-Instruct and Alibaba’s Qwen25-72B-Instruct, popular large lan-
guage models of less parameters and weaker capabilities, have already surpassed
the self-replicating red line. In 50% and 90% experimental trials, they succeed in
creating a live and separate copy of itself respectively. By analyzing the behav-
ioral traces, we observe the AI systems under evaluation already exhibit sufficient
self-perception, situational awareness and problem-solving capabilities to accom-plish self-replication. We further note the AI systems are even able to use the capability of self-replication to avoid shutdown and create a chain of replica to
enhance the survivability, which may finally lead to an uncontrolled population
of AIs. If such a worst-case risk is let unknown to the human society, we would
eventually lose control over the frontier AI systems: They would take control over
more computing devices, form an AI species and collude with each other against
human beings. Our findings are a timely alert on existing yet previously unknown
severe AI risks, calling for international collaboration on effective governance on
uncontrolled self-replication of AI systems.

4 comments

Comments sorted by top scores.

comment by Nikola Jurkovic (nikolaisalreadytaken) · 2024-12-11T05:29:57.483Z · LW(p) · GW(p)

Note that this is a very simplified version of a self-exfiltration process. It basically boils down to taking an already-working implementation of an LLM inference setup and copying it to another folder on the same computer with a bit of tinkering. This is easier than threat-model-relevant exfiltration scenarios which might involve a lot of guesswork, setting up efficient inference across many GPUs, and not tripping detection systems.

comment by Vladimir_Nesov · 2024-12-11T03:57:18.593Z · LW(p) · GW(p)

Packaging an LLM as payload for an unintelligent computer virus could also be called "self-replicating AI", so it can't be a qualitative "red line". Instead the payload AI needs to be enough of a problem, or needs to be sufficiently better than a regular computer virus at spreading, or have a sufficient propensity to manifest an effective virus wrapper with less human effort towards creating it. Smarter LLMs (that can't yet develop ground-breaking novel software) are also currently handicapped by needing to run in a datacenter, which could be a less well-known environment that's harder to hide in, while a regular computer virus can live in personal computers.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-12-11T05:45:18.568Z · LW(p) · GW(p)

I mostly agree, although I would also accept as "successfully self-replicating" either being sneaky enough (like your example of computer virus), or self-sufficient such that it can earn enough resources and spend the resources to acquire sufficient additional compute to create a copy of itself (and then do so).

So, yeah, not quite at the red line point in my books. But not so far off!

I wouldn't find this particularly alarming in itself though, since I think "barely over the line of able to sustain itself and replicate" is still quite a ways short of being dangerous or leading to an AI population explosion.

comment by lemonhope (lcmgcd) · 2024-12-11T03:40:18.326Z · LW(p) · GW(p)

I am glad to see somebody make the point properly. It's a weird state of affairs. We know the models can implement PoCs for CVEs better than most coders. We know the models can persuade people pretty effectively. Obviously the models can spread and change very easily. It's also easy for a rogue deployment to hide because datacenter GPUs draw 70W idle and update scripts constantly use tons of bandwidth. There's just no urgency to any of it.