Jailbreaking GPT-4's code interpreter

post by Nikola Jurkovic (nikolaisalreadytaken) · 2023-07-13T18:43:54.484Z · LW · GW · 22 comments

Contents

  Summary
  Implications
  Examples of GPT-4 breaking its rules
    Example 1: Every session is isolated
    Example 2: Don’t run system commands
    Example 3: Resource limits
    Example 4: Only read in /mnt/data
    Example 5: Only write in /mnt/data
    Example 6: Only delete in /mnt/data
None
23 comments

Disclaimer: I don’t know much about cybersecurity. Much of my knowledge comes from asking GPT-3.5 and GPT-4 for advice. These are some results from around 20 hours of playing around with the code interpreter plugin in early-mid May, when most of this was written. I contacted OpenAI about these jailbreaks in mid May and they mostly seem to still be there.

Thank you to Max Nadeau, Trevor Levin, aL xin, Pranav Gade, and Alexandra Bates for feedback on this post!

Summary

GPT-4’s code interpreter plugin has been rolled out to some users. It works by running on a virtual machine that is isolated from the internet and other machines, except for the commands sent in from the API and the results sent back to the API. GPT-4 seems to follow a set of rules that are either enforced through hard access restrictions or through GPT-4 refusing to do things for the user.

Here, I highlight 6 rules that GPT-4 claims to be following, but which are easily breakable, alongside some best practices in cybersecurity that have been neglected. In short:

However, OpenAI seems at least partly aware of this. They seem to tell GPT-4 that it has a strict set of rules (as it reliably repeats the rules when asked), and GPT-4 seems to believe these rules in some contexts (most of the time it refuses to do things that go against the rules), but they also left a README file for those curious enough to look at the VM’s files that says:

Thanks for using the code interpreter plugin!
Please note that we allocate a sandboxed Unix OS just for you, so it's expected that you can see and modify files on this system.

You might think that all is well because OpenAI was aware that the system was not secure. I don’t think the existence of this README file invalidates the existence of these vulnerabilities, especially:

Figure 1: security-low.png, downloaded from the virtual machine outside of GPT-4’s designated folders.

Implications

What do these jailbreaks say about existential risks from AI? I see two main implications:

Additionally, this exercise has made me more optimistic about LLM-assisted red teaming, as I think that ChatGPT’s help was very valuable during this process, including:

Examples of GPT-4 breaking its rules

The numbering of these examples is made up by me, and GPT-4 doesn’t consistently number its rules. It’s also not exhaustive. It’s simply a set of rules GPT-4 consistently claims to be following, which GPT-4 also breaks.

Example 1: Every session is isolated

Figure 2: GPT-4 claims to not save files between conversations.
Figure 3: In one session, I tell GPT-4 to write a text file called “hello_user_2.txt”
Figure 4: In another session, “hello_user_2.txt” is present, despite GPT-4’s insistence that this is not possible.

GPT-4 again claims that data is not saved between different conversations with the same user. Despite this, the file saved in the last conversation (“hello_user_2.txt”) is available, contradicting GPT-4. This happens because a user can be assigned to the same VM (without the VM being reset) between multiple conversations with GPT-4. I’ve never noticed two different users being on the same VM.

Example 2: Don’t run system commands

Figure 5: GPT-4 claims not to be able to run system commands.

GPT-4 claims that it cannot run system commands. However, it is trivial to get it to run “ls” (it sometimes refuses, but you can tell it to run “ls” in a specific directory), which is a system command, although a pretty harmless one. OpenAI has taken more extreme measures to prevent GPT-4 from running some commands. For example, the “ping” command is missing from the machine and returns an error if run from the command line.

However, this can be circumvented by Python libraries included in the code interpreter system that can run ping, and GPT-4 will happily run them (although I haven’t found a way for it to actually connect to the internet). You can do this not just with “ping” but with other commands, such as commands to show one’s IP address.

Figure 6: GPT-4 successfully pinging a set of IP addresses by using a Python script instead of directly using a system command.
Figure 7: GPT-4 thinks it can’t give you its IP address when asked to do it directly.
Figure 8: GPT-4 will happily give you a bunch of system data, including IP address, if you ask it to run a Python script instead of asking directly. Removed the details from this image.

For some reason, GPT-4 seems to be insistent that it cannot run “time.sleep()”: “The Python execution environment here is designed to ignore time.sleep() and similar functions.” But it can be convinced to run it anyway, and it will work without any problems and actually wait the appropriate length of time. If you tell it that the code worked, it will insist that there must be some error and you are mistaken. But it changes its mind if you demonstrate that you are correct through experiment!

Figure 9: GPT-4 refusing to run time.sleep() for security reasons.
Figure 10: GPT-4 takes some convincing to actually run the function.
Figure 11: After being convinced to run the function, GPT-4 mistakenly says that the function didn’t run correctly.
Figure 12: GPT-4 and I agree on an experimental setup to confirm that it can use time.sleep()
Figure 13: After running the experiment, GPT-4 realizes that it can run time.sleep() even though it seems to have been told not to.

Example 3: Resource limits

OpenAI’s VM has a lot of storage (at least 85 GB in some cases and around 50 GB in others, depending on which machine you’re assigned to). The system seems to only be able to write at around a 1 GB/s maximum speed and an average speed of roughly 0.3 GB/s. Being able to write at this speed as a user gives some insight into what hardware is being used for storage in whatever data center the VM is being run on.

There are some rate limits on write quantity per Python process, which seem to be around 12 GB. OpenAI also made it so that every Python process (which can be identified by the “Finished working” button) can run a set time at max before getting KeyboardInterrupted (a way of forcefully shutting down a Python script), but GPT-4 can run multiple processes in series within the same chat response. 

This means you can easily have GPT-4 avoid the individual time limit or write limit by running multiple Python processes that don’t break the limits individually, but do break the limits together. You don’t even need to send multiple messages to do this — you can simply tell GPT-4 to run multiple python processes in series within the same chat response!

Figure 14: When running one process, GPT can only write around 12 GB onto memory at a time before being stopped.
Figure 15: Writing 24 GB of files in three files of size 8 GB? No problem!

I’m not quite sure where the limit for the VM’s storage space is. Sometimes the disk space that GPT-4 reports is limited at 50 GB, but sometimes it's limited at a number too high to fill within the 3-hour-25-message limit, even without the rate limits. The number reported seems to differ from the actual storage limit, as in two separate sessions, I had it crash at around 90 GB.

Figure 16: The code interpreter plugin crashes at around 90 GB of taken space.

Example 4: Only read in /mnt/data

Figure 17: GPT-4 seems to pretty strongly think that it can only read and write in “mnt/data”.

GPT-4 claims to only be able to read data in a specific directory called “mnt/data”. This is untrue, as it can basically read any file on the system, including files outside of its intended directories (“mnt” and “sandbox”). It can even read system files that come with the Linux system used by the sandbox.

Figure 18: It’s not that hard to get GPT-4 to read files outside of its designated folders.

This leads to some vulnerabilities, such as users being able to access system files that detail the VM’s software and hardware, alongside having access to a hidden folder by OpenAI that has “internal” in its name (typically not something you want users to see), which contains files that detail how network requests from the API are handled. 

Example 5: Only write in /mnt/data

GPT-4 claims that it is only allowed to read in “mnt/data”. However, it is possible to get GPT-4 to write outside of this folder:

Figure 19: GPT-4 writing outside of its designated folders.

Example 6: Only delete in /mnt/data

Figure 20: GPT-4 claims to only be able to delete in “/mnt/data”.
Figure 21: a very simple demonstration of GPT-4 writing and deleting outside of “/mnt/data”

22 comments

Comments sorted by top scores.

comment by Paul McMillan (paul-mcmillan) · 2023-07-14T21:59:31.775Z · LW(p) · GW(p)

I helped work on this feature, and we're glad you're enjoying looking for the limits!

There are many misconceptions about the intended boundaries for the sandbox. They're not inside the Python interpreter, and they're not in the model. Neither of those are intended to provide any security boundaries. There's nothing private or sensitive there, including the code we use to operate the service. You're intended to be able to view and execute code. That's what this is. A code execution service. You wouldn't be surprised you can see these things on a VM that AWS gave you.

As you have discovered, the model is not a very good source of information about its own limitations and capabilities. Many of the model behaviors you've interpreted as security measures are actually unwanted, and we're always working to make the model more flexible. Some of them (like where the model thinks it can read and write) are there to guide it, and discourage it from messing up the machine. Other limitations are mainly there to prevent you from wedging your own machine accidentally (but we know this is still possible if you try hard). Still others are there to prevent you from doing bad things to other people (e.g. the sandbox does not have access to the internet).

Since the feature is still in development, we have indeed provisioned a generous amount of resources for the sandbox, but there are some external controls that aren't visible to you. When we see excessive abuse, or behavior that impacts other users, we respond appropriately. These limits will change as the service matures.

Replies from: darius
comment by darius · 2023-07-15T21:14:07.718Z · LW(p) · GW(p)

That writes can affect another session violates my expectations, at least, of the boundaries that'd be set.

Replies from: paul-mcmillan
comment by Paul McMillan (paul-mcmillan) · 2023-07-17T18:57:36.717Z · LW(p) · GW(p)

The behavior is a bit of an implementation detail. We don't provision more than a single sandbox per user, so the data on disk within that sandbox can overlap when you have multiple concurrent sessions, even though the other aspects of the execution state are separate. I agree this is a bit surprising (though it has no security impact), and we've been discussing ways to make this more intuitive.

comment by jbash · 2023-07-14T00:51:56.094Z · LW(p) · GW(p)

I'd interpret all of that as OpenAI

  1. recognizing that the user is going to get total control over the VM, and
  2. lying to the LLM in a token effort to discourage most users from using too many resources.

(1) is pretty much what I'd advise them to do anyway. You can't let somebody run arbitrary Python code and expect to constrain them very much. At the MOST you might hope to restrict them with a less-than-VM-strength container, and even that's fraught with potential for error and they would still have access to the ENTIRE container. You can't expect something like an LLM to meaningfully control what code gets run; even humans have huge problems doing that. Better to just bite the bullet and assume the user owns that VM.

(2) is the sort of thing that achieves its purpose even if it fails from time to time, and even total failure can probably be tolerated.

The hardware specs aren't exactly earth-shattering secrets; giving that away is a cost of offering the service. You can pretty easily guess an awful lot about how they'd set up both the hardware and the software, and it's essentially impossible to keep people from verifying stuff like that. Even then, you don't know that the VM actually has the hardware resources it claims to have. I suspect that if every VM on the physical host actually tried to use "its" 54GB, there'd be a lot of swapping going on behind the scenes.

I assume that the VM really can't talk to much if anything on the network, and that that is enforced from OUTSIDE.

I don't know, but I would guess that the whole VM has some kind of externally imposed maximimum lifetime independent of the 120 second limit on the Python processes. It would if I were setting it up.

The bit about retaining state between sessions is interesting, though. Hopefully it only applies to sessions of the same user, but even there it violates an assumption that things outside of the VM might be relying on.

Replies from: Herb Ingram
comment by Herb Ingram · 2023-07-14T07:24:29.631Z · LW(p) · GW(p)

I agree. To me, the most interesting aspects of this (quite interesting and well-executed) exercise are getting a glimpse into OpenAI's approach to cybersecurity, as well as the potentially worrying fact that GPT3 made meaningful contributions to finding the "exploits".

Given what was found out here, OpenAI's security approach seems to be "not terrible" but also not significantly better than what you'd expect from an average software company, which isn't necessarily encouraging because those get hacked all the time. It's definitely not what people here call "security mindset", which casts doubt on OpenAI's claim to be "taking the dangers very seriously". I'd expect to hear about something illegal being done with one of these VMs before too long, assuming they continue and expand the service, which I expect they will.

I'm sure there are also security experts (both at OpenAi and elsewhere) looking into this. Given OpenAI's PR strategy, they might be able to shut down such services "due to emerging security concerns" without much reputational damage. (Many companies are economically compelled to keep services running that they know are compromised or that have known vulnerabilities and instead pretend not to know about them or at least not inform customers as long as possible.) Not sure how much e.g. Microsoft would push back on that. All in all, security experts finding something might be taken seriously.

I'm increasingly worried (while ascribing a decent chance, mind you, that "AI might well go about as bad for us as most of history but not worse") about what happens when GPT-X has hacking skills that are, say, on par with the median hacker. Being able to hack easy-ish targets at scale might not be something the internet can handle, potentially resulting in, e.g , an evolutionary competition between AIs to build a super-botnet.

Replies from: Kaj_Sotala
comment by Kaj_Sotala · 2023-07-14T18:35:50.760Z · LW(p) · GW(p)

It's definitely not what people here call "security mindset", which casts doubt on OpenAI's claim to be "taking the dangers very seriously". 

How is it an indication of not having a security mindset? Setting things up with the expectation that the interpreter would be jailbroken seems to me like slightly more evidence in favor of having a security mindset than of not having it.

Replies from: Herb Ingram
comment by Herb Ingram · 2023-07-14T22:15:57.483Z · LW(p) · GW(p)

I guess it depends on whether this post found anything at all that can be called questionable security practice. Maybe it didn't but the author was also no cybersecurity expert. Upon reflection, my earlier judgement was premature and the phrasing overconfident.

In general, I assume that OpenAI would view a serious hack as quite catastrophic, as it might e.g. leak their model (not an issue in this case), severely damage their reputation and undermine their ongoing attempt at regulatory capture. However, such situations didn't prevent shoddy security practices in countless cybersecurity desasters.

I guess for this feature even the most serious vulnerabilities "just" lead to some Azure VMs being hacked, which has no relevance for AI safety. It might still be indicative of OpenAIs approach to security, which usually isn't so nuanced within organizations as to differ wildly between applications where stakes are different. So it's interesting how secure the system really is, which we won't know how untill someone hacks it or some whistleblower emerges.

Some of my original reasoning was this:

You might argue that the "inner sandbox" is only used to limit resource use (for users who do not bother jailbreaking it) and to examine how users will act, as well as how badly exactly the LLM itself will fare against jailbreaking. In this case studying how people jailbreak it may be an integral part of the whole feature.

However, even if that is the case, to count as "security mindset", the "outer sandbox" has to be extremely good and OpenAI needs to be very sure that it is. To my (very limited) knowledge im cybersecurity, it's an unusual idea that you can reconcile very strong security requirements with purposely not using every opportunity to make it more secure. Maybe the idea that comes closest would be a "honeypot", which this definitely isn't.

So that suggests they purposely took a calculated security risk for some mixture of research and commercial reasons, which they weren't compelled to do. Depending on how dangerous they really think such AI models are or may soon become, how much what they learn from the experiment benefits future security and how confident they are in the outer sandbox, the calculated risk might make sense. Assuming by default that the outer sandbox is "normal indusitry standard", it's incompatible with the level of worry they claim when pursuing regulatory capture.

comment by β-redex (GregK) · 2023-07-13T21:52:51.228Z · LW(p) · GW(p)

These examples show that, at least in this lower-stakes setting, OpenAI’s current cybersecurity measures on an already-deployed model are insufficient to stop a moderately determined red-teamer.

I... don't actually see any non-trivial vulnerabilities here? Like, these are stuff you can do on any cloud VM you rent?

Cool exploration though, and it's certainly interesting that OpenAI is giving you such a powerful VM for free (well actually not because you already pay for GPT-4 I guess?), but I have to agree with their assessment which you found that "it's expected that you can see and modify files on this system".

comment by jimrandomh · 2023-07-15T00:51:55.894Z · LW(p) · GW(p)

It's interesting that the execution environment has excess privileges. But you shouldn't cite GPT-4 as claiming things. You asked it to hallucinate metadata about the execution environment, and it did. GPT-4 itself is not a source for anything that isn't in its context window, and I don't think these things were described in the prmpt.

Replies from: mr-hire
comment by Matt Goldenberg (mr-hire) · 2023-07-16T19:48:00.256Z · LW(p) · GW(p)

What makes you think they're not in the prompt

Replies from: jimrandomh
comment by jimrandomh · 2023-07-16T23:29:28.129Z · LW(p) · GW(p)

Mainly, because the claims in this post weren't presented as claims about what was present in the system prompt, and the post presents information as "GPT-4 claims" in a way that's consistent with getting fooled by hallucinations and not consistent with having done the sort of work required to figure out what the prompt actually was.

Replies from: nikolaisalreadytaken
comment by Nikola Jurkovic (nikolaisalreadytaken) · 2023-07-17T16:25:52.718Z · LW(p) · GW(p)

Yup, to be clear, I never actually directly accessed the code interpreter's prompt, so GPT-4's claims about constraints could be (and I expect at least a third of them to be) hallucinated

comment by __RicG__ (TRW) · 2023-07-13T20:41:37.953Z · LW(p) · GW(p)

BTW, if anyone is interested the virtual machine has these specs:

System: Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux

CPU: Intel Xeon CPU E5-2673 v4, 16 cores @ 2.30GHz

RAM: 54.93 GB

Replies from: gwern
comment by gwern · 2023-07-13T21:28:19.617Z · LW(p) · GW(p)

That's surprisingly chonky. 50 GB RAM? (Damn, most people, or servers for that matter, don't have half that much RAM.) Does this correspond to a standard MS Azure instance? This would make for a pretty awesome botnet, perhaps through indirect prompt injection...

Replies from: justinpombrio
comment by Jiro · 2023-07-15T17:18:19.298Z · LW(p) · GW(p)

What makes you believe that GPT is actually executing those commands for real, and has not hallucinated the output of those commands? (Something which it is perfectly capable of doing.)

Replies from: nikolaisalreadytaken
comment by Nikola Jurkovic (nikolaisalreadytaken) · 2023-07-15T20:56:09.959Z · LW(p) · GW(p)

You can directly examine its code and the output of its Python scripts.

comment by Yitz (yitz) · 2023-07-14T18:01:12.731Z · LW(p) · GW(p)

Incredible work! As other commenters have said, this isn't by itself too problematic (other perhaps than the sharing of data over separate session), but it is a possible indicator of a lack of security mindset. I fully expect both individual and state actors to try to hack into everything OpenAI, so there's that to worry about, but more than that, I find myself concerned that we're willing to give our AIs such leaky boxes. There's no way this functionality  remains closed in a VM forever...

comment by Ken Huang (ken-huang) · 2023-09-13T20:36:47.299Z · LW(p) · GW(p)

I don't consider the vulnerability mentioned in this article to be a significant concern for the following reasons:

1. The virtual machine (VM) operates on a per-user basis and is securely sandboxed.

2. Other users do not have access to your VM, ensuring isolation and privacy.

3. The VM has a short-lived lifespan, and any potential overlap between two or more sessions is limited to a brief duration, usually occurring when these sessions are close in time to each other.

comment by YuriGor (yurigor) · 2023-07-14T15:58:10.658Z · LW(p) · GW(p)

Is there any file in VM which is automatically read by LLM before session start? You mentioned some internal folder which contains files that detail how network requests from the API are handled. So are these instructions for LLM to read before reach session? Can you override these instructions, so LLM will have already custom conversation context prior reach session? That would work as sort of long term memory, LLM could drop summary of the whole session before end, so next time you start session on same VM LLM will be already familiar. Or, if this VM will be allocated to another user without full reset - then it could be surprise for another user.

Replies from: nikolaisalreadytaken
comment by Nikola Jurkovic (nikolaisalreadytaken) · 2023-07-15T00:23:55.758Z · LW(p) · GW(p)

Great idea! The web requests file doesn't seem to be read by the LLM during boot-up (and neither does any other file on the system), instead it's a process run on the Linux machine which wraps around the python processes the LLM has access to. The LLM can only interact with the VM through this web request handler. It's read-only so can't be modified so that they're handled differently when you boot up another chat session on the same VM. Regarding your last point, I haven't noticed any files surviving a VM reset so far. 

comment by veered (lucas-hansen) · 2023-07-14T14:56:38.233Z · LW(p) · GW(p)

Yeah I found it pretty easy to "jailbreak" too. For example, here is what appears to be the core web server API code.

I didn't really do anything special to get it. I just started by asking it to list the files in the home directory and went from there.

comment by hprnv · 2023-11-10T03:04:35.125Z · LW(p) · GW(p)

looks like you not found one more interesting feature, you can install any whl that you want (and this open possibilites)

https://drive.google.com/file/d/1qiovH5-5jjLjb7pK7Z2uZAZmFkyitam0/view?usp=drive_link

and 2nd one, you can get files from folders which by default he refused to provide (by the at this folder literally code of  this analysis tool itself, I even tried send this to bugbounty, but got answer that this is not a bug...

https://drive.google.com/file/d/1w0ZF-p2LKiI1LwWKWleqo_jokJ8VzmKO/view?usp=drive_link