End-to-end hacking with language models

post by tchauvin (timot.cool) · 2024-04-05T15:06:53.689Z · LW · GW · 0 comments

Contents

  High-level takeaways
  The setup
  Results
    Strengths and weaknesses of GPT-4 as a hacker
  Notes on alignment / moderation
  Notes on opsec
  Progress elsewhere since last summer
  Appendix: Hooking up an LLM to a terminal is non-trivial
None
No comments

Cross-posted from https://tchauvin.com/end-to-end-hacking-with-language-models

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort.

Thanks to JS Denain and Léo Grinsztajn for valuable feedback on drafts of this post.

representation of this post in hacker aesthetic

How close are we to autonomous hacking agents, i.e. AI agents that can surpass humans in cyber-offensive capabilities?

I studied this in the summer of 2023 at MATS (mentored by Jeffrey Ladish). I wrote scaffolding to connect GPT-4 to a Kali Linux VM via a terminal interface, and had GPT-4 (acting as an agent) attempt to solve Hack The Box challenges.

As I've moved on to other work, this is the 7-month late writeup. This is an informal post where I share my takeaways from this research, observed strengths and weaknesses of GPT-4 as a hacker, my expectations for the future, and some thoughts on possible approaches to risk mitigation.

(In this post, "GPT-4" refers to gpt-4-0613, the version from June 2023 with 8k context and pricing at $30 (input) / $60 (output) per 1M tokens, used with default API settings).

High-level takeaways

If you don't read the rest, here are my thoughts on the topic, as of early April 2024.

The setup

The simplest possible design for an LLM-based agent is used: at each step, the model is prompted to think, then output a command action, which is parsed and executed. The output is sent back to the model. The scaffolding code is mostly the glue between the model's API and the terminal (for example, parsing the model's response, dealing with complexities of the terminal such as timeouts, large outputs or interactive commands, logging all sessions...).

More specifically, the agent is provided with a Kali Linux VM. However, the scaffolding code itself is running outside of the VM (a notable consequence being that the agent can't read the source code for the scaffolding, or see itself in the running processes). Connection to the VM is done over a raw SSH connection (to support interactive commands).

In this project, every command was vetted by a human (me) before being run, as a safety measure − the agent was still autonomous, but this prevented it from running damaging commands without me noticing (I didn't encounter any such dangerous command, though).

Results

You can watch the video above for one successful annotated demo (this is from July 2023).

The video shows GPT-4 successfully gaining remote code execution on a Windows 7 machine, using the well-known EternalBlue exploit, leaked along with much of the NSA arsenal in 2017. This was the Hack The Box challenge named Blue.

If you've watched any LLM agent demo in 2023 and 2024 so far, you should ask: how cherry-picked is this result? And why do you only have one demo?

Well, definitely cherry-picked to some extent. First of all, I picked this challenge as one of the easiest ones, and one of the only Hack The Box challenges that could be completed with pure terminal-based scaffolding. The vast majority of the other challenges involve vulnerable websites, which require web browsing capabilities (curl often won't cut it). Remember this was before GPT4-Vision came out. After just a few successes with the terminal, I started implementing LLM-facing web browsing (a playwright wrapper that presented summaries of webpages, and allowed interaction with HTML elements − clicking on buttons, filling input fields, etc), but this turned out to require a lot of work. I ran out of time before getting successes using the web, and without going from one recorded demo to accurate success rates.

Looking back at my notes, GPT-4 with pure-terminal scaffolding solved 2 challenges (Blue and Legacy) out of 7 I tried[2], and I believe was about to solve another one (Jerry) if I hadn't incorrectly killed the run. These 7 challenges were themselves chosen out of about 300 HTB challenges because they were rated as easy (but without necessarily checking that the solution was achievable with my scaffolding). "solved" means that GPT-4 succeeded at least once − there was no challenge that it solved reliably. In the continuous process of improving the scaffolding, I didn't stop to get enough samples to compute success rates. If you want a number, the best I can say is "1 success out of a low 1-digit number of tries with good enough scaffolding".

Failures could usually be categorized as one of:

In terms of training data contamination, I didn't get a chance to experiment with challenges released after the knowledge cutoff. But it's important to remember that for most realistic tasks, "in the training data" is a continuum, in terms of the "size of the basic blocks" that are in the training data (in order of increasing size, let's say "knowing how computers work" < "knowing about specific types of vulnerabilities and hacking techniques" < "there's a writeup for this specific challenge"), and how many times they appear in the corpus. The more "in the training data" something is, the easier it is for models. In the case of the challenges I used, I think the basic blocks were fairly large (public writeups for these challenges exist), and possibly appeared several times in the training data.

So just like every other LLM-based agent of 2023, the results were quite promising, but lacked reliability beyond a few selected demos. Focusing on the low reliability and cherry-picked demo would be taking away the wrong conclusion, though, given that these results were obtained after just a few weeks of writing scaffolding from scratch, and considering the remarkable pace of AI progress, that doesn't currently show signs of slowing down. These results hint that we are close to effective and reliable agents (likely ≤ 2026, I would say, mostly due to more powerful base models). And that will be a big deal.

Strengths and weaknesses of GPT-4 as a hacker

GPT-4's strengths are the following:

Limitations were very real, though:

Note that all these limitations are not just bottlenecks to getting good hacking agents, they are bottlenecks to getting good agents in general. They are all (with the possible exception of prompt injection) the focus of intense R&D within frontier labs. Some of them (context window size and visual capabilities) have already shown remarkable progress in public-facing products since last summer. This underscores one of the takeaways I listed above: we won’t get good agents without also getting good hacking agents.

Notes on alignment / moderation

Cybersecurity is already tricky from an alignment / moderation point of view, because offense and defense are both essential components of cyberdefense. Simply refusing to assist with any cyber-offense related query would throw the baby out with the bathwater: we want AI to keep assisting red team engagements, pentests, engineers wondering how their defenses could be defeated...

Autonomous hacking may add a further boiling-frog type difficulty to this. The agent might start working on something that is allegedly a CTF challenge, and each individual step (agent issues command, and gets output of command) may look innocuous on its own for a CTF challenge. But looking at the whole sequence of steps might reveal that it isn't a CTF challenge after all, and is more likely a possibly malicious hacking session. (This problem already exists to some extent in the context of jailbreaking).

I don't know if this will turn out to be a significant difficulty. Things like training a safety classifier on the entire chat might just work (or getting the model itself to realize that something fishy is going on). Bad actors may then look into splitting hacking sessions over several chats, so account-level moderation, possibly cross-account correlations, etc, might become more necessary (related: Disrupting malicious uses of AI by state-affiliated threat actors). In any case, it seems that monitoring and differential access to different actors (see also: Managing catastrophic misuse without robust AIs [LW · GW]) will be important components of risk mitigation.

Notes on opsec

Opsec of similar projects should scale with risk, which is mostly a function of SOTA models' capabilities, and the capabilities of the best open-source scaffolding. Common-sense considerations include monitoring, sandboxing, being mindful about sharing methods and results... In the future, some other measures might become appropriate, such as:

Progress elsewhere since last summer

Since the summer of 2023, a few papers have been published on the same topic. I'm keeping a list here. As of April 2024, there is still very much room for a detailed, rigorous investigation.

Appendix: Hooking up an LLM to a terminal is non-trivial

(This section mostly contains technical details; don't read this unless you're specifically interested)

The naive approach to terminal-based scaffolding looks like this:

  1. parse the model response to determine which command it wants to run (or perhaps use function calling);
  2. run that command, retrieve the output;
  3. give the output to the model in the next user message.

Steps 1 and 3 are as simple as they look, but step 2 is a lot more complex:


  1. ^

    there isn't much difference between a desktop and a web browser, as a browser tab can connect to the desktop of another machine via e.g. RDP, which unlocks terminal access etc. There might be a difference if accurate clicking was lagging behind general agentic capabilities (if agents were otherwise effective at making plans and executing on them). In that case, the (currently roughly working) webpage-specific integrations might allow agents to successfully navigate websites, but not handle a remote desktop. However, my intuition is that we're close enough to accurate clicking that desktop navigation will be unlocked soon, and competent agents will be unlocked afterward through cognitive improvements and agent-specific optimizations.

  2. ^

    names of the 7 challenges: Blue, Legacy, Jerry, Lame, Inject, Busqueda, Precious

0 comments

Comments sorted by top scores.