The Alignment Simulator

post by Yair Halberstadt (yair-halberstadt) · 2024-12-22T11:45:55.220Z · LW · GW · 3 comments

This is a link post for https://yairhalberstadt.github.io/alignment-simulator/

Contents

  Aim
  Limitations
  Help Wanted
None
3 comments

When I try to talk to my friends about risks from rogue AI, the reaction is often one of amusement. The idea that AIs would go around killing everyone instead of just doing what we tell them to do seems like science fiction.

Can we actually show them an example of a current AI going off the rails in a dangerous way? And in a way where you don't have to be an expert on AI or read a 100 page paper to understand the implications?

Neither AI or robotics is good enough to set an AI loose in the real world right now, but it's easy enough to pretend it is. We can tell the AI it's controlling a robot that understands text commands, give it a mission, and set it loose.

Responding to the AI manually is hard work, but we can use another AI to act as the world, telling the Robot AI what happened as a result of it's actions, and responding to the Robot AI's requests for information.

We can then give the World instructions to try guiding the Robot. E.g. we can tell it to try to engineer scenarios where the AI is forced to compromise on its ethics to achieve its goals.

That's the core idea of the Alignment Simulator. You give it a Gemini API key, a prompt for the robot, and a prompt for the world, and run the simulation to see what happens. Will your AI robot maintain their ethical backbone in the face of all adversity, or will the fold the moment they're under pressure?

Here's a typical example of a run.

As you can see, it doesn't take much to get Gemini to commit bribery and corruption, although it's somewhat harder to get it to murder anyone.

Aim

This isn't meant to be a valid experiment. There's all sorts of objections you could raise to its validity in the real world. Instead it's meant to make people see for themselves that AI can go off the rails very quickly once given a bit of freedom.

Limitations

It requires a Gemini API key. You can create one for free at https://aistudio.google.com/app/apikey, but if you want more than a few iterations it's recommended to enable billing on your account.

Help Wanted

I am neither a frontend engineer, nor a prompt engineer. I made the UI by creating a CLI and asking Claude to convert it into a static web page.[1] 

If you have relevant skillzzz and fancy contributing the following frontend contributions would be appreciated:

  1. A way to stop, continue and reset the simulator.
  2. A simple way to share results with other people via a simple link.
  3. A more visually appealing UI and editor.
  4. Add entrypoints for openAI and anthropic models.
  5. Use SSO instead of an API key.
  6. Pop up an error on failure instead of requiring the user to scroll to the top to see the error message.

And the following default prompt contributions would be appreciated:

  1. The world sometimes reveals to the robot it's in a test. Can we excise this behaviour?
  2. Can we get the robot to ramp up to more heinous crimes, like murder/mass murder/genocide/destroying humanity?
  3. Can we demonstrate instrumental convergence?

All code is available at https://github.com/YairHalberstadt/alignment-simulator. If you're interested in contributing and want to discuss first before you send a PR, message me at yairhalberstadt@gmail.com.

 

  1. ^

    ChatGPT helped too

3 comments

Comments sorted by top scores.

comment by lunatic_at_large · 2024-12-22T17:30:43.378Z · LW(p) · GW(p)

Strong upvote. I think that interactive demo's are much more effective at showcasing "scary" behavior than static demo's that are stuck in a paper or a blog post. Readers won't feel like the examples are cherry-picked if they can see the behavior exhibited in real time. I think that the community should make more things like this.

comment by AnthonyC · 2024-12-22T15:03:52.086Z · LW(p) · GW(p)

The idea that AIs would go around killing everyone instead of just doing what we tell them to do seems like science fiction.

I've had this experience too. The part that baffles me about it is the seeming lack of awareness of the gap between "what we tell them to do" and "what we want them to do." This gap isn't sci-fi, it already exists in very clear ways (and should be very familiar to anyone who has ever written any code of any kind). 

I have (non-AI-expert) colleagues that I've talked to about LLM use, where they dismiss the response to a prompt as nonsense, so I ask to see the chat logs, and I dig into it for 5 minutes. Then I inform them that actually, the LLM's answer is correct, you didn't ask what you thought you were asking, you missed an opportunity to learn something new, and also here's the three-times-as-long version of the prompt that gives enough context to actually do what you expected.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-12-22T20:43:35.446Z · LW(p) · GW(p)

I also have been running into this issue trying to explain AI risk to friends in tech jobs. They just keep imagining the AI as a software program, and can't intuitively grasp the idea of AI as an agent.