Speedrun ruiner research idea

post by lemonhope (lcmgcd) · 2024-04-13T23:42:29.479Z · LW · GW · 11 comments

Contents

11 comments

Edit Apr 14: To be perfectly clear, this is another cheap thing you can add to your monitoring/control system; this is not a panacea or deep insight folks. Just a Good Thing You Can Do™.

Portal: How To Get Outside Without Cheats (360)

 

(Also if random reader wants to fund this idea, I don't have plans for May-July yet.)

 

 

metadata = {
  "effort: "just thought of this 20 minutes ago",
  "seriousness": "total",
  "checked if someone already did/said this": false,
  "confidence that": {
    "idea is worth doing at all": "80%",
    "one can successfully build a general anti-speedrun thing": "25%",
    "tools/methods would transfer well to modern AI RL training": "50%"
  }
}
  1. ^

    Note that "laggy" is indeed the correct/useful notion, not eg "average CPU utilization increase" because "lagginess" conveniently bundles key performance issues in both the game-playing and RL-training case: loading time between levels/tasks is OK; more frequent & important actions being slower is very bad; turn-based things can be extremely slow as long as they're faster than the agent/player; etc.

11 comments

Comments sorted by top scores.

comment by mako yass (MakoYass) · 2024-04-14T17:49:01.232Z · LW(p) · GW(p)

Are you proposing applying this to something potentially prepotent? Or does this come with corrigibility guarantees? If you applied it to a prepotence, I'm pretty sure this would be an extremely bad idea. The actual human utility function (the rules of the game as intended) supports important glitch-like behavior, where cheap tricks can extract enormous amounts of utility, which means that applying this to general alignment has the potential of foreclosing most value that could have existed.

Example 1: Virtual worlds are a weird out-of-distribution part of the human utility function that allows the AI to "cheat" and create impossibly good experiences by cutting the human's senses off from the real world and showing them an illusion. As far as I'm concerned, creating non-deceptive virtual worlds (like, very good video games) is correct behavior and the future would be immeasurably devalued if it were disallowed.

Example 2: I am not a hedonist, but I can't say conclusively that I wouldn't become one (turn out to be one) if I had full knowledge of my preferences, and the ability to self-modify, as well as lots of time and safety to reflect, settle my affairs in the world, set aside my pride, and then wirehead. This is a glitchy looking behavior that allows the AI to extract a much higher yield of utility from each subject by gradually warping them into a shape where they lose touch with most of what we currently call "values", where one value dominates all of the others. If it is incorrect behavior, then sure, it shouldn't be allowed to do that, but humans don't have the kind of self-reflection that is required to tell whether it's incorrect behavior or not, today, and if it's correct behavior, forever forbidding it is actually a far more horrifying outcome, what you'd be doing is, in some sense of 'suffering', forever prolonging some amount of suffering. That's fine if humans tolerate and prefer some amount of suffering, but we aren't sure of that yet.

Replies from: lcmgcd
comment by lemonhope (lcmgcd) · 2024-04-14T18:57:48.277Z · LW(p) · GW(p)

I do not propose one applies this method to a prepotence

Replies from: MakoYass
comment by mako yass (MakoYass) · 2024-04-14T19:20:24.938Z · LW(p) · GW(p)

Cool then.

Are you aware that prepotence is the default for strong optimizers though?

Replies from: lcmgcd
comment by lemonhope (lcmgcd) · 2024-04-14T19:24:55.979Z · LW(p) · GW(p)

What about mediocre optimizers? Are they not worth fooling with?

Replies from: MakoYass
comment by mako yass (MakoYass) · 2024-04-14T23:59:08.244Z · LW(p) · GW(p)

Wouldn't really need reward modelling for narrow optimizers. Weak general real-world optimizers, I find difficult to imagine, and I'd expect them to be continuous with strong ones, the projects to make weak ones wouldn't be easily distinguishable from the projects to make strong ones.

Oh, are you thinking of applying it to say, simulation training.

comment by tailcalled · 2024-04-14T06:26:12.784Z · LW(p) · GW(p)

The issue with this idea is that it seems pretty much impossible

Replies from: lcmgcd
comment by lemonhope (lcmgcd) · 2024-04-14T18:58:51.783Z · LW(p) · GW(p)

What makes you say so? Seems like 25% chance possible to me. You can find where position is stored and watch for sudden changes. Same thing with score & inventory...

Replies from: tailcalled
comment by tailcalled · 2024-04-15T06:46:12.997Z · LW(p) · GW(p)

Oh wait, maybe I misunderstood what you meant by "any game". I thought you meant a single program that could detect it across all games, but it sounds much more feasible with a program that can detect it in one specific game.

Replies from: lcmgcd
comment by lemonhope (lcmgcd) · 2024-04-17T05:51:13.348Z · LW(p) · GW(p)

All games. Find where position is stored etc automatically i mean. It will certainly have failure cases. Easy to make a game that breaks it. The question is if an adversarial agent can easily break it in a regular (ie not adversarially chosen) game.

comment by Jay Bailey · 2024-04-14T20:38:59.445Z · LW(p) · GW(p)

It seems to me that either:

  • RLHF can't train a system to approximate human intuition on fuzzy categories. This includes glitches, and this plan doesn't work.

  • RLHF can train a system to approximate human intuition on fuzzy categories. This means you don't need the glitch hunter, just apply RLHF to the system you want to train directly. All the glitch hunter does is make it cheaper.

Replies from: lcmgcd
comment by lemonhope (lcmgcd) · 2024-04-14T21:39:50.484Z · LW(p) · GW(p)

You may be right. Perhaps the way to view this idea is "yet another fuzzy-boundary RL helper technique" that works in a very different way and so will have different strengths and weaknesses than stuff like RLHF. So if one is doing the "serially apply all cheap tricks that somewhat reduce risk" approach then this can be yet another thing in your chain.