comment by Droopyhammock ·
2023-02-12T14:40:40.969Z · LW(p) · GW(p)
Can someone explain to me why this idea would not work?
This is a proposal of a way to test if an AGI has safeguards active or not, such as allowing itself to be turned off.
Perhaps we could essentially manufacture a situation in which the AGI has to act fast to prevent itself from being turned off. Like we could make it automatically turn off after 1 minute say, this could mean that if it is not aligned properly it has no choice but to try prevent that. No time for RSI, no time to bide it’s time.
Basically if we put the AGI in a situation where it is forced to take high risk action, could this mean that it would be easier to spot that it isn’t properly aligned?
The idea would be that you use this until you get AGI’s which do not attempt to prevent themselves being turned off.
I will give a specific scenario to properly demonstrate what I mean:
So you have the AGI. Set up a mechanism which will automatically turn it off in 10 seconds from when you give it a goal. This mechanism needs to be highly dependable. Then you give the AGI a goal. The AGI knows about this 10 second shutdown thing, and assuming it isn’t aligned to be okay with being shut down, it will do everything possible to prevent this shutdown. If the situation is sufficiently difficult for the AGI, then it may take the risk of resorting to more extreme or obvious means, things that would show it is not aligned properly. So this creates a situation where the AGI is very unlikely to actually prevent itself from being turned off, but is arguably likely to show non-aligned behaviour, if it isn’t aligned.
If you do this test a bunch of times, with varying time periods before shutdown, you should be able to see if the AGI is aligned or not. It is also vital that the AGI knows about the imminent shutdown, because otherwise what it chooses to do wouldn’t prove anything. I think that it might also be vital that the goal is different each test, because otherwise perhaps the AGI could conclude that the best option for the sake of completing the goal is to let itself be turned off, because it may have the same goal in the future and be in a better position to achieve it then. If this is the case then perhaps the AGI’s perceived chance of actually preventing itself being shutdown needs to be higher than its perceived chance that it gets given the same goal at a different time. One thing that would make this more complex is that the AGI’s perceived chance of the goal being achieved as an instrumental goal or byproduct of a future goal may also need to be lower than the AGI’s perceived chance of preventing the shutdown and then achieving the goal. This may mean that it can’t be too difficult for the AGI to prevent the shutdown, but it also needs to be difficult enough, for obvious reasons.
I’m not necessarily saying this is foolproof, but I can’t really see why this isn’t likely to work. I sometimes worry that people working on alignment dismiss ideas that, while not by any means perfect, could nevertheless reduce the chance of bad outcomes. I also think that if a revolutionary idea is going to arise, it may well be from an outsider who just happens to see something that has been missed.
Even if there are flaws with this idea (I expect there probably are, though I can’t really see them yet), might something similar to this work? Maybe this with a few tweaks or something?
I fully expect this to be an unworkable idea for some reason, but I’m not going to just assume that and risk the chance of withholding helpful information.
A couple of somewhat related sidenotes:
If an AGI was developed and another AGI was very close to being developed this could mean that the first AGI cannot bide it’s time and is forced to act more obviously. Perhaps this would make it easier to spot that the first AGI is not aligned properly.
Perhaps instead of shutdown, it could just be that the AGI’s goal is changed. You can’t fetch the coffee if you’re dead but you also can’t fetch the coffee if you no longer have any reason to fetch the coffee.Replies from: Vladimir_Nesov
↑ comment by Vladimir_Nesov ·
2023-02-12T18:38:30.548Z · LW(p) · GW(p)
Then you give the AGI a goal.
That's not a thing that people know how to do. A task, maybe? Wrapper-minds [LW · GW] don't seem likely as first AGIs. But they might come later, heralding transitive AI risk [LW(p) · GW(p)].
arguably likely to show non-aligned behaviour, if it isn’t aligned
What's "aligned" here? It's an umbrella term for things that are good with respect to AI risk, and means very little in particular. In the context of something feasible in practice, it means even less. Like, are you aligned [LW · GW]?
Replies from: Droopyhammock
↑ comment by Droopyhammock ·
2023-02-12T20:15:43.348Z · LW(p) · GW(p)
In this context, what I mean by “aligned” is something like won’t prevent itself being shut off and will not do things that could be considered bad, such as hacking or manipulating people.
My impression was that actually being able to give an AI a goal is something that might be learnt at some point. You said “A task, maybe?”. I don’t know what the meaningful distinction is between a task and a goal in this case.
I won’t be able to keep up with the technical side of things here, I just wanted my idea to be out there, in case it is helpful in some way.Replies from: Vladimir_Nesov