Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

post by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T12:40:07.678Z · LW · GW · 1 comment

This is a question post.

Contents

  Answers
    5 Daniel Kokotajlo
    4 avturchin
    2 Ramana Kumar
None
1 comment

AI systems up to some high level of intelligence plausibly need to know exactly where they are in space-time in order for deception/"scheming" to make sense as a strategy.
This is because they need to know:
1) what sort of oversight they are subject to 
and
2) what effects their actions will have on the real world

(side note: Acausal trade might break this argument)

There are a number of informal proposals to keep AI systems selectively ignorant of (1) and (2) in order to prevent deception.  Those proposals seem very promising to flesh out; I'm not aware of any rigorous work doing so, however.  Are you?
 

Answers

answer by Daniel Kokotajlo · 2024-09-04T14:59:46.647Z · LW(p) · GW(p)

I know of no rigorous proposals. The general challenge such proposals face is that if you are relying on fooling your AGI about something to keep control over it, and it's constantly and rapidly getting smarter and wiser... that's a recipe for your scheme to fail suddenly and silently (when it stops being fooled), which is a recipe for disaster.

Another type of proposal relies on making it actually true that it might be in a simulation--or to put it more precisely perhaps, making it actually the case that future aligned superintelligences will make simulations so accurate that even a baby superintelligence can't tell the difference. However, two can play at that game; more generally this just becomes a special case of acausal trade stuff which will be wild and confusing and very important once AIs are smart enough to take it seriously.

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T15:55:39.332Z · LW(p) · GW(p)

Not necessarily fooling it, just keeping it ignorant.  I think such schemes can plausibly scale to very high levels of capabilities, perhaps indefinitely, since intelligence doesn't give one the ability to create information from thin air...

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-09-04T19:40:37.096Z · LW(p) · GW(p)

Are you describing something that would fit within my 'Another type of proposal...' category?

Replies from: capybaralet
comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-14T22:57:12.610Z · LW(p) · GW(p)

No, I was only responding to the the first part.

answer by avturchin · 2024-09-04T13:56:07.092Z · LW(p) · GW(p)

I wrote a 'message to any future AI', in which I partly argue that any AI should give significant probability mass to the idea that it is in simulation by more advance AI testing its friendliness. 

answer by Ramana Kumar · 2024-09-26T09:21:29.172Z · LW(p) · GW(p)

Vaguely related perhaps is the work on Decoupled Approval: https://arxiv.org/abs/2011.08827

1 comment

Comments sorted by top scores.

comment by RHollerith (rhollerith_dot_com) · 2024-09-04T16:12:21.945Z · LW(p) · GW(p)

People have proposed putting an AI into a simulated environment such that when it thinks it is acting on reality, in actuality it is not.

Is that what you mean by "plausibly need to know exactly where they are in space-time"? If not, what do you mean?