0 comments

Comments sorted by top scores.

comment by jacob_cannell · 2022-10-06T17:04:38.050Z · LW(p) · GW(p)

let's say three groups have each built an AI which they think is aligned, and before they press the start button on it, they're trying to convince the other two that their design is safe and leads to good worlds.

So this is something I've though and written about - and I think it actually has a fairly obvious answer: the only reliable way to convince others that a system works is to show extensive actual evidence of the . . . system working. Simple, right?

Nobody believed the Wright brothers when they first claimed they had solved powered flight (if I recall correctly they had an initial failed press conference where no reporters showed up) - it required successful test demonstrations.

How do you test alignment? How do you demonstrate successful alignment? In virtual simulation sandboxes [LW · GW].

PS: Is there a reason you don't use regular syntax (ie capitalization of first letters of sentences)? (just curious)

Replies from: carado-1

↑ comment by Tamsin Leake (carado-1) · 2022-10-06T20:03:25.195Z · LW(p) · GW(p)

In virtual simulation sandboxes

forgive me for not reading that whole post right now, but i suspect that an AI may:

act as we'd want but then sharp left turn [LW · GW] once it reaches capabilities that the simulation doesn't have enough compute for but reality does
need more compute than it has access to in the simulation before it starts modifying the world significantly, so we can only observe it being conservative and waiting to get more compute
act nice because it detects it's in a simulation, eg by noticing that the world it's in has nowhere near the computational capacity for agents that would design such an AI
commit to acting nice for a long time and then turn evil way later regardless of whether it's a computation or not, just in case it's in a simulation

i believe that "this computation is inside another computation!" is very plausibly an easily guessable idea for an advanced AI. if you address these concerns in your post, let me know and i'll read it in full (or read the relevant sections) and comment on there.

as for my writing style, see here [LW(p) · GW(p)].

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2022-10-06T21:12:51.025Z · LW(p) · GW(p)

I guess I need to write a better concise summary.

i believe that "this computation is inside another computation!" is very plausibly an easily guessable idea for an advanced AI.

As stated very early in the article, the AI in simboxes do not have the words for even precursor concepts of 'computation' and 'simulation', as their knowledge is intentionally constrained and shaped to early or pre civ level equivalents. The foundational assumption is brain-like AGI in historical sims with carefully crafted/controlled ontologies.

comment by Morpheus · 2022-10-06T19:57:38.200Z · LW(p) · GW(p)

What you are looking for sounds very much like Vanessa Kosoy's agenda (formal guarantees, regret bounds). Best post I know explaining her agenda [LW · GW]. If you liked logical induction, definitely look into Infrabayseanism [? · GW]! It's very dense, so I would reccomend to start with a short introduction [LW · GW], or just look for good stuff under the infrabayseanism tag [? · GW]. The current state of affairs is that we don't have these guarantees yet, or at least only with unsatisfactory assumptions.

Replies from: carado-1, Artaxerxes

↑ comment by Tamsin Leake (carado-1) · 2022-10-06T20:08:40.449Z · LW(p) · GW(p)

i am somewhat familiar with vanessa's work, and it contributed to inspiring this post of mine. i'd like to understand infrabayesianism better, and maybe that intro will help with that, thanks!

↑ comment by Artaxerxes · 2022-10-06T20:20:27.012Z · LW(p) · GW(p)

What you are looking for sounds very much like Vanessa Kosoy's agenda

As it so happens, the author of the post also wrote this [LW · GW] overview post on Vanessa Kosoy's PreDCA protocol.

Replies from: Morpheus

↑ comment by Morpheus · 2022-10-06T20:52:19.106Z · LW(p) · GW(p)

Oops! Well, I did not carefully read the whole post to the end and that's what you get! Ok second try after reading the post carefully:

it seems like a simple set of desiderata ought to capture the true name of what it means for an AI to lead to good worlds.

I think I have been thinking something similar, and my best description of this desideratum is pragmatism. Something like "use a prior that works" [LW · GW] in the worlds where we haven't already lost. It's easy to make toy models where alignment will be impossible. -> regret bounds for some prior where I don't know what it looks like yet.

comment by Lone Pine (conor-sullivan) · 2022-10-06T18:16:20.989Z · LW(p) · GW(p)

Lab #3 either has some wild alien intelligence on hand which is so far advanced that it blows everything out of the water, or they are just faking it. XD

Replies from: carado-1

↑ comment by Tamsin Leake (carado-1) · 2022-10-06T19:51:13.732Z · LW(p) · GW(p)

i didn't mean to imply that they were each 1/2/3; i was intending each statement to apply to all three labs but in unspecified order (so lab 1 might be the second in one statement and the third in another statement)

Replies from: conor-sullivan

↑ comment by Lone Pine (conor-sullivan) · 2022-10-07T07:55:42.197Z · LW(p) · GW(p)

I know, I was just teasing.