confusion about alignment requirements

post by Tamsin Leake (carado-1) · 2022-10-06T10:32:49.779Z · LW · GW · 10 comments

This is a link post for https://carado.moe/confusion-about-alignment-requirements.html

Contents

10 comments

for now, let's put aside the fact that we can't decide whether we're trying to achieve sponge coordination or FAS [LW · GW], and merely consider what it takes to build an aligned AI — regardless of whether it has the capability of saving the world as a singleton, or merely to be a useful but safe tool.

the question this post is about is: what requirements do we want such a solution to satisfy?

let's say three groups have each built an AI which they think is aligned, and before they press the start button on it, they're trying to convince the other two that their design is safe and leads to good worlds. however, their designs are actually very different from one another.

maybe one is an advanced but still overall conventional text-predicting simulator [LW · GW], another is a clever agentic neural net with reinforcement learning and access to a database and calculator, and the third is a novel kind of AI whose core doesn't really relate to current machine learning technology.

so, they start talking about why they think they AI is aligned. however, they run into an issue: they don't even agree on what it takes to be sure an AI is safe, let alone aligned!

and those are optimistic cases! many alignment approaches would simply:

i've noticed this pattern of confusion in myself after trying to explain alignment ideas i've found promising to some people, and the nature of their criticism — "wait, where's the part that makes this lead to good worlds? why do you think it would work?" — seems to be of a similar nature to my criticism of people who think "alignment is easy, just do X": the proposal is failing to answer some fundamental concerns that the person proposing has a hard time even conceiving of.

and so, i've come to wonder: given that those people seem to be missing requirements for an alignment proposal, requirements which seem fundamental to me but unknown unknown to them, what requirements are unknown unknown to me? what could i be missing? how do i know which actual requirements i'm failing to satisfy because i haven't even considered them? how do we collectively know which actual requirements we're all collectively missing? what set of requirements is necessary for an alignment proposal to satisfy, and what set is sufficient?

it feels like there ought to be a general principle that covers all of this. the same way that the logical induction paper demonstrates that the computability desideratum and the "no dutchbook" desideratum, together suffice to satisfy ten other desiderata about logical inductors; it seems like a simple set of desiderata ought to capture the true name [AF · GW] of what it means for an AI to lead to good worlds. but this isn't guaranteed, and i don't know that we'll find such a thing in time, or that we'll have any idea how to build something that satisfies those requirements.

10 comments

Comments sorted by top scores.

comment by jacob_cannell · 2022-10-06T17:04:38.050Z · LW(p) · GW(p)

let's say three groups have each built an AI which they think is aligned, and before they press the start button on it, they're trying to convince the other two that their design is safe and leads to good worlds.

So this is something I've though and written about - and I think it actually has a fairly obvious answer: the only reliable way to convince others that a system works is to show extensive actual evidence of the . . . system working. Simple, right?

Nobody believed the Wright brothers when they first claimed they had solved powered flight (if I recall correctly they had an initial failed press conference where no reporters showed up) - it required successful test demonstrations.

How do you test alignment? How do you demonstrate successful alignment? In virtual simulation sandboxes [LW · GW].

PS: Is there a reason you don't use regular syntax (ie capitalization of first letters of sentences)? (just curious)

Replies from: carado-1
comment by Tamsin Leake (carado-1) · 2022-10-06T20:03:25.195Z · LW(p) · GW(p)

In virtual simulation sandboxes

forgive me for not reading that whole post right now, but i suspect that an AI may:

  • act as we'd want but then sharp left turn [LW · GW] once it reaches capabilities that the simulation doesn't have enough compute for but reality does
  • need more compute than it has access to in the simulation before it starts modifying the world significantly, so we can only observe it being conservative and waiting to get more compute
  • act nice because it detects it's in a simulation, eg by noticing that the world it's in has nowhere near the computational capacity for agents that would design such an AI
  • commit to acting nice for a long time and then turn evil way later regardless of whether it's a computation or not, just in case it's in a simulation

i believe that "this computation is inside another computation!" is very plausibly an easily guessable idea for an advanced AI. if you address these concerns in your post, let me know and i'll read it in full (or read the relevant sections) and comment on there.

as for my writing style, see here [LW(p) · GW(p)].

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-06T21:12:51.025Z · LW(p) · GW(p)

I guess I need to write a better concise summary.

i believe that "this computation is inside another computation!" is very plausibly an easily guessable idea for an advanced AI.

As stated very early in the article, the AI in simboxes do not have the words for even precursor concepts of 'computation' and 'simulation', as their knowledge is intentionally constrained and shaped to early or pre civ level equivalents. The foundational assumption is brain-like AGI in historical sims with carefully crafted/controlled ontologies.

comment by Morpheus · 2022-10-06T19:57:38.200Z · LW(p) · GW(p)

What you are looking for sounds very much like Vanessa Kosoy's agenda (formal guarantees, regret bounds). Best post I know explaining her agenda [LW · GW]. If you liked logical induction, definitely look into Infrabayseanism [? · GW]! It's very dense, so I would reccomend to start with a short introduction [LW · GW], or just look for good stuff under the infrabayseanism tag [? · GW]. The current state of affairs is that we don't have these guarantees yet, or at least only with unsatisfactory assumptions.

Replies from: carado-1, Artaxerxes
comment by Tamsin Leake (carado-1) · 2022-10-06T20:08:40.449Z · LW(p) · GW(p)

i am somewhat familiar with vanessa's work, and it contributed to inspiring this post of mine. i'd like to understand infrabayesianism better, and maybe that intro will help with that, thanks!

comment by Artaxerxes · 2022-10-06T20:20:27.012Z · LW(p) · GW(p)

What you are looking for sounds very much like Vanessa Kosoy's agenda

As it so happens, the author of the post also wrote this [LW · GW] overview post on Vanessa Kosoy's PreDCA protocol.

Replies from: Morpheus
comment by Morpheus · 2022-10-06T20:52:19.106Z · LW(p) · GW(p)

Oops! Well, I did not carefully read the whole post to the end and that's what you get! Ok second try after reading the post carefully:

it seems like a simple set of desiderata ought to capture the true name of what it means for an AI to lead to good worlds.

I think I have been thinking something similar, and my best description of this desideratum is pragmatism. Something like "use a prior that works" [LW · GW] in the worlds where we haven't already lost. It's easy to make toy models where alignment will be impossible. -> regret bounds for some prior where I don't know what it looks like yet.

comment by Lone Pine (conor-sullivan) · 2022-10-06T18:16:20.989Z · LW(p) · GW(p)

Lab #3 either has some wild alien intelligence on hand which is so far advanced that it blows everything out of the water, or they are just faking it. XD

Replies from: carado-1
comment by Tamsin Leake (carado-1) · 2022-10-06T19:51:13.732Z · LW(p) · GW(p)

i didn't mean to imply that they were each 1/2/3; i was intending each statement to apply to all three labs but in unspecified order (so lab 1 might be the second in one statement and the third in another statement)

Replies from: conor-sullivan
comment by Lone Pine (conor-sullivan) · 2022-10-07T07:55:42.197Z · LW(p) · GW(p)

I know, I was just teasing.