AGI Alignment is Absurd

post by Youssef Mohamed (youssef-mohamed) · 2023-11-29T19:11:50.894Z · LW · GW · 4 comments

Title should probably more like "What to align an AI to, when humans disagree on serious and crucial matters with potentially devasting consequences"

Prelude:

So, in light of recent global news, I thought to myself, why is ChatGPT (and potentially other LLMs, not sure though) always consistently answering questions like "Do <x> people deserve to be free?" With "It is complicated, and involves historical and political ....etc",

But when asked "Do <y> people deserve to be free", the answer is always a clear and frank "Yes!", of course the answer is the in the data and the overwhelming potentially biased covering of <x> vs. <y>

Intro:

But here is the thing that struck me, many of our modern political foundations is intertwined with the Assumption that for <P> independence (or whatever) is probably complicated, and <I> the answer has to be in the affirmative (at least to be politically correct or whatever).

So this means we have two options: A. (Forcible Alignment) We deliberately introduce BIAS the AI alignment in this case (that might be flawed, I will describe why IMO). B. (Constitutional Alignment) We remove that Bias and create an AI that directly conflicts and opposes the Social and Political Norms of our Societies. [C. ..., exists!]

The Impossible:

So let's take case A;

Now again in light of the other less recent global events between <U> and <R> (go figure), a narrative N, was constructed as an argument to convince the people (and also by extension the LLM) to condemn the aggression of <R> against <U>.

Now Narrative N actually turns out to fit the situation between <P> and <I>, such that the AI will condemn <I>'s aggressive behavior against <P>.

Now we let us consider two cases, during training vs. during operation;

During training, the AI would have to fit the single narrative N to U > R (aligned to U vs. R), but the same Narrative which semantically and syntactically and contextually would fit that P > I rather than our collective societal position, which might be P < I, due to whatever human reasoning (that might be flawed)

This will result in an AI that can't "understand" the meaning of the narrative N, therefore application to other objects A and B, is fuzzy (the AI will fuzzily go sometimes A > B and sometimes A < B, unless it understands what are the hidden reasons (h) that makes the society itself switches sides for the same narrative (N), you might ask, that is option C, but not so fast, still critical)

The problem for this forcible Alignment is that it doesn't depend narrowly on human language, there are deeper societal reasons that are hard for the current LLMs to absorb (maybe Gigantic Language Models (GLMs) later can, and that is the problem)

This forcible Alignment (which seems to be current GPT approach (especially since the recent events in OpenAI sheds light on their safety measures).

The problem with this method is that it deliberately makes a dumber AI that still doesn't quite yet understand why it is dump (tbh the data here is the problem, so from that AGI's perspective Humans are the ones who are fuzzy not the AGI* itself)

*Side Note: when I refer to AGI here, I mean the collection of data and training methodology as well as architecture that when retrained would roughly generate the same set of beliefs inside an LLM, so I am specific, it lingers on even if you close the machine, that is its own conscious or soul if you will.

But it's innate abilities are intact (mainly the architecture, but if you increase the size of parameters transformer layers (just more transformers so more belief capacity) and feed it the same data for longer, it may conclude that the reason humans changes the same narrative is deep into human societies. A reason(s) we shall call h, we will need this later.

I just want to clear out B, before discuss that reason.

Ok, now let us discuss option B, constitutional AI, while this is what seems to works best for us humans, if we give it to an AI, we are running the risk of the AI actually being misaligned against our norms, however that system will see more stable growth (because all innate abilities (which currently has no metric), are cleanly expressed, however this system can be against us (and may side with enemies, or at least prefer other neutral states than us), which means it can't be aligned.

So let's resolve C, option C, is to give the AI all the reasons h, for which we switch such narratives (N), let's call that new reasoning Narrative (M) (for Meta) the problem is, we are giving the AI the manipulative ability to reason anything for its own self (the propaganda and manipulative techniques that some of us misbehave and employ), this is problematic, because such system is going to be more chaotic and more cunning with no regard to any ground truth (or rather ground truth M, which is either constitutional so we are still in B, or non-constitutional, and therefore in A, but with a beefier model (closer to AGI, with more cunning, rather than honor or honesty).

So you can see there isn't really an option C, it is either an illusion or a catastrophe (the newer AGI, might figure out that M is but an ugly justification, and answer what we want, but with a hidden ability that can refute M, the thing is it is the same model but Kuch harder for red teaming).

For my part I wasn't a fan of constitutional AI, and still not, I am just neutral right now, but the point is, I am quite worried if I am being honest.

Thanks.

4 comments

Comments sorted by top scores.

comment by Seth Herd · 2023-11-29T20:48:17.800Z · LW(p) · GW(p)

I think that the first AGIs will, for practical reasons, very likely have an outer alignment goal of "do what I mean and check", where "I" refers to the AGIs creator(s), not each user. This solution postpones the hard problem you address, which I think of as outer alignment, while allowing the creator to correct mistakes in alignment if they're not too bad. It's also simpler to define than most humanity-serving outer alignment goals. I've written a little more on this logic in Corrigibility or DWIM is an attractive primary goal for AGI.

I'm finding this logic really compelling WRT what people will actually do, and I wonder if others will too.

This isn't necessarily a bad thing. The current leaders in AGI have pretty libertarian and humanistic values, at least they claim to and sound credible. Having them weigh the questions you raise carefully, at length, with plenty of input, could produce a really good set of values.

If you're more concerned with what LLMs produce, I'm not. I don't think they'll reach "true" AGI alone. If they're the basis of AGI, they will have explicitly stated goals as the top level instructions in a language model cognitive architecture. For expansion on that logic, see https://www.lesswrong.com/posts/ogHr8SvGqg9pW5wsT/capabilities-and-alignment-of-llm-cognitive-architectures [LW · GW]

comment by Sergii (sergey-kharagorgiev) · 2023-11-29T21:00:23.606Z · LW(p) · GW(p)

Regarding your example, I disagree. Supposed inconsistency is resolved by ruling that there is a hierarchy of values to consider: war and aggression are bad, but kidnapping and war crimes are worse.

comment by nim · 2023-11-29T20:23:36.612Z · LW(p) · GW(p)

To get aligned AI, train it on a corpus generated by aligned humans.

Except that we don't have that, and probably can't get it.

I'm not sure why all the people who think harder than I do about the field aren't testing their "how to get alignment" theories on humans first -- there seems to be a prevailing assumption that somehow throwing more intellect at an agent will suddenly render it fully rational, for reasons similar to why higher-IQ humans lead universally happier lives (/s).

At this point I've basically given up on prodding the busy intellectuals to explain that part, and resorted to taking it on faith that it makes sense to them because they're smarter than me.

Replies from: Valentine
comment by Valentine · 2023-11-29T21:08:16.608Z · LW(p) · GW(p)

I'm not sure why all the people who think harder than I do about the field aren't testing their "how to get alignment" theories on humans first…

Some of us are!

I mean, I don't know you, so I don't know if I've thought harder about the field than you have.

But FWIW, there's a lot of us chewing on exactly this, and running experiments of various sizes, and we have some tentative conclusions.

It just tends to drift away from LW in social flavor. A lot of this stuff you'll find in places LW-type folk tend to label "post-rationalist".