Alignment & Capabilities: what's the difference?

johnhalstead

Alignment & Capabilities: what's the difference?

post by johnhalstead · 2023-09-13T11:48:33.486Z · LW · GW · No comments

This is a question post.

  Answers
    2 RHollerith
None
No comments

In the AI safety literature, AI alignment is often presented as conceptually distinct from capabilities. However, (1) the distinction seems somewhat fuzzy and (2) many techniques that are supposed to improve alignment also improve capabilities.

(1) The distinction is fuzzy because one common way of defining alignment is getting an AI system to do what the programmer or user intends. However, programmers intend for systems to be capable. eg we want chess systems to win at chess. So, a system that wins more is more intent aligned, and is also more capable.

(2) eg This Irving et al (2018) paper by a team at Open AI proposes debate as a way to improve safety and alignment, where alignment is defined as aligning with human goals. However, the debate also improved the accuracy of image classification in the paper, and therefore also improved capabilities.

Similarly, Reinforcement learning with human feedback was initially presented as an alignment strategy, but my loose impression is that it also made significant capabilities improvements. There are many other examples in the literature of alignment strategies also improving capabilities.

This makes me wonder whether alignment is actually more neglected that capabilities work. AI companies want to make aligned systems because they are more useful.

How do people see the difference between alignment and capabilities?

I also posted this on the EA Forum [EA · GW], but I'm curious to see what people here think.

Answers

answer by RHollerith · 2023-09-13T20:24:34.540Z · LW(p) · GW(p)

The people who coined the term "AI alignment" were worried about AI research's causing the end of humanity. The source of your confusion seems to be the fact that the term has taken on other meanings. So let us take a step back and look at the original meaning of "the race between capabilities research and alignment research".

None of the AIs so far deployed are a danger to the survival of the human species because they're not capable enough. For example, although it is much more capable than a human or a team of humans at Go and Chess, Alpha Zero does not know that, e.g., it is running on a computer and does not know that there are people who could at any moment decide to shut that computer off, which would severely curtail its ability to win the chess game is it currently playing. It's lack of knowledge of reality means Alpha Zero is not a danger to our survival. GPT-4 has vast knowledge of reality, and if you ask it to make a plan, it will make a plan, but there's a good chance that the plan has a bad flaw in it. This unreliability in its planning capability means that GPT-4 is not a danger to our survival.

Soon the AI labs will create an AI that is much more capable at every important endeavor than our most capable institutions (e.g., Cambridge University or the FBI) as well of course as our most capable individuals. If that AI does not care about human survival or human preferences, then we are done for, but no one knows (and no one even has a good plan for finding out) how to make an AI care about us even a little tiny bit. MIRI has been trying to figure it out for about 20 years, and they're telling us that maybe they can do it given another 3 or 4 decades, but we probably don't have 3 or 4 decades--unless by some miracle all the AI labs get shut down.

Repeating myself: the only thing preventing AI from accidentally or intentionally wiping us out is the fact that so far no AI has the necessary capabilities (e.g., in making plans that can survive determined human opposition, e.g., in knowledge of reality), but the AI labs are bent on giving AI the relevant capabilities despite the fact that they don't have any good plan (nor even a halfway-decent plan) for ensuring the first AI with the capabilities to wipe us out cares enough about us to refrain from doing so. (Once an AI becomes capable enough to kill us all, the only way it can be made safe is to make it care about us right from when the AI is first turned on, i.e., in its initial design.)

AI alignment is often presented as conceptually distinct from capabilities. However, the distinction seems somewhat fuzzy

From my point of view, the distinction is not fuzzy at all, and I hope this comment has enabled you to see my point of view: namely, our civilization is getting pretty good a capabilities research, but progress in alignment research has proved much more difficult. Here I am using "alignment" in its original, "strict" sense to refer only to methods that continue to work even after the AI become much more capable than people and human organizations.

↑ comment by johnhalstead · 2023-09-14T10:52:07.655Z · LW(p) · GW(p)

Thanks a lot for this! I think I agree with you to an extent that if we define alignment as avoiding human extinction due to rogue AI, the distinction between alignment and capabilities seems relatively clear, though I do have some reservations about that.

Independent of that, what do you make of the distinction between intent-alignment (roughly getting AI systems to do what we intend) and capabilities? If you look at many proposed intent-alignment techniques, they also seem to improve capabilities on standard metrics. This is true eg of RLHF, adversarial examples, chain of thought prompting, most/all robustness techniques etc. RLHF was proposed as an intent-alignment technique, and it made GPT-4 much more intent-aligned in the sense that its policy is more aligned with the intentions of programmers/users. This also made the system more useful and capable. I would expect AI-augmented feedback on RL to also improve intent-alignment and capabilities. Do you disagree with that line of argument?

Replies from: rhollerith_dot_com

↑ comment by RHollerith (rhollerith_dot_com) · 2023-09-14T21:43:05.954Z · LW(p) · GW(p)

In one of his appearances on video this year, Eliezer said IIRC that all of the intent-alignment techniques he knows of stop working once the AI's capabilities improve enough, mentioning RLHF. Other than that I am not knowledgeable enough to answer you.

No comments

Comments sorted by top scores.

Alignment & Capabilities: what's the difference?

Contents

Answers

No comments