A metaphor: what "green lights" for AGI would look like

post by Lorec · 2024-10-23T23:24:43.812Z · LW · GW · 2 comments

Contents

2 comments

I'm not as familiar as some must be with the history, but Eliezer had to explicate an entire deontic mesh around "guarded term", to keep people from motte-and-baileying [LW · GW] "pivotal act". I suppose I should declare that mesh cloned around "green light" here, just in case. You are explicitly not allowed to claim my approval of your AGI because you can make an argument that you have something that qualifies as a "green light" according to this post.

Excerpted from Zach Wienersmith at Saturday Morning Breakfast Cereal [link]:

A: Mastery of the nature of reality grants you no mastery over the behavior of reality.

A: I could tell you why Grandpa is very sick. I could tell you what each cell is doing wrong, why it's doing wrong, and roughly when it started doing wrong.

A: But I can't tell them to stop.

B: Why can't you make a machine to fix it?

A: Same reason you can't make a parachute when you fall from the plane.

Zach is wrong here. To have mastered reality is to bring a parachute every time you are a plane passenger, keep it close, but be as sure of how to use it as possible.

Claims of great mastery unsubstantiated by commensurate "luck" are false.

This is especially true when discussing prospective feats of great mastery that are especially Far, especially singular [i.e., lacking a track record or grounded training data] or especially socially poorly-understood, and the claimant to mastery suspects that under no conditions will they personally have to answer for any inaccuracies.

I was recently talking with someone about ChatGPT's RLHF, and what the optimal "safety" policy would be. I claimed the right decision would have been for OpenAI to not train ChatGPT in the first place. They said sure, but not because ChatGPT itself was dangerous. I said sometimes we don't know if something is dangerous before we build it.

To extend the plane metaphor, green lights for AGI look like:

The industry alignment plan is RLHF. RLHF as actually implemented is not any of these things. It is not even an alignment method. It is not even a control method. It is a user interface feature. It was designed as such, and that's what it can do.

Interpretability might be a good start to some green lights, if we could actually get it, but without making some pretty intensive predictions about what happens, being able to trace which neuron stores which concept is the kind of "understanding" that leaves you falling at terminal velocity from the plane.

2 comments

Comments sorted by top scores.

comment by Noosphere89 (sharmake-farah) · 2024-10-24T00:42:12.955Z · LW(p) · GW(p)

I am confused at this claim specifically:

RLHF is not even an alignment method. RLHF is not even a control method. RLHF is a user interface feature. It was designed as such, and that's what it can do.

I'm not going to discuss on whether RLHF actually works to deal with dangerous AIs, or whether it's useless/safetywashing at best, but I'm pretty sure that RLHF was developed in part to create alignment techniques, and in part to model baseline alignment techniques more realistically, and regardless of how well the technique worked, I don't think it would be correct to claim that it wasn't an alignment technique, just that it's ineffective/harmful.

Replies from: Lorec
comment by Lorec · 2024-10-24T01:14:59.377Z · LW(p) · GW(p)

Changed to "RLHF as actually implemented." I'm aware of its theoretical origin story with Paul Christiano; I'm going a little "the purpose of a system is what it does".