A metaphor: what "green lights" for AGI would look like

lorec

A metaphor: what "green lights" for AGI would look like

post by Lorec · 2024-10-23T23:24:43.812Z · LW · GW · 6 comments

6 comments

I'm not as familiar as some must be with the history, but Eliezer had to explicate an entire deontic mesh around "guarded term", to keep people from motte-and-baileying [LW · GW] "pivotal act". I suppose I should declare that mesh cloned around "green light" here, just in case. You are explicitly not allowed to claim my approval of your AGI because you can make an argument that you have something that qualifies as a "green light" according to this post.

Excerpted from Zach Wienersmith at Saturday Morning Breakfast Cereal [link]:

A: Mastery of the nature of reality grants you no mastery over the behavior of reality.

A: I could tell you why Grandpa is very sick. I could tell you what each cell is doing wrong, why it's doing wrong, and roughly when it started doing wrong.

A: But I can't tell them to stop.

B: Why can't you make a machine to fix it?

A: Same reason you can't make a parachute when you fall from the plane.

Zach is wrong here. To have mastered reality is to bring a parachute every time you are a plane passenger, keep it close, but be as sure of how to use it as possible.

Claims of great mastery unsubstantiated by commensurate "luck" are false.

This is especially true when discussing prospective feats of great mastery that are especially Far, especially singular [i.e., lacking a track record or grounded training data] or especially socially poorly-understood, and the claimant to mastery suspects that under no conditions will they personally have to answer for any inaccuracies.

I was recently talking with someone about ChatGPT's RLHF, and what the optimal "safety" policy would be. I claimed the right decision would have been for OpenAI to not train ChatGPT in the first place. They said sure, but not because ChatGPT itself was dangerous. I said sometimes we don't know if something is dangerous before we build it.

To extend the plane metaphor, green lights for AGI look like:

Since this is our first "grounded" plane flight, somebody having an obviously-correct and ~exhaustive theory of how heavier-than-air flight works, and what will happen [for our purposes] when we turn on the plane.
All copilots having mastered the simulator. Not infinitely, we're not doing ALARA, but mastered it to where they've stopped crashing it in ways that would kill them if it was a real flight. Since simulations of computations are actual computations, this part will just look like really prepared pilots who have not just learned what all the controls[/not-yet-executed AGI components] are supposed to do, but have spent significant time writing down lists of things that could go wrong and finding ways to check things off the list.
For AGIs that are not supposed to take away humanity's directive power over the timeline forever, a plan for when the flight will terminate [at what point the AGI is going to stop optimizing] before we run out of fuel [confidence in our alignment theory].
A runway that we'll have some idea is Nx as long as we'll need, where N >> 1 [quasi-quantified overconfidence in the redundant effectiveness of our alignment methods].
Ideally, "parachutes" we are fairly confident work, and [mental] practice using them.

The industry alignment plan is RLHF. RLHF as actually implemented is not any of these things. It is not even an alignment method. It is not even a control method. It is a user interface feature. It was designed as such, and that's what it can do.

Interpretability might be a good start to some green lights, if we could actually get it, but without making some pretty intensive predictions about what happens, being able to trace which neuron stores which concept is the kind of "understanding" that leaves you falling at terminal velocity from the plane.

6 comments

Comments sorted by top scores.

comment by aphyer · 2024-10-24T16:02:31.832Z · LW(p) · GW(p)

How many of those green lights could the Wright Brothers have shown you?

Replies from: Lorec

↑ comment by Lorec · 2024-10-24T17:16:38.238Z · LW(p) · GW(p)

I don't know, I'm not familiar with the history; probably zero. It's a metaphor. The things the two scenarios are supposed to have in common are first-time-ness, danger, and technical difficulty. I point out in the post that the AGI scenario is actually irreducibly harder than first-time heavier-than-air flight: you can't safely directly simulate intelligent computations themselves for testing, because then you're just running the actual computation.

But as for the application of "green light" standards - the actual Wright brothers were only risking their own lives. Why should someone else need to judge their project for safety?

comment by Said Achmiz (SaidAchmiz) · 2024-10-24T16:00:32.204Z · LW(p) · GW(p)

What is a “deontic mesh”? I am not familiar with this term; do you have a link that explains it?

Replies from: Lorec

↑ comment by Lorec · 2024-10-24T17:20:48.153Z · LW(p) · GW(p)

A decision tree that's ostensibly both normative and exhaustive of the space at hand.

comment by Noosphere89 (sharmake-farah) · 2024-10-24T00:42:12.955Z · LW(p) · GW(p)

I am confused at this claim specifically:

RLHF is not even an alignment method. RLHF is not even a control method. RLHF is a user interface feature. It was designed as such, and that's what it can do.

I'm not going to discuss on whether RLHF actually works to deal with dangerous AIs, or whether it's useless/safetywashing at best, but I'm pretty sure that RLHF was developed in part to create alignment techniques, and in part to model baseline alignment techniques more realistically, and regardless of how well the technique worked, I don't think it would be correct to claim that it wasn't an alignment technique, just that it's ineffective/harmful.

Replies from: Lorec

↑ comment by Lorec · 2024-10-24T01:14:59.377Z · LW(p) · GW(p)

Changed to "RLHF as actually implemented." I'm aware of its theoretical origin story with Paul Christiano; I'm going a little "the purpose of a system is what it does".

A metaphor: what "green lights" for AGI would look like

Contents

6 comments