Kabir Kumar's Shortform

post by Kabir Kumar (kabir-kumar) · 2024-11-03T17:03:01.824Z · LW · GW · 6 comments

Contents

6 comments

6 comments

Comments sorted by top scores.

comment by Kabir Kumar (kabir-kumar) · 2024-11-03T17:03:02.193Z · LW(p) · GW(p)

btw, thoughts on this for 'the alignment problem'?
"A robust, generalizable, scalable,  method to make an AI model which will do set [A] of things as much as it can and not do set [B] of things as much as it can, where you can freely change [A] and [B]"

Replies from: Seth Herd
comment by Seth Herd · 2024-11-04T13:55:09.460Z · LW(p) · GW(p)

Freely changing an AGIs goals is corrigibility, which is a huge advantage if you can get it. See Max Harms' corrigibility sequence and my "instruction-following AGI is easier...."

The question is how a reliably get such a thing. Goalcrafting is one part of the problem, and I agree that those are good goals; the other and larger part is technical alignment, getting those desired goals to really work that way in the particular first AGI we get.

Replies from: kabir-kumar
comment by Kabir Kumar (kabir-kumar) · 2024-11-04T14:49:15.408Z · LW(p) · GW(p)

Yup, those are hard. Was just thinking of a definition for the alignment problem, since I've not really seen any good ones.

Replies from: Seth Herd
comment by Seth Herd · 2024-11-04T16:13:13.970Z · LW(p) · GW(p)

I'd say you're addressing the question of goalcrafting or selecting alignment targets.

I think you've got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my "if we solve alignment, do we all die anyway" for the problems with that scenario.

Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.

But I do think your goal defintion is a good alignment target for the technical work. I don't think there's a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they're less rigid, but they're both very similar to your definition.

Replies from: kabir-kumar
comment by Kabir Kumar (kabir-kumar) · 2024-11-05T01:27:59.334Z · LW(p) · GW(p)

I pretty much agree. I prefer rigid definitions because they're less ambiguous to test and more robust to deception. And this field has a lot of deception.

comment by Kabir Kumar (kabir-kumar) · 2024-11-16T16:00:01.375Z · LW(p) · GW(p)

give better names to actual formal math things, jesus christ.