Who determines whether an alignment proposal is the definitive alignment solution?

post by MiguelDev (whitehatStoic) · 2023-10-03T22:39:23.700Z · LW · GW · 6 comments

This is a question post.

Contents

6 comments

What is the established process, or what could be a potential process, for validating a highly probable alignment solution? I assume that a robust alignment solution would undergo some form of review—so who is responsible for reviewing these proposals? Is LessWrong the platform for this? or is there a specialized communication network that researchers can access should such a situation arise?

 

For now, the option I see is that the burden falls on the researcher to demonstrate the solution in a real-world setting and have it peer reviewed. However, this approach risks exposing the techniques used, making them available for capabilities research as well.

 

If there is already a post addressing this question, please share it here. Thank you.

Answers

6 comments

Comments sorted by top scores.

comment by thenoviceoof · 2023-10-06T02:02:30.575Z · LW(p) · GW(p)

Thoughts on the different sub-questions, from someone that doesn't professionally work in AI safety:

  • "Who is responsible?" Legally, no one has this responsibility (say, in the same way that the FDA is legally responsible for evaluating drugs). Hopefully in the near future, if you're in the UK the UK AI task force will be competent and have jurisdiction/a mandate to do so, and even more hopefully more countries will have similar organizations (or an international organization exists).
  • Alternative "responsible" take: I'm sure if you managed to get the attention of OpenAI / DeepMind / Anthropic safety teams with an actual alignment plan and it held up to cursory examination, they would consider it their personal responsibility to humanity to evaluate it more rigorously. In other words, it might be good to define what you mean by responsibility (are we trying to find a trusted arbiter? Find people that are competent to do the evaluation? Find a way to assign blame if things go wrong? Ideally these would all be the same person/organization, but it's not guaranteed).
  • "Is LessWrong the platform for [evaluating alignment proposals]?" In the future, I sure hope not. If LW is still the best place to do evaluations when alignment is plausibly solvable, then... things are not going well. A negative/do-not-launch evaluation means nothing without the power to do something about it, and LessWrong is just an open collaborative blogging platform and has very little actual power.
  • That said, LessWrong (or the Alignment Forum) is probably the best current discussion place for alignment evaluation ideas.
  • "Is there a specialized communication network[?]" I've never heard of such a network, unless you include simple gossip. Of course, the PhDs might be hiding one from all non-PhDs, but it seems unlikely.
  • "... demonstrate the solution in a real-world setting..." It needs to be said, please do not run potentially dangerous AIs *before* the review step.
  • "... have it peer reviewed." If we shouldn't share evaluation details due to capability concerns (I reflexively agree, but haven't thought too deeply about it), this makes LessWrong a worse platform for evaluations, since it's completely open, both for access and to new entrants.
Replies from: whitehatStoic
comment by MiguelDev (whitehatStoic) · 2023-10-06T02:44:55.015Z · LW(p) · GW(p)

Unfortunately, I'm not based in the UK. However, the UK government's prioritization of the alignment problem is commendable, and I hope their efforts continue to yield positive results.

 

(are we trying to find a trusted arbiter? Find people that are competent to do the evaluation? Find a way to assign blame if things go wrong? Ideally these would all be the same person/organization, but it's not guaranteed).

 

Unfortunately, I'm not based in the UK. However, the UK government's prioritization of the alignment problem is commendable, and I hope their efforts continue to yield positive results.

(Are we attempting to identify a trusted mediator? Are we seeking individuals competent enough for evaluation? Or are we trying to establish a mechanism to assign accountability should things go awry? Ideally, all these roles would be fulfilled by the same entity or individual, but it's not necessarily the case.)

I understand your point, but it seems that we need a specific organization or team designed for such operations. Why did I pose the question initially? I've developed a prototype for a shutdown mechanism, which involves a potentially hazardous step. This prototype requires assessment by a reliable and skilled team. From my observations of discussions on LW, it appears there's a "clash of agendas" that takes precedence over the principle of "preserving life on earth." Consequently, this might not be the right platform to share anything of a hazardous nature.

Thank you for taking the time to respond to my inquiry.

comment by Mo Putera (Mo Nastri) · 2023-09-26T07:58:48.281Z · LW(p) · GW(p)

Why was this (sincere afaict) question downvoted?

Replies from: M. Y. Zuo, whitehatStoic
comment by M. Y. Zuo · 2023-10-04T12:05:40.854Z · LW(p) · GW(p)

It does seem a bit odd, so I strongly upvoted the post. MiguelDev seems to be a genuine poster, or at least far from a troll/spammer/etc.

Replies from: whitehatStoic
comment by MiguelDev (whitehatStoic) · 2023-10-05T02:55:11.087Z · LW(p) · GW(p)

Even if no one knows the answer to this question, it is still worth exploring. That is why I do find it weird how this platform operates, not allowing freedom to discuss important problems. 

comment by MiguelDev (whitehatStoic) · 2023-09-27T09:40:09.138Z · LW(p) · GW(p)

It is very interesting how questions can be promoted or not promoted here in LessWrong by the moderators.