Thoughts on refusing harmful requests to large language models

post by William_S · 2023-01-19T19:49:22.989Z · LW · GW · 4 comments

Contents

  Proposal
  Benefits
  Possible downside
  Extension
  X-risk relevance
None
4 comments

https://twitter.com/antimatter15/status/1602469101854564352

Currently, large language models (ChatGPTConstitutional AI) are trained to refuse to follow user requests that are considered inappropriate or harmful. This can be done by training on example strings of the form “User: inappropriate request AI: elaborate apology”

Proposal

Instead of training a language model to produce “elaborate apology” when it refuses to do an action, train it to produce a special sequence or token first “<SORRYDAVE>elaborate apology”. Strip the special sequence out before returning a response to the user (and never allow the user to include the special sequence in input).

Benefits

Possible downside

Extension

X-risk relevance

4 comments

Comments sorted by top scores.

comment by [deleted] · 2023-01-20T01:26:56.151Z · LW(p) · GW(p)

Legally or morally do we want this?  It just seems jarring to me.

If you ask a friend "how do I hotwire a car", it's not illegal for them to tell you how to do it unless you reveal specifically it's part of committing a crime.  Similarly, you can google how to bypass the ignition on whatever car you want to start, and google will link to sites that contain the information.  It is not illegal and there are legitimate uses cases to bypass ignition locks, as well as for educational purposes to understand the theory and weaknesses of a particular security model.  

Same with explosives.  The simplest forms of improvised explosives, wikipedia has articles that mention the ratios required for common ingredients.  It is not restricted information.

Should the AI refuse to help, or should it tell you the ratio and the expected yield?  Legally would the authors of the AI be liable if it does help?  

I guess I am of the school of thought that a good tool should do what it is designed to do.  In Linux, you can delete files of running programs or the OS files itself, so long as you have sufficient privileges.  If you break your system, that's expected.  It's only an error if the delete operation failed to delete the files as commanded.  

Or in other words, if you shoot yourself in the foot, Linux developers don't care, but if the gun wouldn't fire or went off on it's own that's a bug.  

I kinda feel like the AI should be chastised if it doesn't give correct advice.  "you should have recommended this ratio of explosive precursors as this other one is unstable..."

Replies from: Insub
comment by Insub · 2023-01-20T17:17:18.945Z · LW(p) · GW(p)

I think the point of this post is more "how do we get the AI to do what we want it do to", and less "what should we want the AI to do"

That is, there's value in trying to figure out how to align an LLM to any goal, regardless of whether a "better" goal exists. And the technique in the post doesnt depend on what target you have for the LLM: maybe someone wants to design an LLM to only answer questions about explosives, in which case they could still use the techniques described in the post to do that.

Replies from: None
comment by [deleted] · 2023-01-22T19:15:07.465Z · LW(p) · GW(p)

That sounds fairly straightforward.

    (1) AI needs a large and comprehensive RL bench to train on, where we stick to tasks that have a clear right or wrong answer

    (2) AI needs to output an empirical confidence as to the answer, and emit responses appropriate to it's level of confidence.  It's empirical as in it means "if I was giving this answer on the RL test bench, this is approximately how likely it will be marked correct".  


For the chatGPT/GPT-n system, the bench could be :

         (1) multiple choice tests from many high school and college courses

         (2) tasks from computer programing that are measurably gradable, such as :

                     a.  Coding problems from leetcode/code signal, where we grade the AI's submission

                     b.  Coding problems of the form "translate this program in language X to language Y and pass the unit test"

                     c.  Coding problems of the form "this is a WRONG answer from a coding website (you can make a deal with LC/code signal to get these".  Write a unit test that will fail on this answer but pass on a correct answer.  

                      d.  Coding problems of the form "take this suboptimal solution and make it run faster"

                      e.  Coding problems of the form "here's the problem description and a submission.  Will this submission work, and if not, write a unit test that will fail"

 

          And so on.  Other codebases with deep unit tests where it's possible to usually know if the AI broke something could be used as challenges as well for a-e.  

  

Oh and the machine needs a calculator and I guess a benchmark that is basically "kumon math".

 

Main thing is that knowing if the answer is correct is different from if the answer is morally right.  And simply correct is possibly easier.  

comment by eva_ · 2023-01-20T04:58:44.020Z · LW(p) · GW(p)

The upside and downside both seem weak when it's currently so easy to bypass the filters. The probability of refusal doesn't seem like the most meaningful measure, but I agree it'd be good for the user to get explicitly flagged whether the trained non-response impulse was activated or not, instead of having to deduce it from the content or wonder if the AI really thinks that it's non-answer is correct.

Currently, I wouldn't be using any illegal advise it gives me for the same reason I wouldn't be using any legal advise it gives me, the model just isn't smarter than looking things up on google. In the future, when the model is stronger, they're going to need something better than security theatre to put it behind, and I agree more flagging of when those systems are triggered would be good. Training in non-response behaviors isn't a secure method because you can always phrase a request to put it in a state of mind where those response behaviors aren't triggered, so I don't think reenforcing this security paradigm and trying to pretend it works for longer would be helpful.