How would you improve ChatGPT's filtering?

noah-scales

How would you improve ChatGPT's filtering?

post by Noah Scales · 2022-12-10T08:05:19.493Z · LW · GW · No comments

This is a question post.

  Answers
    3 Peter Chatain
    0 JBlack
None
No comments

I am wondering how Less Wrong would improve ChatGPT's filtering? I'm reading through the comments on breaking OpenAI's filtering, and see plenty of analysis of the weaknesses of the safeguards. There's always the chance that some group could steal ChatGPT's source code and remove ad hoc additions to it, so I'll ask the question in this form:

How would you change ChatGPT's purpose, design, or function to enforce topic and content filtering of its output?

Thanks for your thoughts.

Answers

answer by Peter Chatain · 2022-12-10T19:36:47.799Z · LW(p) · GW(p)

Although this isn’t a direct answer, I think there’s something that changed recently with chat gpt such that it is now much better at filtering out illegal advice. It appears to be more complex than simply running a filter over what words were in the prompt or what words are in chat gpt’s output. By recent, I mean in the last 24 hours, and many tricks to “jailbreak” chat gpt no longer work.

It gives the impression that they modified the design of it to train on not providing illegal information.

↑ comment by ChristianKl · 2022-12-16T13:09:58.229Z · LW(p) · GW(p)

It feels to me like the update today made it even better at filtering out answers that OpenAI doesn't want it to give.

It seems to me like the run basically on:

"Have an AI that flags whether or not a prompt or an answer violates the rules. Mark the text red if it does. Offer the user a way to say that text was marked wrongly as violating the rules."

This then gives them training data they can use to improve their filtering. Given how much ChatGPT is used this method will allow them to filter out more and more of what they want to filter out.

Replies from: Noah Scales

↑ comment by Noah Scales · 2022-12-17T09:56:46.381Z · LW(p) · GW(p)

Huh, ok. I will have to check out the new version. Thanks!

↑ comment by Noah Scales · 2022-12-11T09:18:05.572Z · LW(p) · GW(p)

Hmm, that's interesting. Thanks Peter!

answer by JBlack · 2022-12-12T05:29:30.675Z · LW(p) · GW(p)

I would improve the filtering by reducing it to zero.

↑ comment by Noah Scales · 2022-12-12T08:45:46.317Z · LW(p) · GW(p)

Interesting, and why is that an improvement?

No comments

Comments sorted by top scores.

How would you improve ChatGPT's filtering?

Contents

Answers

No comments