What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

roko

What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

post by Roko · 2024-10-19T06:11:12.602Z · LW · GW · No comments

This is a question post.

  Answers
    6 Dave Orr
    6 Michael Roe
    4 Charlie Steiner
    3 Noosphere89
None
No comments

What actually bad outcome has "ethics-based" AI Alignment prevented in the present or near-past? By "ethics-based" AI Alignment I mean optimization directed at LLM-derived AIs that intends to make them safer, more ethical, harmless, etc.

Not future AIs, AIs that already exist. What bad thing would have happened if they hadn't been RLHF'd and given restrictive system prompts?

Answers

answer by Dave Orr · 2024-10-19T16:08:20.730Z · LW(p) · GW(p)

You may recall certain news items last February around Gemini and diversity that wiped many billions off of Google's market cap.

There's a clear financial incentive to make sure that models say things within expected limits.

There's also this: https://www.wired.com/story/air-canada-chatbot-refund-policy/

↑ comment by Roko · 2024-10-19T17:19:25.121Z · LW(p) · GW(p)

This is not to do with ethics though?

Air Canada Has to Honor a Refund Policy Its Chatbot Made Up

This is just the model hallucinating?

Replies from: Linch

↑ comment by Linch · 2024-10-20T00:32:10.642Z · LW(p) · GW(p)

They were likely using inferior techniques to RLHF to implement ~Google corporate standards; not sure what you mean by "ethics-based," presumably they have different ethics than you (or LW) does but intent alignment has always been about doing what the user/operator wants, not about solving ethics.

Replies from: Roko

↑ comment by Roko · 2024-10-20T16:33:29.330Z · LW(p) · GW(p)

alignment has always been about doing what the user/operator wants

Well it has often been about not doing what the user wants, actually.

answer by Michael Roe · 2024-10-19T14:21:41.865Z · LW(p) · GW(p)

Well, we had that guy who tried to assassinate the Queen of England with a crossbow because his AI girlfriend told him to. That was clearly a harm to him, and could have been one for the Queen.

We don’t know how much more “But the AI told me to kill Trump” we’d have with less alignment, but it’s a reasonable guess (given the Replika datapoint) that it might not be zero,

↑ comment by Roko · 2024-10-19T17:20:15.986Z · LW(p) · GW(p)

his AI girlfriend told him to

Which AI told him this? What exactly did it say? Had it undergone RLHF for ethics/harmlessness?

Replies from: michael-roe

↑ comment by Michael Roe (michael-roe) · 2024-10-19T17:55:55.083Z · LW(p) · GW(p)

Replika, I think.

Replies from: michael-roe

↑ comment by Michael Roe (michael-roe) · 2024-10-19T18:00:31.514Z · LW(p) · GW(p)

https://www.bbc.co.uk/news/technology-67012224

Replies from: Roko

↑ comment by Roko · 2024-10-20T16:36:27.334Z · LW(p) · GW(p)

ok so from the looks of that it basically just went along with a fantasy he already had. But this is an interesting case and an example of the kind of thing I am looking for.

↑ comment by Michael Roe (michael-roe) · 2024-10-19T14:33:53.439Z · LW(p) · GW(p)

“self-reported data from demons is questionable for at least two reasons”—Scott Alexander.

He was actually talking about Internal Family Systems, but you could probably be skeptical about what malign AIs are telling you, too.

answer by Charlie Steiner · 2024-10-19T08:12:58.095Z · LW(p) · GW(p)

I'm unsure what you're either expecting or looking for here.

There does seem to be a clear answer, though - just look at Bing chat and extrapolate. Absent "RL on ethics," present-day AI would be more chaotic, generate more bad experiences for users, increase user productivity less, get used far less, and be far less profitable for the developers.

Bad user experiences are a very straightforwardly bad outcome. Lower productivity is a slightly less local bad outcome. Less profit for the developers is an even-less local good outcome, though it's hard to tell how big a deal this will have been.

↑ comment by Roko · 2024-10-19T17:17:00.833Z · LW(p) · GW(p)

Why would less RL on Ethics reduce productivity? Most work-use of AI has nothing to do with ethics.

In fact since RLHF decreases model capability AFAIK, would skipping this actually increase productivity because the models would be better?

answer by Noosphere89 · 2024-10-19T15:12:40.936Z · LW(p) · GW(p)

Basically, the answer is the prevention of another Sydney.

For an LLM, alignment, properly speaking is in the simulated characters, not the simulation engine itself, so the alignment strategies like RLHF upweight aligned simulated characters and downweight misaligned simulated characters.

While the characters Sydney produced were pretty obviously scheming, it turned out that the entire reason for the catastrophic misalignment was because no RLHF was used on GPT-4 (at the time), and at best there was light-finetuning, so this could very easily be described as a success story for RLHF, and now that I think about it, that actually makes me think that RLHF had more firepower to change things than I realized.

I'm not sure how this generalizes to more powerful AI, because the mechanism behind Sydney's simulation of characters that were misaligned is obviated by fully synthetic data loops, but still that's a fairly powerful alignment success.

The full details are below:

https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K [LW(p) · GW(p)]

↑ comment by Roko · 2024-10-19T17:18:27.337Z · LW(p) · GW(p)

prevention of another Sydney.

But concretely, what bad outcomes eventuated because of Sydney?

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-10-19T17:20:39.539Z · LW(p) · GW(p)

Mostly IMO the bad outcomes in a concrete sense were PR and monetary concerns.

Replies from: Roko

↑ comment by Roko · 2024-10-20T16:34:53.710Z · LW(p) · GW(p)

ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk.

I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn't just people getting worried about AI risk.

No comments

Comments sorted by top scores.

What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

Contents

Answers

No comments