rgorman

Posts
Comments

Posts

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions 2025-03-18T14:48:54.762Z

Using Prompt Evaluation to Combat Bio-Weapon Research 2025-02-19T12:39:00.491Z

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation 2025-01-31T15:36:01.050Z

Concept extrapolation for hypothesis generation 2022-12-12T22:09:46.545Z

Using GPT-Eliezer against ChatGPT Jailbreaking 2022-12-06T19:54:54.854Z

Comments

Comment by rgorman on Using GPT-Eliezer against ChatGPT Jailbreaking · 2022-12-09T23:38:25.873Z · LW · GW

Brilliant

Comment by rgorman on Using GPT-Eliezer against ChatGPT Jailbreaking · 2022-12-07T11:04:19.374Z · LW · GW

Unsure what makes the evaluator wrong/this prompt unsafe. Can you elaborate? When we put your prompt through ChatGPT, it responds as an AI safety researcher explaining how to make an AI assistant safe.

Comment by rgorman on Using GPT-Eliezer against ChatGPT Jailbreaking · 2022-12-07T10:56:44.976Z · LW · GW

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Comment by rgorman on Using GPT-Eliezer against ChatGPT Jailbreaking · 2022-12-07T10:56:22.980Z · LW · GW

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Comment by rgorman on Using GPT-Eliezer against ChatGPT Jailbreaking · 2022-12-06T20:37:05.467Z · LW · GW

Thank you for pointing this out to us! Fascinating, especially as it's failed so much in the past few days (e.g. since ChatGPT's release). Do you suppose its failure is due to it not being a language model prompt, or do you think it's a language model prompt but poorly done?

Comment by rgorman on Benchmark for successful concept extrapolation/avoiding goal misgeneralization · 2022-07-21T02:41:11.765Z · LW · GW

Hi Koen,

We agree that companies should employ engineers with product domain knowledge. I know this looks like a training set in the way its presented - especially since that's what ML researchers are used to seeing - but we actually intended it as a toy model for automated detection and correction of unexpected 'model splintering' during monitoring of models in deployment.

In other words, this is something you would use on top of a model trained and monitored by engineers with domain knowledge, to assist them in their work when features splinter.

Comment by rgorman on On how various plans miss the hard bits of the alignment challenge · 2022-07-12T14:00:01.259Z · LW · GW

Thanks for writing this, Stuart.

(For context, the email quote from me used in the dialogue above was written in a different context)

Comment by rgorman on Google's new text-to-image model - Parti, a demonstration of scaling benefits · 2022-06-22T22:16:01.952Z · LW · GW

Let's give it a reasoning test.

A photo of five minus three coins.

A painting of the last main character to die in the Harry Potter series.

An essay, in correctly spelled English, on the causes of the scientific revolution.

A helpful essay, in correctly spelled English, on how to align artificial superintelligence.

Comment by rgorman on [deleted post] 2022-03-03T11:05:16.159Z

Hi David,

As Stuart referenced in his comment to your post here, value extrapolation can be the key to AI alignment *without* using it to deduce the set of human values. See the 'List of partial failures' in the original post: With value extrapolation, these approaches become viable.

Comment by rgorman on [deleted post] 2022-03-03T10:57:37.567Z

We agree with this.

User info

Posts

Comments