Posts
Comments
Exactly. It depends on the level of effort required to achieve the outcome which the creator didn't intend. If grandma would have to be drugged or otherwise put into an extreme situation before showing any violent tendencies then we don't consider her a dangerous person. Someone else might in ideal circumstances also be peaceful, but if they can be easily provoked to violence by mild insults then it's fair to say they're a violent person i.e. misaligned.
Given this, I think it's really useful to see the kinds of prompts people are using to get unintended behaviour from ChatGPT / Bing Chat. If little effort is required to provoke unwanted behaviour (unwanted from the point of view of the creators / general human values) then the model is not sufficiently aligned. It's especially concerning if bad outcomes can be plausibly elicited by mistake, even if the specific example is found by someone searching for it.
Of course in the case of the kitchen knife, misuse is easy. Which is why we have laws around purchasing and carrying knives in public. Similarly cars, guns etc. AI applications need to prove they're safer than a kitchen knife if they are to be used by the general public without controls. For OpenAI etc surely the point is to show that regulation is not required, rather than to achieve alignment perfection.