Talk and Q&A - Dan Hendrycks - Paper: Aligning AI With Shared Human Values. On Discord at Aug 28, 2020 8:00-10:00 AM GMT+8. 2020-08-14T23:57:35.224Z


Comment by wassname on Developmental Stages of GPTs · 2020-08-11T15:04:09.862Z · LW · GW

are there any alignment approaches that we could try out on GPT-3 in simplified form?

For a start you could see how it predicts or extrapolates moral reasoning. The datasets I've seen for that are "moral machines” and 'am I the arsehole' on reddit.

EDIT Something like this was just released Aligning AI With Shared Human Values

Comment by wassname on Are we in an AI overhang? · 2020-08-10T00:32:52.814Z · LW · GW

One thing they could have achieved was dataset and leaderboard creation (MSCOCO, GLUE, and imagenet for example). These have tended to focus and help research and persist in usefulness for some time, as long as they are chosen wisely.

Predicting and extrapolating human preferences is a task which is part of nearly every AI Alignment strategy. Yet we have few datasets for it, the only ones I found are,

So this hypothetical ML Engineering approach to alignment might have achieved some simple wins like that.

EDIT Something like this was just released Aligning AI With Shared Human Values

Comment by wassname on Machine Learning Analogy for Meditation (illustrated) · 2019-08-10T08:18:00.094Z · LW · GW

You mentioned that this metaphor should also include world models. I can help there.

Many world models try to predict the next state of the world given the agent's action. With curiosity-driven exploration the agent tries to explore in a way that maximizes it's a reduction of surprise, allowing it to learn about its effect on the world (see for example Why not just maximize surprise? Because we want a surprise we can learn to decrease, not just the constant surprise of a TV showing static.

This means they focus an exploration reward on finding novel states. Specifically, novel states that are due to the agent's actions, since those are the most salient. We could rephrase this as "novel changes the agent has control over". But what is defined as an action, and what can it control?

Meditation changes where we draw the boundary between the agent and the environment. The no-self insight and lets you view thoughts as external things arising outside of your control. The impermanence insight lets you view more things as outside your control.

These two changes in perspective mean that an agent no longer experiences negative reward for states it now thinks it has no control over. It can also do reward hacking on its own thoughts since they are now "external" and targets of exploration rewards. Previously it could only learn patterns of thought with reference to some external goal, not it can learn a pattern of thought directly.

Disclaimer: world models and curiosity-driven exploration are at an early stage, and probably have a poor correspondence to how our brains work. There are quite a few unsolved problems like the noisy TV problem.

Comment by wassname on Spamming Micro-Intentions to Generate Willpower · 2019-05-26T05:15:48.892Z · LW · GW

I've been using this for meditation too, but it's interesting to see it formulated for wider application. It seems to work for me to reduce resistance. Some other comments mentioned how this mirror how addictions seem to work. But it also mirrors how advertisements and even reading about something work.