Value extrapolation partially resolves symbol grounding

post by Stuart_Armstrong · 2022-01-12T16:30:19.003Z · LW · GW · 10 comments

Take the following AI, trained on videos of happy humans:

Since we know about AI wireheading [LW · GW], we know that there are at least two ways the AI could interpret its reward function[1]: either we want it to make more happy humans (or more humans happy); call this . Or we want it to make more videos of happy humans; call this .

We would want the AI to learn to maximise , of course. But even without that, if it generates as a candidate [? · GW] and applies a suitable diminishing return to all its reward functions [LW · GW], then we will have a positive outcome - the AI may fill the universe with videos of happy humans, but it will also act to make us happy.

Thus solving value extrapolation will solve symbol grounding, at least in part.


  1. This is a massive over-simplification of what would be needed to define "happy" or anything similar. ↩︎

10 comments

Comments sorted by top scores.

comment by johnswentworth · 2022-01-12T16:53:00.838Z · LW(p) · GW(p)

That might work in a tiny world model with only two possible hypotheses. In a high-dimensional world model with exponentially many hypotheses, the weight on happy humans would be exponentially small.

Replies from: quintin-pope
comment by Quintin Pope (quintin-pope) · 2022-01-13T03:28:19.590Z · LW(p) · GW(p)

Wouldn't there also be exponentially many variants of the "happy humans" hypothesis? We're really interested in the probability assigned to all hypotheses whose fulfillment leads to human happiness. Once you've trained on happy humans videos, I think there's plausibly enough probability mass assigned to happy humans hypotheses that the AI will actually cause a fair amount of happiness.

Replies from: JBlack
comment by JBlack · 2022-01-15T06:01:19.055Z · LW(p) · GW(p)

There would, so long as the extra dimensions are irrelevant. If there are more relevant dimensions then the total space becomes larger much faster than the happy space. Even having lots of irrelevant dimensions can be risky because it makes the training data sparser in the space being modelled, thus making superexponentially many more alternative hypotheses viable.

comment by Gordon Seidoh Worley (gworley) · 2022-01-13T04:26:02.476Z · LW(p) · GW(p)

This doesn't really seem like solving symbol grounding, partially or not, so much as an argument that it's a non-problem for the purposes of value alignment.

comment by Jon Garcia · 2022-01-12T23:55:10.555Z · LW(p) · GW(p)

This might have a better chance of working if you give the AI strong inductive biases to perceive rewarding stimuli not as intrinsically rewarding in themselves, but rather as evidence of something of value happening that merely generates the stimulus. We want the AI to see smiles as good things, not because upwardly curving mouths are good, but because they are generated by good mental states that we actually value.

For that, it would need a generative model (likelihood function) whose hidden states (hypotheses) are mapped to its value function, rather than mapping the sensory data (evidence) directly. The latter type of feed-forward mapping, which most current deep learning is all about, can only offer cheap heuristics, as far as I can tell.

The AI should also priviledge hypotheses about the true nature of the value function where the mental/physiological states of eudaimonic agents are the key predictors. With your happy-video example, the existence of the video player and video files is sufficient to explain its training data, but that hypothesis (R_2) does not involve other agents. R_1, on the other hand, does assume that the happiness of other agents is what predicts both its input data and its value labels, so it should be given priority. (For this to work, it would also need a model of human emotional expression to compare its sensory data against in the likelihood function.)

Replies from: Stuart_Armstrong
comment by Stuart_Armstrong · 2022-01-13T17:42:15.699Z · LW(p) · GW(p)

Thanks - this feels somewhat similar to my vague idea of "define wireheading, tell AI to avoid situations like that [? · GW]".

comment by jacob_cannell · 2022-01-12T20:30:48.182Z · LW(p) · GW(p)

This may not scale due to Pascal's Mugging type effects.

Replies from: Stuart_Armstrong
comment by Stuart_Armstrong · 2022-01-13T17:42:52.203Z · LW(p) · GW(p)

Can you expand your point?

comment by Gunnar_Zarncke · 2022-01-12T19:23:31.972Z · LW(p) · GW(p)

We would want the to learn to AI to maximise

Did you mean "We would want the AI to learn to maximize"?

Replies from: Stuart_Armstrong
comment by Stuart_Armstrong · 2022-01-12T22:23:30.707Z · LW(p) · GW(p)

Thanks! _ corrected