Posts

Comments

Comment by rcoreilly on Human preferences as RL critic values - implications for alignment · 2023-03-23T20:14:31.265Z · LW · GW

So the system needs to draw a distinction between just imagining freedom and making a plan for action that is predicted to actually produce freedom. This seems like something that a critic system can learn pretty easily. It's known that the rodent dopamine system can learn blockers, such as not predicting reward when a blue light comes on at the same time as the otherwise reward-predictive red light.

 

There are 2 separable problems here: A. can a critic learn new abstract values?; B. how does the critic distinguish reality from imagination?  I don't see how blocking provides a realistic solution to either?  Can you spell out what the blocker is and how it might solve these problems?

In general, these are both critical problems with the open-ended "super critic" hypothesis -- how does Montague deal with these? So far, I don't see any good solution except a strong grounding to basic survival-relevant values, as any crack in the system seems like it will quickly spiral out of control, much like heroin..

I'm a fan of Tomasello's idea that social & sharing motivations provide the underlying fixed value function that drives most of human open-ended behavior.  And there is solid evidence that humans vs. chimps differ strongly in these basic motivations, so it seems plausible that it is "built in" -- curious to hear more about your doubts on that data?

In short, I strongly doubt that an open-ended critic is viable: it is just too easy to short-circuit (wirehead).  The socially-grounded critic also has a strong potential for bad local minima: basically the "mutual admiration society" of self-reinforcing social currency.  The result is cults of all forms, including that represented by one of the current major political parties in the US... But inevitably these are self-terminating when they conflict strongly with more basic survival values..