Character alignment II

post by p.b. · 2023-05-14T14:17:58.610Z · LW · GW · 0 comments

In character alignment I lay out a view on alignment that focuses on behavioral priors aka character traits as opposed to values, goals or behaviors. The idea being that values and behaviors are complex, high dimensional, and not necessarily consistent and therefore hard to pin down. Behavioral priors like “being cooperative” are (approximately) one-dimensional and in many circumstances, they can even be mathematically described. 

For example, in a multi-agent setting with known goals for each agent (“put the red thing into the blue box” etc.) it would not be difficult to mathematically describe and therefore train a maximally cooperative agent. While putting red things into blue boxes has no relation to real world human values, being maximally cooperative is something that might well generalize from training settings to the real world because of its one-dimensional nature. 

Several corollaries drop out of this view on alignment:

But how to turn this view on alignment into something actionable?

One idea I had roughly a year ago is a fine-tuning technique that I called “Rawlsian Reinforcement Learning” – let the LLM make decisions in multi-agent situations, let a static LLM evaluate the result for each agent and reinforce the decisions based on the outcome for a randomly selected agent (yes, not efficient, but otherwise I wouldn’t be able to call it “Rawlsian”). 

I never wrote about this because the LLM would likely just stay a context driven simulator – LLM outputs are fundamentally not decisions about interactions with other agents in a long-term context, but it’s a cute idea. (But see here [LW · GW] for a cool approach using the context window to pin down single traits.) 

Another possible direction of research would be to find out whether it is possible to ground the thinking of language models in additional models that provide that missing understanding and coherence. This is something I might potentially get to within my research agenda [LW · GW]. 

Of course, aligning to values or to a goal is more ambitious and if achieved also more powerful than implementing one or two character traits. Some would probably say that creating a maximally cooperative AI would not solve the alignment problem, because you’d still have to align the user of the AI, but I think it would be a great start. 

0 comments

Comments sorted by top scores.