General alignment properties
post by TurnTrout · 2022-08-08T23:40:47.176Z · LW · GW · 2 commentsContents
Terminally valuing latent objects in reality. Navigating ontological shifts. Reflective reasoning / embeddedness. Fragility of outcome value to initial conditions / Pairwise misalignment severity None 2 comments
AIXI and the genome are both ways of specifying intelligent agents.
- Give AIXI a utility function (perhaps over observation histories), and hook it up to an environment, and this pins down a policy.[1]
- Situate the genome in the embryo within our reality, and this eventually grows into a human being with a policy of their own.
These agents have different "values", in whatever sense we care to consider. However, these two agent-specification procedures also have very different general alignment properties.
General alignment properties are not about what a particular agent cares about (e.g. the AI "values" chairs). I call an alignment property "general" if the property would be interesting to a range of real-world agents trying to solve AI alignment. Here are some examples.
Terminally valuing latent objects in reality.
AIXI only "terminally values" its observations and doesn't terminally value latent objects in reality, while humans generally care about e.g. dogs (which are latent objects in reality).
Navigating ontological shifts.
Consider latent-diamond-AIXI (LDAIXI), an AIXI variant. LDAIXI's utility function which scans its top 50 hypotheses (represented as Turing machines), checks each work tape for atomic representations of diamonds, and then computes the utility to be the amount of atomic diamond in the world.
If LDAIXI updates sufficiently hard towards non-atomic physical theories, then it can no longer find any utility in its top 50 hypotheses. All policies now might have equal value (zero), and LDAIXI would not continue maximizing the expected diamond content of the future. From our viewpoint, LDAIXI has failed to rebind its "goals" to its new conceptions of reality. (From LDAIXI's "viewpoint", it has Bayes-updated on its observations and continues to select optimal actions.)
On the other hand, physicists do not stop caring about their friends when they learn quantum mechanics. Children do not stop caring about animals when they learn that animals are made out of cells. People seem to navigate ontological shifts pretty well.
Reflective reasoning / embeddedness.
AIXI can't think straight about how it is embedded in the world [LW · GW]. However, people quickly learn heuristics like "If I get angry, I'll be more likely to be mean to people around me", or "If I take cocaine now, I'll be even more likely to take cocaine in the future."
Fragility of outcome value to initial conditions / Pairwise misalignment severity
This general alignment property seems important to me, and I'll write a post on it. In short: How pairwise-unaligned are two agents produced with slightly different initial hyperparameters/architectural choices (e.g. reward function / utility function / inductive biases)?
I'm excited about people thinking more about general alignment properties and about what generates those properties.
- ^
Supposing e.g. uniformly random tie-breaking for actions enabling equal expected utility.
2 comments
Comments sorted by top scores.
comment by Gunnar_Zarncke · 2022-08-09T21:41:44.045Z · LW(p) · GW(p)
The main difference between LDAIXI and a human in terms of ontology seems to be that the things the human values are ultimately grounded in senses and a reward tied to that. For example, we value sweet things because we have a detector for sweetness and a reward tied to that. When our understanding of what sugar is changes the detector doesn't, and thus the ontology change works out fine. But I don't see a reason you couldn't set up LDAIXI the same way: Just specify the reward in terms of a diamond detector - or multiple ones. In the end, there are already detectors that AIXI uses - how else would it get input?
Replies from: TurnTrout↑ comment by TurnTrout · 2022-08-15T03:43:37.680Z · LW(p) · GW(p)
Because LDAIXI doesn't e.g. have the credit assignment mechanism which propagates reward into learned values. Hutter just called it "reward." But that "reward function" is really just a utility function over observation histories, or the work tapes of the hypotheses, or whatever. Not the same as the mechanisms within people which make them have good general alignment properties.
(See also: the detached lever fallacy [LW · GW])