comment by paulfchristiano ·
2021-08-05T20:04:19.269Z · LW(p) · GW(p)
(To restate the obvious, all of the stuff here is extremely WIP and rambling.)
I've often talked about the case where an unaligned model learns a description of the world + the procedure for reading out "what the camera sees" from the world. In this case, I've imagined an aligned model starting from the unaligned model and then extracting additional structure.
It now seems to me that the ideal aligned behavior is to learn only the "description of the world" and then have imitative generalization take it from there, identifying the correspondence between the world we know and the learned model. That correspondence includes in particular "what the camera sees."
The major technical benefit of doing it this way is that we end up with a higher prior probability on the aligned model than the unaligned model---the aligned one doesn't have to specify how to read out observations. And specifying how to read out observations doesn't really make it easier to find that correspondence.
We still need to specify how the "human" in imitative generalization actually finds this correspondence. So this doesn't fundamentally change any of the stuff I've recently been thinking about, but I think that the framing is becoming clearer and it's more likely we can find our way to the actually-right way to do it.
It now seems to me that a core feature of the situation that lets us pull out a correspondence is that you can't generally have two equally-valid correspondences for a given model---the standards for being a "good correspondence" are such that it would require crazy logical coincidence, and in fact this seems to be the core feature of "goodness." For example, you could have multiple "correspondences" that effectively just recompute everything from scratch, but by exactly the same token those are bad correspondences.
(This obviously only happens once the space and causal structure is sufficiently rich. There may be multiple ways of seeing faces in clouds, but once your correspondence involves people and dogs and the people talking about how the dogs are running around, it seems much more constrained because you need to reproduce all of that causal structure, and the very fact that humans can make good judgments about whether there are dogs implies that everything is incredibly constrained.)
There can certainly be legitimate ambiguity or uncertainty. For example, there may be a big world with multiple places that you could find a given pattern of dogs barking at cats. Or there might be parts of the world model that are just clearly underdetermined (e.g. there are two identical twins and we actually can't tell which is which). In these cases the space of possible correspondences still seems effectively discrete, rather than being a massive space parameterized as neural networks or something. We'd be totally happy surfacing all of the options in these cases.
There can also be a bunch of inconsequential uncertainty, things that feel more like small deformations of the correspondence than moving to a new connected component in correspondence-space. Things like slightly adjusting the boundaries of objects or of categories.
I'm currently thinking about this in terms of: given two different correspondences, why is it that they manage to both fit the data? Options:
- They are "very close," e.g. they disagree only rarely or make quantitatively similar judgments.
- One of them is a "bad correspondence" and could fit a huge range of possible underlying models, i.e. it's basically introducing the structure we are interested in within the correspondence itself.
- The two correspondences are "not interacting," they aren't competing to explain the same logical facts about the underlying model. (e.g. a big world, one correspondence faces .)
- There is an automorphism of my model of the world (e.g. I could exchange the two twins Eva and Lyn), and can compose a correspondence with that automorphism. (This seems much more likely to happen for poorly-understood parts of the world, like how we talk about new physics, than for simple things like "is there a cat in the room.")
I don't know where all of this ends up, but I feel some pretty strong common-sense intuition like "If you had some humans looking at the model, they could recognize a good correspondence when they saw it" and for now I'm going to be following that to see where it goes.
I tentatively think the whole situation is basically the same for "intuition module outputs a set of premises and then a deduction engine takes it from there" as for a model of physics. That is, it's still the case that (assuming enough richness) the translation between the "intuition module"'s language and human language is going to be more or less pinned down uniquely, and we'll have the same kind of taxonomy over cases where two translations would work equally well.