Posts
Comments
Hey Milan, I’m broadly sympathetic to the argument in Proposition 1 Reason 2 that if we want to understand if models do human-derived cognitive operation X, we need to define what X is, and the best validation of our definition will come from testing it in humans. But recently, I’ve been wrestling with whether we need to define the cognition that models are doing in the same terms that we define human cognition to get alignment of model behavior.
For instance you could take the definition of deception given in this paper: “the systematic inducement of false beliefs in the pursuit of some outcome other than the truth”. This definition intentionally avoids relying on any cognitive concepts like belief or intention in the models. You could ask “what kinds of internal operations in the model lead to this kind of behavior?”. If you are able to explain that and change the model so that it doesn’t have these kinds of behaviors, then you can essentially avoid the bad outcomes without knowing whether the models are doing the same cognitive operations humans are doing when they engage in similar behaviors (“deception”). This might actually require explaining how models work at a higher level of abstraction in order to be useful, something like cognitive science for AI, but the point is that I can imagine paths toward alignment where we are able to avoid certain bad behaviors in AI without ever checking that the concepts that we are using map directly to human cognition.
For a more concrete example, you could imagine that an LLM is able to generate false answers to questions even when they are otherwise able to generate true answers to the same questions in other contexts. The LLM might be lying in the way we mean humans to lie, which is minimally making a claim contrary to one’s belief and perhaps also with the intention to form false beliefs in others. Alternatively, the context might have just elicited the LLM to role play a persona that tends to give the false response to the questions we give it. In this second case, it is not necessarily true that the model actually has any beliefs or intentions, but it still engages in behavior that we deem undesirable. If we define lying as the former and test our mechanistic definition of lying in the brain, our mechanistic definition would then only identify the former as lying in the model, but not the latter. But both things lead to bad behavior in the model, and importantly, the second phenomenon may not have an analog in human cognition (or maybe it does?). If instead we did not use the human-derived concepts to define deception and defined it based on the behavior of AI, then maybe we are able to identify both cases as “deception” and correct for it accordingly.
I’m not sure yet how convinced I am by these arguments yet, but I find it difficult to come up with reasons why you would actually need to have your definitions based on human cognition rather than model behavior if your goal is only to avoid the behavior. Do you think that these kinds of alignment scenarios without human-derived definitions of cognitive concepts are plausible? I’m curious if you have arguments or intuitions for why you expect human-derived cognitive concepts to be essential for ensuring alignment.