Posts
Comments
How about "Cognitive Interpretability", or "AI Cognitive Psychology" (AI Cog Psych for short) rather than "Prosaic Interpretability"?
"Prosaic" conjures only some of the correct associations, and then only if you've heard of "Prosaic Alignment", which was a pretty bad name imho. If you had told me to guess what you meant by the term PI, I would not have guessed what you have described.
I think MI, and what you call PI, are analogous to Cognitive Neuroscience and Cognitive Psychology, respectively, which is why I think AI Cog Psych will lead to more correct inferences on first hearing.
I also suspect that Cognitive Psychology, especially the linguistics part, already has a wealth of methods that could transfer very nicely onto LLMs. For example, in The Language Instinct, Steven Pinker describes how it is possible to discover many things about how we parse sentences without any brain scans - solely through natural language experiments. He mentioned a bunch of other experiments I think could work quite well, or could at least help build intuitions for how to discover mental mechanisms solely through input output behaviour on carefully constructed sentences.
It also sounds cooler to say you work on AI Cognitive Psychology rather than Prosaic Interpretability ;)
By the way, the analogy with genes is fantastic. I think it nicely points to the fact that even if some features are relatively straightforward to find, circuits may nevertheless be fiendishly difficult to uncover. Thanks for writing such an excellent post and being honest about some of your hard work that didn't pan out how you hoped!
Minor maths error:
I think you’re overestimating how significant those numbers say political compatibility is. The study said that 4% of marriages are between a Republican and a Democrat, which *sounds* low, but given that something like 30% of people are Republicans and 30% Democrats and 40% Independents, you would only expect 9% from pure mixing. There are 17% between Independents and non-Independents, but from random mixing you would only expect 24%.
Given those numbers of Republicans, Democrats and Independents, I'm pretty sure from pure mixing you would expect 18% (not 9%) of marriages to be between a Republican and a Democrat and 48% (not 24%) between an Independent and non-Independent.
Explanation: simplifying to heterosexual relationships and assuming both Republicans and Democrats are 50-50 male-female, you would expect 9% of marriages to be between a male Republican and a female Democrat, and 9% to be between a male Democrat and a female Republican, making 18% Republican-Democrat marriages in total.
Very much enjoying reading these! In Definition 3 (an a-measure), it might be worth making it more conspicuous that the measure should not have a negative part as I somehow only spotted that on the third or fourth reading.