Posts
Comments
Thanks for the feedback! I was quite surprised at the Claude results myself. I did play around a little bit with the prompt on Claude 3.5 Sonnet, and found that it could change the result on individual questions, but I couldn't get it to change the overall accuracy much that way -- other questions would also flip to refusal. So this certainly warrants further investigation, but by itself I wouldn't take it as evidence the overall result changes .
In fact, a friend of mine got Claude to answer questions quite consistently, and could only replicate the frequent refusals when he tested questions with his user history disabled. It's pure speculation, but the inconsistency on specific questions makes me think this behaviour might be caused by reward misspecification and not intentionally trained (which I imagine would result in something more reliable).