Posts

Inducing human-like biases in moral reasoning LMs 2024-02-20T16:28:11.424Z
Paper: Understanding and Controlling a Maze-Solving Policy Network 2023-10-13T01:38:09.147Z

Comments

Comment by Austin Meek on Inducing human-like biases in moral reasoning LMs · 2024-03-07T20:47:31.251Z · LW · GW

Hi Charlie, thanks for your comment and apologies for the late reply. To echo what Artyom said, we didn't observe a significant difference between the models we fine-tuned vs. the base models with regard to the brain-score. Our models did not end up becoming more correlated with the neuroimaging data of subjects taking these moral reasoning tests. In the future it would be neat if this works and we can start utilizing more neuroimaging data for alignment, but this initial stab didn't make those concrete connections.

To go into a bit more detail on the brain-score metric, we use the Pearson's correlation coefficient (PCC) to measure the correlation there. For some moral scenario given to the subject, we can take the neuroimaging data at different times (it's measured every 2 seconds here, and we experiment with different sampling strategies), after taking the hemodynamic delay into account. This data is then fit to 1,024 dimensions, with the value being the BOLD response at that point. We do this over a small portion of similar examples, then fit a regression model to predict the 1,024-dimensional BOLD response vector for some scenario held out from that set. Finally, we take the PCC between the predicted response and the activations at some layer. This gives us our metric of brain score. So we can see this on a layer-by-layer basis and also aggregate it to get a brain-score for an entire model, which we report.

Hope that helps! Let me know if there's anything more I can help clarify.