Posts

FLEXIBLE AND ADAPTABLE LLM’s WITH CONTINUOUS SELF TRAINING 2024-06-19T14:17:32.109Z
fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR 2023-07-11T17:17:51.576Z

Comments

Comment by Escaque 66 (escaque-66) on fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR · 2023-11-08T09:48:39.932Z · LW · GW

For a work implementin this idea, see: https://www.anthropic.com/index/decomposing-language-models-into-understandable-components

Comment by Escaque 66 (escaque-66) on fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR · 2023-07-12T11:21:28.469Z · LW · GW

Thank you for your comment, Zac. The links you suggest will be helpful for me to check whether this kind of analysis has been tried. Up to now I've only seen studies directed to interpret specific neurons or areas of a model, but not a statistical analysis of the whole model that can raise an alert when the model is using certain areas previously associated with negative behaviors.