fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR
post by Escaque 66 (escaque-66) · 2023-07-11T17:17:51.576Z · LW · GW · 3 commentsContents
Technical feasibility Test Design Limitations and Risks None 3 comments
fMRI (Functional Magnetic Resonance Imaging) is a technique used to investigate the human brain by detecting changes in the blood flow when doing certain tasks. Blood flow is an indicator of energy consumption, and its changes indicate areas of the brain associated with certain activities.
AI alignment is an open area of investigation trying to solve the problem of ensuring that an AI system's goals and behaviours are aligned with human values. Among other problems that may arise in an advance AI system (specification, robustness, interpretability…), one of the main concerns is the possibility that it may attempt to hide its real intentions. This is generally referred to as "concealed intent" or "deceptive behaviour".
We propose to use an approach similar to fMRI to investigate the possibility of a deceptive behaviour in an advance AI system. In our case, the activation of neurons will play the role of energy consumption in fMRI, with the advantage of having a much higher granularity.
First, we hypothesize that:
- A complex AI system will have certain internal areas dedicated to specific functions and ideas, as it happens in the human brain. In our case, certain areas of the system will be associated to ideas of deception, fraud, deceitfulness, betrayal, … (in short, we will call these areas Deceptive Areas).
- It won’t be possible for a complex system to avoid the use of those Deceptive Areas if it is acting with a deceptive behaviour.
To use this approach, we propose these general steps:
- Resting State. Establishment of a “Resting State”. Determining the average activation state of the neurons when doing varied tasks.
- Deceptive Areas. Presentation to the system of tasks and stimulus that involve ideas related to deception. Use of statistical analysis to compare the results of steps 1 and 2 to determine the areas with a significant deviation from the resting state; those will be identified as the Deceptive Areas.
- Continuous tests of the AI system in operation, to detect anomalous activation of the Deceptive Areas.
Technical feasibility
Complex AI systems have a huge number of neurons and parameters, and work as a “black box” with no clear explanation of the results it produces. This complexity makes difficult to determine the state and evolution of the activated cells when the system is in training or operation.
Our approach uses averages of activation to represent internal operation. The use of memory will be large, but of the order or magnitude of the AI system size.
The calculation of averages of activation should also be feasible in terms of compute power.
Test Design
The design of the tests to be used in phases 1 and 2 is critical, and we propose that it should be prepared by a multidisciplinary team with people specialized in psychology, behaviour, AI architecture, prompting…
Limitations and Risks
There are two main risks that may limit the applicability of this approach:
- Hypothesis 2 may be wrong, and the AI system may be able to produce a deceptive behaviour without expressing unusual activity in Deceptive Areas.
- There are many ways in which a system may present a behaviour that is not aligned to human values. Test design may not cover all those possibilities.
3 comments
Comments sorted by top scores.
comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2023-07-12T00:26:33.350Z · LW(p) · GW(p)
See A Longlist of Theories of Impact for Interpretability [LW · GW] (this seems similar to #4). Unfortunately I think interpretability is harder than you seem to think; on that see transformer-circuits.pub and this Mechanistic Interpretability Quickstart Guide [LW · GW].
Replies from: escaque-66↑ comment by Escaque 66 (escaque-66) · 2023-07-12T11:21:28.469Z · LW(p) · GW(p)
Thank you for your comment, Zac. The links you suggest will be helpful for me to check whether this kind of analysis has been tried. Up to now I've only seen studies directed to interpret specific neurons or areas of a model, but not a statistical analysis of the whole model that can raise an alert when the model is using certain areas previously associated with negative behaviors.
comment by Escaque 66 (escaque-66) · 2023-11-08T09:48:39.932Z · LW(p) · GW(p)
For a work implementin this idea, see: https://www.anthropic.com/index/decomposing-language-models-into-understandable-components