Posts
Comments
Comment by
Zheng Wang (zheng-wang) on
Why and When Interpretability Work is Dangerous ·
2024-11-17T20:32:50.282Z ·
LW ·
GW
My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.
The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the "value system" is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of "doom" comparing with AI taking full control.