Posts

Comments

Comment by Zheng Wang (zheng-wang) on Why and When Interpretability Work is Dangerous · 2024-11-17T20:32:50.282Z · LW · GW

My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.

The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the "value system" is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of "doom" comparing with AI taking full control.