Posts

Comments

Comment by Andy Zou (jiaming-zou) on Coup probes: Catching catastrophes with probes trained off-policy · 2023-11-18T22:32:05.726Z · LW · GW

Yea I think it's fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).

In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It's exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.