Unsupervised Methods for Concept Discovery in AlphaZero
post by aogara (Aidan O'Gara) · 2023-10-26T19:05:57.897Z · LW · GW · 0 commentsThis is a link post for https://arxiv.org/abs/2310.16410
Contents
No comments
Using contrast pairs, the authors extract linear directions in the activation space of AlphaZero which correspond to concepts. By observing AlphaZero's play in situations that use these concepts, human grandmasters can improve their own play.
This is related to the following recent research:
- Burns et al. (2022) found directions which correlate with truth in a language model’s latent space by using unsupervised methods applied to contrast pairs.
- Research by Alex Turner's SERI MATS stream used contrast pairs to identify directions representing various concepts in the latent space of GPT-2 and RL agents. By adding or subtracting these vectors during the models’ forward passes, they controlled model outputs in sophisticated ways. (1 [LW · GW], 2 [LW · GW], 3)
- Zou et al. (2023) describe and motivate the research direction of representation engineering. They empirically evaluate variants of these techniques, showing that representations can be effectively used to monitor and control various AI behaviors.
Collin Burns has argued [LW · GW] that unsupervised methods for concept discovery should scale to superhuman systems, offering an empirical average-case approach to ELK.
Section 4.1 describes the method for constructing contrast pairs and finding linear directions representing concepts. The full paper can be found here.
0 comments
Comments sorted by top scores.