Progress Report 5: tying it together
post by Nathan Helm-Burger (nathan-helm-burger) · 2022-04-23T21:07:03.142Z · LW · GW · 0 commentsContents
No comments
Previous: Progress report 4 [LW · GW]
First, an example of what not to do if you want humanity to survive: make an even foom-ier and less interpretable version of neural nets. On the spectrum of good idea to bad idea, this one is way worse than neuromorphic computing. In fact, they even had a paragraph in their paper discussing their method in contrast to Hebbian learning, and showing how their method is more volatile and unpredictable. Great. https://arxiv.org/abs/2202.05780
On the flip side of the coin, here's some good stuff that I'm happy to se being worked on.
Some other work that has been done with nostalgebraist's logit-lens: Looking for Grammar in all the Right Places Alethea Power
Tom Frederik working on a projected called 'Unseal' https://github.com/TomFrederik/unseal (to unseal the mystery of transformers), expanding on the PySvelte library to include aspects of the logit-lens idea.
Inspired by this, I started my own fork from his library to work on a interactive visualization that creates an open-ended framework for tying together multiple ways of interpreting models. I'm hoping to use it to put together the stuff I've found or made so far, as well as whatever I find or make next. It just seems generally useful to have some sort of way to put a bunch of different 'views' of the model together side-by-side in order to scan across all of them to get a more inclusive view of things.
Once I get this tool a bit more fleshed out, I plan to start trying to plan the audit game with it, using toy models that have been edited.
Here's a presentation of my ideas I made for the conclusion of the AGI safety fundamentals course. https://youtu.be/FarRgPwBpGU
Among the things I'm thinking of putting in are ideas related to these papers.
One: A paper mentioned by jsteinhardt here https://www.lesswrong.com/posts/qAhT2qvKXboXqLk4e/early-2022-paper-round-up : Summarizing Differences between Text Distributions with Natural Language (w/ Ruiqi Zhong, Charlie Snell, Dan Klein)
This paper discusses their language data summarization technique, which I think will be cool to use someday, but also along the way to building that they had to do some clustering which I think sounds like a useful thing to include in my visualization and contrast with the neuron-importance topic clustering. I hope to also revisit and improve on the neuron-importance clustering. I think if I repeat the neuron-importance sampling on topic-labeled samples, I'll then be able to tag the resulting clusters with the topics they most strongly relate to. That will make them more useful for interpretation.
Two: https://arxiv.org/abs/2203.14680 Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space Mor Geva, Avi Caciularu, Kevin Ro Wang, Yoav Goldberg
Third:
What does BERT dream of? Deep dream with text https://www.gwern.net/docs/www/pair-code.github.io/c331351a690011a2a37f7ee1c75bf771f01df3a3.html
Seems neat and sorta related. I can probably figure out some way to add a version of this text-deep-dreaming to the laundry list of 'windows into interpretability' I'm accumulating.
0 comments
Comments sorted by top scores.