Posts

Analyzing how SAE features evolve across a forward pass 2024-11-07T22:07:02.827Z

Comments

Comment by bensenberner on Open Thread Fall 2024 · 2024-11-13T15:55:30.443Z · LW · GW

Sure!

Comment by bensenberner on Open Thread Fall 2024 · 2024-11-07T18:38:48.037Z · LW · GW

Hi! I joined LW in order to post a research paper that I wrote over the summer, but I figured I'd post here first to describe a bit of the journey that led to this paper.

I got into rationality around 14 years ago when I read a blog called "you are not so smart", which pushed me to audit potential biases in myself and others, and to try and understand ideas/systems end-to-end without handwaving.

I studied computer science at university, partially because I liked the idea that with enough time I could understand any code (unlike essays, where investigating bibliographies for the sources of claims might lead to dead ends), and also because software pays well. I specialized in machine learning because I thought that algorithms that could make accurate predictions based on patterns in the world that were too complex for people to hardcode were cool. I had this sense that somewhere, someone must understand the "first principles" behind how to choose a neural network architecture, or that there was some way of reverse-engineering what deep learning models learned. Later I realized that there weren't really first principles regarding optimizing training, and that spending time trying to hardcode priors into models representing high-dimensional data was less effective than just getting more data (and then never understanding what exactly the model had learned).

I did a couple of kaggle competitions and wanted to try industrial machine learning. I took a SWE job on a data-heavy team at a tech company working on the ETLs powering models, and then did some backend work which took me away from large datasets for a couple years. I decided to read through recent deep learning textbooks and re-implement research papers at a self-directed programming retreat. Eventually I was able to work on a large scale recommendation system, but I still felt a long way from the cutting edge, which had evolved to GPT-4. At this point, my initial fascination with the field had become tinged with concern, as I saw people (including myself) beginning to rely on language model outputs as if they were true without consulting primary sources. I wanted to understand what language models "knew" and whether we could catch issues with their "reasoning."

I considered grad school, but I figured I'd have a better application if I understood how ChatGPT was trained, and how far we'd progressed in reverse engineering neural networks' internal representations of their training data.

I participated in the AI Safety fundamentals course which covered both of these topics, focusing particularly on the mechanistic interpretability section. I worked through parts of the ARENA curriculum, found an opportunity to collaborate on a research project, and decided to commit to it over the summer, which led to the paper I mentioned in the beginning! Here it is.