Topological Data Analysis and Mechanistic Interpretability

post by Gunnar Carlsson (gunnar-carlsson) · 2025-02-24T19:56:02.498Z · LW · GW · 0 comments

Contents

  I. Topological Data Modeling
  II. Mapper
  III. Mechanistic Interpretability
  IV. Graph Modeling of SAE features
  V. Next steps for TDA and SAE features
  VI. Summary
  VII. Acknowledgments
None
No comments

This article was written in response to a post on LessWrong from the Apollo Research interpretability team [LW · GW]. This post represents our initial attempts at acting on the topological data analysis suggestions.

In this post, we’ll look at some ways to use topological data analysis (TDA) for mechanistic interpretability. We’ll first show how one can apply TDA in a very simple way to the internals of convolutional neural networks to obtain information about the “responsibilities” of the various layers, as well as about the training process. For LLM’s, though, simply approaching weights or activations “raw” yields limited insights, and one needs additional methods like sparse autoencoders (SAEs) to obtain useful information about the internals. We will discuss this methodology, and give a few initial examples where TDA helps reveal structure in SAE feature geometry.

I. Topological Data Modeling

The term topology refers to the study of shape using methods that are insensitive to deformations such as stretching, compressing, or shearing. For example, topology does not “see” the difference between a circle and an ellipse, but it does recognize the difference between the digit 0 and the digit 8. No matter how I stretch or compress the digit 0, I can never achieve the two loops that are present in the digit 8. Shapes can often be represented by graphs or their higher dimensional analogues called simplicial complexes. For instance, one can think of a hexagon as modeling a circle, with the understanding that the modeling is accomplished with a small amount of error:

Of course data sets can have notions of shape, too. For example, here is a data set that we can recognize as having a circular shape, even though it only consists of samples and is not a complete circle.

A circular shape may be an indication of periodic behavior. In a mechanistic interpretability context, Engels et al showed that some LLM SAE features are organized in a circular pattern, and that those features correspond to temporal periodic structures like days of the week or months of the year.

There are numerous other examples where periodic data is shown as a circle when graphed, notably in dynamical systems like predator/prey models.

II. Mapper

Mapper is the name for a family of methods that use topological ideas to build graphs representing data sets. The core concept behind Mapper is the nerve of a covering.  A covering of a set  is a family of subsets  so that .  The nerve graph of the covering is the graph  whose vertices  correspond to the sets , and where vertices  and  form an edge in  if , i.e. if  and  overlap.  As an example, suppose the set  is as shown below, with covering by four sets colored red, yellow, blue, and green, with overlaps as indicated.

The nerve graph has four vertices, one for each of the covering sets. The vertices corresponding to the yellow and red sets are connected by an edge because they overlap. The  vertices corresponding to the yellow and blue sets are not connected by an edge because they do not overlap. Building a graph representation of a dataset by constructing a good covering is a powerful technique, motivated by fundamental results like the nerve lemmas, which give guarantees about topological equivalence of a space with the nerve of a sufficiently nice cover of that space. There are numerous strategies for constructing such graph models motivated by this simple construction.  Of course, the graphs constructed often have many more vertices than the model above.  This kind of graph modeling is a part of an area of data science called Topological Data Analysis.

Graphical modeling can be used to understand the internals of neural networks, as illustrated below, from this paper (a presentation can be found here).

These graphs were obtained from VGG16, a convolutional neural network pre-trained on ImageNet.  For each layer, we constructed the data set of weight vectors for each neuron, including only those vectors satisfying a certain local density threshold. One can see that in the first two layers, the graph model is circular, and it shows that the weight vectors are concentrated around those which detect approximations to linear gradients. Later layers always include these but also additional ones.  For example, layer four includes weight vectors which detect a horizontal line against a dark background.  Layer five includes a white “bulls eye” and a crossing of two lines.  Later layers include combinations of these.  The coloring of the nodes encodes the number of data points in the set corresponding to the node, so red points would contain more points than green or blue ones. These visualizations demonstrate the presence of geometric structure in VGG16's weight vectors, indicating that specific, interpretable features are learned at each layer.

A second example performs the same kind of analysis for a two hidden layer convolutional neural network, but observing how the structure changes over the course of training.  In this case, in the first layer, one can see roughly random behavior after 100 iterations, but after 200 iterations, one sees concentration (as indicated by the redness) around points on the circular boundary, which correspond to the linear gradients as in VGG16.  This pattern becomes even more pronounced after 400 iterations, but begins to degrade after 500 iterations. In the second layer, one sees a very weak circular boundary through the first 500 iteration, becoming more pronounced after that. One can hypothesize that the second layer is “compensating” for the degradation occurring in the first layer.  The first layer has opted to retain the linear gradients in the vertical and horizontal directions, but has additionally included a black bulls eye on a lighter background.  This is unexpected behavior, and probably is due to the small number of layers in this network. What we would have expected is behavior similar to that seen in VGG16 above, in which the earliest layers respond to the simplest local behavior, namely an edge, and later layers to more complex behaviors.

III. Mechanistic Interpretability

Apollo Research recently led an extensive report on open problems in mechanistic interpretability, with a large portion focused on open questions about SAEs. Some of the issues that stood out to us were:

  1. As it is, the method does not create a usable geometry on the space of features. Geometry (and, we would add, topology) of feature sets is a useful way of organizing the features, and obtaining understanding and interpretations of them. It is well known that geometries of feature spaces are often extremely useful in signal processing. Fourier analysis uses the circular geometry of periodic data in a critical way, and the field of graph signal processing illustrates the power of geometry in organizing the features of a data set (see here and here for more details).
  2. SAEs give an organization of the activations in neural networks, level by level, but does not directly give information about mechanisms.  How can one represent mechanisms?
  3. The ultimate goal is to extract interpretable features that accurately describe the internal processes of a model. Sparsity is used as a proxy for interpretability in SAEs. However, it is not clear whether sparsity is the best proxy for interpretability, or even always a helpful one. There are methods being developed which may improve the situation, notably minimum description length.  We believe that geometrically inspired measures will yield improved interpretability.  

IV. Graph Modeling of SAE features

Question 1 above concerns the need for a geometry on feature spaces. This is a key ingredient in interpretability of features.  We think TDA can help understand this feature geometry, and we'll show a few simple examples we've tried on the SAE features constructed by OpenAI for GPT-2-small. The graphs we build are constructed using BluelightAI's Cobalt software, which employs a variant of the Mapper technique outlined in Section II. We did need to implement a few workarounds to make this function, and we plan to share a cleaned-up Colab notebook detailing the process in the near future. The largest component of the graph constructed on these SAE features is displayed below. We used cosine similarity to compare features.

Each node of the graph corresponds to a collection of the SAE features. Below we will show selections A and B from the above diagram, and indicate what words or concepts trigger the features in each node or region.  Each SAE feature activates with varying frequency on different sets of words, and collections of features are labeled by the most frequently occurring words in the collection.

Selection A contains a three step progression, which looks like this:

(1) Regard to, attributed to, associated with, likening to

(2) Connect, link, belong, contribute, refer

(3) In conjunction with, listening to, according to, access to

All three have to do with relationships. (1) describes type of relationships, and those relationships are conceptual.  (2) describes more explicit types of relationships, and (3) gives even more explicit and detailed forms of relating.

Selection B contains a “Y”-shape enclosed in the upper left, and we can interpret it like this:

V. Next steps for TDA and SAE features

We have a lot more ideas for how to use TDA to help better understand SAEs and neural network activation spaces more generally.

We looked at the geometric structure of SAE features themselves here, but we think these features may also be useful as a way to better understand the topological structure of activation space:

VI. Summary

We have demonstrated the use of topological data analysis in the study of SAEs for large language models, and obtained conceptual understanding of groups of these features.  This methodology is quite powerful, and holds the promise for the mechanistic understanding of the internals of large language models.  

VII. Acknowledgments

We thank Lee Sharkey for his helpful comments and suggestions.

0 comments

Comments sorted by top scores.