Initial Experiments Using SAEs to Help Detect AI Generated Text

post by Aaron_Scher · 2024-07-22T05:16:20.516Z · LW · GW · 0 comments

Contents

  Summary
  Experiments
    General Methods
    Linear classifiers on SAE neurons perform better than reasonable baselines
      Figure 1
      Table 1
    Using only a few SAE neurons still works well
      Figure 2
    What are these neurons doing?
      Figure 3
      Figure 4
  Discussion
    Motivation
      AI text detection is an important problem
      It might be time for narrow “Microscope AI”
      AI text detection is a particularly good area for this
    Potential Next Steps
    Considerations in working on this
  Appendix
None
No comments

This is an informal progress update with some preliminary results using Sparse Autoencoders (SAEs) to detect AI-generated text.

Summary

I would like reliable methods for distinguishing human and AI-generated text. Unfortunately, not only are the current best methods for doing this unreliable, they are also opaque: based on training large neural networks to do the classification for us. Fortunately, advances in AI interpretability, like Sparse Autoencoders [? · GW] (SAEs), have given us methods for better understanding large language model (LLM) internals. In this post, I describe some early experiments on using GPT-2-Small SAEs to find text features that differ between AI and human generated text. I have a few main results:

One major limitation is that I use a simple and non-adversarial setting — the AI generated text is all generated by GPT-3.5-Turbo and in response to similar prompts. On the other hand, the task of “classifying GPT-3.5-Turbo vs. human text” is off the training distribution for both GPT-2 and the SAEs, and it’s not a priori obvious that they would represent the requisite information. Working on this project has left me with the following high-level outlook about using SAEs for AI text detection:

Experiments

General Methods

Linear classifiers on SAE neurons perform better than reasonable baselines

To test how SAE-based classifiers compare to other classifiers for this task, I train simple logistic regression classifiers on a variety of data. In addition to comparing SAE and GPT-2 activations at all layers (Figure 1), I do more analysis at layer 8 (chosen for the high classification accuracy of GPT-2 activations in development). I include the following baselines:

I chose these baselines to get some mix of “obvious things to compare SAEs to” and to cover other linguistic feature methods which are human-understandable and have been found to be effective in previous work.[4] 

Figure 1

Besides the first few layers, classifiers on SAE activations slightly but consistently outperform GPT-2 activations at every layer. Classifiers are fit on the train set; these results are on a held-out validation set not used during development. Results on the development test set, which motivated the focus on layer 8, are similar. Note the y-axis scale.

Table 1

Performance of different methods on held-out validation set (except Dev Test Acc which is from the test set used in development). GPT-2, SAE, SAE Reconstructed, and PCA results are all based on just the pre-layer 8 residual stream. Sorted by F1.

 

MethodAccuracyF1 ScoreAUROCDev Test Acc# Features
GPT-2 Activations0.9240.9240.9780.919768
SAE Activations0.9440.9440.9850.94124576
SAE Reconstructed Acts0.9180.9190.9750.908768
Ngram Count Full0.9090.9090.9730.885104830
Ngram Count Simple0.9040.9020.9680.8975570
TF-IDF Simple0.9020.9010.9640.8815570
Linguistic Features0.6540.6610.7090.64512
Text Embedding 3 Small0.850.8480.9270.8331536
pca_10.6140.6240.6430.6481
pca_20.6880.6890.7570.6952
pca_320.8860.8860.9510.88132

Using only a few SAE neurons still works well

Given that SAE activations are effective for classifying AI and human generated text in this setting, it is natural to ask whether using only a few SAE neurons will still work well. The ideal situation would be for a few neurons to be effective and interpretable, such that they could be replaced by human-understandable measures.

Starting with all 24,576 SAE neurons at layer 8, I narrow this down to a short list of SAE neurons that seem particularly good for classifying AI-generated vs. human-generated texts. I apply a kitchen-sink style approach to narrow this down with various methods: logistic regression and selecting large weights (i.e., Adaptive Thresholding), Mann-Whitney U test, ANOVA F-value, and more. That gets to a short list of ~330 neurons, but this length is arbitrary based on hyperparameters set in an ad hoc manner. With a short list like this, it is feasible to directly apply sklearn’s RFE with a step size of 1 (Adaptive Thresholding is RFE with a step size equal to 0.5*n, or step=0.5, as far as I can tell): train a classifier on n features, eliminate the single worst feature by coefficient absolute value, repeat until you reach the desired value of n. Here is the performance of classifiers trained for different top-n neurons:

Figure 2

Performance of classifiers trained on increasingly small sets of SAE neurons / "features", specifically area under the curve and accuracy. Top SAE neurons and their classifiers are fit on the train set; these results are on a held-out validation set not used during development. Results on the development test set are similar.

That’s exciting! Using only a few of the top neurons is still effective for classification.

What are these neurons doing?

One hypothesis for what’s going on is that these top SAE neurons correspond to specific ngrams whose frequency differs between the two texts. For instance, some people have observed that ChatGPT uses the term “delve” much more than human authors — maybe one of the SAE neurons is just picking up on the word “delve”, or this dataset’s equivalent. Some of my manual inspection of top-activating text samples also indicated SAE neurons may correspond closely to ngrams. If this hypothesis were true in the strongest form, it would mean SAEs are not useful for text analysis, as they could be totally replaced with the much easier method of ngram counting.

To test this hypothesis, I obtain ngram counts for words in the text, in particular I use CountVectorizer and an ensemble of a few parameter choices. This is important because using just analyzer=’word’ would miss important details, e.g., one common ngram in AI generated text on my dataset is “. The” (starting sentences with “The”). I then obtain correlation scores for each SAE neuron and each of the 104,830 ngrams in this vocabulary and graph the highest correlation that each SAE neuron has with an ngram (note this is an imprecise method because it compares sequence-level counts/means, but the underlying thing being predicted would be closer to the word or token level).

Figure 3

Compared to all SAE neurons, the top-differentiating neurons have, on average, higher max correlations with an ngram; data is from the Train Set, y-axis is log scaled.

These top-differentiating neurons seem to be more ngram-related than average SAE neurons. Another test that can bear on this relationship between SAE neurons and ngrams is to get the highest correlating ngrams for a given SAE neuron and compare their classification performance against that of the SAE neuron. I do this for the top-16 differentiating SAE neurons. Rather than directly looking at the top k ngrams, I take the top 1024 correlating and apply RFE for classification performance — so this doesn’t represent the result of the top-k correlating ngrams. For many of these, 8 highly correlating ngrams is enough to match their performance, and for a couple, around 70% accuracy can be achieved with just 2 ngrams. I also manually inspect a few of the top SAE neurons.

Generally, these top-differentiating SAE neurons don’t seem particularly easy to understand, but they are interesting. I didn’t put very much effort into interpreting them, nor did I use the best existing techniques, so I expect more time and effort would come up with better explanations. Here are some notes on the 4 best differentiating SAE neurons I find:

One of the goals of this project is to generate human-understandable information from SAE neurons such that we can do AI text detection in a more manual (or automated without AIs) and safe way. I therefore also run Adaptive Thresholding + RFE on ngrams — ngram counting is highly understandable and it would be convenient if a few ngrams could be effectively used for AI text detection.

Figure 4

Top ngram performance on held-out validation set. Using a small number of ngrams underperforms using a small number of SAE neurons, but it is still impressive.

Discussion

Motivation

AI text detection is an important problem

This topic has been discussed at length elsewhere, and I don’t have too much to add: detecting AI generated text and the “digital fingerprint” of AI agents seems beneficial for mitigating negative societal impacts in a world with AIs running around autonomously or being used by malicious actors.

Separately, I think AI text detection, or at least some subsets of it, may be useful for reducing large scale risks when AIs are being run under Control protocols (i.e., without Rogue Deployments [LW · GW]). Two particular risks come to mind: collusion, and sandbagging + exploration hacking (which in some sense is a form of collusion).

Collusion could arise from multiple AI systems sharing malicious information, for example an AI who is being evaluated by other AIs trying to bargain with them. Without explicit communication, there are certainly some risks [LW · GW] from collusion, but AIs being able to reliably communicate with each other likely makes it easier for them. Therefore, if we can make it difficult for AIs to know whether they are talking to other AIs or human red-teamers, plotting is more difficult. This is relevant to AI text detection from a somewhat counterintuitive angle: creating AI generated text that is indistinguishable from human text might be good here, as it prevents AIs from colluding via direct communication. Studying AI text detection, and the mechanisms AIs themselves use for AI text detection, could help prevent collusion risks.

Sandbagging is a risk where an AI strategically underperforms during evaluation to hide its true capabilities. Exploration hacking [AF · GW] is a version of sandbagging where an AI is being trained in a non-supervised setting and the AI is able to hide its capabilities by never taking high-performance actions during training. In a supervised setting, we should be able to disincentivize the cognition that caused the model to sandbag, because we have known high-performance trajectories that we can train on (i.e., push the model’s cognition [LW · GW] so it is more likely to pursue them). Successfully pulling off an exploration hacking strategy would require identifying when you are in a supervised or RL setting. In particular, a model might try to identify whether a copy of it, some other model, or humans, are responsible for creating the data it has seen so far in the problem/solution — if all the reasoning so far was human generated, this is more likely to be a supervised setting. Similar to collusion, understanding to what extent AIs can detect human vs. their own vs. other AI generated text may help mitigate this problem. Note that collusion [LW · GW] and sandbagging [LW · GW] risks are may not actually occur in the real world; it’s important to have empirical work to determine how likely they are so we can prioritize defensive resources.

It might be time for narrow “Microscope AI”

Microscope AI, as introduced by Chris Olah and discussed [AF · GW] by Evan Hubinger, is an approach to AI safety in which we build powerful AI systems, but rather than directly running these AIs — which may pose various dangers — we extract knowledge they have learned about the world, and directly apply that knowledge ourselves (or via some safe process). There are two large hurdles with Microscope AI: capabilities (we need AIs that have actually learned novel and interesting insights) and interpretability (we need to be able to extract that knowledge and transform it into human-understandable terms). For most domains or tasks I want to solve, I don’t think the capabilities are there; and the interpretability tools are still in their early stages. However, I think some domains might be a good fit for narrow applications of this approach.  

AI text detection is a particularly good area for this

There are a few reasons AI text detection looks like a good fit for trying Microscope AI:

Potential Next Steps

There are many ways to keep going on this project. In no particular order, I think some of the important ones are:

Considerations in working on this

I won’t continue working on this project, but if somebody else wants to push it forward and wants mentorship, I’m happy to discuss. Here are some reasons to be excited about this research direction:

But ultimately I won’t be working on this more. The main reasons:

Appendix

Random Notes:

  1. ^

     I do not directly compare against state-of-the-art techniques for AI text detection; I am quite confident they would perform better than SAEs.

  2. ^

    Thanks, Egg, for the link

  3. ^

     I briefly investigated using other classification techniques, specifically DecisionTree, RandomForest, SVC, and KNeighbors. None of these consistently performed better than LogisticRegression for any of the baselines I tried in preliminary testing, and they took longer to run, so I focused on LogisticRegression.

  4. ^

     I think the most obvious baselines I did not run would be fine-tuning RoBERTa, and using GLTR. The paper which introduces the dataset I use claims that various standard methods aren’t very particularly effective zero shot (Table 2, Figure 4), but fine-tuning classifiers is highly effective. These are not a 1-1 comparison to my experiments as I use a small subset of the data, truncate sequences to be fairly short, and only focus on the “Generated” and human subsets.

  5. ^

    There are also plenty of negative results in using AIs for AI text detection. 

0 comments

Comments sorted by top scores.