Analysing Adversarial Attacks with Linear Probing

post by Yoann Poupart, Imene Kerboua (imenelydiaker), Clement Neo (clement-neo), Jason Hoelscher-Obermaier (jas-ho) · 2024-06-17T14:16:33.832Z · LW · GW · 0 comments

Contents

  TL;DR
  Introduction
    Motivation
    Hypotheses to Test
    Post Navigation
  Background
    Adversarial Attacks
    Linear Probing
  Experiments
    Models Setup
      Dataset Creation
      Classifier Training
      Probe Training
      Sanity Checks
    Analysing Adversarial Attacks with Linear Probing
      Goal
      Setup
      Lemon -> Tomato
      Lemon -> Tomato (Layer 6 & Layer 0)
      Lemon -> Banana
      Tomato -> Lemon
    Concept Probability Through Layers: Naive Detection
  Limitations
      Interpretability Illusion
      You Need Concept in Advance
      Introduced Biases
      Truly Bugs Attacks
  Perspectives
      What’s next?
      Multimodal Settings
      Representation Engineering
  Related Work
      Analysing Adversarial Attacks
      Adversarial Attacks on Latent Representation
      Universal Adversarial Attacks (UAPs)
      Visualising Adversarial Attacks
  Conclusion
  Appendix
    Dataset description
None
No comments

This work was produced as part of the Apart Fellowship. @Yoann Poupart [AF · GW] and @Imene Kerboua [LW · GW] led the project; @Clement Neo [AF · GW] and @Jason  Hoelscher-Obermaier [AF · GW] provided mentorship, feedback and project guidance.

Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper [LW · GW]’s post: EIS IX: Interpretability and Adversaries [AF · GW].

Code available on GitHub (still drafty).

TL;DR

Introduction

Motivation

Adversarial samples are a concerning failure mode of deep learning models. It has been shown that by optimising an input, e.g. an image, the output of the targeted model can be shifted at the attacker’s will. This failure mode appears repeatedly in different domains like image classification, text encoders, and more recently on multimodal models enabling arbitrary outputs or jailbreaking.

Figure 1: Experimental setup to analyse an adversarial attack using linear probes. Given a classifier trained on image samples, we train linear probes to detect the concepts activated by each input image. Each linear probe is specialised on one concept. In this example, an image of a lemon should activate the concept “ovaloid”. When the input is attacked, the probe can detect other concepts related to the targeted adversarial class.

Hypotheses to Test

This work was partly inspired by the feature vs bug world hypothesis. In our particular context this would imply that adversarial attacks might be caused by meaningful feature directions being activated (feature world) or by high-frequency directions overfitted by the model (bug world). The hypotheses we would like to test are thus the following:

Post Navigation

First, we briefly present the background we used for adversarial attacks and linear probing. Then we showcase experiments, presenting our setup, to understand the impact of adversarial attacks on linear probes and see if we can detect it naively. Finally, we present the limitations and perspectives of our work before concluding.

Background

Adversarial Attacks

For our experiment, we'll begin with the most basic and well-known adversarial attack, the Fast Gradient Sign Method (FGSM). This intuitive method takes advantage of the classifier differentiability to optimise the adversary goal (misclassification) using a gradient descent. It can be described as:

With  the original image and label,  the adversarial example and  the perturbation amplitude. This simple method can be derived in an iterative form:

and seen as projected gradient descent, with the projection  ensuring that . In our experiments .

Linear Probing

Linear probing is a simple idea where you train a linear model (probe) to predict a concept from the internals of the interpreted target model. The prediction performances are then attributed to the knowledge contained in the target model's latent representation rather than to the simple linear probe. Here, we are mostly interested in binary concepts, e.g. features presence/absence or adversarial/non-adversarial, and the optimisation can be formulated as minimising the following loss:

where in our experiments  is a logistic regression using the  penalty.

Experiments

Models Setup

Dataset Creation

In order to explore image concepts we first focused on a small category of images, fruits and vegetables, before choosing easy concepts like shapes and colours. We proceeded as follow:

Classifier Training

Probe Training

Sanity Checks

The goal is to validate that probes are detecting the concepts and not something else that is related to the class (we are not yet totally confident about our probes, more in the limitations).

Figure 2: Metrics for a probe trained to detect the “red” and “yellow” concepts given a layer’s activations. The probes seem to detect the concept over all layers with almost an equivalent precision and recall.
Figure 3: Metrics for a probe trained to detect the “stem” and “sphere” concepts given a layer’s activations. The probes seem to detect the concepts better in later layers.

Analysing Adversarial Attacks with Linear Probing

Goal

See what kind of features (if any) adversarial attacks find. Increase of the probe's accuracy on non-related features w.r.t. the input sample but related to the target sample.

Setup

Lemon -> Tomato

For this experiment, we show that given an input image of a lemon, it requires a few steps of adversarial perturbation for the model to output the desired target, tomato in our case (see figures 4-6).

Figure 4: Logit curve (left) and the final adversarial image (right) misclassified as a tomato.
Figure 5: Probability of the ovaloid probe for layer 11 at different steps of the adversarial attack. The ovaloid concept disappears through the attack.

We compare the appearance of two concepts “ovaloid” and “sphere” in the last layer of the vision encoder, which are the shapes of the lemon and the tomato respectively. It appears in Figure 7 that the concept “ovaloid” characterising the lemon disappears gradually with the increase of the adversarial steps, while the “sphere” concept marking the tomato starts to appear (see Figure 8).

Figure 6: Probability of the sphere probe for layer 11 at different steps of the adversarial attack. The sphere concept appears through the attack.

Lemon -> Tomato (Layer 6 & Layer 0)

We noted that the concept “sphere” which characterises a tomato, does not appear and remains unchanged in the first layers of the encoder (see Figure 9 and 10 for layers 0 and 6 respectively) despite the adversarial steps.

Figure 7: Probability of the sphere probe for layer 0 at different steps of the adversarial attack. The sphere concept is unchanged through the attack.
Figure 8: Probability of the sphere probe for layer 6 at different steps of the adversarial attack. The sphere concept is unchanged through the attack.

Lemon -> Banana

We focus our interest in a concept that characterises both a lemon and a banana, the colour “yellow”. The adversarial attack does not change the representation of this concept through the layers and the adversarial steps (see Figure 11, Figure 12 and Figure 13).

Figure 9: Probability of the yellow probe for layer 11 at different steps of the adversarial attack. The yellow concept is unchanged through the attack.
Figure 10: Probability of the yellow probe for layer 6 at different steps of the adversarial attack. The yellow concept is unchanged through the attack.
Figure 11: Probability of the yellow probe for layer 0 at different steps of the adversarial attack. The yellow concept is unchanged through the attack.

Tomato -> Lemon

Figure 12: Probability of the sphere probe for layer 11 at different steps of the adversarial attack. The sphere concept disappears through the attack.

Concept Probability Through Layers: Naive Detection

This section contains early-stage claims not fully backed, thus we would love hearing feedback on this methodology.

We spotted interesting concept activation patterns through layers, and we believe they could help detect adversarial attacks once formalised. We spotted two patterns:

Figure 13: Evolution of the probes’ probability during layers for the original image and its adversarial attack. The orange line shows the probability of the concept’s appearance (sphere) before the attack, and the blue line shows it at the 10th step of the attack.
Figure 14: Evolution of the probes’ probability during the adversarial attacks for different layers. The mean probability over the tokens is represented on the left and the probability on the CLS token (used for classification) is represented on the right.

Looking at the concept curve (e.g. on the CLS in Figure 14) we see that there is a transition mostly on the last layer. This could indicate that our adversarial attacks mostly target later layers and thus could be detected spotting incoherent patterns like the probability in Figure 14 (L0: 02 -> L4: 0.35 -> L8: 0.8 -> L10: 0.0).

Intuitively this would only be fair for simple concepts that are in the early layers.

We present what we are excited to (see) tackle next in the perspectives section.

Limitations

Interpretability Illusion

Interpretability is known to have illusion issues and linear probing doesn’t make an exception. We are not totally confident that our probes do measure their associated concept. In this respect we aim to further improve and refine our dataset.

You Need Concept in Advance

Introduced Biases

We trained one probe per concept and per layer (on all visual tokens). Our intuition is that information is mixed by the cross-attention mechanism and similarly represented across tokens for such simple concepts.

Truly Bugs Attacks

We are not tackling some adversarial attacks that have been labelled as truly bugs. We are interested in testing our method on it as this could remain a potential threat.

Perspectives

We now describe what we are excited to test next and present possible framing of this study for future work.

What’s next?

First, we want to make sure that we are training the correct probes. Some proxies are mixed with concepts, e.g. pulp is mostly equivalent to orange + lemon. For that, we are thinking of ways to augment our labelled dataset using basic CV tricks.

We are moving to automated labelling using GPT4V, while keeping a human in the loop overseeing the labels produced. And we are also thinking about creating concept segmentation maps as the concepts localisation might be relevant (c.f. the heatmaps above).

In the longer term view, we are also willing to explore the following tweaks:

Multimodal Settings

Representation Engineering

Related Work

Analysing Adversarial Attacks

Adversarial Attacks on Latent Representation

Universal Adversarial Attacks (UAPs)

Visualising Adversarial Attacks

Conclusion

Our method doesn’t guarantee that we’ll be able to detect any adversarial attacks but showcases that some mechanistic measures can be sanity safeguards. These linear probes are not costly to train nor to use during inference and can help detect adversarial attacks and possibly other kinds of anomalies.

Appendix

Dataset description

We collected a dataset of fruits and vegetables image classification from Kaggle (insert link). The dataset was cleaned and manually labelled by experts. The preprocessing included a deduplication step to remove duplicate images, and a filtering of images that were present in both the train and test sets, which encouraged data leakage.

For the annotation process, there were 2 annotators who labelled each image with the concepts that appear on it, each image is associated with at least 1 concept. An additional step was required to resolve annotation conflicts and correct the concepts.

After preprocessing, the train set contains over 2579 samples and the test set over 336 samples with 36 classes each; the class distributions over sets are shown in Figure 17.

Figure 17: Class distributions over both the train and test sets.

The chosen concepts concern the colour, shape and background of the images. Figure 18 shows the distribution of concepts over the whole dataset.

0 comments

Comments sorted by top scores.