What is Interpretability?

post by RobertKirk, Tomáš Gavenčiak (tomas-gavenciak), Ada Böhm (spirali) · 2020-03-17T20:23:33.002Z · LW · GW · 0 comments

Contents

    Interpretability, transparency, explainability?
  Dimensions for interpretability
    Goals
    Enabling Humans
    A Map of Methods
    Other dimensions
  What’s next?
  Appendix: Previous Literature
None
No comments

In this post we lay out some ideas around framing interpretability research which we have found quite useful. Our framing is goal-oriented, which we believe is important for making sure interpretability research is meaningful. We also go over a variety of dimensions which we think are useful to consider when thinking about interpretability research. We wanted to have a shared vocabulary when talking about this kind of research, and found that these ideas helped us communicate effectively.

One of our motivations for having these thoughts and discussions is so we can understand the relevance of interpretability to alignment, and to help us think about which categories or dimensions of interpretability research are important for alignment of strong AI. In a coming post we discuss interpretability and alignment, using the ideas from this post and other previous writing on the subject.

Interpretability, transparency, explainability?

A quick note on terminology. Many terms are used to mean many similar things in the interpretability research literature. We think of Interpretability as the broadest possible category, which is why it can be hard to define and operationalise. Two specific terms are used with frequency apart from interpretability - explainability and transparency - and following the definitions in The Mythos Of Model Interpretability (summarised in more detail in the appendix), we’ll use them as follows:

Dimensions for interpretability

We found two dimensions most useful to consider when thinking about interpretability. The first dimension (Goals) is concerned with which goals interpretability research aims to fulfil. The second (Enabling Humans) refers to the idea that interpretability is human-centred, and we feel that interpretability methods normally must have a human involved in interpreting the model for them to be meaningful. We could also consider agents acting on behalf of humans (i.e. in amplification or recursive reward modelling). These aren’t the only axes on which to consider interpretability research (see Other Dimensions), but they are the two we felt were the most useful.

Goals

Possibly the most important dimension to consider is the Goals of interpretability methods. We describe five generic goals (not specific to certain domains). We found that this helps us the most when thinking about what kinds of interpretability methods we should pursue developing and what a specific interpretability method is hoping to achieve. This is descriptive, in that these are (generic) goals researchers have put forward for using interpretability methods, or that interpretability methods have been used to achieve. There’s overlap between these goals, and especially the first three are interrelated, as are the last two.

The goals are:

A key caveat is that many methods might not target a specific goal in the above list, but instead aim to produce in the end-user of the interpretability method some Generic Understanding or a Gearful Model of the object being interpreted. This understanding could then serve in the fulfilment of a variety of goals, such as aiding the user in predicting behaviour, or understanding why the model fails and hence improving its performance. A key example of these kinds of Generic Understanding methods are the series of methods based on Feature Visualisation, especially the Activation Atlas. The Activation Atlas paper is a good example of a method giving Generic Understanding which is then cashed out in the ability to produce adversarial examples aimed at specific things using only overlapping natural images, which is a kind of Predicting Behaviour.

We want to stress, however, that validating whether a interpretability method produces genuine understanding seems very difficult (or perhaps impossible) to do without seeing whether the understanding can be cashed out to achieve one of the goals above. The aim when trying to produce understanding should not be for the user to just feel like they understand the method, but rather for the user’s understanding to enable them to better achieve one of the goals above. If a method produces some fuzzy feeling of understanding but doesn’t seem useful for any downstream task, then it’s worth considering what it’s useful for.

Enabling Humans

The second dimension, Enabling Humans, is about whether the method enables a human to achieve one of the goals above, as opposed to achieving the goal directly. That is, there are many methods or algorithms in machine learning which Improve model performance, such as the Adam optimiser or adding more layers to your neural network. These techniques are obviously not interpretability techniques, but it’s useful to think about the difference between them and a technique we would call interpretability which does enable an improvement in model performance.

We think the key difference is the existence of a human in a chain between interpretability method and in this case Improvement of model performance. An interpretability method will Enable Humans to achieve the goals above. For a method such as automatic verification (which could be considered interpretability) the verifier is doing the vast majority of the work, but the human still needs to decide what to verify, i.e. they need to ask the right questions. On the other hand, methods such as feature visualisation and saliency maps only provide some visualisation of the model behaviour or internals, and hence still require human effort to achieve one of the goals above.

For some of the goals above this is basically guaranteed: When trying to gain Assurance of Properties, we’re the human in the loop, and the method could give us these assurances or enable us to find them for ourselves. However as mentioned above, for Improving model performance the distinction becomes more necessary.

This dimension is more difficult to reason about than the Goals dimension, and is hence less fleshed out. However, we think it’s just as important to consider what the role of the human is in an interpretability method’s use.

A Map of Methods

These dimensions aren’t an operationalisation of interpretability, and don't present necessary and sufficient conditions, and we’re uncertain whether it’d be useful to have a strict dividing line. We do think that the framing is useful even if it’s not a formal definition of what interpretability is, as it helped us to discuss what goals interpretability methods are trying to achieve, and whether they’re enabling humans to achieve those goals or not. To test out our ideas with a variety of current methods, we produced a map of current interpretability methods on these two dimensions. This helped us in earlier formulations to spot problems with the description of the framing, relations between various parts of it, and similar patterns of goal behaviour from different methods. This is just an illustration of applying some of our ideas to current methods. Bold lines indicate that method helping more with that goal. We don't want to stake a lot on exactly where the bold lines are drawn.

Map of methods

Other dimensions

The two dimensions described above were the ones we felt were the most useful at a higher level when thinking about generating ideas for interpretability methods. However, there’s a few other dimensions which we think are also important to think about. This can also be useful in a generative sense. If we draw out the matrix of all dimensions, we can consider whether methods in certain points of the matrix would be useful, and how they might look. These other important dimensions are:

What’s next?

While the ideas in this post helped us deconfuse and discuss interpretability research in general, we think an ideal use-case for the framing (which also motivates us) is talking about interpretability in AI alignment. This problem setting is different from those tackled in the standard machine learning literature: We’ll care about a subset of the goals described above, and the way in which a method enables humans will also deserve more scrutiny. A second post coming soon will focus on this issue, using the ideas in this post and other previous writing on the subject. To give an (obvious) sneak preview, Persuasion of Properties is a goal we would want to avoid explicitly optimising for when thinking about aligning strong AI.

Appendix: Previous Literature

There have been previous works addressing the problem of defining and discussing interpretability research. We here summarise two works which seemed to be the most discussed.

The Mythos of Model Interpretability tackles the problem that interpretability is an under-defined term. The author first presents five desiderata which cover most interpretability researchers’ motivations for doing interpretability research: Trust, Causality, Informativeness, Transferability, and Fair and ethical decision making. In many problems, the problem specification used to train the model (i.e. the loss function, which normally just optimises accuracy) does not capture all of the desired properties of the model (i.e. not discriminating different races, having an understanding of causality, being transferable to other similar problems). Hence, optimising to solve this problem specification will not produce these properties. In these cases, Interpretability can be used to validate that these extra properties hold. This point is made both in this paper and following one.

The paper describes a taxonomy of interpretability methods, with the top level topics of transparency and post-hoc interpretability:

Towards A Rigorous Science of Interpretable Machine Learning echoes the point made in the above paper that interpretability research is aimed at addressing the case where the loss function a model is optimised for doesn’t capture all of the desired properties of the trained model. To the list of possible use-cases or problem underspecification cases it adds having a Multi-objective trade-off or only having trained with a Proxy reward function. It provides a semi-formal description of the variety of ways in which interpretability methods or models can be validated, and the links between these methods and the desiderata or methods described above.

Some papers we haven’t looked into but also seem to tackle a similar topic:

This work was done as part of the AISRP 2019 https://aisrp.org/

0 comments

Comments sorted by top scores.