Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

post by lifelonglearner, Peter Hase (peter-hase) · 2021-04-09T19:19:42.826Z · LW · GW · 15 comments


    List of Summarized Papers
  Our Opinions by Area
  Paper Summaries
    and Opinion (5)
      Estimating Feature Importance (10)
      Interpreting Representations and Weights (5)
      Generating Counterfactuals and Recourse Procedures (4)
      Explanation by Examples, Exemplars, and Prototypes (4)
      Finding Influential Training Data (2)
      Natural Language Explanations (8)
      Developing More Easily Interpreted Models (6)
      Robust and Adversarial Explanations (6)
      Unit Testing (1)
    RL Agents (8)
    in Practice (2)
  Additional Papers
    and Opinion (12)
    Estimating Feature Importance (16)
    Interpreting Representations and Weights (7)
    Generating Counterfactuals and Recourse Procedures (8)
    Explanation by Examples, Exemplars, and Prototypes (4)
    Finding Influential Training Data (4)
    Natural Language Explanations (9)
    Developing More Easily Interpreted Models (3)
    Robust and Adversarial Explanations (5)
    RL Agents (2)
    and Data Collection (1)
    in Practice (2)

Peter Hase
UNC Chapel Hill

Owen Shen
UC San Diego

With thanks to Robert Kirk and Mohit Bansal for helpful feedback on this post.


Model interpretability was a bullet point in Concrete Problems in AI Safety (2016). Since then, interpretability has come to comprise entire research directions in technical safety agendas (2020); model transparency appears throughout An overview of 11 proposals for building safe advanced AI [AF · GW] (2020); and explainable AI has a Twitter hashtag, #XAI. (For more on how interpretability is relevant to AI safety, see here [AF · GW] or here [AF · GW].) Interpretability is now a very popular area of research. The interpretability area was the most popular in terms of video views at ACL last year. Model interpretability is now so mainstream there are books on the topic and corporate services promising it.

So what's the state of research on this topic? What does progress in interpretability look like, and are we making progress?

What is this post? This post summarizes 70 recent papers on model transparency, interpretability, and explainability, limited to a non-random subset of papers from the past 3 years or so. We also give opinions on several active areas of research, and collate another 90 papers that are not summarized.

How to read this post. If you want to see high-level opinions on several areas of interpretability research, just read the opinion section, which is organized according to our very ad-hoc set of topic areas. If you want to learn more about what work looks like in a particular area, you can read the summaries of papers in that area. For a quick glance at each area, we highlight one standout paper per area, so you can just check out that summary. If you want to see more work that has come out in an area, look at the non-summarized papers at the end of the post (organized with the same areas as the summarized papers).

We assume readers are familiar with basic aspects of interpretability research, i.e. the kinds of concepts in The Mythos of Model Interpretability and Towards A Rigorous Science of Interpretable Machine Learning. We recommend looking at either of these papers if you want a primer on interpretability. We also assume that readers are familiar with older, foundational works like "Why Should I Trust You?: Explaining the Predictions of Any Classifier."

Disclaimer: This post is written by a team of two people, and hence its breadth is limited and its content biased by our interests and backgrounds. A few of the summarized papers are our own. Please let us know if you think we've missed anything important that could improve the post.

Master List of Summarized Papers

Our Opinions by Area

Paper Summaries

Theory and Opinion (5)

Evaluation (9)


Estimating Feature Importance (10)

Interpreting Representations and Weights (5)

Generating Counterfactuals and Recourse Procedures (4)

Explanation by Examples, Exemplars, and Prototypes (4)

Finding Influential Training Data (2)

Natural Language Explanations (8)

Developing More Easily Interpreted Models (6)

Robust and Adversarial Explanations (6)

Unit Testing (1)

Explaining RL Agents (8)

Interpretability in Practice (2)

Additional Papers

We provide some additional papers here that we did not summarize above, including very recent papers, highly focused papers, and others. These are organized by the same topic areas as above.

Theory and Opinion (12)

Evaluation (10)

Methods: Estimating Feature Importance (16)

Methods: Interpreting Representations and Weights (7)

Methods: Generating Counterfactuals and Recourse Procedures (8)

Methods: Explanation by Examples, Exemplars, and Prototypes (4)

Methods: Finding Influential Training Data (4)

Methods: Natural Language Explanations (9)

Methods: Developing More Easily Interpreted Models (3)

Methods: Robust and Adversarial Explanations (5)

Explaining RL Agents (2)

Datasets and Data Collection (1)

Interpretability in Practice (2)


We hope this post can serve as a useful resource and help start important conversations about model interpretability and AI Safety. As mentioned, please let us know if you noticed any mistakes or think we missed anything that could improve the post.


Comments sorted by top scores.

comment by rohinmshah · 2021-04-19T22:53:05.466Z · LW(p) · GW(p)

Planned summary for the Alignment Newsletter:

This is basically 3 months worth of Alignment Newsletters focused solely on interpretability wrapped up into a single post. The authors provide summaries of 70 (!) papers on the topic, and include links to another 90. I’ll focus on their opinions about the field in this summary.

The theory and conceptual clarity of the field of interpretability has improved dramatically since its inception. There are several new or clearer concepts, such as simulatability, plausibility, (aligned) faithfulness, and (warranted) trust. This seems to have had a decent amount of influence over the more typical “methods” papers.

There have been lots of proposals for how to evaluate interpretability methods, leading to the [problem of too many standards](https://xkcd.com/927/). The authors speculate that this is because both “methods” and “evaluation” papers don’t have sufficient clarity on what research questions they are trying to answer. Even after choosing an evaluation methodology, it is often unclear which other techniques you should be comparing your new method to.

For specific methods for achieving interpretability, at a high level, there has been clear progress. There are cases where we can:

1. identify concepts that certain neurons represent,

2. find feature subsets that account for most of a model's output,

3. find changes to data points that yield requested model predictions,

4. find training data that influences individual test time predictions,

5. generate natural language explanations that are somewhat informative of model reasoning, and

6. create somewhat competitive models that are inherently more interpretable.

There does seem to be a problem of disconnected research and reinventing the wheel. In particular, work at CV conferences, work at NLP conferences, and work at NeurIPS / ICML / ICLR form three clusters that for the most part do not cite each other.

Planned opinion:

This post is great. Especially to the extent that you like summaries of papers (and according to the survey I recently ran, you probably do like summaries), I would recommend reading through this post. You could also read through the highlights from each section, bringing it down to 13 summaries instead of 70.

Replies from: lifelonglearner
comment by lifelonglearner · 2021-04-19T23:45:41.808Z · LW(p) · GW(p)

Hi Rohin! Thanks for this summary of our post. I think one other sub-field that has seen a lot of progress is in creating somewhat competitive models that are inherently more interpretable (i.e. a lot of the augmented/approximate decision tree models), as well as some of the decision set stuff. Otherwise, I think it's a fair assessment, will also link this comment to Peter so he can chime in with any suggested clarifications of our opinions, if any. Cheers, Owen

Replies from: rohinmshah
comment by rohinmshah · 2021-04-20T00:38:28.627Z · LW(p) · GW(p)

Sounds good, I've added a sixth bullet point. Fyi, I originally took that list of 5 bullet points verbatim from your post, so you might want to update that list in the post as well.

comment by danieldewey · 2021-04-10T13:28:05.261Z · LW(p) · GW(p)

This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.

Replies from: TurnTrout
comment by TurnTrout · 2021-04-10T14:53:50.253Z · LW(p) · GW(p)

I agree. I've put it in my SuperMemo and very much look forward to going through it. Thanks Peter & Owen!

Replies from: mark-xu
comment by Mark Xu (mark-xu) · 2021-04-14T22:52:58.870Z · LW(p) · GW(p)

I'm curious what "put it in my SuperMemo" means. Quick googling only yielded SuperMemo as a language learning tool.

Replies from: TurnTrout
comment by TurnTrout · 2021-04-14T23:23:01.761Z · LW(p) · GW(p)

It's a spaced repetition system that focuses on incremental reading. It's like Anki, but instead of hosting flashcards separately from your reading, you extract text while reading documents and PDFs. You later refine extracts into ever-smaller chunks of knowledge, at which point you create the "flashcard" (usually 'clozes', demonstrated below). 

Here's a Wikipedia article I pasted into SuperMemo. Blue bits are the extracts, which it'll remind me to refine into flashcards later.
A cloze deletion flashcard. It's easy to make a lot of these. I like them.

Incremental reading is nice because you can come back to information over time as you learn more, instead of having to understand enough to make an Anki card right away. 

In the context of this post, I'm reading some of the papers, making extracts, making flashcards from the extracts, and retaining at least one or two key points from each paper. Way better than retaining 1-2 points from all 70 summaries!

Replies from: adamShimi
comment by adamShimi · 2021-04-15T12:52:10.873Z · LW(p) · GW(p)

I've been wanting to try SuperMemo for a while, especially given the difficulty that you mention with making Anki cards. But it doesn't run natively on linux AFAIK, and I can't be bothered for the moment to make it work using wine.

Replies from: TurnTrout
comment by TurnTrout · 2021-04-15T14:22:19.136Z · LW(p) · GW(p)

Apparently VMs are the way to go for pdf support on linux.

comment by Jack R (Jack Ryan) · 2021-04-11T06:19:34.238Z · LW(p) · GW(p)

Thanks a lot for this--I'm doing a lit. review for an interpretability project and this is definitely coming in handy :)

Random note: the paper "Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction" is listed twice in the master list of summarized papers.

Replies from: lifelonglearner
comment by lifelonglearner · 2021-04-11T15:49:36.722Z · LW(p) · GW(p)

Thanks! Didn't realize we had a double entry, will go and edit.

comment by Rafael Harth (sil-ver) · 2021-04-12T09:42:13.079Z · LW(p) · GW(p)

there are books on the topic

Does anyone know if this book is any good? I'm planning to get more familiar with interpretability research, and 'read a book' has just appeared in my set of options.

Replies from: yingzhen-zhou, lifelonglearner
comment by Yingzhen Zhou (yingzhen-zhou) · 2021-05-20T02:47:39.814Z · LW(p) · GW(p)

I finished this book (by 'finish', I mean read through Chapt 4 through Chapt 7, and read them three times). 

Here's suggestion and what I think: 

  1. If you are comfortable reading online, use [this link] to read the GitBook version. A few benefits: errors are adjusted by the author in time, new sections coming from time to time that are only available here in the online version, and lastly, dark-mode possible. 
  2. From the TOC you'd see the book is mainly about model-agnostic methods, it introduces most of the model-agnostic concepts that are well-received. The list from this post are mostly for CV or NLP problems. Because my area is to interpret NNs that are trained for tabular data, I find the book very useful.
  3. In the book, each section has a "Pros" and "Cons" of the corresponding method, gives links to the GitHub repo that implements the corresponding method, both R and Python. This is handy.
  4. The illustrations and figures are clear and overall everything's well-explained.
  5. Downside is, the gradient methods (saliency map), concept detection (TCAV) are not described in detail. I'd recommend reading papers on those specific topics. (Plus, I also noticed that the updates of these chapters were not written by the author of this book. This is understandable as those require people with difference expertise. 
Replies from: sil-ver
comment by Rafael Harth (sil-ver) · 2021-05-20T15:01:24.463Z · LW(p) · GW(p)

Thanks a bunch for summarizing your thoughts; this is helpful.

comment by lifelonglearner · 2021-04-12T16:31:16.103Z · LW(p) · GW(p)

I have not read the book, perhaps Peter has.

A quick look at the table of contents suggests that it's focused more on model-agnostic methods. I think you'd get a different overview of the field compared to the papers we've summarized here, as an fyi.

I think one large area you'd miss out on from reading the book is the recent work on making neural nets more interpretable, or designing more interpretable neural net architectures (e.g. NBDT).