Visualizing Interpretability

darold-davis

Visualizing Interpretability

post by Darold Davis (darold) · 2025-02-03T19:36:38.938Z · LW · GW · 0 comments

  Abstract
  Introduction
    Methods
      Survey
      Interview
  Results
      Sankey Diagram
      TreeMap
      Hexbin Scatter Plot
  Discussion
  Future work
  References
None
No comments

Abstract

This project aims to address gaps in machine learning (ML) interpretability with regard to visualization by investigating researchers workflows, tool usage, and challenges in understanding model behavior. Through a survey and interviews with practitioners, I identified limitations in existing visualization tools, such as fragmented workflows and insufficient support for analyzing neuron-level attributions. Based on these findings, I developed a prototype tool to visualize neuron activations and attributions, enabling deeper insights into model decision-making. This work contributes to enhancing the understanding of ML models and improving their transparency, a critical step toward ensuring the safety and reliability of advanced AI systems.

Introduction

Understanding model behavior is critical for AI safety, as opaque systems risk unintended harmful outcomes. Improved interpretability tools help researchers audit models, detect biases, and verify alignment with intended goals. Existing tools like TensorBoard, SHAP, LIME, and Captum provide partial solutions but focus on specific tasks (e.g., feature importance). Studies (Lipton 2018; Samek et al. 2021) highlight the need for integrated, neuron-level analysis. However, no tool combines attribution mapping with activation visualization in a unified workflow, a gap this work targets.

Methods

Survey

Through a short survey distributed to ML researchers, I was able to gain insight into the tools they used such as Transformer Lens and CircuitsVis, workflow pain points, and desired features like the ability to visualize finer-grained explanation only for specific units of interest among other inquiries.

I then analyzed responses quantitatively (usage frequency) and qualitatively (open-ended feedback). Among valuable features in existing tools, Activation Pattern analysis stood out as quite significant.

Forms response chart. Question title: What features do you find most valuable?. Number of responses: 2 responses.

Interview

Following the surveys, I conducted a semi-structured interview with Harvard PhD candidate and researcher Shivam Raval about specializing in neural network interpretability. Topics included methods for visually analyzing neuron behavior, activation maximization and the current challenges in attributing model decisions to specific neurons.

We discussed the effectiveness of various visualization tools, emphasizing the importance of interactive visualizations in research for hypothesis formation and validation. Shivam expressed concerns that researchers might be reluctant to explore new tools, which could hinder innovation. He shared his approach to visualization, focusing on design and scaling insights, while also highlighting the need for tools that enhance the research experience. The conversation shifted to model feature analysis techniques, where Shivam explained probing and patching, along with additional methods like circuit analysis and logit lens, stressing the significance of manipulating activations to understand model behavior, especially in safety contexts.

Results

With qualitative data and secondary market analysis, I developed a concept prototype of a web-based tool using HTML, CSS and JavaScript (D3.js) to visualize activation patterns across network layers and enable interactive exploration of neuron contributions. The initial objective was to incorporate the insights I gained throughout the research and rapidly build a proof of concept interactive mockup that could be iterated on over subsequent usability testing sessions.

First, the focus was on a designing the project dashboard for the purpose of managing interpretability visualizations with different techniques and with other collaborators.

Visual Interpretability main dashboard user interface (UI)

For this concept prototype, I explored three different visualizations and the type of inspection methods most suitable each. What follows are brief descriptions of the visualization types I focused on for this project, the inspection methods and screenshots of the prototype UI designs.

Sankey Diagram

Sankey is a flow visualization tool where the width of arrows represents the quantity or magnitude of flow. For neural networks, it can effectively illustrate dynamic relationships and quantitative distributions. Below are key aspects a Sankey diagram could visualize:

Relevance Propagation: Trace how output decisions map back to input features (e.g., Layer-wise Relevance Propagation).
Activation Patterns: Track activation intensities for specific inputs across layers.

Activation Pattern tab with Sankey visualization

TreeMap

A treemap diagram is effective for visualizing hierarchical and part-to-whole relationships through nested rectangles, where size and color can encode quantitative or categorical variables. For neural networks, treemaps can illustrate the following aspects:

Feature Activation Intensity: Color neurons by average activation strength (e.g., ReLU outputs).
Sparsity: Represent inactive neurons (e.g., dead neurons in poorly initialized networks).

Feature activation intensity visualized as a Treemap

Hexbin Scatter Plot

A hexbin scatter plot, which aggregates data points into hexagonal bins to visualize density, can effectively illustrate several aspects of a neural network model. The key areas where hexbin plots are particularly useful are:

Visualization: Activations of two neurons (or layers) over a dataset.
Insight: Shows common activation combinations, highlighting saturation (e.g., ReLU neurons stuck at zero) or correlations between neurons.

Neuron combination visualized as a Hexbin Scatter Plot

Discussion

Current tools are siloed and lack support for integration with other tools, and lack the ability to visualize finer-grained explanation only for specific units of interest (neuron-level interpretability). This concept prototype is an initial step in addressing this gap by integrating activation and attribution visualization into one tool, streamlining the workflow. Validation testing to identify areas for improvement is the next activity as well as ongoing stakeholder and community research to gather additional insights for the design and development of new features.

This prototype is a work in progress and requires further development to fully realize the benefits to the ML research field. By making model behavior more transparent, this tool can potentially help researchers identify misaligned or biased decision pathways, enabling corrective measures before deployment to the public.

Future work

In order to make this tool accessible to novice researchers as well as experienced researchers in other fields, I intend to expand the survey and interview a broader audience of industry practitioners. Other plans are to:

Conduct usability testing for the prototype and iterate.
Explore adding more interactivity and other types of visualizations.
Add support for dynamic computation graphs (e.g., Transformers).
Integrate quantitative metrics such as attribution consistency scores.

By addressing these steps, the aim is to create a robust, widely adoptable tool for ML interpretability, advancing the safety of AI systems.

Acknowledgements
I would like to express my deepest gratitude to Shivam for giving me his time, insights and knowledge regarding machine learning interpretability.

References

[1]	Z. C. Lipton, ‘The Mythos of Model Interpretability’, arXiv [cs.LG]. 2017.
[2]	C. Rudin, C. Chen, Z. Chen, H. Huang, L. Semenova, and C. Zhong, ‘Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges’, arXiv [cs.LG]. 2021.

0 comments

Comments sorted by top scores.

Visualizing Interpretability

Contents

Abstract

Introduction

Methods

Results

Discussion

Future work

References

0 comments