Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

post by Karolis Jucys (karolis-ramanauskas), george_adams, Sonia Joseph (redhat) · 2024-07-18T17:02:06.179Z · LW · GW · 0 comments

This is a link post for https://arxiv.org/abs/2407.12161

Contents

  TL;DR
  Abstract
  2min teaser video
  Media
  Website
None
No comments

TL;DR

We apply mechinterp techniques on VPT, OpenAI's Minecraft agent. We also find a new case of goal misgeneralization - VPT kills a villager when we force one to stand under some tree leaves.

Abstract

Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making tasks is critical to ensuring that such systems operate transparently and safely. In this work, we perform exploratory analysis on the Video PreTraining (VPT) Minecraft playing agent, one of the largest open-source vision-based agents. We aim to illuminate its reasoning mechanisms by applying various interpretability techniques. First, we analyze the attention mechanism while the agent solves its training task - crafting a diamond pickaxe. The agent pays attention to the last four frames and several key-frames further back in its six-second memory. This is a possible mechanism for maintaining coherence in a task that takes 3-10 minutes, despite the short memory span. Secondly, we perform various interventions, which help us uncover a worrying case of goal misgeneralization: VPT mistakenly identifies a villager wearing brown clothes as a tree trunk when the villager is positioned stationary under green tree leaves, and punches it to death.

2min teaser video

Media

Feel free to use this GIF in your presentations on the importance of AI safety :) More formats and speeds in the website below.

Website

https://sites.google.com/view/vpt-mi

0 comments

Comments sorted by top scores.