AI Safety 101 : Introduction to Vision Interpretability

post by jeanne_ (jeanne_s), Charbel-Raphaël (charbel-raphael-segerie) · 2023-07-28T17:32:11.545Z · LW · GW · 0 comments

This is a link post for https://github.com/jeanne-s/Seminaire_Turing_2023/blob/main/Vision_Interpretability_Introduction.pdf

Contents

No comments

Last year, we taught an AGISF course to Master's students at a French university (ENS Ulm). The course consisted of 10 sessions, each lasting 2 hours, and drew inspiration from the official AGISF curriculum. Currently, we are writing the corresponding textbook, which aims to provide a succinct overview of the essential aspects covered during the sessions. Here, we are sharing the section dedicated to vision interpretability to obtain feedback and because we thought it might be valuable outside of the course's context for anyone trying to learn more about interpretability. Other topics such as reward misspecification, goal misgeneralization, scalable oversight, transformer interpretability, and governance are currently a work in progress, and we plan to share them in the future.

We welcome any comments, corrections, or suggestions you may have to improve this work. This is our first attempt at creating pedagogical content, so there is undoubtedly room for improvement. Please don't hesitate to contact us at jeanne.salle@yahoo.fr and crsegerie@gmail.com.

The pdf can be found here.

 

It introduces the following papers:

 

Other papers briefly mentioned:

 

The associated deck of slides can be found here.

Thanks to Agatha Duzan for her useful feedback.

0 comments

Comments sorted by top scores.