Timaeus's First Four Months

post by Jesse Hoogland (jhoogland), Daniel Murfet (dmurfet), Stan van Wingerden (stan-van-wingerden), Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-02-28T17:01:53.437Z · LW · GW · 6 comments

Contents

  Contributions
    Research Contributions
      What we learned
      Papers
    Outreach Contributions
      Posts
      Code
      Talks
      Events
  Organization
    What we need
  What's next 
None
6 comments

Timaeus was announced in late October 2023 [LW · GW], with the mission of making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. This is our first progress update.

In service of the mission, our first priority has been to support and contribute to ongoing work in Singular Learning Theory (SLT) and developmental interpretability [LW · GW], with the aim of laying theoretical and empirical foundations for a science of deep learning and neural network interpretability. 

Our main uncertainties in this research were:

The research Timaeus has conducted over the past four months, in collaboration with Daniel Murfet's group at the University of Melbourne and several independent AI safety researchers, has significantly reduced these uncertainties, as we explain below [? · GW]. As a result we are now substantially more confident in the research agenda.

While we view this fundamental work in deep learning science and interpretability as critical, the mission of Timaeus is to make fundamental contributions to alignment and these investments in basic science are to be judged relative to that end goal. We are impatient about making direct contact between these ideas and central problems in alignment. This impatience has resulted in the research directions outlined at the end of this post [LW · GW]. 

Contributions

Timaeus conducts two main activities: (1) research, that is, developing new tools for interpretability, mechanistic anomaly detection, etc., and (2) outreach, that is, introducing and advocating for these techniques to other researchers and organizations.

Research Contributions

What we learned

Regarding the question of whether SLT is useful in deep learning we have learned that

Regarding whether structure in neural networks forms in phase transitions, we have learned that

Though we do not predict that all structure forms in SLT-defined developmental stages, we now expect that enough important structure forms in discrete stages for developmental interpretability to be a viable perspective in interpretability. 

Papers

The aforementioned work has been undertaken by members of the Timaeus core team (Jesse Hoogland), Timaeus RAs (George Wang and Zach Furman), a group at the University of Melbourne, and independent AI safety researchers:

Note that the first two papers above were mostly/entirely completed before Timaeus was launched in October. 

Outreach Contributions

Posts

Posts authored by Timaeus core team members Jesse Hoogland and Stan van Wingerden:

We've developed other resources such as a list of learning materials for SLT and a list of open problems in DevInterp & SLT. Our collaborators distilled Chen et al. (2023) in Growth and Form in a Toy Model of Superposition [LW · GW]. Also of interest (but not written or supported by Timaeus) is Joar Skalse’s criticism of Singular Learning Theory [LW · GW] with extensive discussion in the comments [LW · GW].

Code

We published a devinterp repo and Python package for using the techniques we've introduced. This is supported by documentation and introductory notebooks to help newcomers get started.

Talks

We've given talks at DeepMind, Anthropic, OpenAI, 80,000 Hours, Constellation, MATS, the Topos Institute, Carnegie Mellon University, Monash University, and the University of Melbourne. We have planned talks with FAR, the Tokyo Technical AI Safety Conference, and the Foresight Institute.

Events

Organization

Our team currently consists of six people: three core team members, two research assistants, and one research lead. We continue to collaborate closely with Daniel Murfet's research group at the University of Melbourne.

What we need

What's next 

Six months ago, we first put our minds to the question of what SLT could have to say about alignment. This led to developmental interpretability [LW · GW]. But this was always a first step, not the end goal. Having now validated the basic premises of this research agenda we are starting to work on additional points of contact with alignment.

Our current research priorities:

We will provide an outline of this new research agenda (as we did with our original post announcing developmental interpretability [LW · GW]) in the coming months. Structural generalization and geometry of program synthesis are ambitious multi-year research programs in their own right. However, we will move aggressively to empirically validate the theoretical ideas, maintain tight feedback loops between empirical and theoretical progress, and publish incremental progress on a regular basis. Our capacity to live up to these principles has been demonstrated in the developmental interpretability agenda over the past six months.

We will have a next round of updates and results to report in late May. 

To stay up-to-date, join the DevInterp discord. If you're interested in contributing or learning more, don't hesitate to reach out

  1. ^

    George joined this project later and is thus not yet included on the current preprint version.

6 comments

Comments sorted by top scores.

comment by Adrià Garriga-alonso (rhaps0dy) · 2024-02-28T18:18:57.920Z · LW(p) · GW(p)

That's a lot of things done, congratulations!

comment by Mateusz Bagiński (mateusz-baginski) · 2024-02-28T18:51:11.504Z · LW(p) · GW(p)

I would love to see something like Vanessa's LTA reading list [LW · GW] but for devinterp.

Replies from: jhoogland
comment by Jesse Hoogland (jhoogland) · 2024-02-28T18:55:34.085Z · LW(p) · GW(p)

You can find a v0 of an SLT/devinterp reading list here. Expect an updated reading list soon (which we will cross-post to LW). 

comment by evhub · 2024-02-28T21:04:34.599Z · LW(p) · GW(p)

(Moderation note: added to the Alignment Forum from LessWrong.)

comment by Mateusz Bagiński (mateusz-baginski) · 2024-03-30T09:15:13.714Z · LW(p) · GW(p)

Generalization, from thermodynamics to statistical physics (Hoogland 2023) — A review of and introduction to generalization theory from classical to contemporary. 

FYI this link is broken

comment by Review Bot · 2024-02-28T22:31:02.757Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?