AI improving AI [MLAISU W01!]

post by Esben Kran (esben-kran) · 2023-01-06T11:13:17.321Z · LW · GW · 0 comments

This is a link post for https://newsletter.apartresearch.com/posts/hundreds-of-research-ideas-w01

Contents

  Mechanistic interpretability
  ML improving ML
  Aligned AGI vs. unaligned AGI
  Deep learning research and other news
  Opportunities
None
No comments

Over 200 research ideas for mechanistic interpretability, ML improving ML and the dangers of aligned artificial intelligence. Welcome to 2023 and a happy New Year from us at the ML & AI Safety Updates!

Watch this week's MLAISU on  YouTube  or listen to it on  Spotify.

Mechanistic interpretability

The interpretability researcher Neel Nanda has published [LW · GW] a massive list of 200 open and concrete problems in mechanistic interpretability. They’re split into the following categories:

  1. Analyzing toy models [? · GW]: Diving into models that are much smaller but trained the same way as large models. These are way easier to analyze than large models and he has made 12 small models available.
  2. Looking for circuits in the wild [? · GW]: Inspired by the paper “interpretability in the Wild”, can we use mechanistic interpretability on real-life language models?
  3. Interpreting algorithmic problems [? · GW]: Algorithms are highly interpretable and learned as a clearly interpretable structure. We can for example observe that grokking happens [LW · GW] when an algorithm is generalized within the network.
  4. Exploring polysemanticity and superposition [? · GW]: Superposition is when one feature is spread across multiple neurons in a network and gives problems in our interpretation of what neurons represent. Can we find better ways to understand or mitigate this effect?
  5. Analyzing training dynamics [? · GW]: Understanding how models change over training is very interesting for identifying how and when capabilities emerge.

These are great projects to go for and we’re collaborating with Neel Nanda to run a mechanistic interpretability hackathon the 20th of January! As Lawrence Chan mentions in a new post [AF · GW]; we need to touch reality as soon as possible, and these hackathons are a great way to get fast and concrete research results. You can join us but you can also run a local hackathon site!

ML improving ML

Thomas Woodside summarizes [LW · GW] a collaborative project to map cases where ML systems are self-improving. There are already 11 different major research projects that have shown machine learning systems used to improve other systems and we assume that there is much more happening behind the scenes since these are only published papers.

Several of the projects use models to create data that another model is fine-tuned on while a few relate to speed-ups in running and developing machine learning systems. These include using ML to better optimize GPUs, optimizing compilers and helping humans spot flaws in a large language model using (LLM) another LLM.

A concrete example of the data generation and fine-tuning a paper from Microsoft and MIT that shows a LLM can be used to generate programming puzzles that a programming LLM is fine-tuned and improves a lot from.

With ML already reaching this level, we have to make sure that there are good introductions to ML safety for academics and engineers to understand the prominent issues with AI development. Vael Gates and Collin Burns try to identify the best intro texts by asking a bunch of ML researchers (28) which of eight texts they prefer. They find that the best resource is Joe Carlsmith’s “More is Different” blog posts.

In these posts, Joe Carlsmith explores two ways of looking at ML safety: Philosophy and engineering. He mentions that the engineering approach preferred by ML academia is underrated from the philosophical side and that the philosophical side (represented by Superintelligence) is significantly undervalued from the engineering perspective.

An important point of these posts is how future AI systems will be qualitatively different from current AI systems and that this results in weird emergent behaviour.

Aligned AGI vs. unaligned AGI

In “The Case Against AI Alignment” [LW · GW], Andrew Sauer describes how the greatest risks of an unaligned artificial general intelligence is that humanity goes extinct while an aligned system can lead to extreme suffering for a minority or for simulated beings. It is based on the inherent outgroup hatred of human psychology.

This comes at a time when the field of alignment is growing rapidly in response to the systems that have been released in the past year. One of the most important tasks of the sub-field of alignment concerned with value alignment is also to figure out whose values to align to, something that few have grappled with until now.

Responses to Sauer’s piece accept the importance of figuring out these questions but reject the hypothesis that we should accept the death of all humans because there “might” be a highly risky outcome. Additionally, human-invoked suffering for others is not a stable state, as compared to extinction, which means it has much less relevance on the larger timescale than one might expect.

Deep learning research and other news

In other news…

Opportunities

We have a few interesting opportunities coming up. Thanks goes to AGISF for once more sharing opportunities in ML & AI safety.

This has been the ML & AI safety update. See you next week!

0 comments

Comments sorted by top scores.