An overview of some promising work by junior alignment researchers

post by Akash (akash-wasil) · 2022-12-26T17:23:58.991Z · LW · GW · 0 comments

Contents

  Externalized reasoning oversight by Tamera Lanham
  Goal misgeneralization by Lauro Langosco 
  An overview of the technical alignment landscape by Thomas Larsen & Eli Lifland
  Naive hypotheses by Shoshannah Tekofsky 
  A few other spotlights
  Caveats
None
No comments

We’re all familiar with ELK [LW · GW], natural abstractions [LW · GW], and toy models of superposition [LW · GW]. 

But there’s also a new cohort of alignment researchers. Many of them got involved (or produced their first major pieces of work) in the last year.

I’ve been impressed by the quality of some of their work, and I think it deserves wider recognition. The recent increase in attention being paid to alignment gives me some hope that an “unknown genius” may emerge in the field. Additionally, there are several junior alignment researchers who seem to have a lot of potential, and I’m excited to see their contributions as they get more experience and influence in the field. 

Here’s some work by junior alignment researchers that excited me in the last year:

Externalized reasoning oversight [LW · GW] by Tamera Lanham

Mechanistic interpretability tries to understand what models “think” by looking at their weights and activations. Externalized reasoning oversight tries to do this by just asking the models to explain their reasoning. 

Inspired by chain-of-thought prompting techniques, Tamera set out to see if she could get language models to provide honest and transparent answers about their reasoning processes. In the limit, this technique could provide a novel way for us to interpret language models.

I’m excited to see junior researchers tackle existing problems (e.g., how do we understand what models think?) in new ways (e.g., hm, everyone’s been focusing on weights and activations… are there any alternatives?). Tamera’s work is a great example. 

Relevant post: Externalized reasoning oversight: A research direction for language model alignment [LW · GW]

Relevant interview: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic [LW · GW]

Goal misgeneralization by Lauro Langosco 

Everyone knows that models can learn unintended goals. Everyone knows that models can fail to generalize out-of-distribution.

Lauro’s paper connects these points and distinguishes between capabilities generalization failures (the model is incompetent out-of-distribution) and goal misgeneralization failures (the model is competent but pursues the wrong goal). He and his colleagues then define goal misgeneralization more concretely in an RL context, discuss its implications for alignment, and demonstrate alignment failures in existing models. Moreover, I found the writing in his paper to be particularly good at striking a balance between (a) explicitly discussing x-risks and (b) presenting arguments rigorously in a format suitable for an ML audience.

I’m excited to see more work that takes well-known problems in alignment and tries to (a) explain them more concretely and (b) examine them in current-day systems. I don’t expect this work to directly solve alignment, but I expect it to help us get a better understanding of alignment problems, find new ways to make progress on these problems, and make it easier for other ML researchers to find alignment problems they can work on. 

Relevant paper: Goal misgeneralization in deep reinforcement learning

An overview of the technical alignment landscape [LW · GW] by Thomas Larsen & Eli Lifland

Many research fields regularly have literature reviews and meta-analyses. In fact these are often the most widely-read and widely-cited papers.

Alignment hasn’t had much of this. Perhaps this is because the field is small, so “everyone already knows what everyone is working on.” This certainly isn’t true for junior researchers, and my conversations with senior researchers suggest this isn’t even the case for the veterans. (People are generally focused on their research, and they don’t have time to follow everything in the space.)

Earlier this year, Thomas Larsen & Eli Lifland presented the most comprehensive overview of the technical alignment space. In addition to providing summaries, they also provided brief opinions and assessments of various alignment agendas. 

I’m excited to see more people who are willing to try to “understand the entire space” and who are bold enough to raise criticisms of work by senior people. This is how research fields grow, and I’m grateful to Larsen & Lifland for providing an excellent example. 

Relevant post: (My understanding of) What Everyone in Technical Alignment is Doing and Why [LW · GW]

Naive hypotheses [LW · GW] by Shoshannah Tekofsky 

Have you ever noticed that when you learn more about a field, you start to think more like everyone else? Do you notice concepts and frames like “inner/outer alignment” or “latent knowledge” or “sharp left turn” popping up in your thoughts?

Shoshannah Tekofsky anticipated this in advance. So, before she read what other alignment researchers thought, she decided to write down her “naive hypotheses” for how to solve the alignment problem. None of her ideas are likely to work (obviously), but I’m impressed by the epistemology she’s bringing to the table. Shoshannah is now reading up on the alignment problem and the various approaches on how to solve it. But if she notices her creativity dipping, she can return to her list of naive hypotheses as an intuition pump.

I’m excited to see more people do exercises like this, where they write down their best-guesses for how to solve alignment. More broadly, I’m excited for junior researchers to engage in more active learning, where they’re more willing to pause/stop in order to write down their own thoughts/ideas/objections to the concepts they’re learning about. Shoshannah embodies this.

​​Relevant post: Naive hypotheses on AI alignment [LW · GW]

Relevant interview: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas [LW · GW]

A few other spotlights

Caveats

0 comments

Comments sorted by top scores.