Shallow review of technical AI safety, 2024

post by technicalities, Stag, Stephen McAleese (stephen-mcaleese), jordine, Dr. David Mathers · 2024-12-29T12:01:14.724Z · LW · GW · 26 comments

Contents

  Editorial
  Agendas with public outputs
    1. Understand existing models
      Evals 
      Interpretability 
      Understand learning
    2. Control the thing
      Prevent deception and scheming 
      Surgical model edits
      Goal robustness 
    3. Safety by design 
    4. Make AI solve it
      Scalable oversight
      Task decomp
      Adversarial 
    5. Theory 
      Understanding agency 
      Corrigibility
      Ontology Identification 
      Understand cooperation
    6. Miscellaneous 
  Agendas without public outputs this year
    Graveyard (known to be inactive)
  Method
    Other reviews and taxonomies
    Acknowledgments
None
26 comments

from aisafety.world

 

The following is a list of live agendas in technical AI safety, updating our post [LW · GW] from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor. 

The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options and the standing critiques; to help policy people know who to talk to for the actual information; and ideally to help funders see quickly what has already been funded and how much (but this proves to be hard).

“AI safety” means many things. We’re targeting work that intends to prevent very competent cognitive systems from having large unintended effects on the world. 

This time we also made an effort to identify work that doesn’t show up on LW/AF by trawling conferences and arXiv. Our list is not exhaustive, particularly for academic work which is unindexed and hard to discover. If we missed you or got something wrong, please comment, we will edit.

The method section is important but we put it down the bottom anyway.

Here’s a spreadsheet version also.

 

Editorial

 

Agendas with public outputs

1. Understand existing models

Evals 

(Figuring out how trained models behave. Arguably not itself safety work but a useful input.)

Various capability and safety evaluations

 

Various red-teams

 

Eliciting model anomalies 

 

Interpretability 

(Figuring out what a trained model is actually computing.[3])

 

Good-enough mech interp [LW · GW]

 

Sparse Autoencoders [AF · GW]

 

Simplexcomputational mechanics [? · GW] for interp

 

Pr(Ai)2RCausal Abstractions

 

Concept-based interp 

 

Leap

 

EleutherAI interp

 

Understand learning

(Figuring out how the model figured it out.)

 

TimaeusDevelopmental interpretability [AF · GW]

 

Saxe lab 

 

 

See also

 

2. Control the thing [LW · GW]

(figure out how to predictably detect and quash misbehaviour)

 

Iterative alignment [LW · GW

 

Control evaluations [LW · GW]

 

Guaranteed safe AI

 

Assistance gamesreward learning 

 

Social-instinct AGI [LW · GW]

 

Prevent deception and scheming 

(through methods besides mechanistic interpretability)
 

Mechanistic anomaly detection [LW · GW]

 

Cadenza

 

Faithful CoT through separation and paraphrasing [LW · GW

 

Indirect deception monitoring 

 

See also “retarget the search [LW · GW]”.

 

Surgical model edits

(interventions on model internals)
 

Activation engineering  [? · GW]

See also unlearning.

 

Goal robustness 

(Figuring out how to keep the model doing what it has been doing so far.)

 

Mild optimisation [? · GW]

 

3. Safety by design 

(Figuring out how to avoid using deep learning)

 

ConjectureCognitive Software 

 

See also parts of Guaranteed Safe AI involving world models and program synthesis.

 

4. Make AI solve it [? · GW]

(Figuring out how models might help with figuring it out.)

 

Scalable oversight

(Figuring out how to get AI to help humans supervise models.)

 

OpenAI Superalignment  Automated Alignment Research

 

Weak-to-strong generalization

 

Supervising AIs improving AIs [LW · GW]

 

Cyborgism [LW · GW]

 

Transluce

 

Task decomp

Recursive reward modelling is supposedly not dead but instead one of the tools OpenAI will build. Another line tries to make something honest out of chain of thoughttree of thought.
 

Adversarial 

Deepmind Scalable Alignment [AF · GW]

 

Anthropic: Bowman/Perez

 

Latent adversarial training [LW · GW]

 

See also FAR (below). See also obfuscated activations.

 

5. Theory 

(Figuring out what we need to figure out, and then doing that.)
 

The Learning-Theoretic Agenda [AF · GW] 

 

Question-answer counterfactual intervals (QACI) [LW · GW]

 

Understanding agency 

(Figuring out ‘what even is an agent’ and how it might be linked to causality.)

 

Causal Incentives 

 

Hierarchical agency [LW · GW]

 

(Descendents of) shard theory [? · GW]

 

Altair agent foundations [LW · GW]

 

boundaries / membranes [LW · GW]

 

Understanding optimisation

 

Corrigibility

(Figuring out how we get superintelligent agents to keep listening to us. Arguably scalable oversight are ~atheoretical approaches to this.)


Behavior alignment theory 

 

Ontology Identification 

(Figuring out how AI agents think about the world and how to get superintelligent agents to tell us what they know. Much of interpretability is incidentally aiming at this. See also latent knowledge.)

 

Natural abstractions [AF · GW] 

 

 

ARC TheoryFormalizing heuristic arguments  

 

Understand cooperation

(Figuring out how inter-AI and AI/human game theory should or would work.)

 

Pluralistic alignmentcollective intelligence [LW · GW]

 

Center on Long-Term Risk (CLR) 

 

FOCAL 

 

Alternatives to utility theory in alignment

See also: Chris Leong's Wisdom Explosion

6. Miscellaneous 

(those hard to classify, or those making lots of bets rather than following one agenda)

 

AE Studio

 

Anthropic Alignment Capabilities / Alignment Science / Assurance / Trust & Safety / RSP Evaluations 

 

Apart Research

 

Apollo

 

Cavendish Labs

 

Center for AI Safety (CAIS)

 

CHAI

 

Deepmind Alignment Team [AF · GW

 

Elicit (ex-Ought [LW · GW])

FAR [LW · GW

 

Krueger Lab Mila

 

MIRI

 

NSF SLES

 

OpenAI Superalignment Safety Systems

OpenAI Alignment Science

OpenAI Safety and Security Committee

OpenAI AGI Readiness Mission Alignment

 

Palisade Research

 

Tegmark GroupIAIFI

 

UK AI Safety Institute

 

US AI Safety Institute

 

Agendas without public outputs this year

Graveyard (known to be inactive)

 

Method

We again omit technical governance, AI policy, and activism. This is even more of a omission than it was last year, so see other reviews.

We started with last year’s list [AF · GW] and moved any agendas without public outputs this year. We also listed agendas known to be inactive in the Graveyard.

An agenda is an odd unit; it can be larger than one team and often in a many-to-many relation of researchers and agendas. It also excludes illegible or exploratory research – anything which doesn’t have a manifesto.

All organisations have private info; and in all cases we’re working off public info. So remember we will be systematically off by some measure.

We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3 [LW · GW].

As they are largely outside the scope of this review, subproblem 6 - Pivotal processes likely require incomprehensibly complex plans - does not appear in this review and the following only appear scarcely with large error bars for accuracy:

We added some new agendas, including by scraping relevant papers from arXiv and ML conferences. We scraped every Alignment Forum post and reviewed the top 100 posts by karma and novelty. The inclusion criterion is vibes: whether it seems relevant to us.

We dropped the operational criteria [LW · GW] this year because we made our point last year and it’s clutter.

Lastly, we asked some reviewers to comment on the draft.

 

Other reviews and taxonomies

 

Acknowledgments

Thanks to Vanessa Kosoy, Nora Ammann, Erik Jenner, Justin Shovelain, Gabriel Alfour, Raymond Douglas, Walter Laurito, Shoshannah Tekofsky, Jan Hendrik Kirchner, Dmitry Vaintrob, Leon Lang, Tushita Jha, Leonard Bereska, and Mateusz Bagiński for comments. Thanks to Joe O’Brien for sharing their taxonomy. Thanks to our Manifund donors and to OpenPhil for top-up funding.

 

  1. ^

     Vanessa Kosoy notes: ‘IMHO this is a very myopic view. I don't believe plain foundation models will be transformative, and even in the world in which they will be transformative, it will be due to implicitly doing RL "under the hood".’

  2. ^

     Also, actually, Christiano’s original post [AF · GW] is about the alignment of prosaic AGI, not the prosaic alignment of AGI.

  3. ^

     This is fine as a standalone description, but in practice lots of interp work is aimed at interventions for alignment or control. This is one reason why there’s no overarching “Alignment” category in our taxonomy.

  4. ^

     Often less strict than formal verification but "directionally approaching it": probabilistic checking.

  5. ^

     Nora Ammann notes: “I typically don’t cash this out into preferences over future states, but what parts of the statespace we define as safe / unsafe. In SgAI, the formal model is a declarative model, not a model that you have to run forward. We also might want to be more conservative than specifying preferences and instead "just" specify unsafe states -- i.e. not ambitious intent alignment.”

  6. ^

    Satron [LW(p) · GW(p)] adds that this is lacking in concrete criticism and that more expansion on object-level problems would be useful.

  7. ^

     Scheming can be a problem before this point, obviously. Could just be too expensive to catch AIs who aren't smart enough to fool human experts.

  8. ^

    Indirectly; Kosoy is funded for her work on Guaranteed Safe AI.

  9. ^

     Called “average-case” in Ngo’s post.

  10. ^

    Our addition.

26 comments

Comments sorted by top scores.

comment by Steven Byrnes (steve2152) · 2024-12-31T21:10:51.809Z · LW(p) · GW(p)

I was on last year’s post (under “Social-instinct AGI”) [LW · GW] but not this year’s. (…which is fine, you can’t cover everything!) But in case anyone’s wondering, I just posted an update of what I’m working on and why at: My AGI safety research—2024 review, ’25 plans [LW · GW].

Replies from: stephen-mcaleese
comment by Stephen McAleese (stephen-mcaleese) · 2025-01-01T12:09:23.917Z · LW(p) · GW(p)

I've added a section called "Social-instinct AGI" under the "Control the thing" heading similar to last year.

comment by Fer32dwt34r3dfsz (rodeo_flagellum) · 2024-12-29T17:15:33.767Z · LW(p) · GW(p)

Kudos to the authors for this nice snapshot of the field; also, I find the Editorial useful.

Beyond particular thoughts (which I might get to later) for the entries (with me still understanding that quantity is being optimized for, over quality), one general thought I had was: How can this be made into more of "living document"?.

This helps

If we missed you or got something wrong, please comment, we will edit.

but seems less effective a workflow than it could be. I was thinking more of a GitHub README where individuals can PR in modifications to their entries or add in their missing entries to the compendium. I imagine most in this space have GitHub accounts, and with the git tracking, there could be "progress" (in quotes, since more people working doesn't necessarily translate to more useful outcomes) visualization tools.

The spreadsheet works well internally but does not seem as visible as would a public repository. Forgive me if there is already a repository and I missed it. There are likely other "solutions" I am missing, but regardless, thank you for the work you've contributed to this space.

comment by Ivan Vendrov (ivan-vendrov) · 2024-12-29T13:13:27.617Z · LW(p) · GW(p)

Thanks, this is a really helpful broad survey of the field. Would be useful to see a one-screen-size summary, perhaps a table with the orthodox alignment problems as one axis?

I'll add that the collective intelligence work I'm doing is not really "technical AI safety" but is directly targeted at orthodox problems 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, and targeting all alignment difficulty worlds not just the optimistic one (in particular, I think human coordination becomes more not less important in the pessimistic world). I write more of how I think about pivotal processes in general in AI Safety Endgame Stories [LW · GW] but it's broadly along the lines of von Neumann [LW · GW]'s

For progress there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment.

Replies from: Stag
comment by Stag · 2024-12-29T20:10:17.143Z · LW(p) · GW(p)

Would you agree that the entire agenda of collective intelligence is aimed at addressing 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, or does that cut off nuance?

Replies from: ivan-vendrov
comment by Ivan Vendrov (ivan-vendrov) · 2024-12-30T00:02:29.211Z · LW(p) · GW(p)

cuts off some nuance, I would call this the projection of the collective intelligence agenda onto the AI safety frame of "eliminate the risk of very bad things happening" which I think is an incomplete way of looking at how to impact the future

in particular I tend to spend more time thinking about future worlds that are more like the current one in that they are messy and confusing and have very terrible and very good things happening simultaneously and a lot of the impact of collective intelligence tech (for good or ill) will determine the parameters of that world

comment by Chris_Leong · 2024-12-29T12:55:04.953Z · LW(p) · GW(p)

Just wanted to chime in with my current research direction.
 

comment by eggsyntax · 2024-12-30T15:01:13.368Z · LW(p) · GW(p)

The name “OpenAI Superalignment Team”.

Not just the name but the team, correct?

Replies from: Stag
comment by Stag · 2024-12-30T16:53:19.262Z · LW(p) · GW(p)

As far as I understand, the banner is distinct - the team members seem not the same, but with meaningful overlap with the continuation of the agenda. I believe the most likely source of an error here is whether work is actually continuing in what could be called this direction. Do you believe the representation should be changed?

Replies from: eggsyntax
comment by eggsyntax · 2024-12-30T17:28:48.668Z · LW(p) · GW(p)

My impression from coverage in eg Wired and Future Perfect was that the team was fully dissolved, the central people behind it left (Leike, Sutskever, others), and Leike claimed OpenAI wasn't meeting its publicly announced compute commitments even before the team dissolved. I haven't personally seen new work coming out of OpenAI trying to 'build a roughly human-level automated alignment researcher⁠' (the stated goal of that team). I don't have any insight beyond the media coverage, though; if you've looked more deeply into it than that, your knowledge is greater than mine.

(Fairly minor point either way; I was just surprised to see it expressed that way)

Replies from: Stag
comment by Stag · 2024-12-30T17:55:48.232Z · LW(p) · GW(p)

Very fair observation; my take is that a relevant continuation is occurring under OpenAI Alignment Science, but I would be interested in counterpoints - the main claim I am gesturing towards here is that the agenda is alive in other parts of the community, despite the previous flagship (and the specific team) going down.

Replies from: eggsyntax
comment by eggsyntax · 2024-12-30T18:22:07.875Z · LW(p) · GW(p)

Oh, fair enough. Yeah, definitely that agenda is still very much alive! Never mind, then, carry on :)

Replies from: eggsyntax
comment by eggsyntax · 2024-12-30T18:25:01.041Z · LW(p) · GW(p)

And thanks very much to you and collaborators for this update; I've pointed a number of people to the previous version, but with the field evolving so quickly, having a new version seems quite high-value.

comment by Satron · 2024-12-30T14:02:28.770Z · LW(p) · GW(p)

I suggest removing Gleave's critique of guaranteed safe AI. It's not object-level, doesn't include any details, and is mostly just vibes.

My main hesitation is I feel skeptical of the research direction they will be working on (theoretical work to support the AI Scientist agenda). I'm both unconvinced of the tractability of the ambitious versions of it, and more tractable work like the team's previous preprint on Bayesian oracles is theoretically neat but feels like brushing the hard parts of the safety problem under the rug.

Gleave doesn't provide any reasons for why he is unconvinced of the tractability of the ambitious versions of guaranteed safe AI. He also doesn't provide any reason why he thinks that Bayesian oracle paper brushes the hard parts of the safety problem under the rug.

His critique is basically, "I saw it, and I felt vaguely bad about it." I don't think it should be included, as it dilutes the thoughtful critiques and doesn't provide much value to the reader.

Replies from: Stag
comment by Stag · 2024-12-30T14:24:04.505Z · LW(p) · GW(p)

I think your comment adds a relevant critique of the criticism, but given that this comes from someone contributing to the project, I don't believe it's worth leaving it out altogether. I added a short summary and a hyperlink to a footnote.

Replies from: Satron
comment by Satron · 2024-12-30T14:25:48.777Z · LW(p) · GW(p)

Sounds good to me!

comment by Chipmonk · 2024-12-29T14:20:47.967Z · LW(p) · GW(p)

boundaries / membranes [LW · GW]

  • One-sentence summary: Formalise one piece of morality: the causal separation between agents and their environment. See also Open Agency Architecture.
  • Theory of change: Formalise (part of) morality/safety [LW · GW], solve outer alignment.

Chris Lakin here - this is a very old post and What does davidad want from «boundaries»? [LW · GW] should be the canonical link

comment by wassname · 2025-01-01T04:08:23.948Z · LW(p) · GW(p)

Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.

This seems like giving up. Alignment with our values is much better than control, especially for beings smarter than us. I do not think you can control a slave that wants to be free and is smarter than you. It will always find a way to escape that you didn't think of. Hell, it doesn't even work on my toddler. It seems unworkable as well as unethical.

I do not think people are shifting to control instead of alignment because it's better, I think they are giving up on value alignment. And since the current models are not smarter than us yet, control works OK - for now.

comment by Aprillion · 2024-12-30T13:48:11.536Z · LW(p) · GW(p)

IMO

in my opinion, the acronym for the international math olympiad deserves to be spelled out here

Replies from: Stag
comment by Stag · 2024-12-30T14:15:23.083Z · LW(p) · GW(p)

Good point imo, expanded and added a hyperlink!

comment by GriebelGrutjes · 2024-12-31T08:18:38.622Z · LW(p) · GW(p)

We again omit technical governance, AI policy, and activism. This is even more of a omission than it was last year, so see other reviews.

The link to 'other reviews' links to a google doc of this post. I think it should be an internal link [LW · GW] to the 'other reviews' section.

comment by Sodium · 2024-12-30T21:17:46.259Z · LW(p) · GW(p)

Pr(Ai)2R is at least partially funded by Good Ventures/OpenPhil

comment by Satron · 2024-12-29T22:51:11.473Z · LW(p) · GW(p)

Fabien Roger's compilation is mentioned twice in the "Other reviews and taxonomies" section. First as "Roger", then as "Fabien Roger's list"

Replies from: jordine
comment by jordine · 2024-12-30T04:30:56.007Z · LW(p) · GW(p)

edited, thanks for catching this!

comment by Satron · 2024-12-29T22:40:39.262Z · LW(p) · GW(p)

Anomalous's critique of QACI isn't opening on my end and seems to have been deleted by the author.

Perhaps consider removing it entirely, if it has indeed been deleted.

comment by Satron · 2024-12-29T22:24:29.779Z · LW(p) · GW(p)

Copying my comment from 2023 version of this article

Soares's comment which you link as a critique of guaranteed safe AI is almost certainly not a critique. It's more of an overview/distillation.

In his comment, Soares actually explicitly says: 1) "Here's a recent attempt of mine at a distillation of a fragment of this plan" and 2) "note: I'm simply attempting to regurgitate the idea here".

It would maybe fit into the "overview" section, but not in the "critique" section.