(My understanding of) What Everyone in Technical Alignment is Doing and Why

thomas-larsen

(My understanding of) What Everyone in Technical Alignment is Doing and Why

post by Thomas Larsen (thomas-larsen), elifland · 2022-08-29T01:23:58.073Z · LW · GW · 90 comments

  Introduction
  Aligned AI / Stuart Armstrong
  Alignment Research Center (ARC)
    Eliciting Latent Knowledge / Paul Christiano
    Evaluating LM power-seeking / Beth Barnes
  Anthropic 
    LLM Alignment
    Interpretability
    Scaling laws
  Brain-Like-AGI Safety / Steven Byrnes
  Center for AI Safety (CAIS) / Dan Hendrycks
  Center for Human Compatible AI (CHAI) / Stuart Russell
  Center on Long Term Risk (CLR)
  Conjecture 
    Epistemology
    Scalable LLM Interpretability
    Refine
    Simulacra Theory
  David Krueger
  DeepMind
  Dylan Hadfield-Menell
  Encultured
  Externalized Reasoning Oversight / Tamera Lanham
  Future of Humanity Institute (FHI) 
  Fund For Alignment Research (FAR)
  MIRI
    Communicate their view on alignment
    Deception + Inner Alignment / Evan Hubinger
    Agent Foundations / Scott Garrabrant and Abram Demski
    Infra-Bayesianism / Vanessa Kosoy
    Visible Thoughts Project
  Jacob Steinhardt 
  OpenAI
  Ought
  Redwood Research
    Adversarial training
    LLM interpretability
  Sam Bowman 
  Selection Theorems / John Wentworth
  Team Shard
  Truthful AI / Owain Evans and Owen Cotton-Barratt
  Other Organizations
  Appendix
    Visualizing Differences
      Automating alignment and alignment difficulty
      Conceptual vs. applied
    Thomas’s Alignment Big Picture
None
90 comments

Epistemic Status: My best guess [LW · GW]

Epistemic Effort: ~75 hours of work put into this document

Contributions: Thomas wrote ~85% of this, Eli wrote ~15% and helped edit + structure it. Unless specified otherwise, writing in the first person is by Thomas and so are the opinions. Thanks to Miranda Zhang, Caleb Parikh, and Akash Wasil for comments. Thanks to many others for relevant conversations.

Introduction

Despite a clear need for it, a good source explaining who is doing what and why in technical AI alignment doesn't exist. This is our attempt to produce such a resource. We expect to be inaccurate in some ways, but it seems great to get out there and let Cunningham’s Law do its thing.^[1]

The main body contains our understanding of what everyone is doing in technical alignment and why, as well as at least one of our opinions on each approach. We include supplements visualizing differences [LW · GW] between approaches and Thomas’s big picture view on alignment [LW · GW]. The opinions written are Thomas and Eli’s independent impressions [? · GW], many of which have low resilience [? · GW]. Our all-things-considered views are significantly more uncertain.

This post was mostly written while Thomas was participating in the 2022 iteration SERI MATS program, under mentor John Wentworth. Thomas benefited immensely from conversations with other SERI MATS participants, John Wentworth, as well as many others who I met this summer.

Disclaimers:

This post is our understanding and has not been endorsed by the people doing the work itself.
The length of the summaries varies according to our knowledge of this approach, and is not meant to reflect a judgement on the quality or quantity of work done.
We are not very familiar with most of the academic alignment work being done, and have only included a few academics.

A summary of our understanding of each approach:

Approach	Problem Focus	Current Approach Summary	Scale
Aligned AI [LW · GW]	Model splintering [LW · GW]	Solve extrapolation problems.	2-5 researchers, started Feb 2022
ARC [LW · GW]	Inaccessible information [LW · GW]	ELK [? · GW] + LLM power-seeking evaluation [LW · GW]	3 researchers, started April 2021 [LW · GW]
Anthropic [LW · GW]	LLM Outer Alignment (?)[3] [LW(p) · GW(p)]	Interpretability + HHH + augmenting alignment research with LLMs	~35? technical staff[3] [LW(p) · GW(p)], started May 2021
Brain-like-AGI Safety [LW · GW]	Brain-like AGI Safety [? · GW]	Use brains as a model for how AGI will be developed, think about alignment in this context	~4 researchers [LW · GW], started March 2021 [LW · GW]
Center for AI Safety (CAIS) [LW · GW]	Engaging the ML community, many technical problems [LW · GW]	Technical research [LW · GW], Infrastructure, and ML community field-building for safety	7-10 FTE, founded in ~March 2022
CHAI [LW · GW]	Outer alignment, though CHAI is diverse	Improve CIRL + many other independent approaches.	~20 FTE?, founded in 2016
CLR [LW · GW]	Suffering risks [? · GW]	Foundational game theory research	5-10 FTE, founded before 2015
Conjecture [LW · GW]	Inner alignment	Interpretability + automating alignment research with LLMs	~20 FTE, announced April 2022 [LW · GW]
David Krueger [LW · GW]	Goal misgeneralization	Empirical examples and understanding ML inductive biases	Academic lab with 7 students
DeepMind [LW(p) · GW(p)]	Many including scalable oversight and goal misgeneralization	Many including Debate, discovering agents [LW · GW], ERO, and understanding threat models. [4] [LW(p) · GW(p)]	>1000 FTE for the company as a whole, ~20-25 FTE on the alignment + scalable alignment teams
Dylan Hadfield-Menell [LW · GW]	Value Alignment	Reward specification + Norms	Academic research lab
Encultured [LW(p) · GW(p)]	Multipolar failure [LW · GW] from lack of coordination	Video game [LW · GW]	~3 people, announced August 2022 [LW · GW]
Externalized Reasoning Oversight [LW · GW]	Deception	Get the reasoning of the AGI to happen in natural language, then oversee that reasoning	~1 person's project for a summer (though others are working on this approach)
FHI [LW · GW]	Agent incentives / wireheading (?)	Causal model formalism to study incentives.	~3 people in the causal group / ~20 total?, FHI founded in 2005, Causal group founded in 2021
FAR [LW · GW]	Many	Incubate new, scalable alignment research agendas, technical support for existing researchers	4 people on leadership but I'm guessing ~5 more engineers, announced July 2022 [EA · GW]
MIRI [LW(p) · GW(p)]	Many including deception [LW · GW], the sharp left turn [LW · GW], corrigibility is anti-natural	Mathematical research to resolve fundamental confusion about the nature of goals/agency/optimization	11 research staff, founded in approximately 2005
Jacob Steinhardt [LW · GW]	Distribution Shift	Conceptual alignment	Academic lab of 9 PhD students + Postdocs
OpenAI [LW(p) · GW(p)]	Scalable oversight	RLHF / Recursive Reward Modeling, then automate alignment research	100 capabilities and 30 alignment researchers [LW · GW], founded December 2015.
Ought [LW(p) · GW(p)]	Scalable oversight	Supervise process rather than outcomes [LW · GW] + augment alignment researchers	10 employees, founded in ~2018
Redwood [LW · GW]	Inner alignment (?)	Interpretability + Adversarial Training	12-15 research staff, started sometime before September 2021 [AF · GW]
Sam Bowman [LW · GW]	LLM Outer Alignment	Creating datasets for evaluation + inverse scaling prize	Academic lab
Selection Theorems [LW · GW]	Being able to robustly point [LW · GW] at objects in the world	Selection Theorems [LW · GW] based on natural abstractions [LW · GW]	~2 FTE, started around August 2019 [AF · GW]
Team Shard [LW · GW]	Instilling inner values from an outer training loop	Find patterns of values given by current RL setups and humans, then create quantitative rules to do this	~4-6 people, started Spring 2022
Truthful AI [LW · GW]	Deception	Create standards and datasets to evaluate model truthfulness	~10 people, one research project

Previous related overviews include:

Neel Nanda's My Overview of the AI Alignment Landscape [LW · GW]
Evan Hubinger's An overview of 11 proposals for building safe advanced AI [LW · GW]
Larks' yearly Alignment Literature Review and Charity Comparison [LW · GW]
Nate Soares' On how various plans miss the hard bits of the alignment challenge [LW · GW]
Andrew Critch's Some AI research areas and their relevance to existential safety [LW · GW]
80,000 Hours’ list of organizations working in the area

Aligned AI [EA · GW] / Stuart Armstrong

One of the key problems in AI safety is that there are many ways for an AI to generalize off-distribution, so it is very likely that an arbitrary generalization will be unaligned. See the model splintering post [LW · GW] for more detail. Stuart's plan to solve this problem is as follows:

Maintain a set of all possible extrapolations of reward data that are consistent with the training process
Pick among these for a safe reward extrapolation.

They are currently working on algorithms to accomplish step 1: see Value Extrapolation [LW · GW].

Their initial operationalization of this problem is the lion and husky problem. Basically: if you train an image model on a dataset of images of lions and huskies, the lions are always in the desert, and the huskies are always in the snow. So the problem of learning a classifier is under-defined: should the classifier be classifying based on the background environment (e.g. snow vs sand), or based on the animal in the image?

A good extrapolation algorithm, on this problem, would generate classifiers that extrapolate in all the different ways^[4], and so the 'correct' extrapolation must be in this generated set of classifiers. They have also introduced a new dataset for this, with a similar idea: Happy Faces [LW · GW].

Step 2 could be done in different ways. Possibilities for doing this include: conservatism [LW · GW], generalized deference to humans [LW · GW], or an automated process for removing some goals. like wireheading/deception/killing everyone.

Opinion: I like that this approach tries to tackle distributional shift, which might I see as one of the fundamental hard parts of alignment.

The problem is that I don't see how to integrate this approach for solving this problem with deep learning. It seems like this approach might work well for a model-based RL setup where you can make the AI explicitly select for this utility function.

It is unclear to me how to use this to align an end-to-end AI training pipeline. The key problem unsolved by this is how to get inner values into a deep learning system via behavioral gradients: reward is not the optimization target [LW · GW]. Generating a correct extrapolation of human goals does not let us train a deep learning system to accomplish these goals.

Alignment Research Center (ARC)

Eliciting Latent Knowledge / Paul Christiano

ARC is trying to solve Eliciting Latent Knowledge (ELK). Suppose that you are training an AI agent that predicts the state of the world and then performs some actions, called a predictor. This predictor is the AGI that will be acting to accomplish goals in the world. How can you create another model, called a reporter, that tells you what the predictor believes about the world? A key challenge in training this reporter is that training your reporter on human labeled training data, by default, incentivizes the predictor to just model what the human thinks is true, because the human is a simpler model than the AI.

Motivation: At a high level, Paul's plan seems to be to produce a minimal AI that can help to do AI safety research. To do this, preventing deception [LW · GW] and inner alignment failure [LW · GW] are on the critical path, and the only known solution paths to this require interpretability (this is how all of Evan's 11 proposals [LW · GW] plan to get around this problem).

If ARC can solve ELK, this would be a very strong form of interpretability: our reporter is able to tell us what the predictor believes about the world. Some ways this could end up being useful for aligning the predictor include:

Using the reporter to find deceptive/misaligned thoughts in the predictor, and then optimizing against those interpreted thoughts. At any given point in time, SGD only updates the weights a small amount. If an AI becomes misaligned, it won't be very misaligned, and the interpretability tools will be able to figure this out and do a gradient step to make it aligned again. In this way, we can prevent deception at any point in training.
Stopping training if the AI is misaligned.

Opinion: There are several key uncertainties that I have with this approach.

I am not sure if there exists an ELK solution, even in theory. Even if such a thing exists, I am not sure if it will be tractable to implement.
Optimizing against your reporter puts optimization pressure into regions in which the classifier deceives you and the reporter, or more generally where the reporter fails.
Depending on how ELK gets used, there seems like a risk of too many AIs: if your reporter must be "smarter" than your predictor, there might be coordination between the predictor and reporter, or the reporter might become agentic and cause catastrophe. In other words, many approaches built off ELK seem like godzilla strategies [LW · GW].

Overall, ELK seems like one of the most promising angles of attack on the problem because it seems both possible to make progress on [LW · GW] and also actually useful towards solving alignment: if it works it would let us avoid deception. It is simple and naturally becomes turned into a proposal for alignment. I'm very excited about more effort being put towards solving ELK.

While this seems like a very powerful form of interpretability, there are also some limitations, for example, solving ELK does not immediately tell you how the internals of your agent works, as would be required for a plan like retargeting the search [? · GW].

Evaluating LM power-seeking [AF · GW] / Beth Barnes

Beth is working on “generating a dataset that we can use to evaluate how close models are to being able to successfully seek power”. The dataset is being created through simulating situations in which an LLM is trying to seek power.

The overall goal of the project is to assess how close a model is to being dangerous, e.g. so we can know if it’s safe for labs to scale it up. Evaluations focus on whether models are capable enough to seek power successfully, rather than whether they are aligned. They are aiming to create an automated evaluation which takes in a model and outputs how far away from dangerous it is, approximating an idealized human evaluation.

Eli’s opinion: I’m very excited about this direction, but I think for a slightly different reason than Beth is. There’s been lots of speculation about how close the capabilities of current systems are to being able to execute complex strategies like “playing the training game” [AF · GW], but very little empirical analysis of how “situationally aware” [AF · GW] models actually are. The automated metric is an interesting idea, but I’m most excited about getting a much better understanding of the situational awareness of current models through rigorous human evaluation; the project also might produce compelling examples of attempts at misaligned power-seeking in LLMs that could be very useful for field-building (convincing ML researchers/engineers).

Opinion: I think this only works in a slowish takeoff world where we can continuously measure deception. I am ~70% in a world where AGI capabilities jump fast enough that this type of evaluation doesn't help. It seems really hard to know when AGI will come.

In the world where takeoffs are slow enough that this is meaningful, the difficulty then becomes getting labs to actually slow down based on this data: if this worked, it would be hugely valuable. Even if it doesn't get people to slow down, it might help inform alignment research about likely failure modes for LLMs.

Anthropic

LLM Alignment

Anthropic fine tuned a language model to be more helpful, honest and harmless: HHH.

Motivation: I think the point of this is to 1) see if we can "align" a current day LLM, and 2) raise awareness about safety in the broader ML community.

Opinion: This seems… like it doesn't tackle what I see as the core [LW · GW] problems [LW · GW] in alignment [LW · GW]. This may make current day language models being less obviously misaligned, but that doesn't seem like that helps us at all to align AGIs.

Interpretability

Chris Olah, the interpretability legend, is working on looking really hard at all the neurons to see what they all mean. The approach he pioneered is circuits: looking at computational subgraphs of the network, called circuits, and interpreting those. Idea: "decompiling the network into a better representation that is more interpretable". In-context learning via attention heads, and interpretability here seems useful.

One result I heard about recently: a linear softmax unit stretches space and encourages neuron monosemanticity (making a neuron represent only one thing, as opposed to firing on many unrelated concepts). This makes the network easier to interpret.

Motivation: The point of this is to get as many bits of information about what neural networks are doing, to hopefully find better abstractions. This diagram gets posted everywhere, the hope being that networks, in the current regime, will become more interpretable because they will start to use abstractions that are closer to human abstractions.

Opinion: This seems like it won't scale up to AGI. I think that model size is the dominant factor in making things difficult to interpret, and so as model size scales to AGI, things will become ever less interpretable. If there was an example of networks becoming more interpretable as they got bigger, this would update me (and such an example might already exist, I just don't know of it).

I would love a new paradigm for interpretability, and this team seems like probably the best positioned to find such a paradigm.

There is a difficult balance to be made here, because publishing research helps with both alignment and capabilities. They are very aware of this, and have thought a lot about this information tradeoff, but my inclination is that they are on the wrong side: I would rather they publish less. Even though this research helps safety some, buying more time is just more important than knowing more about safety right now.

Scaling laws

The basic idea is to figure out how model performance scales, and use this to help understand and predict what future AI models might look like, which can inform timelines and AI safety research. A classic result found that you need to increase data, parameters, and compute all at the same time (at roughly the same rate) in order to improve performance. Anthropic extended this research here.

Opinion: I am guessing this leads to capabilities gains because it makes the evidence for data+params+compute = performance much stronger and clearer. Why can't we just privately give this information to relevant safety researchers instead of publishing it publicly?

I'm guessing that the point of this was to shift ML culture: this is something valuable and interesting to mainstream ML practitioners which means they will read this safety focused paper.

Brain-Like-AGI Safety / Steven Byrnes [? · GW]

[Disclaimer: haven't read the whole sequence]. This is primarily Steven Brynes, a full time independent alignment researcher, working on answering the question: "How would we align an AGI whose learning algorithms / cognition look like human brains?"

Humans seem to robustly care about things, why is that? If we understood that, could we design AGIs to do the same thing? As far as I understand it, most of this work is biology based: trying to figure out how various parts of the brain works, but then also connecting this to alignment and seeing if we can solve the alignment problem with this understanding.

There are three other independent researchers working on related projects [LW · GW] that Steven has proposed.

Opinion: I think it's quite likely that we get useful bits of information and framings from this analysis. On the current margin, I think it is very good that we have some people thinking about brain-like AGI safety, and I also think this research is less likely to be dual use.

I also find it unlikely (~30%) that we'll get brain-like AGI as opposed to prosaic AGI.

Center for AI Safety (CAIS) / Dan Hendrycks

Rewritten slightly after Thomas Woodside (an employee of CAIS) commented. I recommend reading his comment [LW(p) · GW(p)], as well as their sequence, Pragmatic AI Safety [? · GW], which lays out a more flushed out description of their theory of impact.

Right now, only a very small subset of ML researchers are thinking about x-risk from AGI. CAIS seeks to change this -- their goal is to get the broader ML community, including both industry and academia.

CAIS is working on a number of projects, including:

Writing papers that talk about x-risk.
Publishing compilations of open problems.
Make safety benchmarks that the ML community can iterate on.
Running a NeurIPS competition on these benchmarks.
Running the ML Safety Scholars program (MLSS)
A Philosophy Fellowship aimed at recruiting philosophers to do conceptual alignment research.

One of these competitions is a Trojan detection competition, which is a way of operationalizing deceptive alignment. A Trojan is a backdoor into a neural network that causes it to behave weirdly on a very specific class of inputs. These are often trained into a model via poisoned data. Trojans are similar to deceptive alignment because there are a small number of examples (e.g. 300 out of 3 million training examples) that cause very different behavior (e.g. a treacherous turn), while for the vast majority of inputs cause the model to perform normally.

This competition is in a builder breaker format, with rewards for both detecting trojans as well as coming up with trojans that no one else could detect.

Opinion: One worry with the competition is that contestants will pursue strategies that work right now but won't work for AGI, because they are trying to win the competition instead of align AGIs.

Engaging the broader ML community to reduce AGI x-risk seems robustly good both for improving the quality of the discourse and steering companies and government away from building AGI unsafely.

Center for Human Compatible AI (CHAI) / Stuart Russell

CHAI is an academic research organization affiliated with UC Berkeley. It is lead by Stuart Russell, but includes many other professors and grad students pursuing a diverse array of approaches, most of whom I will not summarize here. For more information see their 2022 progress report.

Stuart wrote the book Human Compatible, in which he outlines his AGI alignment strategy, which is based on cooperative inverse reinforcement learning (CIRL). The basic idea of CIRL is to play a cooperative game where both the agent and the human are trying to maximize the human's reward, but only the human knows what the human reward is. Since the AGI has uncertainty it will defer to humans and be corrigible.

Other work that I liked is Clusterability in neural networks: try to measure the modularity of neural networks by thinking of the network as a graph and performing the graph n-cut.

Opinion: CIRL might make sense in a non-deep learning paradigm, but I think that deep learning will scale to AGI (~95%). The problem of getting a deep learning system to 'try' to maximize the human's reward is the hard part. Viewed this way, CIRL is a wrong way reduction. Some discussion on this is here [LW · GW].

Eli’s opinion: My understanding is that both Stuart and others mainly expect CIRL to potentially be helpful if deep learning doesn’t scale to AGI, but I expect deep learning to scale to AGI with >80% probability.

Center on Long Term Risk (CLR)

CLR is focused primarily on reducing suffering-risk (s-risk), where the future has a large negative value. They do foundational research in game theory / decision theory, primarily aimed at multipolar AI scenarios. One result relevant to this work is that transparency can increase cooperation.

Update after Jesse Clifton commented: [LW(p) · GW(p)] CLR also works on improving coordination for prosaic AI scenarios, risks from malevolent actors [EA · GW] and AI forecasting [LW · GW]. The Cooperative AI Foundation (CAIF) shares personnel with CLR, but is not formally affiliated with CLR, and does not focus just on s-risks.

Opinion: I have <1% credence in agential s-risk happening, which is where most of the worry from s-risk is as far as I can tell. I have ~70% on x-risk from AGI, so I don't think s-risk is worth prioritizing. I also view the types of theoretical research done here as being not very tractable.

I also find it unlikely (<10%) that logic/mathy/decision theory stuff ends up being useful for AGI alignment.

Eli’s opinion: I haven’t engaged with it much but am skeptical for similar reasons to Thomas, though I have closer to ~5% on s-risk.

Conjecture

Conjecture is an applied org focused on aligning LLMs (Q & A here [EA · GW]). Conjecture has short timelines (the org acts like timelines are between ~5-10 year, but some have much shorter timelines, such as 2-4 years), and they think alignment is hard. They take information hazards (specifically ideas that could lead towards better AI capabilities) very seriously, and have a public infohazard document [LW · GW].

Epistemology [? · GW]

The alignment problem is really hard to do science on: we are trying to reason about the future, and we only get one shot, meaning that we can't iterate [LW · GW]. Therefore, it seems really useful to have a good understanding of meta-science/epistemology, i.e. reasoning about ways to do useful alignment research.

An example post [LW · GW] I thought was very good.

Opinion: Mixed. One part of me is skeptical that we should be going up a meta layer instead of trying to directly solve the problem — solving meta-science seems really hard and quite indirect. On the other hand, I consistently learn a lot from reading Adam's posts, and I think others do too, and so this work does seem really helpful.

This would seem to be more useful in the long timelines world, contradicting Conjecture's worldview, however, another key element of their strategy is decorrelated research bets, and this is certainly quite decorrelated.

Eli’s opinion: I’m a bit more positive on this than Thomas, likely because I have longer timelines (~10-15% within 10 years, median ~25-30 years).

Scalable LLM Interpretability

I don't know much about their research here, other than that they train their own models, which allow them to work on models that are bigger than the biggest publicly available models, which seems like a difference from Redwood.

Current interpretability methods are very low level (e.g., "what does x neuron do"), which does not help us answer high level questions like "is this AI trying to kill us".

They are trying a bunch of weird approaches, with the goal of scalable mechanistic interpretability, but I do not know what these approaches actually are.

Motivation: Conjecture wants to build towards a better paradigm that will give us a lot more information, primarily from the empirical direction (as distinct from ARC, which is working on interpretability with a theoretical focus).

Opinion: I like the high level idea, but I have no idea how good the actual research is.

Refine [LW · GW]

Refine is an incubator for new decorrelated alignment "research bets". Since no approach is very promising right now for solving alignment, the purpose of this is to come up with a bunch of independent new ideas, and hopefully some of these will work.

Opinion: Refine seems great because focusing on finding new frames seems a lot more likely to succeed than the other ideas.

Eli’s opinion: Agree with Thomas, I’m very excited about finding smart people with crazy-seeming decorrelated ideas and helping them.

Simulacra Theory

The goal of this is to create a non-agentic AI, in the form of an LLM, that is capable of accelerating alignment research. The hope is that there is some window between AI smart enough to help us with alignment and the really scary, self improving, consequentialist AI. Some things that this amplifier might do:

Suggest different ideas for humans, such that a human can explore them.
Give comments and feedback on research, be like a shoulder-Eliezer

A LLM can be thought of as learning the distribution over the next token given by the training data. Prompting the LM is then like conditioning this distribution on the start of the text. A key danger in alignment is applying unbounded optimization pressure towards a specific goal in the world. Conditioning a probability distribution does not behave like an agent applying optimization pressure towards a goal. Hence, this avoids goodhart-related problems, as well as some inner alignment failure.

One idea to get superhuman work from LLMs is to train it on amplified datasets like really high quality / difficult research. The key problem here is finding the dataset to allow for this.

There are some ways for this to fail:

Outer alignment: It starts trying to optimize for making the actual correct next token, which could mean taking over the planet so that it can spend a zillion FLOPs on this one prediction task to be as correct as possible.
Inner alignment:
- An LLM might instantiate mesa-optimizers, such as a character in a story that the LLM is writing, and this optimizer might realize that they are in an LLM and try to break out and affect the real world.
- The LLM itself might become inner misaligned and have a goal other than next token prediction.
Bad prompting: You ask it for code for a malign superintelligence; it obliges. (Or perhaps more realistically, capabilities).

Conjecture are aware of these problems and are running experiments. Specifically, an operationalization of the inner alignment problem is to make an LLM play chess. This (probably) requires simulating an optimizer trying to win at the game of chess. They are trying to use interpretability tools to find the mesa-optimizers in the chess LLM that is the agent trying to win the game of chess. We haven't ever found a real mesa-optimizer before, and so this could give loads of bits about the nature of inner alignment failure.

My opinion: This work seems robustly useful in understanding how we might be able to use tool AIs to solve the alignment problem. I am very excited about all of this work, and the motivation is very clear to me.

Eli’s opinion: Agree with Thomas.

David Krueger

David runs a lab at the University of Cambridge. Some things he is working on include:

Operationalizing inner alignment failures and other speculative alignment failures that haven't actually been observed.
Understanding neural network generalization.

For work done on (1), see: Goal Misgeneralization, a paper that empirically demonstrated examples of inner alignment failure in Deep RL environments. For example, they trained an agent to get closer to cheese in a maze, but where the cheese was always in the top right of a maze in the training set. During test time, when presented with cheese elsewhere, the RL agent navigated to the top right instead of to the cheese: it had learned the mesa objective of "go to the top right".

For work done on (2), see OOD Generalization via Risk Extrapolation, an iterative improvement on robustness to previous methods.

I'm not sure what his motivation is for these specific research directions, but my guess is that these are his best starts on how to solve the alignment problem.

Opinion: I'm mixed on the Goal Misgeneralization. On one hand, it seems incredibly good to simply be able to observe misaligned mesaoptimization in the wild. On the other hand, I feel like I didn't learn much from the experiment because there are always many ways that the reward function could vary OOD. So whatever mesa objective was learned this system would therefore be a 'misaligned mesaoptimizer' with respect to all of the other variations OOD. (The paper states that goal misgeneralization is distinct from mesaoptimization, and I view this as mostly depending on what your definition of optimizer is, but I think that I'd call pretty much any RL agent an optimizer.) I view also view this as quite useful for getting more of academia interested in safety concerns, which I think is critical for making AGI go well.

I'm quite excited about understanding generalization better, particularly while thinking about goal misgneralization, this seems like a very useful line of research.

DeepMind

Updated after Rohin's comment [LW(p) · GW(p)].

DeepMind has both a ML safety team focused on near-term risks, and an alignment team that is working on risks from AGI. The alignment team is pursuing many different research avenues, and is not best described by a single agenda.

Some of the work they are doing is:

Engaging with recent MIRI arguments [LW · GW].
Rohin Shah produces the alignment newsletter.
Publishing interesting research like the Goal Misgeneralization paper.
Geoffrey Irving is working on debate as an alignment strategy: more detail here [LW · GW].
Discovering agents [LW · GW], which introduces a causal definition of agents, then introduces an algorithm for finding agents from empirical data.
Understanding and distilling threat models, e.g. "refining the sharp left turn [LW · GW]" and "will capabilities generalize more [LW · GW]".

See Rohin's comment [LW(p) · GW(p)] for more research that they are doing, including description of some that is currently unpublished so far.

Opinion: I am negative about DeepMind as a whole because it is one of the organizations closest to AGI, so DeepMind is incentivizing race dynamics and reducing timelines.

However, the technical alignment work looks promising and appears to be tackling difficult problems that I see as the core difficulties of alignment like inner alignment/sharp left turn problems, as well as foundational research understanding agents. I am very excited about DeepMind engaging with MIRI's arguments.

Dylan Hadfield-Menell

Dylan's PhD thesis argues three main claims (paraphrased):

Outer alignment failures are a problem.
We can mitigate this problem by adding in uncertainty.
We can model this as Cooperative Inverse Reinforcement Learning (CIRL).

Thus, his motivations seem to be modeling AGI coming in some multi-agent form, and also being heavily connected with human operators.

I'm not sure what he is currently working on, but some recent alignment-relevant papers that he has published include:

Dylan has also published a number of articles that seem less directly relevant for alignment.

Opinion: I don't see how this gets at the core problems in alignment, which I view as being related to inner alignment / sharp left turn / distribution shift, and I wish that outer alignment approaches also focused on robustness to distribution shift. How do you get this outer alignment solution into the deep learning system? However, if there is a way to do this (with deep learning) I am super on board and would completely change my mind. The more recent stuff seems less relevant.

Encultured

See post [LW · GW].

Enclutured are making a multiplayer online video game as a test environment for AI: an aligned AI should be able to play the game without ruining the fun or doing something obviously destructive like completely taking over the world, even if it has this capabilities. This seems roughly analogous to setting an AGI loose on the real world.

Motivation: Andrew Critch is primarily concerned about a multipolar AI scenario [AF · GW]: there are multiple actors with comparably powerful AI, on the cusp of recursive self improvement. The worst case is a race, and even though each actor would want to take more time checking their AGI for safety, worry that another actor will deploy will push each actor to take shortcuts and try to pull off a world-saving act. Instead of working directly on AI, which can accelerate timelines and encourage racing, creating this standardized test environment where alignment failures are observable is one component of a good global outcome.

Opinion: I don't really see how this helps, on my models we mostly have one shot: by the time we see the core difficulties of alignment, we are past the point of no return, because that AGI is sufficiently powerful to control the world. Therefore the alignment failures that we can observe in this game seem like they won't give us useful information.

On the other hand, this might be really helpful in slow takeoff worlds where alignment failures are obvious in the video game (e.g. deception isn't selected for), and these failures are relatively easy to fix.

Externalized Reasoning Oversight [AF · GW] / Tamera Lanham

Idea: Make the AGI, which is an LLM, externalize its reasoning (via chain of thought) so that its thoughts are visible to an external overseer, which then verifies that the LLM is not thinking deceptive thoughts.

This scheme requires:

Not too much steganography: the model can't hide its reasoning in non-english logical thought. This is the default of what would happen if we apply RL to this model maybe?
A safe overseer that can look through the chain of thought reasoning and identify misaligned behavior.

A key dynamic is that english is not the ideal language to be reasoning: an AGI could be more capable by doing some reasoning not in English. Thus, applying RL to a model exerts pressure towards ideal logical reasoning. However, self-supervised learning (SSL) exerts some pressure back towards just doing text prediction, and hence more of its reasoning being in English. A key question for this research agenda is thus how to put pressure to keep the reasoning externalized, instead of collapsing into internal, more efficient/logical reasoning that is much harder for us to oversee.

A rough visualization of the pressure of RL and SSL. Key question: how to make the green arrow stronger?

Opinion: This seems like it gets us a little further because we can now apply some RL, which might cause deception that is now catchable, so long as the RL doesn't push it so that the chain of thought itself is steganographic and hence not understandable to humans.

Also, catching the deception is only a small part of the battle: in worlds where deception is a big problem, we need to find ways to optimize against it, and this only gives us a small number of bits of optimization pressure.

Overall, this seems worth exploring more, I'm generally excited about more theoretical approaches on the above issues, as well as doing experiments about whether deception is a problem by default for chain of thought LLMs, which would give information on how likely deception is in general.

One problem (identified by John Wentworth [LW(p) · GW(p)]) with this strategy is that the AI might cause large bad side effects, like wiping out humans without intent, simply because the AGI wants to use the earth for something else, and this won't trigger the oversight schemes. This seems like a problem, but this project still is clear progress towards solving alignment.

Eli’s opinion: Seems promising, though significantly more promising on shorter timelines where something like current LLMs scale to AGI.

Future of Humanity Institute (FHI)

FHI does a lot of work on non-technical AI safety, but as far as I know their primary technical agenda is the Causal incentives group (joint between FHI and DeepMind), who uses notions from causality to study incentives and their application to AI Safety. Recent work includes:

Agent Incentives: A Causal Perspective, a paper which formalizes concepts such as the value of information and control incentives.
Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, a paper which theoretically analyzes wireheading.

Opinion: I haven't read this research in depth, but I don't understand the focus on causality as necessarily the best way to do theory work on alignment.

Fund For Alignment Research (FAR)

Description copied from a comment [LW · GW] by board member Adam Gleave

FAR's theory of change is to incubate new, scalable alignment research agendas. Right now I see a small range of agendas being pursued at scale (largely RLHF and interpretability), then a long tail of very diverse agendas being pursued by single individuals (mostly independent researchers or graduate students) or 2-3 person teams. I believe there's a lot of valuable ideas in this long tail that could be scaled, but this isn't happening due to a lack of institutional support. It makes sense that the major organisations want to focus on their own specific agendas -- there's a benefit to being focused! -- but it means a lot of valuable agendas are slipping through the cracks.

FAR's current approach to solving this problem is to build out a technical team (research engineers, junior research scientists, technical communication specialists) and provide support to a broad range of agendas pioneered by external research leads. Those that work, FAR will double down on and invest more in. This model has had a fair amount of demand already so there's product-market fit, but we still want to iterate and see if we can improve the model. For example, long-term FAR might want to bring some or all research leads in-house.

In terms of concrete agendas, an example of some of the things FAR is working on:

Adversarial attacks against narrowly superhuman systems like AlphaGo.
Language model benchmarks for value learning.
The inverse scaling law [LW · GW] prize.

You can read more about FAR in their launch post [LW · GW].

Eli's opinion: New, scalable alignment research agendas seem great to me in theory. I don't have strong opinions on the concrete agendas FAR is working on thus far; they seem interesting but hard to evaluate their usefulness without more deeply understanding the motivations (I skimmed the inverse scaling justification and felt a bit skeptical/unconvinced of direct usefulness, though it might be pretty useful for field-building).

MIRI

MIRI thinks technical alignment is really hard, and that we are very far from a solution. However, they think that policy solutions have even less hope. Generally, I think of their approach as supporting a bunch of independent researchers following their own directions, hoping that one of them will find some promise. They mostly buy into the security mindset: we need to know exactly (probably mathematically formally [LW · GW]) what we are doing, or the massive optimization pressure will default in ruin.

Opinion: I wish that they would try harder to e.g. do more mentorship, do more specific 1-1 outreach, do more expansion, and try more different things. I also wish they would communicate more and use less metaphors: instead of a dialogue about rockets, explain why alignment needs math. Why this reference class instead of any other?

Now I'll list some of the individual directions that they are pursuing:

Communicate their view on alignment

Recently they've been trying to communicate their worldview, in particular, how incredibly doomy they are [LW · GW], perhaps in order to move other research efforts towards what they see as the hard problems.

2021 MIRI Conversations [? · GW]
2022 MIRI Alignment Discussion [? · GW]

Opinion: I am glad that they are trying to communicate this, because I think it caused a lot of people to think through their plans to deal with these difficulties. For example, I really liked the DeepMind alignment team's response [LW · GW] to AGI ruin.

Eli’s opinion: I also really appreciate these posts. I generally think they are pointing at valuable considerations, difficulties, etc. even if I’m substantially more optimistic than them about the potential to avoid them (my p(doom) is ~45%).

Deception + Inner Alignment [LW · GW] / Evan Hubinger

I don't know a lot about this. Read Evan's research agenda [LW · GW] for more information.

It seems likely that deceptive agents are the default, so a key problem in alignment is to figure out how we can avoid deceptive alignment at every point in the training process. This seems to rely on being able to consistently exert optimization pressure against deception, which probably necessitates interpretability tools.

His plan to do this right now is acceptability verification: have some predicate that precludes deception, and then check your model for this predicate at every point in training.

One idea for this predicate is making sure that the agent is myopic [LW · GW], meaning that the AI only cares about the current timestep, so there is no incentive to deceive, because the benefits of deception happen only in the future. This is operationalized as “return the action that your model of HCH [LW · GW] would return, if it received your inputs.”

Opinion: I think this is a big problem that it would be good to have more understanding of, and I think the primary bottleneck for this is understanding the inductive biases of neural networks.

I'm skeptical of the speed prior or myopia is the right way of getting around this. The speed prior seems to take a large capabilities hit because the speed prior is not predictive about the real world. Myopia seems weird to me overall because I don't know what we want to do with a myopic agent: I think a better frame on this is simulacra theory.

Eli’s opinion: I haven’t engaged much with this but I feel intuitively skeptical of the myopia stuff; Eliezer’s arguments in this thread [AF · GW] seem right to me.

Agent Foundations [? · GW] / Scott Garrabrant and Abram Demski

They are working on fundamental problems like embeddedness, decision theory, logical counterfactuals [LW · GW], and more. A big advance was Cartesian Frames [LW · GW], a formal model of agency.

Opinion: This is wonderful work on deconfusing fundamental questions of agency. It doesn't seem like this will connect with the real world in time for AGI, but it seems like a great building block.

Infra-Bayesianism [? · GW] / Vanessa Kosoy

See Vanessa's research agenda [LW · GW] for more detail.

If we don't know how to do something given unbounded compute, we are just confused about the thing. Going from thinking that chess was impossible for machines to understanding minimax was a really good step forward for designing chess AIs, even though minimax is completely intractable.

Thus, we should seek to figure out how alignment might look in theory, and then try to bridge the theory-practice gap by making our proposal ever more efficient.The first step along this path is to figure out a universal RL setting that we can place our formal agents in, and then prove regret bounds in.

A key problem in doing this is embeddedness. AIs can't have a perfect self model — this would be like imagining your ENTIRE brain, inside your brain. There are finite memory constraints. Infra-Bayesianism [? · GW] (IB) is essentially a theory of imprecise probability that lets you specify local / fuzzy things. IB allows agents to have abstract models of themselves, and thus works in an embedded setting.

Infra-Bayesian Physicalism [LW · GW] (IBP) is an extension of this to RL. IBP allows us to

Figure out what agents are running [by evaluating the counterfactual where the computation of the agent would output something different, and see if the physical universe is different].
Give a program, classify it as an agent or a non agent, and then find its utility function.

Vanessa uses this formalism to describe PreDCA [LW(p) · GW(p)], an alignment proposal based on IBP. This proposal assumes that an agent is an IBP agent, meaning that it is an RL agent with fuzzy probability distributions (along with some other things). The general outline of this proposal is as follows:

Find all of the agents that preceded the AI
Discard all of these agents that are powerful / non-human like
Find all of the utility functions in the remaining agents
Use combination of all of these utilities as the agent's utility function

Vanessa models an AI as a model based RL system with a WM, a reward function, and a policy derived from the WM + reward. She claims that this avoids the sharp left turn [LW · GW]. The generalization problems come from the world model, but this is dealt with by having an epistemology that doesn't contain bridge rules [LW · GW], and so the true world is the simplest explanation for the observed data.

It is open to show that this proposal also solves inner alignment, but there is some chance that it does.

This approach deviates from MIRI's plan, which is to focus on a narrow task to perform the pivotal act, and then add corrigibility. Vanessa instead tries to directly learn the user's preferences, and optimize those.

Opinion: Seems quite unlikely (<10%) for this type of theory to connect with deep learning in time (but this is based on median ~10 year timelines, I'm a lot more optimistic about this given more time). On the other hand, IB directly solves non-realizability, a core problem in embedded agency.

Also, IBP has some weird conclusions like the monotonicity principle [LW · GW]: the AI's utility function has to be increasing in the number of computations running, even if these computations would involve the experience of suffering.

In the worlds the solution to alignment requires formally understanding AIs, Infra-Bayesianism seems like by far the most promising research direction that we know of. I am excited about doing more research in this direction, but also I'm excited for successor theories to IBP.

Visible Thoughts Project [LW · GW]

This project is to create a language dataset where the characters think out loud a lot, so that when we train an advanced LLM on this dataset, it will be thinking out loud and hence interpretable.

Opinion: Seems reasonable enough.

Jacob Steinhardt

Jacob Steinhardt is a professor at UC Berkeley who works on conceptual alignment. He seems to have a broad array of research interests, but with some focus on robustness to distribution shift.

A technical paper he wrote is Certified defenses against adversarial examples: a technique for creating robust networks in the sense that an adversary has to shift the input image by some constant in order to cause a certain drop in test performance. He's also researched the mechanics of distribution shift, and found that different distribution shifts induce different robustness performance.

He's published several technical overviews including AI Alignment Research Overview and Concrete problems in AI safety, and created an AI forecasting competition.

Opinion: I think it is very valuable to develop a better understanding of distribution shifts, because I view that at the heart of the difficulty of alignment.

OpenAI

The safety team at OpenAI's plan is to build a MVP aligned AGI that can help us solve the full alignment problem.

They want to do this with Reinforcement Learning from Human Feedback (RLHF): get feedback from humans about what is good, i.e. give reward to AI's based on the human feedback. Problem: what if the AI makes gigabrain 5D chess moves that humans don't understand, so can't evaluate. Jan Leike, the director of the safety team, views this (the informed oversight problem) as the core difficulty of alignment. Their proposed solution: an AI assisted oversight scheme, with a recursive hierarchy of AIs bottoming out at humans. They are working on experimenting with this approach by trying to get current day AIs to do useful supporting work such as summarizing books and criticizing itself.

OpenAI also published GPT-3, and are continuing to push LLM capabilities, with GPT-4 expected to be released at some point soon.

Opinion: This approach focuses almost entirely on outer alignment, and so I'm not sure how this is planning to get around deception. In the unlikely world where deception isn't the default / inner alignment happens by default, this alignment plan seems and complicated and therefore vulnerable to the godzilla problem [LW · GW]. I also think it relies on very slow takeoff speeds: the misaligned AGI has to help us design a successor aligned AGI, and this only works when the AGI doesn't recursively self improve to superintelligence.^[5]

Overall, the dominant effect of OpenAI pushes is that they advance capabilities, which seems really bad because it shortens AGI timelines.

Eli’s opinion: Similar to Thomas. I’m generally excited about trying to automate alignment research, but relatively more positive on Conjecture’s approach since it aims to use non-agentic systems.

Ought

Ought aims to automate and scale open-ended reasoning through Elicit, an AI research assistant. Ought focuses on advancing process-based systems [LW · GW] rather than outcome-based ones, which they believe to be both beneficial for improving reasoning in the short term and alignment in the long term. Here [LW · GW] they argue that in the long run improving reasoning and alignment converge.

So Ought’s impact on AI alignment has 2 components: (a) improved reasoning of AI governance & alignment researchers, particularly on long-horizon tasks [LW · GW] and (b) pushing supervision of process rather than outcomes [LW · GW], which reduces the optimization pressure on imperfect proxy objectives leading to “safety by construction”. Ought argues that the race between process and outcome-based systems [LW · GW] is particularly important because both states may be an attractor.

Eli’s opinion (Eli used to work at Ought): I am fairly optimistic about (a), the general strategy of applying AI in a differentially helpful way for alignment via e.g. speeding up technical alignment research or improving evaluations of alignment-aimed policies. Current Elicit is too broadly advertised to all researchers for my taste; I’m not sure whether generally speeding up science is good or bad. I’m excited about a version of Elicit that is more narrowly focused on helping with alignment (and perhaps longtermist research more generally), which I hear may be in the works.

I’m less optimistic about (b) pushing process-based systems, it seems like a very conjunctive theory of change to me. It needs to go something like: process-based systems are competitive with end-to-end training for AGI (which may be unlikely, see e.g. Rohin’s opinion here [LW · GW]; and my intuition is that HCH doesn’t work), process-based systems are substantially more aligned than end-to-end training while being as capable (feels unlikely to me; even if it helps a bit it probably needs to help a lot to make the difference between catastrophe and not [LW · GW]), and Ought either builds AGI or strongly influences the organization that builds AGI. Off the cuff I’d give something like 10%, 3%, 1% for these respectively (conditioned on the previous premises) which multiplies to .003%; this might be fine given the stakes (it’s estimated to be worth $17-$167M via .01% Fund proposal [EA · GW] assuming AI x-risk is 50%) but it doesn’t feel like one of the most promising approaches.

Redwood Research

Adversarial training [LW · GW]

The following diagram describing an adversarial oversight scheme to do for aligning an RL agent:

Motivation: The point of this approach is to create extremely reliable AI where it will never engage in certain types of behavior, for example, killing all humans, or deceiving its creators. A practice problem is to get any kind of behavior extremely reliably out of current day LLMs. The way Redwood operationalized this is by trying to train an LLM to have the property that they finish the prompt such that no humans get hurt (technically slightly weaker than this — only the AI believes no one gets hurt).

Opinion: This seems like a reasonable applied alignment project because it is a technique that could solve alignment for worlds in which alignment is pretty easy: if we just train the model to not hurt people, not take over the world, etc, then it'll just work. More specifically, we use either humans or amplified humans to get the description of the bad thing, then train a model that can very reliably simulate this description (and is much quicker to query than a human). Then, when we train the AI agent, we use this model to make sure that the AI never does the bad thing.

My worry with adversarial training is that it might just be pushing around the inner misalignment somewhere else in the model. In the real case there are instrumental incentives for power seeking to push against this technique, so even if the failure rate is 1 in a trillion, there will be strong optimization pressure, and a large distribution shift that will break this learned invariant.

LLM interpretability

Redwood is also doing some work on interpretability tools, though as far as I know they have not published a writeup of their interpretability results. As of April, they were focused on getting a complete understanding of nontrivial behaviors of relatively small models. They have released a website for visualizing transformers. Apart from the standard benefits of interpretability [LW · GW], one possibility is that this might be helpful for solving ELK.

Opinion: Excited to see the results from this.

Sam Bowman

Sam runs a lab at NYU. He is on sabbatical working at Anthropic for the 2022-2023 academic year, and has already been collaborating with them.

Projects include language model alignment by creating datasets for evaluating language models, as well as inductive biases of LLMs.

He is involved in running the inverse scaling prize [LW · GW] in collaboration with FAR, a contest for finding tasks where larger language models perform worse than smaller language models. The idea of this is to understand how LLMs are misaligned, and find techniques for uncovering this misalignment.

Opinion: I don't understand this research very well, so I can't comment.

Selection Theorems [LW · GW] / John Wentworth

John's plan [LW · GW] is:

Step 1: sort out our fundamental confusions about agency
Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)
Step 3: …
Step 4: profit!
… and do all that before AGI kills us all.

He is working on step 1: figuring out what the heck is going on with agency. His current approach is based on selection theorems [LW · GW]: try to figure out what types of agents are selected for in a broad range of environments. Examples of selection pressures include: evolution, SGD, and markets. This is an approach to agent foundations that comes from the opposite direction as MIRI: it's more about observing existing structures (whether they be mathematical or real things in the world like markets or e coli), whereas MIRI is trying to write out some desiderata and then finding mathematical notions that satisfy those desiderata.

Two key properties that might be selected for are modularity and abstractions.

Abstractions are higher level things that people tend to use to describe things. Like "Tree" and "Chair" and "Person". These are all vague categories that contain lots of different things, but are really useful for narrowing down things. Humans tend to use really similar abstractions, even across different cultures / societies. The Natural Abstraction Hypothesis [LW · GW] (NAH) states that a wide variety of cognitive architectures will tend to use similar abstractions to reason about the world. This might be helpful for alignment because we could say things like "person" without having to rigorously and precisely say exactly what we mean by person.

The NAH seems very plausibly true for physical objects in the world, and so it might be true for the inputs to human values. If so, it would be really helpful for AI alignment because understanding this would amount to a solution to the ontology identification problem: we can understand when environments induce certain abstractions, and so we can design this so that the network has the same abstractions as humans.

Opinion: I think that understanding abstraction seems like a promising research direction, both for current neural networks, as well as agent foundations problems [LW · GW].

I also think that the learned abstractions are highly dependent on the environment, and that the training environment for an AGI might look very different than the the learning environment that humans follow.

A very strong form of the Natural Abstraction Hypothesis might even hold for the goals of agents, for example, maybe cooperation is universally selected for by iterated prisoners dilemmas.

Modularity [? · GW]: In pretty much any selection environment, we see lots of obvious modularity. Biological species have cells and organs and limbs. Companies have departments. We might expect neural networks to be similar, but it is really hard to find modules [LW · GW] in neural networks. We need to find the right lens to look through to find this modularity in neural networks. Aiming at this can lead us to really good interpretability.

Opinion: Looking for modules / frames on modularity in neural networks seems like a promising way to get scalable interpretability, so I'm excited about this.

Team Shard [LW · GW]

Humans care about things! The reward circuitry in our brain reliably causes us to care about specific things. Let's create a mechanistic model of how the brain aligns humans, and then we can use this to do AI alignment.

One perspective that Shard theory has added is that we shouldn't think of the solution to alignment as:

Find an outer objective that is fine to optimize arbitrarily strongly
Find a way of making sure that the inner objective of an ML system equals the outer objective.

Shard theory argues that instead we should focus on finding outer objectives that reliably give certain inner values into system and should be thought of as more of a teacher of the values we want to instill as opposed to the values themselves. Reward is not the optimization target [LW · GW] — instead, it is more like that which reinforces. People sometimes refer to inner aligning an RL agent with respect to the reward signal, but this doesn't actually make sense. (As pointed out in the comments this is not a new insight, but it was for me phrased a lot more clearly in terms of Shard theory).

Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward.

Shard Theory points at the beginning of a mechanistic story for how inner values are selected for by outer optimization pressures. The current plan [LW · GW] is to figure out how RL induces inner values into learned agents, and then figure out how to instill human values into powerful AI models (probably chain of thought LLMs, because these are the most intelligent models right now). Then, use these partially aligned models to solve the full alignment problem. Shard theory also proposes a subagent theory of mind.

This has some similarities to Brain-like AGI Safety, and has drawn on some research from this post, such as the mechanics of the human reward circuitry as well as the brain being mostly randomly initialized at birth.

Opinion: This is promising so far, deserves a lot more work to be done on it to try to find a reliable way to implant certain inner values into trained systems. I view Shard Theory as a useful frame for alignment already even if it doesn’t go anywhere else.

Eli’s opinion: I’m not sure I agree much with the core claims of Shard Theory, but the research that I’ve seen actually being proposed based on it generally seems useful.

Truthful AI / Owain Evans and Owen Cotton-Barratt

Truthful AI is a research direction to get models to avoid lying to us. This involves developing 1) clear truthfulness standards e.g. avoiding “negligent falsehoods”, 2) institutions to enforce standards, and 3) truthful AI systems e.g. via curated datasets and human interaction.

Edit for clarity: According to Owen, TruthfulAI is not trying to solve the hard bit of the alignment problem [EA(p) · GW(p)]. Instead, the goal of this approach is to improve society to generally be more able to solve hard challenges such as alignment.

Opinion: It seems like the root problem is that deception is an instrumental subgoal of powerful inner misaligned AGIs, so we need to fix that in order to solve alignment. I'm confused about how this helps society solve alignment, but that is probably mostly because I don't know much about this.

Eli’s opinion: This seems like a pretty indirect way to tackle alignment of powerful AIs; I think it’s mostly promising insofar as it helps us more robustly automate technical and strategy alignment research.

Other Organizations

Though we have tried to be exhaustive, there are certainly many people working on technical AI alignment not included in this overview. While these groups might produce great research, we either 1) didn't know enough about it to summarize or 2) weren’t aware that it was aimed at reducing x-risk from AI.

Below is a list of some of these, though we probably missed some here. Please feel free to add comments and I will add others.

Future of Life Institute (FLI) (Though they seem to mostly give out grants)
Many independent researchers.
A number of academics:
*ERIs:
Principles of Intelligent Behavior in Biological and Social Systems (PIBSS)
Alignment of Complex Systems Research Group [LW · GW]

Appendix

Visualizing Differences

Automating alignment and alignment difficulty

Much of the difference between approaches comes down to disagreements on two axes: how promising it is to automate alignment [? · GW] vs. solve it directly and how difficult alignment is (~p(doom)). We’ll chart our impression of where orgs and approaches stand on these axes, based on something like the median view of org/approach leadership. We leave out orgs/approaches that we have no idea about, but err on the side of guessing if we have some intuitions (feel free to correct us!).

**Don’t share this chart too widely e.g. on Twitter, as these views haven’t yet been endorsed by the organizations/people. These are our best guesses and nothing more.**

Conceptual vs. applied

Another important axis distinguishing approaches is conceptual (i.e. thinking about how to align powerful AI) vs. applied (i.e. experimenting with AI systems to learn about how to align powerful AI).

Type of approach	Mostly conceptual	Mixed	Mostly applied
Organization	MIRI, John Wentworth, ARC	Team Shard, CHAI, DeepMind	Conjecture, Encultured, OpenAI, Anthropic, Redwood, Ought

Thomas’s Alignment Big Picture

This is a summary of my overall picture of the AGI landscape, which is a high level picture of the generator of most my opinions above.

I buy most of the arguments given in Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover and AGI Ruin, A List of Lethalities [LW · GW].

To make AGI go well, many organizations with economic and social incentives to make AGI before alignment is solved must avoid doing so. Moreover, the majority of the AI community does not take x-risk from AGI seriously. Whatever it is that moves us from our current trajectory to a safe-from-AGI-world, in my eyes, would count as a pivotal act^[6] — an act that makes a large positive impact on the world a billion years in the future. A pivotal act might also consist of lots of smaller actions that together make a difference in the long term, then all these acts together could constitute a pivotal act.

Performing a pivotal act without AI is likely really hard, but could look something like transforming the world's governments into Dath Ilan [? · GW]. If a non-AI pivotal act is feasible, I think it is probably best for us to cease all AIS research, and then just work towards performing that pivotal act. AIS research usually contributes at least something to capabilities, and so stopping that buys more time (so long as the AIS researchers don't start doing capabilities directly instead).

The other option is an AI-assisted pivotal act. This pivotal act might look like making a very aligned AI recursively self improve, and implement CEV [? · GW]. It might be more like the MIRI example of melting GPUs to prevent further AI progress and then turning itself off, but this isn't meant to be a concrete plan. MIRI seems to think that this is the path with the highest odds of success — see the strategic background section here.

The alignment tax [? · GW] is the amount of extra effort it will take for an AI to be aligned. This includes both research effort as well as compute and engineering time. I think it is quite likely (~80%) that part of the alignment tax for the first safe AGI requires >1 year of development time.

Pulling off an AGI-assisted pivotal act requires aligning the AGI you are using, meaning that:

The alignment tax to be low enough for an actor to pay
That the actor implementing the first AGI pays this alignment tax
1 and 2 need to happen before the world is destroyed.

Technical alignment work is primarily focused on 1. Some organizations are also trying to be large enough and ahead to pay the alignment tax for an AGI.

My model of EA and AI Safety is that it is expanding exponentially, and that this trend will likely continue until AGI. This leads me to think that the majority of the AIS work will be done right before AGI. It also means that timelines are crucial — pushing timelines back is very important. This is why I am usually not excited about work that improves capabilities even if it does more to help safety [LW · GW].^[7]

In the world where we have to align an AGI, solving the technical alignment problem seems like a big bottleneck, but cooperation to slow down timelines is necessary in both worlds.

^{^}
We may revise the document based on corrections in the comments or future announcements, but don't promise anything. Others are welcome to create future versions or submit summaries of their own approaches for us to edit in. We will note the time it was last edited when we edit things. (ETA: most recent update: 10/9/2022)
^{^}
In this chart, the ? denotes more uncertainty if this is a correct description
^{^}
~~I would appreciate someone giving more information on DeepMind's approach to alignment.~~ Update: Rohin has given a helpful summary in a comment [LW(p) · GW(p)].
^{^}
Technically, they just need to span the set of extrapolations, so that the correct extrapolation is just a linear combination of the found classifiers.
^{^}
Hold on, how come you are excited about Conjecture automating alignment research but not OpenAI?
Answer: I see a categorical distinction between trying to align agentic and oracle AIs. Conjecture is trying only for oracle LLMs, trained without any RL pressure giving them goals, which seems way safer. OpenAI doing recursive reward modeling / IDA type schemes involves creating agentic AGIs and therefore faces also a lot more alignment issues like convergent instrumental goals, power seeking, goodharting, inner alignment failure, etc.
I think inner alignment can be a problem with LLMs trained purely in a self-supervised fashion (e.g., simulacra becoming aware of their surroundings), but I anticipate it to only be a problem with further capabilities. I think RL trained GPT-6 is a lot more likely to be an x-risk than GPT-6 trained only to do text prediction.
^{^}
To be clear: I am very against proposals for violent pivotal acts that are sometimes brought up, such as destroying other AI labs on the verge of creating a misaligned AGI. This seems bad because 1) violence is bad and isn't dignified. 2) it seems like this intention would make it much harder to coordinate. 3) Setting an AGI loose to pull off a violent pivotal act could incredibly easily disempower humanity: you are intentionally letting the AGI destructively take over.
^{^}
Some cruxes that would change this conclusion are if we don't get prosaic AGI or if solving alignment takes a lot of serial thought, e.g. work that needs to be done by 1 researcher over 10 years, and can't be solved by 10 researchers working for 1 year.

90 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-08-30T06:29:14.498Z · LW(p) · GW(p)

One insight this has generated so far is that Reward is not the optimization target [LW · GW] — instead, it is more like that which reinforces. People sometimes refer to inner aligning an RL agent with respect to the reward signal, but this doesn't actually make sense.

Grumble grumble. Savvy people have known that reward is not the optimization target for at least five years, probably more like a decade. It's true that various people don't know this yet & so I'm glad that post was written, but it's a bit unfair to credit shard theory with having generated that idea. (I think TurnTrout would agree with this, his post says that alignment people seem to be aware of this point already)

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-09-01T17:31:14.432Z · LW(p) · GW(p)

I don't consider this a settled question; is there rigorous technical work establishing that "Reward is not the optimization target"?

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-09-01T18:47:12.912Z · LW(p) · GW(p)

Depends on your standards for "rigorous technical work" and "establishing." In some sense nothing on this topic is sufficiently rigorous, and in some sense nothing on this topic has been established yet. I think the Risks from Learned Optimization paper might be what you are looking for. There's also evhub's recent talk. [LW · GW] And of course, TurnTrouts post that was linked above. And again I just pull these out of the top of my head, the ideas in them have been floating around for a while.

I'd be interested to hear an argument that reward is the optimization target, if you've got one!

I suspect that this is an issue that will be cleared up by everyone being super careful and explicit and nitpicky about their definitions. (Because I think a big part of what's going on here is that people aren't doing that and so they are getting subtly confused and equivocating between importantly different statements, and then on top of that other people are misunderstanding their words)

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-09-01T20:27:04.423Z · LW(p) · GW(p)

Thanks! I don't think those meet my criteria. I also suspect "everyone being super careful and explicit and nitpicky about their definitions" is lacking, and I'd consider that a basic and essential component of rigorous technical work.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-09-01T22:32:56.585Z · LW(p) · GW(p)

Agreed!

Got an argument that reward is the optimization target?

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-09-02T19:26:04.518Z · LW(p) · GW(p)

I don't think this framing of it being the optimization target or not is very helpful. It's like asking "does SGD converge?" or "will my supervised learning model learn the true hypothesis?" The answer will depend on a number of factors, and it's often not best thought of as a binary thing.

e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.

Here's another argument: maybe it's the field of RL, and not Alex Turner, who is right about this: https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#Appendix__The_field_of_RL_thinks_reward_optimization_target [LW · GW]
(I'm not sure Alex characterizes the field's beliefs correctly, and I'm sort of playing devil's advocate with that one (not a big fan of "outside views"), but it's a bit odd to act like the burden of proof is on someone who agrees with the relevant academic field).

Replies from: daniel-kokotajlo, steve2152

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-09-02T20:37:35.405Z · LW(p) · GW(p)

Thanks!

I'm not sure the framing is helpful either, but reading Turner's linked appendix it does seem like various people are making some sort of mistake that can be summarized as "they seem to think the policy / trained network should be understood as trying to get reward, as preferring higher-reward outcomes, as targeting reward..." (And Turner says he himself was one of them despite doing a PhD in RL theory) Like I said above I think that probably there's room for improvement here -- if everyone defined their terms better this problem would clear up and go away. I see Turner's post as movement in this direction but by no means the end of the journey.

Re your first argument: If I understand you correctly, you are saying that if your AI design involves something like monte-carlo tree search using a reward-estimator module (Idk what the technical term for that is) and the reward-estimator module is just trained to predict reward, then it's fair to describe the system as optimizing for the goal of reward. Yep that seems right to me, modulo concerns about inner alignment failures in the reward-estimator module. I don't see this as contradicting Alex Turner's claims but maybe it does.

Re your second argument, the appeal to authority: I suppose in a vacuum, not having thought about it myself or heard any halfway decent arguments, I'd defer to the RL field on this matter. But I have thought about it a bit myself and I have heard some decent arguments, and that effect is stronger than the deference effect for me, and I think this is justified.

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-09-03T20:58:20.190Z · LW(p) · GW(p)

RE appeal to authority: I mostly mentioned it because you asked for an argument and I figured I would just provide any decent ones I thought of OTMH. But I have not provided anything close to my full thoughts on the matter, and probably won't, due to bandwidth.

↑ comment by Steven Byrnes (steve2152) · 2022-09-05T17:12:00.910Z · LW(p) · GW(p)

e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.

Often, when an RL agent imagines a possible future roll-out, it does not evaluate whether that possible future is good or bad by querying an external ground-truth reward function; instead, it queries a learned value function. When that’s the case, the thing that the agent is foresightedly “trying” / “planning” to do is to optimize the learned value function, not the reward function. Right?

For example, I believe AlphaZero can be described this way—it explores some number of possible future scenarios (I’m hazy on the details), and evaluates how good they are based on querying the learned value function, not querying the external ground-truth reward function, except in rare cases where the game is just about to end.

I claim that, if we make AGI via model-based RL (as I expect), it will almost definitely be like that too. If an AGI has a (nonverbal) idea along the lines of “What if I try to invent a new microscope using (still-somewhat-vague but innovative concept)”, I can’t imagine how on earth you would build an external ground-truth reward function that can be queried with that kind of abstract hypothetical. But I find it very easy to imagine how a learned value function could be queried with that kind of abstract hypothetical.

(You can say “OK fine but the learned value function will asymptotically approach the external ground-truth reward function”. However, that might or might not be true. It depends on the algorithm and environment. I expect AGIs to be in a nonstationary environment with vastly too large an action space to fully explore, and full of irreversible actions that make full exploration impossible anyway. In that case, we cannot assume that there’s no important difference between “trying” to maximize the learned value function versus “trying” to maximize the reward function.)

Sorry if I’m misunderstanding. (My own discussion of this topic, in the context of a specific model-based RL architecture, is Section 9.5 here. [AF · GW])

comment by Rohin Shah (rohinmshah) · 2022-08-29T14:16:21.301Z · LW(p) · GW(p)

Note: I link to a bunch of stuff below in the context of the DeepMind safety team, this should be thought of as "things that particular people do" and may not represent the views of DeepMind or even just the DeepMind safety team.

I just don't know much about what the [DeepMind] technical alignment work actually looks like right now

We do a lot of stuff, e.g. of the things you've listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:

Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
LLM alignment (lots of work discussed in the podcast with Geoffrey [LW · GW] you mentioned)
Scalable oversight (same as above)
Mechanistic interpretability (unpublished so far)
Externalized Reasoning Oversight (my guess is that this will be published soon) (EDIT: this paper)
Communicating views on alignment (e.g. the post you linked [LW · GW], the writing that I do on this forum is in large part about communicating my views)
Deception + inner alignment (in particular examples of goal misgeneralization)
Understanding agency (see e.g. discovering agents [LW · GW], most of Ramana's posts [LW · GW])

And in addition we've also done other stuff like

I'm probably forgetting a few others.

I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn't really one "unified agenda".

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-29T15:33:53.965Z · LW(p) · GW(p)

Thanks you for this thoughtful response, I didn't know about most of these projects. I've linked this comment in the DeepMind section, as well as done some modifications for both clarity and including a bit more.

I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn't really one "unified agenda".

This is useful to know.

Replies from: Vika

↑ comment by Vika · 2022-09-13T16:21:21.262Z · LW(p) · GW(p)

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn [LW · GW]" and "will capabilities generalize more [LW · GW]".

Some corrections for your overall description of the DM alignment team:

I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
I would put DM alignment in the "fairly hard" bucket (p(doom) = 10-50%) for alignment difficulty, and the "mixed" bucket for "conceptual vs applied"

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-10-09T15:55:57.649Z · LW(p) · GW(p)

Sorry for the late response, and thanks for your comment, I've edited the post to reflect these.

Replies from: Vika

↑ comment by Vika · 2022-10-09T16:00:33.166Z · LW(p) · GW(p)

No worries! Thanks a lot for updating the post

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-08-30T20:16:58.237Z · LW(p) · GW(p)

The main thing missing here are academic groups (like mine at Cambridge https://www.davidscottkrueger.com/). This is a pretty glaring oversight, although I'm not that surprised since it's LW.

Some other noteworthy groups in academia lead by people who are somewhat connected to this community:
- Jacob Steinhardt (Berkeley)
- Dylan Hadfield-Menell (MIT)
- Sam Bowman (NYU)
- Roger Grosse (UofT)

More at https://futureoflife.org/team/ai-existential-safety-community/ (although I think the level of focus on x-safety and engagement with this community varies substantially among these people).

BTW, FLI is itself worth a mention, as is FHI, maybe in particular https://www.fhi.ox.ac.uk/causal-incentives-working-group/ if you want to focus on technical stuff.

Some other noteworthy groups in academia lead by people who are perhaps less connected to this community:
- Aleksander Madry (MIT)
- Percy Liang (Stanford)
- Scott Neikum (UMass Amhearst)

These are just examples.

Replies from: elifland, derber, thomas-larsen, Gunnar_Zarncke

↑ comment by elifland · 2022-08-30T21:26:28.938Z · LW(p) · GW(p)

(speaking for just myself, not Thomas but I think it’s likely he’d endorse most of this)

I agree it would be great to include many of these academic groups; the exclusion wasn’t out of any sort of malice. Personally I don’t know very much about what most of these groups are doing or their motivations; if any of them want to submit brief write ups I‘d be happy to add them! :)

edit: lol, Thomas responded with a similar tone while I was typing

↑ comment by David Reber (derber) · 2022-08-31T18:41:44.802Z · LW(p) · GW(p)

The causal incentives working group should get mentioned, it's directly on AI safety: though it's a bit older I gained a lot of clarity about AI safety concepts via "Modeling AGI Safety Frameworks with Causal Influence Diagrams", which is quite accessible even if you don't have a ton of training in causality.

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-30T21:28:51.385Z · LW(p) · GW(p)

Sorry about that, and thank you for pointing this out.

For now I've added a disclaimer (footnote 2 right now, might make this more visible/clear but not sure what the best way of doing that is). I will try to add a summary of some of these groups in when I have read some of their papers, currently I have not read a lot of their research.

Edit: agree with Eli's comment.

↑ comment by Gunnar_Zarncke · 2022-08-31T14:21:20.967Z · LW(p) · GW(p)

Some other noteworthy groups in academia lead by people who are somewhat connected to this community:
- Jacob Steinhardt (Berkeley)
- Dylan Hadfield-Menell (MIT)
- Sam Bowman (NYU)
- Roger Grosse (UofT)
Some other noteworthy groups in academia lead by people who are perhaps less connected to this community:
- Aleksander Madry (MIT)
- Percy Liang (Stanford)
- Scott Neikum (UMass Amhearst)

Can you provide some links to these groups?

Replies from: Aidan O'Gara

↑ comment by aog (Aidan O'Gara) · 2022-08-31T16:05:36.473Z · LW(p) · GW(p)

These professors all have a lot of published papers in academic conferences. It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already. I would start by looking at their Google Scholar pages, followed by personal websites and maybe Twitter. One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

Replies from: thomas-larsen, elifland, JohnMalin, johnswentworth

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-31T18:12:15.008Z · LW(p) · GW(p)

Agree with both aogara and Eli's comment.

One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.

Replies from: capybaralet, joshua-clymer, Aidan O'Gara

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-09-02T19:32:50.016Z · LW(p) · GW(p)

There's a lot of work that could be relevant for x-risk but is not motivated by it. Some of it is more relevant than work that is motivated by it. An important challenge for this community (to facilitate scaling of research funding, etc.) is to move away from evaluating work based on motivations, and towards evaluating work based on technical content.

Replies from: elifland

↑ comment by elifland · 2022-09-03T02:26:25.031Z · LW(p) · GW(p)

See The academic contribution to AI safety seems large [EA · GW] and comments for some existing discussion related to this point

↑ comment by joshc (joshua-clymer) · 2022-09-05T05:17:35.677Z · LW(p) · GW(p)

PAIS #5 [AF · GW] might be helpful here. It explains how a variety of empirical directions are related to X-Risk and probably includes many of the ones that academics are working on.

↑ comment by aog (Aidan O'Gara) · 2022-08-31T18:49:49.123Z · LW(p) · GW(p)

Agreed it's really difficult for a lot of the work. You've probably seen it already but Dan Hendrycks has done a lot of work explaining academic research areas in terms of x-risk (e.g. this and this paper). Jacob Steinhardt's blog and field overview and Sam Bowman's Twitter are also good for context.

Replies from: derber

↑ comment by David Reber (derber) · 2022-08-31T18:57:53.994Z · LW(p) · GW(p)

I second this, that it's difficult to summarize AI-safety-relevant academic work for LW audiences. I want to highlight the symmetric difficulty of trying to summarize the mountain of blog-post-style work on the AF for academics.

In short, both groups have steep reading/learning curves that are under-appreciated when you're already familiar with it all.

↑ comment by elifland · 2022-08-31T16:12:57.774Z · LW(p) · GW(p)

It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already

Fair, I see why this would be frustrating and apologize for any frustration caused. In an ideal world we would have read many of these papers and summarized them ourselves, but that would have taken a lot of time and I think the post was valuable to get out ASAP.

ETA: Probably it would have been better to include more of a disclaimer on the "everyone" point from the get-go, I think not doing this was a mistake.

Replies from: Aidan O'Gara

↑ comment by aog (Aidan O'Gara) · 2022-08-31T16:43:28.453Z · LW(p) · GW(p)

(Also, this is an incredibly helpful writeup and it’s only to be expected that some stuff would be missing. Thank you for sharing it!)

↑ comment by JohnMalin · 2022-08-31T22:25:50.455Z · LW(p) · GW(p)

I don't think the onus should be on the reader to infer x-risk motivations. In academic ML, it's the author's job to explain why the reader should care about the paper. I don't see why this should be different in safety. If it's hard to do that in the paper itself, you can always e.g. write a blog post explaining safety relevance (as mentioned by aogara, people are already doing this, which is great!).

There are often many different ways in which a paper might be intended to be useful for x-risks (and ways in which it might not be). Often the motivation for a paper (even in the groups mentioned above) may be some combination of it being an interesting ML problem, interests of the particular student, and various possible thoughts around AI safety. It's hard to try to disentangle this from the outside by reading between the lines.

Replies from: Morpheus

↑ comment by Morpheus · 2022-09-14T14:28:27.990Z · LW(p) · GW(p)

On the other hand there are a lot of reasons to belief the authors to be delusional about promises of their research and it's theory for impact. I think the most I get personally out of posts like this is having this 3rd party perspective that I can compare with my own.

↑ comment by johnswentworth · 2022-08-31T17:48:19.850Z · LW(p) · GW(p)

It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already.

On the one hand, yeah, probably frustrating. On the other hand, that's the norm in academia: people publish work and then nobody reads it.

Replies from: derber, Aidan O'Gara

↑ comment by David Reber (derber) · 2022-08-31T18:50:28.885Z · LW(p) · GW(p)

Anecdotally, I've found the same said of Less Wrong / Alignment Forum posts among AI safety / EA academics: that it amounts to an echo chamber that no one else reads.

I suspect both communities are taking their collective lack of familiarity with the other as evidence that the other community isn't doing their part to disseminate their ideas properly. Of course, neither community seems particularly interested in taking the time to read up on the other, and seems to think that the other community should simply mimic their example (LWers want more LW synopses of academic papers, academics want AF work to be published in journals).

Personally I think this is symptomatic of a larger camp-ish divide between the two, which is worth trying to bridge.

↑ comment by aog (Aidan O'Gara) · 2022-08-31T18:36:43.694Z · LW(p) · GW(p)

All of these academics are widely read and cited. Looking at their Google Scholar profiles, everyone one of them has more than 1000, and half have more than 10,000 citations. Outside of LessWrong, lots of people in academia and industry labs already read and understand their work. We shouldn't disparage people who are successfully bringing AI safety into the mainstream ML community.

comment by Thomas Larsen (thomas-larsen) · 2022-09-01T22:41:32.175Z · LW(p) · GW(p)

Just made a fairly large edit to the post after lots of feedback from commenters. My most recent changes include the following:

Note limitations in introduction (lack academics, not balanced depth proportional to people, not endorsed by researchers)
Update CLR as per Jesse's comment
Add FAR
Update brain-like AGI to include this [LW · GW].
Rewrite shard theory section
- Brain <-> shards
effort: 50 -> 75 hours :)
Add this paper to DeepMind
Add some academics (David Krueger, Sam Bowman, Jacob Steinhardt, Dylan Hadfield-Menell, FHI)
Add other category
Summary table updates:
- Update links in table to make sure they work.
- Add scale of organization
- Add people

Thank you to everyone who commented, it has been very helpful.

comment by TW123 (ThomasWoodside) · 2022-08-29T03:57:28.835Z · LW(p) · GW(p)

Thanks so much for writing this! I think it's a very useful resource to have. I wanted to add a few thoughts on your description of CAIS, which might help make it more accurate.

[Note: I worked full time at CAIS from its inception until a couple weeks ago. I now work there on a part time basis while finishing university. This comment hasn't been reviewed by others at CAIS, but I'm pretty confident it's accurate.]

For somebody external to CAIS, I think you did a fairly good job describing the organization so thank you! I have a couple things I'd probably change:

First, our outreach is not just to academics, but also to people in industry. We usually use the term "ML community" rather than "academia" for this reason.
Second, the technical research side of the organization is about a lot more than robustness. We do research in Trojans as you mention, which isn't robustness, but also in machine ethics, cooperative AI, anomaly detection, forecasting, and probably more areas soon. We are interested in most of the areas in Open Problems in AI X-Risk [AF · GW], but the extent to which we're actively working on them varies.
I also think it might be good to add our newly-announced (so maybe after you wrote the post) Philosophy Fellowship, which focuses on recruiting philosophers to study foundational conceptual problems in AI risk. This might correct a misconception that CAIS isn't interested in conceptual research; we very much are, but of a different flavor than some others, which I would broadly characterize as "more like philosophy, less like math".
Also, there is no way you would have known about this since we've never said it publicly anywhere, but we intend to also build out compute and research engineering infrastructure for academics specifically, who often don't have funding for compute and even if they do don't have the support necessary to leverage it. Building out a centralized way for safety academics to access compute and engineering support would create economies of scale (especially the compute contracts and compute infrastructure). However, these plans are in early stages.
Another fieldbuilding effort maybe worth mentioning is ML Safety Scholars [AF · GW].

In general, here is how I personally describe the theory of change for CAIS. This hasn't been reviewed by anyone, and I don't know how much Dan personally likes it, but it's how I think of it. It's also not very polished, sorry. Anyway, to me there are three major forms of research:

Philosophizing. Many AI safety problems are still very undefined. We need people to think about the properties of possible systems at a high level and tease out relevant considerations and possible solutions. This is exactly what philosophers do and why we are interested in the program above. Without this kind of conceptual research, it's very difficult to figure out concrete problems to work on.
Concretization. It does us no good if the ideas generated in philosophizing are never concretized. Part of this is because no amount of thinking can substitute for real experimentation and implementation. Part of this is because it won't be long before we really need progress: we can't afford to just philosophize. Concretization involves taking the high level ideas and implementing something that usefully situates them in empirical systems. Benchmarks are an example of this.
Iterative improvements. Once an idea is concretized, the initial concretization is likely not optimal. We need people to make tweaks and make the initial methods better at achieving their aims, according to the concretized ideas. Most papers produced by the broader ML community are iterative improvements.

CAIS intends to be the glue that integrates all three of these areas. Through our philosophy fellowship program, we will train philosophers to do useful conceptual research while working in close proximity with ML researchers. Most of our ML research focuses on building foundational methods and benchmarks that can take fuzzy problems and concretize them. Lastly, we see our fieldbuilding effort as very much driving iterative improvements: who better to make iterative improvements on well-defined safety problems than the ML community? They have shown themselves to be quite good at this when it comes to general capabilities.

For a more in depth look at our research theory of impact, I suggest Pragmatic AI Safety.

Edit: I realized your post made me actually write things up that I hadn't before, because I thought it would likely be more accurate than the (great for an outsider!) description that you had written. This strikes me as a very positive outcome of this post, and I hope others who feel their descriptions miss something will do the same!

Replies from: thomas-larsen, ThomasWoodside

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-29T04:33:49.414Z · LW(p) · GW(p)

Thank you Thomas, I really appreciate you taking the time to write out your comment, it is very useful feedback.

I've linked your comment in the post and rewritten the description of CAIS.

Replies from: ThomasWoodside

↑ comment by TW123 (ThomasWoodside) · 2022-08-29T04:41:21.111Z · LW(p) · GW(p)

Thanks! I really appreciate it, and think it's a lot more accurate now. Nitpicks:

I think the MLSS link is currently broken. Also, in the headline table, it still emphasizes model robustness perhaps more than is warranted.

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-29T04:56:20.643Z · LW(p) · GW(p)

Right! I've changed both.

Replies from: conor-sullivan

↑ comment by Lone Pine (conor-sullivan) · 2022-08-30T00:11:25.187Z · LW(p) · GW(p)

I confused CAIS with Drexler's Comprehensive AI Services. Can you add a clarification stating that they are different things?

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-09-01T22:28:33.570Z · LW(p) · GW(p)

Good point. We've added the Center for AI Safety's full name into the summary table which should help.

↑ comment by TW123 (ThomasWoodside) · 2022-08-29T04:06:00.998Z · LW(p) · GW(p)

Also, as to your comment:

My worry is that academics will pursue strategies that work right now but won't work for AGI, because they are trying to win the competition instead of align AGIs. This might be really helpful though.

(My personal opinion, not necesasarily the opinion of CAIS) I pretty much agree. It's the job of the concretizers (and also grantmakers to some extent) to incentivize/nudge research to be in a useful direction rather than a nonuseful direction, and for fieldbuilding to shift researchers towards more explicitly considering x-risk. But, as you say, competition can be a valuable force; if you can set the incentives right, it might not be necessary for all researchers to be caring about x-risk. If you can give them a fun problem to solve and make sure it's actually relevant and they are only rewarded for actually relevant work, then good research could still be produced. Relevant research has been produced by the ML community before by people who weren't explicitly thinking about x-risk (mostly "accidentally", i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change).

Also, iterative progress involves making progress that works now but might not in the future. That's ok, as long as some of it does in fact work in the future.

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-29T04:51:44.172Z · LW(p) · GW(p)

. If you can give them a fun problem to solve and make sure it's actually relevant and they are only rewarded for actually relevant work, then good research could still be produced.

Yeah I think the difficulty of setting this up correctly is the main crux. I'm quite uncertain on this, but I'll give the argument my model of John Wentworth makes against this:

The Trojan detection competition it does seem roughly similar to deception, and if you can find Trojan's really well, it's plausible that you can find deceptive alignment. However, what we really need is a way to exert optimization pressure away from deceptive regions of parameter space. And right now, afaik, we have no idea how strongly deception is favored.

I can imagine using methods from this competition to put a small amount of pressure away from this, by, e.g., restarting whenever you see deception, or running SGD on your interpreted deception. But this feels sketchy because 1) you are putting pressure on these tools, and you might just steer into regions of space where they fail, and 2) you are training a model until it becomes deceptive: eventually, a smart deceptive model will be actively trying to beat these tools.

So what I really want is understanding the generators of deceptive alignment, which could take the form of formal version of the argument given here [LW · GW], so that I can prevent entering the deceptive regions of parameter space in the first place.

Relevant research has been produced by the ML community before by people who weren't explicitly thinking about x-risk (mostly "accidentally", i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change).

Could you link an example? I am curious what you have in mind. I'm guessing something like the ROME paper?

Replies from: joshua-clymer

↑ comment by joshc (joshua-clymer) · 2022-11-05T03:42:39.300Z · LW(p) · GW(p)

Thoughts on John's comment: this is a problem with any method for detecting deception that isn't 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck.

Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction.

Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant.

In general, it seems better to me to evaluate research by asking "where is this taking the field/what follow-up research is this motivating?" rather than "how are the words in this paper directly useful if we had to build AGI right now?" Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I'm pretty skeptical of a lot of the direct value of empirical research.

comment by Charlie Steiner · 2022-08-29T16:24:42.726Z · LW(p) · GW(p)

Because this is from your perspective, could you say a bit about who you are, what your research tastes are, which of these people you've interacted with?

Replies from: thomas-larsen, elifland

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-29T20:14:32.575Z · LW(p) · GW(p)

That makes sense. For me:

Background: I graduated from college at the University of Michigan this spring, I majored in Math and CS. In college I worked on vision research for self-driving cars, and wrote my undergrad thesis on robustness (my linkedin). I spent a lot of time running the EA group at Michigan. I'm currently doing SERI MATS under John Wentworth.
Research taste: currently very bad and confused and uncertain. I want to become better at research and this is mostly why I am doing MATS right now. I guess I especially enjoy reading and thinking about mathy research like Infra-Bayesianism and MIRI embedded agency stuff, but I'll be excited about whatever research I think is the most important.
I'm pretty new to interacting with the alignment sphere (before this summer I had just read things online and taken AGISF). Who I've interacted with (I'm probably forgetting some, but gives a rough idea):
1. 1 conversation with Andrew Critch
2. ~3 conversations with people at each of Conjecture and MIRI
3. ~8 conversations with various people at Redwood
4. Many conversations with people who hang around Lightcone, especially John and other SERI MATS participants (including Team Shard)

This summer, when I started talking to alignment people, I had a massive rush of information and so this was initially just a google doc of notes to organize my thoughts and figure out what people were doing. I then polished this and published this after some friends encouraged me to. I emphasize that nothing I write in the opinion section are strongly held beliefs -- I am still deeply confused about a lot of things in alignment. I'm hoping that by posting this more publicly I can also get feedback / perspectives from others who are not in my social sphere right now.

↑ comment by elifland · 2022-08-29T16:46:58.038Z · LW(p) · GW(p)

Good point. For myself:

Background (see also https://www.elilifland.com/): I did some research on adversarial robustness of NLP models while in undergrad. I then worked at Ought as a software/research engineer for 1.5 years, was briefly a longtermist forecasting entrepreneur then have been thinking independently about alignment strategy among other things for the past 2 months.
Research tastes: I'm not great at understanding and working on super mathy stuff, so I mostly avoided giving opinions on these. I enjoy toy programming puzzles/competitions but got bored of engineering large/complex systems which is part of why I left Ought. I'm generally excited about some level of automating alignment research.
Who I've interacted with:
1. A ton: Ought
2. ~3-10 conversations: Conjecture (vast majority being "Simulacra Theory" team), Team Shard
3. ~1-2 conversations with some team members: ARC, CAIS, CHAI, CLR, Encultured, Externalized Reasoning Oversight, MIRI, OpenAI, John Wentworth, Truthful AI / Owain Evans

comment by JesseClifton · 2022-08-30T12:45:55.654Z · LW(p) · GW(p)

[I work at CAIF and CLR]

Thanks for this!

I recommend making it clearer that CAIF is not focused on s-risk and is not formally affiliated with CLR (except for overlap in personnel). While it’s true that there is significant overlap in CLR’s and CAIF’s research interests, CAIF’s mission is much broader than CLR’s (“improve the cooperative intelligence of advanced AI for the benefit of all”), and its founders + leadership are motivated by a variety of catastrophic risks from AI.

Also, “foundational game theory research” isn’t an accurate description of CAIF’s scope. CAIF is interested in a variety of fields relevant to the cooperative intelligence of advanced AI systems. While this includes game theory and decision theory, I expect that a majority of CAIF’s resources (measured in both grants and staff time) will be directed at machine learning, and that we’ll also support work from the social and natural sciences. Also see Open Problems in Cooperative AI and CAIF’s recent call for proposals for a better sense of the kinds of work we want to support.

[ETA] I don’t think “foundational game theory research” is an accurate description of CLR’s scope, either, though I understand how public writing could give that impression. It is true that several CLR researchers have worked and are currently working on foundational game & decision theory research. But people work on a variety of things. Much of our recent technical and strategic work on cooperation is grounded in more prosaic models of AI (though to be fair much of this is not yet public; there are some forthcoming posts that hopefully make this clearer, which I can link back to when they’re up.) Other topics include risks from malevolent actors [EA · GW] and AI forecasting [LW · GW].

[Edit 14/9] Some of these "forthcoming posts" are up now [LW · GW].

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-09-01T22:28:05.751Z · LW(p) · GW(p)

Thanks for the update! We've edited the section on CLR to reflect this comment, let us know if it still looks inaccurate.

comment by SteveZ (steve-zekany) · 2022-08-29T03:09:24.178Z · LW(p) · GW(p)

I think this is a really nice write-up! As someone relatively new to the idea of AI Safety, having a summary of all the approaches people are working on is really helpful as it would have taken me weeks to put this together on my own.

Obviously this would be a lot of work, but I think it would be really great to post this as a living document on GitHub where you can update and (potentially) expand it over time, perhaps by curating contributions from folks. In particular it would be interesting to see three arguments for each approach: a “best argument for”, “best argument against” and “what I think is the most realistic outcome”, along with uncertainties for each.

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-29T05:05:27.011Z · LW(p) · GW(p)

I think this is a really nice write-up! As someone relatively new to the idea of AI Safety, having a summary of all the approaches people are working on is really helpful as it would have taken me weeks to put this together on my own.

Thanks!

Obviously this would be a lot of work, but I think it would be really great to post this as a living document on GitHub where you can update and (potentially) expand it over time, perhaps by curating contributions from folks.

I probably won't do this, but I agree it would be good.

In particular it would be interesting to see three arguments for each approach: a “best argument for”, “best argument against” and “what I think is the most realistic outcome”, along with uncertainties for each.

I agree that this would be good, but especially hard to do in a manner endorsed by all parties. I might try to write a second version of this post that tries to write this out, specifically, trying to clarify the assumptions on what the world has to look like for this research to be useful.

Replies from: jskatt

↑ comment by JakubK (jskatt) · 2022-09-08T23:04:26.311Z · LW(p) · GW(p)

Maybe the "AI Watch" page could incorporate ideas from this post and serve as an equivalent to "a living document on GitHub."

comment by AdamGleave · 2022-08-31T20:07:54.304Z · LW(p) · GW(p)

One omission from the list is the Fund for Alignment Research (FAR), which I'm a board member of. That's fair enough: FAR is fairly young, and doesn't have a research agenda per se, so it'd be hard to summarize their work from the outside!. But I thought it might be of interest to readers so I figured I'd give a quick summary here.

In terms of concrete agendas, an example of some of the things FAR is working on:

Adversarial attacks against narrowly superhuman systems like AlphaGo.
Language model benchmarks for value learning.
The inverse scaling law [LW · GW] prize.

You can read more about us on our launch post [LW · GW].

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-31T21:05:35.600Z · LW(p) · GW(p)

Hi Adam, thank you so much for writing this informative comment. We've added your summary of FAR to the main post (and linked this comment).

comment by Rohin Shah (rohinmshah) · 2022-08-29T13:52:14.459Z · LW(p) · GW(p)

The NAH is almost certainly not true for ethics itself (this would amount to a form of moral realism).

I don't follow. To get at my confusion:

Do you also think that the NAH is not true for trees because that would amount to a form of tree realism [LW · GW]?
Do you think that GPT-N will not be able to answer questions about how humans would make ethical decisions?

Truthful AI

The authors don't view Truthful AI as a solution to alignment. [EA(p) · GW(p)]

The default outcome of AGI is doom [LW · GW].

I object to the implication that the linked post argues for this claim: the "without specific countermeasures" part of that post does a lot of work.

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-30T01:15:50.061Z · LW(p) · GW(p)

Hi Rohin, thank you so much for your feedback. I agree with everything you said and will try to update the post for clarity.

I don't follow.

Sorry, that part was not well written (or well thought out), and so I'll try to clarify:

What I meant by 'is the NAH true for ethics?' is 'do sufficiently intelligent agents tend to converge on the same goals?', which, now that I think about it, is just the negation of the orthogonality thesis.

I'm not sure I understand the tree realism post other than that a tree is a fuzzy category. While I am also fuzzy on the question of 'what are my values', that's not the argument I'm trying to make.
I definitely think GPT-N will be able to answer questions about how humans would make ethical decisions, and wouldn't be surprised if GPT-3 already performs fairly well at this.

Truthful AI
The authors don't view Truthful AI as a solution to alignment. [EA(p) · GW(p)]

Thanks for pointing that out, I hadn't read that comment.

I object to the implication that the linked post argues for this claim: the "without specific countermeasures" part of that post does a lot of work.

Hm, yeah sorry for that poor reasoning, I think I should qualify that more. I do think that the default right now is that sufficient countermeasures are likely to not be deployed, but that point definitely deserves to be scrutinized more by me.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2022-08-30T07:35:17.101Z · LW(p) · GW(p)

What I meant by 'is the NAH true for ethics?' is 'do sufficiently intelligent agents tend to converge on the same goals?', which, now that I think about it, is just the negation of the orthogonality thesis.

Ah, got it, that makes sense. The reason I was confused is that NAH applied to ethics would only say that the AI system has a concept of ethics similar to the ones humans have; it wouldn't claim that the AI system would be motivated by that concept of ethics.

comment by Anthony DiGiovanni (antimonyanthony) · 2022-08-29T12:02:58.681Z · LW(p) · GW(p)

(Speaking for myself as a CLR researcher, not for CLR as a whole)

I don't think it's accurate to say CLR researchers think increasing transparency is good for cooperation. There are some tradeoffs here, such that I and other researchers are currently uncertain whether marginal increases in transparency are net good for AI cooperation. Though, it is true that more transparency opens up efficient equilibria that wouldn't have been possible without open-source game theory. (ETA: some relevant research by people (previously) at CLR here, here, and here.)

comment by AdamGleave · 2022-08-31T20:21:04.147Z · LW(p) · GW(p)

I liked this post and think it'll serve as a useful reference point, I'll definitely send it to people who are new to the alignment field.

But I think it needs a major caveat added. As a survey of alignment research that regularly posts on LessWrong or interacts closely with that community, it does a fine job. But as capybaralet already pointed out, it misses many academic groups. And even some major industry groups are de-emphasized. For example, DeepMind alignment is 20+ people, and has been around for many years. But it's got if anything a slightly less detailed write-up than Team Shard, a small group of people for a few months, or infra-Bayesianism, largely one person for several years.

The best shouldn't be the enemy of the good, and some groups are just quite opaque, but I think it does need to be cleared about its limitations. One anti-dote would be including in the table a sense of # of people, # of years it's been around, and maybe even funding to get a sense of what the relative scale of these different projects are.

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-31T21:12:34.126Z · LW(p) · GW(p)

Strong upvoted and I quite like this antidote, I will work on adding my guess of the scale of these orgs into the table.

comment by Steven Byrnes (steve2152) · 2022-08-29T20:29:24.587Z · LW(p) · GW(p)

Aligned AI / Stuart Armstrong
The problem is that I don't see how to integrate this approach for solving this problem with deep learning. It seems like this approach might work well for a model-based RL setup where you can make the AI explicitly select for this utility function.

For my part, I was already expecting AGI to be some kind of model-based RL. So I’m happy to make that assumption.

However, when I tried to flesh out model splintering (a.k.a. concept extrapolation) assuming a model-based-RL AGI—see Section 14.4 here [LW · GW]—I still couldn’t quite get the whole story to hang together.

(Before publishing that, I sent a draft to Stuart Armstrong, and he told me that he had a great answer but couldn’t make it public yet :-P )

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-29T22:36:12.063Z · LW(p) · GW(p)

However, when I tried to flesh out model splintering (a.k.a. concept extrapolation) assuming a model-based-RL AGI—see Section 14.4 here [LW · GW]—I still couldn’t quite get the whole story to hang together.

Thanks for linking that!

(Before publishing that, I sent a draft to Stuart Armstrong, and he told me that he had a great answer but couldn’t make it public yet :-P )

Oooh that is really exciting news.

comment by Ruby · 2022-09-13T06:26:02.380Z · LW(p) · GW(p)

Curated! I think this post is a considerable contribution to the ecosystem and one that many people are grateful for. Progress is made by people building on the works of others, and for that to happen, people have to be aware of the works of others and able to locate those most relevant to them. As the Alignment field grows, it gets progressively harder to keep up with what everyone is up to, what's been tried, where more effort might be useful. Roundups like these enable more people to get a sense of what's happening much more cheaply. And seeing an overview all at once helps distill the bigger picture and questions.

In my case, a thing in this roundup that wasn't previously salient to me is that it's hard to find modularity in neural nets. I feel a bit indignant about that. Why should that be hard? I feel incline to poke at the problem and think about it, and who knows, maybe contribute some progress.

But thinking again about this review. A I'm aware has been raised is how to keep something like this accurate and up to date. That feels like something LessWrong arguably should try to do, and we do have the wiki-tag system, so how to build a system of tech + people that does this work is something this post prompts me to think about. Kudos!

comment by Gunnar_Zarncke · 2022-08-29T09:42:34.223Z · LW(p) · GW(p)

The Brain-like AGI safety research agenda [LW · GW] has proposed multiple research areas, and multiple people are working on some of them:

15.2.1.2 The “Reverse-engineer human social instincts” research program

There is project aintelope (see the project announcement here [LW · GW]) that operationalizes this by implementing agents according to Steven's framework. We have applied for LTFF funding.
There is also at least one more researcher actively working on it.

15.2.2.2 The “Easy-to-use super-secure sandbox for AGIs” research program

Encultured AI [LW · GW] is working on this

Note: There is quite some overlap in approach between Shard Theory and Brain-like AGI, which is not mentioned in the post.

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-09-01T22:35:54.334Z · LW(p) · GW(p)

Good point, I've updated the post to reflect this.

I'm excited for your project :)

comment by TW123 (ThomasWoodside) · 2022-08-29T04:37:06.465Z · LW(p) · GW(p)

As somebody who used to be an intern at CHAI, but certainly isn't speaking for the organization:

CHAI seems best approximated as a collection of researchers doing a bunch of different things. There is more reinforcement learning at CHAI than elsewhere, and it's ML research, but it's not top down at all so it doesn't feel that unified. Stuart Russell has an agenda, but his students have their own agendas which only sometimes overlap with his.

comment by JanB (JanBrauner) · 2022-09-05T18:45:56.618Z · LW(p) · GW(p)

Anthropic is also working on inner alignment, it's just not published yet.

Regarding what "the point" of RL from human preferences with language models is; I think it's not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it's a relevant alignment issue).

See e.g. Ajeya's comment here [LW(p) · GW(p)]:

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):
Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
Human feedback provides a more realistic baseline to compare other strategies to -- you want to be able to tell clearly if your alignment scheme actually works better than human feedback.
With that said, my guess is that on the current margin people focused on safety shouldn't be spending too much more time refining pure human feedback (and ML alignment practitioners I've talked to largely agree, e.g. the OpenAI safety team recently released this critiques work -- one step in the direction of debate).

comment by habryka (habryka4) · 2024-01-15T08:01:28.336Z · LW(p) · GW(p)

These kinds of overview posts are very valuable, and I think this one is as well. I think it was quite well executed, and I've seen it linked a lot, especially to newer people trying to orient to the state of the AI Alignment field, and the ever growing number of people working in it.

comment by Gabe M (gabe-mukobi) · 2022-08-30T01:59:39.119Z · LW(p) · GW(p)

Thanks for actually taking the time to organize all the information here, this is and will be very useful!

For OpenAI, you could also link this recent blog post about their approach to alignment research that reinforces the ideas you already gathered. Though maybe that blog post doesn't go into enough detail or engage with those ideas critically and you've already read it and decided to leave it out?

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-30T02:10:29.099Z · LW(p) · GW(p)

Thank you Gabriel!

Yeah good point, I think I should have included that link, updated now to include it.

comment by Gunnar_Zarncke · 2024-06-15T17:53:53.637Z · LW(p) · GW(p)

It's almost two years. I think it would be valuable to do a review or update to this summary post!

comment by Soroush Pour (soroush-pour) · 2023-08-14T02:27:58.558Z · LW(p) · GW(p)

For anybody else wondering what "ERO" stands for in the DeepMind section -- it stands for "Externalized Reasoning Oversight" and more details can be found in this paper.

Source: @Rohin Shah's comment [LW(p) · GW(p)].

comment by Roman Leventov · 2022-09-03T14:11:31.518Z · LW(p) · GW(p)

Alignment of Complex Systems Research Group [LW · GW] is missing from the post?

comment by Raemon · 2022-08-31T21:40:24.161Z · LW(p) · GW(p)

Note: I wanted to curate this post, but it seemed like it was still in the process of getting revisions based on various feedback. Thomas/elifland, when you think you've made all the edits you're likely to make and new edit-suggestions have trailed off, give me a ping.

comment by RobertM (T3t) · 2022-08-29T02:46:43.577Z · LW(p) · GW(p)

Great writeup, very happy to see an overview of the field like this.

One note: it looks like the Infra-Bayesianism [LW · GW] section is cut off, and ends on a sentence fragment:

In the worlds where AIs solve alignment for us

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-08-29T02:58:17.929Z · LW(p) · GW(p)

Thank you Robert!

I've fixed that, thanks for pointing that sentence fragment out.

comment by Kerrigan · 2023-02-20T07:12:40.522Z · LW(p) · GW(p)

Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward

Is there an already written expansion of this?

comment by jungofthewon · 2022-08-31T14:07:07.454Z · LW(p) · GW(p)

and Ought either builds AGI or strongly influences the organization that builds AGI.

"strongly influences the organization that builds AGI" applies to all alignment research initiatives right? Alignment researchers at e.g. DeepMind have less of an uphill battle but they still have to convince the rest of DeepMind to adopt their work.

Replies from: elifland

↑ comment by elifland · 2022-08-31T14:44:20.375Z · LW(p) · GW(p)

"strongly influences the organization that builds AGI" applies to all alignment research initiatives right? Alignment researchers at e.g. DeepMind have less of an uphill battle but they still have to convince the rest of DeepMind to adopt their work.

Yes, I didn't mean to imply this was necessarily an Ought-specific problem and I guess it may have been a bit unfair for me to only do a BOTEC on Ought. I included it because I had the most fleshed-out thoughts on it but it could give the wrong impression about relative promise when others don't have BOTECs. Also people (not implying you!) often take my BOTECs too seriously, they're done in this spirit.

That being said, I agree that strong within-organization influence feels more likely than across; not sure to what extent.

Replies from: Vika, jungofthewon

↑ comment by Vika · 2022-09-13T16:38:58.864Z · LW(p) · GW(p)

I would expect that the way Ought (or any other alignment team) influences the AGI-building org is by influencing the alignment team within that org, which would in turn try to influence the leadership of the org. I think the latter step in this chain is the bottleneck - across-organization influence between alignment teams is easier than within-organization influence. So if we estimate that Ought can influence other alignment teams with 50% probability, and the DM / OpenAI / etc alignment team can influence the corresponding org with 20% probability, then the overall probability of Ought influencing the org that builds AGI is 10%. Your estimate of 1% seems too low to me unless you are a lot more pessimistic about alignment researchers influencing their organization from the inside.

Replies from: elifland

↑ comment by elifland · 2022-09-14T04:31:22.831Z · LW(p) · GW(p)

Good point, and you definitely have more expertise on the subject than I do. I think my updated view is ~5% on this step.

I might be underconfident about my pessimism on the first step (competitiveness of process-based systems) though. Overall I've updated to be slightly more optimistic about this route to impact.

↑ comment by jungofthewon · 2022-09-01T18:47:26.723Z · LW(p) · GW(p)

All good, thanks for clarifying.

comment by Nicholas / Heather Kross (NicholasKross) · 2023-05-29T00:09:43.416Z · LW(p) · GW(p)

Also possibly relevant (though less detailed): this table I made [LW · GW].

comment by hamnox · 2022-10-28T11:30:29.373Z · LW(p) · GW(p)

I wanna offer feedback on the READING.

at "Off the cuff I’d give something like 10%, 3%, 1% for these respectively (conditioned on the previous premises) which multiplies to .003%", the verbal version doubled back to remind what the referents for each percentage we're, then read the sentence again.

that was PERFECT. high value add. made sure the actual point was gotten across, when it would have been very easy to just mentally tune out numerical information.

comment by Patodesu · 2022-10-06T06:18:07.668Z · LW(p) · GW(p)

Even if you think S-risks from AGI are 70 times less likely than X-risks, you should think how many times worse would it be. For me would be several orders of magnitude worse.

comment by peterslattery · 2023-02-24T02:10:13.408Z · LW(p) · GW(p)

Is there a plan to review and revise this to keep it up to date? Or is there something similar that I can look at which is more updated? I have this saved as something to revisit, but I worry not that it could be out of date and inaccurate given the speed of progress.

Replies from: peterslattery

↑ comment by peterslattery · 2023-02-24T02:12:13.664Z · LW(p) · GW(p)

Also, just as feedback (which probably doesn't warrant any changes being made unless similar feedback provided), I will flag that it would be good to be able to see posts that this is mentioned in ranked by recency rather than total karma.

comment by Aaron Bergman (aaronb50) · 2022-09-13T02:21:28.432Z · LW(p) · GW(p)

Note: I'm probably well below median commenter in terms of technical CS/ML understanding. Anyway...

I feel like a missing chunk of research could be described as “seeing DL systems as ‘normal,’ physical things and processes that involve electrons running around inside little bits of (very complex) metal pieces” instead of mega-abstracted “agents.”

The main reason this might be fruitful is that, at least intuitively and to my understanding, failures like “the AI stops just playing chess really well and starts taking over the world to learn how to play chess even better” involve a qualitative change beyond just “the quadrillion parameters adjust a bit to minimize loss even more” that eventually cashes out in some very different way that literal bits of metal and electrons are arranged.

And plausibly abstracting away from the chips and electrons means ignoring the mechanism that permits this change. Of course, this probably only makes sense if something resembling deep learning scales to AGI, but it seems that some very smart people think that it may!

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2022-09-13T03:25:57.569Z · LW(p) · GW(p)

I can understand why it would seem excessively abstract, but when we speak of agency, we are in fact talking about patterns in the activations of the gpu's circuit elements - specifically we'd be talking about patterns of numerical feedback where the program forms a causal predictive model of a variable and then, based on the result of the predictive model, does any form of model-predictive control, eg outputting bytes (floats, probably) that encode an action that the action-conditional predictive model evaluates as likely to impact the variable.

Merely minimizing loss is insufficient to end up with this outcome in many cases, but on some datasets, with some problem formulations - ones that we expect to come up, such as motor control of a robot in order to walk across a room, for a trivial example, or trying to select videos which maximize probability that a user stays on the website - we can expect that the predictive model, if more precise about the future than a human's predictive model, would allow the gpu code to select actions (motor actions or video selections) that have higher reliability of reaching the target outcome (cross the room, ensure the user stays on the site) that the control loop code evaluated via the predictive model. The worry is that, if an agent is general enough in purpose to form its own subgoals and evaluate those in the predictive model, it could end up doing multi-step plan chaining through this general world-simulator subalgorithm and realize it can attack its creators in one of a great many possible ways.

Replies from: aaronb50

↑ comment by Aaron Bergman (aaronb50) · 2022-09-13T04:56:04.913Z · LW(p) · GW(p)

Ngl I did not fully understand this, but to be clear I don't think understanding alignment through the lense of agency is "excessively abstract." In fact I think I'd agree with the implicit default view that it's largely the single most productive lense to look through. My objection to the status quo is that it seems like the scale/ontology/lense/whatever I was describing is getting 0% of the research attention whereas perhaps it should be getting 10 or 20%.

Not sure this analogy works, but if NIH was spending $10B on cancer research, I would (prima facie, as a layperson) want >$0 but probably <$2B spent on looking at cancer as an atomic-scale phenomenon, and maybe some amount at an even lower-scale scale

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2022-09-13T07:52:08.974Z · LW(p) · GW(p)

yeah I was probably too abstract in my reply - to rephrase: a thermostat (or other extremely small control system) is a perfectly valid example of agency. it's not dangerously strong agency or any such thing. but my point is really to say that you're on the right track here, looking at the micro-scale versions of things is very promising.

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Contents

Introduction

Aligned AI [EA · GW] / Stuart Armstrong

Eliciting Latent Knowledge / Paul Christiano

Evaluating LM power-seeking [AF · GW] / Beth Barnes

LLM Alignment

Interpretability

Scaling laws

Brain-Like-AGI Safety / Steven Byrnes [? · GW]

Center for AI Safety (CAIS) / Dan Hendrycks

Center for Human Compatible AI (CHAI) / Stuart Russell

Epistemology [? · GW]

Scalable LLM Interpretability

Refine [LW · GW]

Simulacra Theory

Externalized Reasoning Oversight [AF · GW] / Tamera Lanham

Future of Humanity Institute (FHI)

Communicate their view on alignment

Deception + Inner Alignment [LW · GW] / Evan Hubinger

Agent Foundations [? · GW] / Scott Garrabrant and Abram Demski

Infra-Bayesianism [? · GW] / Vanessa Kosoy

Visible Thoughts Project [LW · GW]

Adversarial training [LW · GW]

Selection Theorems [LW · GW] / John Wentworth

Team Shard [LW · GW]

Truthful AI / Owain Evans and Owen Cotton-Barratt

Other Organizations

Appendix

Visualizing Differences

Automating alignment and alignment difficulty

Conceptual vs. applied

Thomas’s Alignment Big Picture

90 comments

15.2.1.2 The “Reverse-engineer human social instincts” research program

15.2.2.2 The “Easy-to-use super-secure sandbox for AGIs” research program