My AI Alignment Research Agenda and Threat Model, right now (May 2023)

nicholas-heather-kross

My AI Alignment Research Agenda and Threat Model, right now (May 2023)

post by Nicholas / Heather Kross (NicholasKross) · 2023-05-28T03:23:38.353Z · LW · GW · 0 comments

This is a link post for https://www.thinkingmuchbetter.com/nickai/conceptual/myagenda-may2023.html

  TLDR
  Threat Model
  The Two Subproblems
    Steering Cognition
    Determining/Loading Values
  Theory of Change
    What Success Could Look Like
  What I'm Personally Learning/Researching
    Learning
    Researching
  My Current Constraints
  The AI Landscape
      Table 1: An Informal Assessment of Potentially-Strategically-Relevant AI and Alignment Organizations, as of late May 2023. (Note: This table may not be up-to-date.)
None
No comments

TLDR

Fairly short timelines, mildly fast takeoffs, and medium-high uncertainties. --> Looking for abstractions to help cognition-steering and value-loading. --> Grasping at / reacting to related/scary/FOMO lines of research.

Threat Model

An AI system could be built that's far smarter than any human or small group of humans. This AI system could use its intelligence to defeat any non-motivation-directing safeguards, and gain control of the world and the future of humanity and other sentient life. Based on the orthogonality thesis [? · GW], this amount of power would probably not, by default, be directed towards the best interests of humanity and other sentient life. Based on the idea of instrumental convergence [? · GW], such an AI would destroy everything we value in its quest to fulfill its (dumb-by-default) original goal. This AI may require new insights to build, or could arise by "scaling up" existing ML architectures. This AI may "self-improve" its architecture, or it could get smarter through prosaic "hack more cloud computing power" techniques. In either case, it could start from a position of low capabilities and end up as the most powerful entity on Earth. This all could start within as few as 1.5 years from now, and will probably happen within 10 years, barring nuclear or other catastrophe.

The simplest solution to the above problem would be "don't build superhuman AGI, at least for the near future". However, superhuman AGI is likely to be built, on purpose or by accident, by any of a handful of groups with large amounts of computational resources and talented researchers. These groups are generally not monolithic, and contain leaders and employees who disagree [LW · GW] (internally, with other orgs, and/or with me) about the best approach to AI alignment. (See the section "The AI Landscape" below).

Imagine if any of these groups got a box, today, that said "Input your alignment solution by USB drive, push button to get a superhuman AGI that runs on that, box expires in 1 week". According to my threat model, humanity is unlikely to survive longer than 1 week in this scenario. This is despite the wildly varying (often quite good!) alignment-motivations and security-mindsets [LW · GW] of these groups. On my view, this is (mainly) because none of these groups has an adequate pre-prepared response for the below "Two Subproblems".

The Two Subproblems

I forgot where this advice came from, but I followed the tip of "Take a day or so to think through AI alignment, for yourself, from scratch". I was definitely biased by my previous readings on AI (especially by Yudkowsky and Wentworth), but I basically came away with

Steering Cognition

How do we direct the thought-patterns, goals, and development of an AI system? This is basically the rocket alignment analogy, specifically the "Newtonian mechanics"/"basic physics" part.

For many, the core difficulty and most-important-part of AI alignment [LW · GW] is to be able to steer a mind's cognition at all. If we get this right, we set a lower-bound on the badness of AGI X-risks (while also opening the door to S-risks from solving this subproblem and neglecting the one below, but that's not the immediate focus).

Determining/Loading Values

If we could aim a superintelligent AI system at anything, what should we aim it at, and how? In the "rocket alignment" analogy, this is basically the flight plan (or the method for creating the flight plan) to get to the moon.

At first, this seems to naturally decompose into "determine values" and "encode values into the AGI". However, I consider this to be one subproblem, because an AGI could most likely carry out the execution of either or both of those steps. But before those steps is something like "figure out what [a pointer to [the best values for an AGI]] would look like, in enough detail to point an AGI at it and expect things to go well from there." Due to fragility-of-value [? · GW], I don't expect a real-life satisfactory solution to AI alignment to involve a human (that is, an neither-augmented-nor-simulated [LW · GW] human) writing down the full Sheet Of Human Values and then plugging it into an AGI. However, we could end up with, say, a reliable theory [? · GW] of / mathematical abstraction for our values, which an AGI could then "fill in the blanks" of through observation.

Theory of Change

If I research the above two subproblems (and/or the items in the section "What I'm Personally Learning/Researching" below), then one or both of the above subproblems will become more-solved. This could be a full end-to-end solution, a theoretical-but-proved plan, a paradigm that can be developed further, or contributions to the work of others. I am eager to help and fairly-agnostic about how.

Furthermore, I think that even if my above "Threat Model" is wrong in one or more key ways, the research I want to do would still be helpful. For example, if AI takeoff speeds were slower, I would still want research-similar-to-mine to be developed and refined quickly. If neural networks were usurped by a new AI paradigm, I would still think research-similar-to-mine could help align the new architectures.

And, of course, if you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly [LW · GW].

What Success Could Look Like

Create a manual/framework for [LW · GW] building a friendly AGI.
Create a manual/framework for building an AGI that can be pointed at anything.
Helping with either of the above two.
Some other result that prevents AGI-caused extinction of humanity.

What I'm Personally Learning/Researching

Learning

John Wentworth's work on abstractions [LW · GW]
MIRI's work on agent foundations [? · GW]
QACI's work on pointing-at-real-world-values [LW · GW]
Getting a broad knowledge base of mathematics, modulo my existing knowledge and short timelines.
Finding out what other nuggets of interesting research would be helpful for my goals, from cyborgism [AF · GW]/human-researcher-intelligence-amplification to governance/large-training-moratorium to moral uncertainty [? · GW] to theories-of-how-sentience-and/or-consciousness-works.

Researching

Finding or developing mathematical structures that are actually useful in aligning smart AI systems ("What's the type signature of an agent? [AF · GW]").
Contributing to abstraction theory to get it to a point where "pointing at things in the world [LW · GW]" is doable.
- Applying abstraction theory to understanding and changing the contents of artificial minds. This includes the ability to e.g. trace a "thought" through a neural network, from input to output, and understand its progression.
- Applying abstraction theory to determining human values.
Contributing to, giving feedback on, extending, and applying the above areas I'm learning about, especially cutting-across different organizations. (For reference: I am currently in closest touch with people at Orthogonal, Conjecture, and OpenAI. I've also talked briefly with various alignment researchers at the EA Global SF 2022 conference.)

And again: If you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly [LW · GW].

My Current Constraints

These are described in more depth here. They are (in no particular order):

Funding
Mathematical intuition/ability/"talent"
Mathematical concepts (getting even more "broad technical background [AF · GW]")
Mathematical notation/formalism knowledge
My working memory
My mental stamina

The most important constraint right now (i.e. the only real bottleneck at this time) is funding. With enough funding, I could work full-time on AI alignment, which would include solving or mitigating the other constraints.

Note that I already have a Bachelor's degree in computer science, a minor in mathematics, and some other AI-related background (see here [LW · GW]).

The AI Landscape

Here is how the rest of the AI alignment/safety landscape looks, to me, as of this writing:

Table 1: An Informal Assessment of Potentially-Strategically-Relevant AI and Alignment Organizations, as of late May 2023. (Note: This table may not be up-to-date.)

Organization	One-Sentence Summary	Are they likely to cause AGI doom, including by accident?	Do they care about AGI risk? (This includes investor pressure and disagreements with my risk model [LW · GW]!)
OpenAI	The most cutting-edge AGI research lab, structured as a sort of nonprofit/for-profit hybrid company, with heavy investment from Microsoft.	worrying	mostly
Microsoft	Tech giant, with large investments of money and cloud computing towards OpenAI.	worrying	maybe?
Google DeepMind	The other most cutting-edge AGI research lab, the AI arm of the search engine giant, and the original developers of the popular framework TensorFlow.	worrying	mostly
Google/Alphabet	Tech giant that owns DeepMind.	worrying	maybe?
X.AI	Elon Musk's new AI research company.	worrying	Elon Musk
Meta/Facebook AI	The AI arm of the social networking giant, and the original developers of popular framework PyTorch.	worrying	no(?)
Conjecture	EleutherAI alumni + computational resources and funding + security mindset.	More than "kinda", but less than "worrying".	yes
MIRI (Machine Intelligence Research Institute)	The original AI alignment nonprofit, founded by Eliezer Yudkowsky, focused on formal research and fieldbuilding.	no	yes
Orthogonal	A new nonprofit built around an idiosyncratic approach [AF · GW], founded by an alumna of the Conjecture-hosted Refine [LW · GW] alignment research incubator.	kinda	yes
ARC (Alignment Research Center)	The nonprofit alignment group run by Paul Christiano.	no	yes
Redwood Research	The nonprofit alignment group where Buck Shlegeris works.	kinda	yes
Ought	Product-oriented nonprofit working on factored cognition.	not much	maybe [AF · GW]
Anthropic	Well-funded AI company focused on building and aligning large language models.	worrying	probably
The US government/military	The federal government and/or military of the United States of America.	not much, but could	probably
The Chinese government/military	The national government and/or military of the People's Republic of China.	not much, but could	maybe
Any large technology company headquartered in China (including Baidu, Alibaba, Tencent, and others)	The most cutting-edge companies in China to have large computational resources.	not much, but could	maybe
Keen Technologies	AGI company with >$20M in funding, founded by John Carmack.	Never count out Carmack.	no(?)
Apart Research	Mostly fieldbuilding	no	probably
SERI MATS	Stanford/Berkeley-run program that supports independent alignment researchers like John Wentworth and Jeffrey Ladish.	no	yes

0 comments

Comments sorted by top scores.

My AI Alignment Research Agenda and Threat Model, right now (May 2023)

Contents

TLDR

Threat Model

The Two Subproblems

Steering Cognition

Determining/Loading Values

Theory of Change

What Success Could Look Like

What I'm Personally Learning/Researching

Learning

Researching

My Current Constraints

The AI Landscape

0 comments