My AI Alignment Research Agenda and Threat Model, right now (May 2023)
post by Nicholas / Heather Kross (NicholasKross) · 2023-05-28T03:23:38.353Z · LW · GW · 0 commentsThis is a link post for https://www.thinkingmuchbetter.com/nickai/conceptual/myagenda-may2023.html
Contents
TLDR Threat Model The Two Subproblems Steering Cognition Determining/Loading Values Theory of Change What Success Could Look Like What I'm Personally Learning/Researching Learning Researching My Current Constraints The AI Landscape Table 1: An Informal Assessment of Potentially-Strategically-Relevant AI and Alignment Organizations, as of late May 2023. (Note: This table may not be up-to-date.) None No comments
TLDR
Fairly short timelines, mildly fast takeoffs, and medium-high uncertainties. --> Looking for abstractions to help cognition-steering and value-loading. --> Grasping at / reacting to related/scary/FOMO lines of research.
Threat Model
An AI system could be built that's far smarter than any human or small group of humans. This AI system could use its intelligence to defeat any non-motivation-directing safeguards, and gain control of the world and the future of humanity and other sentient life. Based on the orthogonality thesis [? · GW], this amount of power would probably not, by default, be directed towards the best interests of humanity and other sentient life. Based on the idea of instrumental convergence [? · GW], such an AI would destroy everything we value in its quest to fulfill its (dumb-by-default) original goal. This AI may require new insights to build, or could arise by "scaling up" existing ML architectures. This AI may "self-improve" its architecture, or it could get smarter through prosaic "hack more cloud computing power" techniques. In either case, it could start from a position of low capabilities and end up as the most powerful entity on Earth. This all could start within as few as 1.5 years from now, and will probably happen within 10 years, barring nuclear or other catastrophe.
The simplest solution to the above problem would be "don't build superhuman AGI, at least for the near future". However, superhuman AGI is likely to be built, on purpose or by accident, by any of a handful of groups with large amounts of computational resources and talented researchers. These groups are generally not monolithic, and contain leaders and employees who disagree [LW · GW] (internally, with other orgs, and/or with me) about the best approach to AI alignment. (See the section "The AI Landscape" below).
Imagine if any of these groups got a box, today, that said "Input your alignment solution by USB drive, push button to get a superhuman AGI that runs on that, box expires in 1 week". According to my threat model, humanity is unlikely to survive longer than 1 week in this scenario. This is despite the wildly varying (often quite good!) alignment-motivations and security-mindsets [LW · GW] of these groups. On my view, this is (mainly) because none of these groups has an adequate pre-prepared response for the below "Two Subproblems".
The Two Subproblems
I forgot where this advice came from, but I followed the tip of "Take a day or so to think through AI alignment, for yourself, from scratch". I was definitely biased by my previous readings on AI (especially by Yudkowsky and Wentworth), but I basically came away with
Steering Cognition
How do we direct the thought-patterns, goals, and development of an AI system? This is basically the rocket alignment analogy, specifically the "Newtonian mechanics"/"basic physics" part.
For many, the core difficulty and most-important-part of AI alignment [LW · GW] is to be able to steer a mind's cognition at all. If we get this right, we set a lower-bound on the badness of AGI X-risks (while also opening the door to S-risks from solving this subproblem and neglecting the one below, but that's not the immediate focus).
Determining/Loading Values
If we could aim a superintelligent AI system at anything, what should we aim it at, and how? In the "rocket alignment" analogy, this is basically the flight plan (or the method for creating the flight plan) to get to the moon.
At first, this seems to naturally decompose into "determine values" and "encode values into the AGI". However, I consider this to be one subproblem, because an AGI could most likely carry out the execution of either or both of those steps. But before those steps is something like "figure out what [a pointer to [the best values for an AGI]] would look like, in enough detail to point an AGI at it and expect things to go well from there." Due to fragility-of-value [? · GW], I don't expect a real-life satisfactory solution to AI alignment to involve a human (that is, an neither-augmented-nor-simulated [LW · GW] human) writing down the full Sheet Of Human Values and then plugging it into an AGI. However, we could end up with, say, a reliable theory [? · GW] of / mathematical abstraction for our values, which an AGI could then "fill in the blanks" of through observation.
Theory of Change
If I research the above two subproblems (and/or the items in the section "What I'm Personally Learning/Researching" below), then one or both of the above subproblems will become more-solved. This could be a full end-to-end solution, a theoretical-but-proved plan, a paradigm that can be developed further, or contributions to the work of others. I am eager to help and fairly-agnostic about how.
Furthermore, I think that even if my above "Threat Model" is wrong in one or more key ways, the research I want to do would still be helpful. For example, if AI takeoff speeds were slower, I would still want research-similar-to-mine to be developed and refined quickly. If neural networks were usurped by a new AI paradigm, I would still think research-similar-to-mine could help align the new architectures.
And, of course, if you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly [LW · GW].
What Success Could Look Like
-
Create a manual/framework for [LW · GW] building a friendly AGI.
-
Create a manual/framework for building an AGI that can be pointed at anything.
-
Helping with either of the above two.
-
Some other result that prevents AGI-caused extinction of humanity.
What I'm Personally Learning/Researching
Learning
-
John Wentworth's work on abstractions [LW · GW]
-
MIRI's work on agent foundations [? · GW]
-
QACI's work on pointing-at-real-world-values [LW · GW]
-
Getting a broad knowledge base of mathematics, modulo my existing knowledge and short timelines.
-
Finding out what other nuggets of interesting research would be helpful for my goals, from cyborgism [AF · GW]/human-researcher-intelligence-amplification to governance/large-training-moratorium to moral uncertainty [? · GW] to theories-of-how-sentience-and/or-consciousness-works.
Researching
-
Finding or developing mathematical structures that are actually useful in aligning smart AI systems ("What's the type signature of an agent? [AF · GW]").
-
Contributing to abstraction theory to get it to a point where "pointing at things in the world [LW · GW]" is doable.
-
Applying abstraction theory to understanding and changing the contents of artificial minds. This includes the ability to e.g. trace a "thought" through a neural network, from input to output, and understand its progression.
-
Applying abstraction theory to determining human values.
-
-
Contributing to, giving feedback on, extending, and applying the above areas I'm learning about, especially cutting-across different organizations. (For reference: I am currently in closest touch with people at Orthogonal, Conjecture, and OpenAI. I've also talked briefly with various alignment researchers at the EA Global SF 2022 conference.)
And again: If you or somebody you know seems well-equipped to carry out any of this research, please steal my ideas, move the work forward, and disclose the results responsibly [LW · GW].
My Current Constraints
These are described in more depth here. They are (in no particular order):
-
Funding
-
Mathematical intuition/ability/"talent"
-
Mathematical concepts (getting even more "broad technical background [AF · GW]")
-
Mathematical notation/formalism knowledge
-
My working memory
-
My mental stamina
The most important constraint right now (i.e. the only real bottleneck at this time) is funding. With enough funding, I could work full-time on AI alignment, which would include solving or mitigating the other constraints.
Note that I already have a Bachelor's degree in computer science, a minor in mathematics, and some other AI-related background (see here [LW · GW]).
The AI Landscape
Here is how the rest of the AI alignment/safety landscape looks, to me, as of this writing:
Table 1: An Informal Assessment of Potentially-Strategically-Relevant AI and Alignment Organizations, as of late May 2023. (Note: This table may not be up-to-date.)
Organization | One-Sentence Summary | Are they likely to cause AGI doom, including by accident? | Do they care about AGI risk? (This includes investor pressure and disagreements with my risk model [LW · GW]!) |
---|---|---|---|
OpenAI | The most cutting-edge AGI research lab, structured as a sort of nonprofit/for-profit hybrid company, with heavy investment from Microsoft. | worrying | mostly |
Microsoft | Tech giant, with large investments of money and cloud computing towards OpenAI. | worrying | maybe? |
Google DeepMind | The other most cutting-edge AGI research lab, the AI arm of the search engine giant, and the original developers of the popular framework TensorFlow. | worrying | mostly |
Google/Alphabet | Tech giant that owns DeepMind. | worrying | maybe? |
X.AI | Elon Musk's new AI research company. | worrying | Elon Musk |
Meta/Facebook AI | The AI arm of the social networking giant, and the original developers of popular framework PyTorch. | worrying | no(?) |
Conjecture | EleutherAI alumni + computational resources and funding + security mindset. | More than "kinda", but less than "worrying". | yes |
MIRI (Machine Intelligence Research Institute) | The original AI alignment nonprofit, founded by Eliezer Yudkowsky, focused on formal research and fieldbuilding. | no | yes |
Orthogonal | A new nonprofit built around an idiosyncratic approach [AF · GW], founded by an alumna of the Conjecture-hosted Refine [LW · GW] alignment research incubator. | kinda | yes |
ARC (Alignment Research Center) | The nonprofit alignment group run by Paul Christiano. | no | yes |
Redwood Research | The nonprofit alignment group where Buck Shlegeris works. | kinda | yes |
Ought | Product-oriented nonprofit working on factored cognition. | not much | maybe [AF · GW] |
Anthropic | Well-funded AI company focused on building and aligning large language models. | worrying | probably |
The US government/military | The federal government and/or military of the United States of America. | not much, but could | probably |
The Chinese government/military | The national government and/or military of the People's Republic of China. | not much, but could | maybe |
Any large technology company headquartered in China (including Baidu, Alibaba, Tencent, and others) | The most cutting-edge companies in China to have large computational resources. | not much, but could | maybe |
Keen Technologies | AGI company with >$20M in funding, founded by John Carmack. | Never count out Carmack. | no(?) |
Apart Research | Mostly fieldbuilding | no | probably |
SERI MATS | Stanford/Berkeley-run program that supports independent alignment researchers like John Wentworth and Jeffrey Ladish. | no | yes |
0 comments
Comments sorted by top scores.