Twenty-three AI alignment research project definitions
post by rmoehn · 2020-02-03T22:21:49.089Z · LW · GW · 0 commentsContents
No comments
I came up with these research project definitions when I read the iterated amplification sequence. Last year I put five of them up for voting (see Which of these five AI alignment research projects ideas are no good?) and chose no. 23 to work on (see IDA with RL and overseer failures). But I didn't think of publishing all of them until JJ Hepburn gave me the idea that they might be useful to others.
The definitions follow the format recommended in The Craft of Research: ‘I'm studying X, because I want to find out Y, so that I (and you) can better understand Z.’ The quality of language and content varies.
An unaligned benchmark
----------------------
1
I'm studying mild optimization,
because I want to understand what its problems and limitations
are,
because I want to decide whether or not to work on one of
them,
because well-functioning mild optimization appears to
be close to what humans and organisations safely do
today, so basing an AI on it might have useful
outcomes.
- This is more a study project than a research project.
2
I'm studying how image classifiers can recognize new types of
images,
because I want to find out how to detect out-of-distribution
inputs,
in order to help my reader understand how to make ML
systems that don't confidently make wrong predictions.
Approval-directed agents/bootstrapping
--------------------------------------
3
I'm studying Bayesian machine learning,
because I want to understand how to make ML systems that
notice when they are confused
in order to help my reader understand how to make ML
systems that will ask the overseer for input when doing
otherwise would lead to failure.
4
I'm studying possible structures of approval-directed agents
because I want to understand how much human thought and input
they would require
in order to help my reader understand whether
approval-directed agents are feasible.
Humans Consulting HCH
---------------------
5
I'm studying the articles linked from Humans Consulting HCH,
because I want to understand the section ‘Hope’ from Humans
Consulting HCH
in order be able to think about the tension between the
system being capable and reflecting the human's judgment.
Corrigibility (Christiano 2017)
-------------------------------
6
I'm studying Omohundro's preferences-about-your-utility-function
case,
because I want to implement it
in order to help my reader understand how it can be
implemented and whether it is unstable.
7
I'm studying the change in prediction failures as a predictor
becomes stronger,
because I want to find out whether the failures not only
become fewer, but also harder to detect over time,
in order to help my reader understand whether systems
built on such predictors will remain corrigible as they
become stronger.
Iterated Distillation and Amplification
---------------------------------------
8
I'm studying the capability and safety-relevant properties of
imitation learning,
because I want to find out whether it can produce aligned
agents,
in order to help my reader understand how to even get to
the base case of iterated distillation and amplification.
Benign model-free RL
--------------------
9
I'm studying the performance of benign model-free RL,
because I want to know whether the claim that it will achieve
state-of-the-art performance is true,
in order to help my reader understand how useful benign
model-free RL will be.
10
I'm studying reward learning and robustness,
because I want to know whether they can achieve competitive
agents without malign behaviour,
in order to help my reader understand whether benign
model-free RL will be possible.
11
I'm implementing (parts of) benign model-free RL,
because I want to know what works and doesn't work in
practice,
in order to help my reader understand which parts of the
scheme need further conceptual research.
Supervising strong learners by amplifying weak experts
------------------------------------------------------
12
I'm studying ways to improve the sample efficiency of a supervised
learner,
because I want to know how to reduce the number of calls to H
in CSASupAmp,
in order to help my reader understand how we can adapt
that proof-of-concept for solving real world tasks that
require even more
training data.
13
I'm studying the effects of how CSASupAmp samples questions,
because I want to know how to sample questions in a way that
improves the scheme's learning performance,
in order to help my reader understand how we can adapt
that proof-of-concept for solving real world tasks that
require even more training data.
Machine Learning Projects for Iterated Distillation and Amplification
--------------------------------------------------------------------
14
❇ Any of the projects there. At first glance Adaptive Computation
is most interesting, but perhaps also requires most studying. I
would ask Owain to find out what people are already working on or
what has received most interest, then work on the least crowded
one.
Directions and desiderata for AI alignment
------------------------------------------
15
I'm studying integrating models of known human heuristics and
biases into IRL systems,
because I want to improve the performance of IRL in a domain
where it is hindered by the discrepancies between the existing
error models and actual human irrationality
in order to help my readers understand how to get IRL
systems to infer true human values despite Stuart
Armstrong's impossibility result.
The reward engineering problem
------------------------------
16
I'm experimenting with semi-supervised reinforcement learning,
because I want to find out how humans can supervise machine
learning with reasonably small effort,
in order to help my reader understand how to avoid
optimizing proxy objectives that we have to use because
the sample hunger of current ML algorithms is so great.
17
I'm studying the use of a discriminator in imitation learning,
because I want to find out how to help humans produce
demonstrations that the agent can imitate,
in order to help my reader understand how we might use
imitation learning to solve the reward engineering
problem.
Capability amplification (Christiano 2016)
------------------------------------------
18
I'm studying cognitive tasks and how to decompose them into ever
simpler steps,
because I want to find algorithms for capability amplification
in order to help my reader understand the nature of
obstacles to capability amplification. – What is the
obstacle? How exactly is it an obstacle? Make it simple.
Learning with catastrophes
--------------------------
19
❇ I'm studying the development of the performance of an adversary
in adversarial training,
because I want to find out whether the adversary gets worse as
the primary agent becomes more robust, or whether the
adversary gets traction at all when the primary agent is
already quite robust,
in order to help my reader understand how confident we
could be in a red team to find all relevant catastrophic
situations.
Thoughts on reward engineering
------------------------------
20
I'm studying the effects of importance sampling on the behaviour
that an RL agent learns,
because I want to find out whether it can lead to undesirable
outcomes
in order to help my reader understand whether importance
sampling can solve the problem of widely varying rewards
in reward engineering.
21
I'm studying the effects of an inconsistent comparison function on
optimizing with comparisons,
because I want to know whether it prevents the two agents from
converging on a desirable equilibrium quickly enough
inorder to help my reader understand whether optimizing
with comparisons can solve the problem of inconsistency
and unreliability in reward engineering.
Techniques for optimizing worst-case performance
------------------------------------------------
22
❇ I'm studying transparency in the service of adversarial training
(using transparency to ease finding adversarial examples, or to
detect adversarial success earlier/more often),
because I want to make the adversary ten times more effective
in order to help my reader understand how to build ML
systems that never fail catastrophically.
Reliability ampflification
--------------------------
23
I'm studying the impact of overseer failure on RL-based IDA,
because I want to know under what conditions the amplification
increases or decreases the failure rate,
in order to help my reader understand whether we need to
combine capability amplification with explicit reliability
amplification in all cases.
Security amplification
----------------------
24
I'm studying adversarial examples for meta-execution with ML-based
sub-agents,
because I want to find a case where security amplification by
meta-execution fails to amplify security,
in order to help my reader understand what obstructions to
security amplification there are.
Meta-execution
--------------
No project idea from the article directly. Ought has put forth
some open issues (cf.
https://docs.google.com/document/d/1xzFuDD1xiG-oe750MYrP9PEgwEXtxbfSBnHrgrCRnhY/edit#),
but that might be outdated and would be too closely tied to Ought.
0 comments
Comments sorted by top scores.