# What's a Decomposable Alignment Topic?

post by elriggs · 2020-08-21T22:57:00.642Z · score: 26 (10 votes) · LW · GW · 9 commentsThis is a question post.

## Contents

Answers 11 Alex Ray 4 steve2152 4 Vanessa Kosoy 3 Vanessa Kosoy 3 Vanessa Kosoy 3 Charlie Steiner 2 Vanessa Kosoy None 9 comments

What's an alignment topic where, if someone decomposed the overall task, a small group of smart people (like here on Lesswrong) could make conceptual progress? By "smart", assume they can notice confusion [? · GW], google, and program.

Be specific. Explain what strategies you would use to explore your topic, and give examples on decomposing tasks.

This is more than just an exercise in factored cognition; there are stakes: I host the key alignment group [LW · GW], and if your answer is convincing enough, then we'll work on it (caveat: some people may leave if your topic isn't their cup of tea, but others may join because it is). Additionally, you can pitch your idea in the key alignment group meeting on (8/25) Tuesday 7pm UTC; just DM me if you'd like to join the next meeting.

## Answers

I work at OpenAI on safety. In the past it seems like theres a gap between what I'd consider to be alignment topics that need to be worked on, and the general consensus for this forum. A good friend poked me to write something for this so here I am.

Topics w/ strategies/breakdown:

- Fine-tuning GPT-2 from human preferences, to solve small scale alignment issues
- Brainstorm small/simple alignment failures: ways that existing generative language models are not aligned with human values
- Design some evaluations or metrics for measuring a specific alignment failure (which lets you measure whether you’ve improved a model or not)
- Gather human feedback data / labels / whatever you think you can try training on
- Try training on your data (there are tutorials on how to use Google Colab to fine-tune GPT-2 with a new dataset)
- Forecast scaling laws: figure out how performance on your evaluation or metric varies with the amount of human input data; compare to how much time it takes to generate each labelled example (be quantitative!)
- Multi-objective reinforcement learning — instead of optimizing a single objective, optimize multiple objectives together (and some of the objectives can be constraints)
- What are ways we can break down existing AI alignment failures in RL-like settings into multi-objective problems, where some of the objectives are safety objectives and some are goal/task objectives
- How can we design safety objectives such that they can transfer across a wide variety of systems, machines, situations, environments, etc?
- How can we measure and evaluate our safety objectives, and what should we expect to observe during training/deployment?
- How can we incentivize individual development and sharing of safety objectives
- How can we augment RL methods to allow transferrable safety objectives between domains (e.g., if using actor critic methods, how to integrate a separate critic for each safety objective)
- What are good benchmark environments or scenarios for multi-objective RL with safety objectives (classic RL environments like Go or Chess aren’t natively well-suited to these topics)
- Forecasting the Economics of AGI (turn ‘fast/slow/big/etc’ into real numbers with units)
- This is more “AI Impacts” style work than you might be asking for, but I think it’s particularly well-suited for clever folks that can look things up on the internet.
- Identify vague terms in AI alignment forecasts, like the “fast” in “fast takeoff”, that can be operationalized
- Come up with units that measure the quantity in question, and procedures for measurements that result in those units
- Try applying traditional economics growth models, such as experience curves, to AI development, and see how well you can get things to fit (much harder to do this for AI than making cars — is a single unit a single model trained? Maybe a single week of a researchers time? Is the cost decreasing in dollars or flops or person-hours or something else? Etc etc)
- Sketch models for systems (here the system is the whole ai field) with feedback loops, and inspect/explore parts of the system which might respond most to different variables (additional attention, new people, dollars, hours, public discourse, philanthropic capital, etc)

Topics not important enough to make it into my first 30 minutes of writing:

- Cross disciplinary integration with other safety fields, what will and won’t work
- Systems safety for organizations building AGI
- Safety acceleration loops — how/where can good safety research make us better and faster at doing safety research
- Cataloguing alignment failures in the wild, and create a taxonomy of them

Anti topics: Things I would have put on here a year ago

- Too late for me to keep writing so saving this for another time I guess

I’m available tomorrow to chat about these w/ the group. Happy to talk then (or later, in replies here) about any of these if folks want me to expand further.

I think it would be fun and productive to "wargame" the emergence of AGI in broader society in some specific scenario---my choice (of course) would be "we reverse-engineer the neocortex". Different people could be different interest-groups / perspectives, e.g. industry researchers, academic researchers, people who have made friends with the new AIs, free-marketers, tech-utopians, people concerned about job losses and inequality, people who think the AIs are conscious and deserve rights, people who think the AIs are definitely not conscious and don't deserve rights (maybe for religious reasons?), militaries, large companies, etc.

I don't know how these "wargame"-type exercises actually work---honestly, I haven't even played D&D :-P Just a thought. I personally have some vague opinions about brain-like AGI development paths and what systems might be like at different stages etc., but when I try to think about how this could play out with all the different actors, it kinda makes my head spin. :-)

The goal of course is to open conversations about what *might* plausibly happen, not to figure out what *will* happen, which is probably impossible.

**Motivation**: Starting the theoretical investigation of dialogic reinforcement learning [LW(p) · GW(p)] (DLRL).

**Topic**: Consider the following setting. is a set of "actions", is a set of "queries", is a set of "annotations". is the set of "worlds" defined as . Here, the semantics of the first factor is "mapping from actions to rewards", the semantics of the second factor is "mapping from queries to {good, bad, ugly}", where "good" means "query can be answered", "bad" means "query cannot be answered", "ugly" means "making this query loses the game". In addition, we are given a fixed mapping (assigning to each query its semantics). is a set of "hypotheses" which is a subset of (i.e. each hypothesis is a belief about the world).

Some hypothesis represents the user's beliefs, but the agent doesn't know which. Instead, it only has a prior . On each round, the agent is allowed to either make an annotated query or take an action from . Taking an action produces a reward and ends the game. Making a query can either (i) produce a number, which is (good), or (ii) produce nothing (bad), or (iii) end the game with zero reward (ugly).

The problem is devising algorithms for the agent, s.t., in expectation w.r.t. , the -expected reward approximates the best possible -expected reward (the latter is what we would get if the agent knew which hypothesis is correct) and the number of queries is low. Propose sets of assumptions about the ingredients of the setting that lead to non-trivial bounds. Consider proving both positive results and negative results (the latter meaning: "no algorithm can achieve a bound better than...")

**Strategy:** See the theoretical research part of my other answer [LW(p) · GW(p)]. I advise to start by looking for the minimal simplification of the setting about which it is still possible to prove non-trivial results. In addition, start with bounds that scale with the sizes of the sets in question, proceed to look for more refined parameters (analogous to VC dimension in offline learning).

**Motivation**: Improving understanding of relationship between learning theory and game theory.

**Topic**: Study the behavior of learning algorithms in mortal population games [LW(p) · GW(p)], in the limit. Specifically, consider the problem statements from the linked comment:

- Are any/all of the fixed points attractors?
- What can be said about the size of the attraction basins?
- Do all Nash equilibria correspond to fixed points?
- Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?

You can approach this theoretically (proving things) or experimentally (writing simulations). Specifically, it would be easiest to start from agents that follow fictitious play. You can then go on to more general Bayesian learners, other algorithms from the literature, or (on the experimental side) to using deep learning. Compare the convergence properties you get to those known in evolutionary game theory.

Notice that, due to the grain-of-truth problem, I intended to study this using non-Bayesian learning algorithms, but due to the ergodic-ish nature of the setting, Bayesian learning algorithms might perform well. But, if they perform poorly, this is still important to know.

**Strategies**: See my other answer [LW(p) · GW(p)].

*The idea is an elaboration of a comment [LW · GW] I made previously*.

**Motivation**: Improving the theoretical understanding of AGI by facilitating synthesis between algorithmic information theory and statistical learning theory.

**Topic**: Fix some reasonable encoding of communicating MDPs, and use this encoding to define : the Solomonoff-type prior over communicating MDPs. That is, the probability of a communicating MDP is proportional to where is the length of the shortest program producing the encoding of .

Consider CMDP-AIXI: the Bayes-optimal agent for . Morally speaking, we would like to prove that CMDP-AIXI (or any other policy) has a frequentist (i.e. per hypothesis ) non-anytime regret bound of the form , where is the time horizon^{[1]}, is a parameter such as MDP diameter, bias span or mixing time, , (this time is just a constant, *not* time discount). However, this precise result is probably impossible, because the Solomonoff prior falls off very slowly. *Warm-up*: Prove this!

Next, we need the concept of "sophisticated core", inspired by algorithmic statistics. Given a bit string , we consider the Kolmogorov complexity of . Then we consider pairs where is a program that halts on all inputs, is a bit string, and . Finally, we minimize over . The minimal is called the *sophistication* of . For our problem, we are interested in the minimal itself: I call it the "sophisticated core" of and denote it .

To any halting program we can associate the environment . We also define the prior by . and are "equivalent" in the sense that . However, they are not equivalent for the purpose of regret bounds.

*Challenge*: Investigate the conjecture that there is a (-dependent) policy satisfying the regret bound for every , or something similar.

**Strategy**: See the theoretical research part of my other answer [LW(p) · GW(p)].

I am using unnormalized regret and step-function time discount here to make the notation more standard, even though usually I prefer normalized regret and geometric time discount. ↩︎

Here's a problem for you, which I'm not sure fits the requirements, but might: How do you learn whether an AI has been trained to use Gricean communication (e.g. "I interpret your words by modeling you as saying them because you model me as interpreting them, and so on until further recursion isn't fruitful") without being able to read its source code and check its functioning against some specification of recursive agential modeling?

*The idea is an elaboration of a comment [LW(p) · GW(p)] I made previously.*

**Motivation**: Improving our understanding of superrationality.

**Topic**: Investigate the following conjecture.

Consider two agents playing iterated prisoner's dilemma (IPD) with geometric time discount. It is well known that, for sufficiently large discount parameters (), essentially all outcomes of the normal form game become Nash equilibria (the folk theorem). In particular, cooperation can be achieved via the tit-for-tat strategy. However, defection is still a Nash equilibrium (and even a subgame perfect equilibrium).

Fix . Consider the following IPD variant: the first player is forced to play a strategy that can be represented by a finite state automaton of states, and the second player is forced to play a strategy that can be represented by a finite state automaton of states. For our purpose a "finite state automaton" consists of a set of states , the transition mapping and the "action mapping" . Here, tells you how to update your state after observing the opponent's last action, and tells you which action to take. Denote the resulting (normal form) game , where is the time discount parameter.

*Conjecture:* If then there are a functions and s.t. the following conditions hold:

- Any thermodynamic equilibrium [AF · GW] of of temperature has the payoffs of up to .

**Strategies:** You could take two approaches: theoretical research and experimental research.

For theoretical research, you would try to prove or disprove the conjecture. If the initial conjecture is too hard, you can try to find easier variants (such as , or adding more constraints on the automaton). If you succeed proving the conjecture, you can go on to studying games other than prisoner's dilemma (for example, do we always converge to Pareto efficiency?) If you succeed in disproving the conjecture, you can go on to look for variants that survive (for example, assume or that the finite state automatons must not have irreversible transitions).

To decompose the task I propose: (i) have each person in the team think of ideas how to approach this (ii) brainstorm everyone's ideas together and select a subset of promising ideas (iii) distribute the promising ideas among people and/or take each promising idea and find multiple lemmas that different people can try proving.

Don't forget to check whether the literature has adjacent results. This also helps decomposing: the literature survey can be assigned to a subset of the team, and/or different people can search for different keywords / read different papers.

For experimental research, you would code an algorithm that computes the thermodynamic equilibria, and see how the payoffs behave as a function of and . Optimally, you would also provide a derivation of the error bounds on your results. To decompose the task, use the same strategy as in the theoretical case to come up with the algorithms and the code design. Afterwards, decompose it by having each person implement a segment of the code (pair programming is also an option).

It is also possible to go for theoretical *and* experimental simultaneously, by distributing among people and cross-fertilizing along the way.

## 9 comments

Comments sorted by top scores.

Microscope AI in general seems like a very decomposition-friendly area. Take a trained neural net, assign each person a chunk to focus on, and everybody tries to figure out what features/algorithms/other neat stuff are embedded in their chunk.

Also should work well with a regular-meetup-group format, since the arrangement would be fairly robust to people missing a meeting, joining up midway, having completely different approaches or backgrounds, etc. Relatively open-ended, room for people to try different approaches based on what interests them and cross-pollinate strategies with the group.

What sort of math background can we assume the group to have?

I don't know (partially because I'm unsure who would stay and leave).

If you didn't take math background that in consideration and wrote a proposal (saying "requires background in real analysis" or ...), then that may push out people w/o that background but also attract people with that background.

As long as pre-reqs are explicit, you should go for it.

Are you looking for an open problem which is sub-dividable into many smaller open problems? Or for one small open problem which is a part of a larger open problem?

The first one. As long as you can decompose the open problem into tractable, bite-sized pieces, it's good.

Vanessa mentioned some strategies that might generalize to other open problems: group decomposition (we decide how to break a problem up), programming to empirically verify X, and literature reviews.

I'm unclear on how to apply this filter. Can you give an example of what you mean by decomposable, and an example of not? (Perhaps not from alignment.)

If you only had access to people who can google, program, and notice confusion, how could you utilize that to make conceptual progress on a topic you care about?

Decomposable: Make a simple first person shooter. Could be decomposed into creating asset models, and various parts of the actual code can be decomposed (input-mapping, getting/dealing damage).

Non-decomposable: Help me write an awesome piano song. Although this can be decomposed, I don't expect anyone to have the skills required (and acquiring the skills requires too much overhead).

Let's operationalize "too much overhead" to mean "takes more than 10 hours to do useful, meaningful tasks".

Am I correct that the real generating rule here is something like "I have a group of people who'd like to work on some alignment open problems, and want a problem that is a) easy to give my group, and b) easy to subdivide once given to my group?"

b) seems right. I'm unsure what (a) could mean (not much overhead?).

I feel confused to think about decomposability w/o considering the capabilities of the people I'm handing the tasks off to. I would only add:

By "smart", assume they can notice confusion [? · GW], google, and program

since that makes the capabilities explicit.