Desiderata for decision theory

post by TsviBT · 2016-10-27T02:10:48.000Z · LW · GW · 1 comments

Contents

  What decision theory is for
  Desiderata for decision theory
    Optimal bounded reasoning
    Philosophically sound reasoning
    Optimal decision-making
    Philosophically sound decision-making
None
1 comment

author:

This is a collection of my current hypotheses about what decision theory is for and what would constitute progress in decision theory.

***

There are not many nonlocal dependencies, so feel free to skip around liberally. I’ve chosen not to track sources, both for my own convenience and also because I may have put my own spin on the ideas; but it is safe to assume that the ideas and my apprehension of them are due almost entirely to conversations with and writings by other reasearchers. This should be considered a draft, so comments are particularly welcome.

Github pdf: https://github.com/tsvibt/public-pdfs/blob/master/decision-theory/desiderata/main.pdf

What decision theory is for

Decision theory is for

“Directing a reasoning process” is not restricted to decision-making systems. Broadly, if we are getting useful work out of an AI, it’s because the AI is doing computations that produce logical information about how we can achieve our goals. So we must have somehow pointed the consequentialist-reasoning part of the AI towards our goals. The questions of decision theory in this context are then: How do you specify logical or environmental goals? How do you direct a consequentialist reasoner to produce the relevant logical information? How do you make a consequentialist reasoner that does this efficiently?

There may be more specific problems relating to agent-like systems in particular. For example, systems that make decisions may exhibit reward hacking, instability under self-modification, averting instrumental incentives, severe suboptimality due to things like spurious counterfactuals, and other more exotic failure modes such as Pascal’s mugging. A good theory of decision-making might also enlighten us about other research paths such as low impact, mild optimization, and so on.

I think it’s plausible that it’s simply not necessary to have a deeper theoretical understanding of decision-making in order to design AIs that are capable and aligned enough to fulfill their potential for positive impact. For example, from this perspective, Paul’s approach could be viewed as saying that decision theory problems, especially the more philosophical ones, can be solved satisfactorily by appealing to and amplifying human judgement. Even if a mathematical theory of decision-making were necessary, this is a fact we might want to first learn more definitively and crisply, and then solve. However, decision theory does seem like an identifiable fundamental theoretical aspect of AI design that may take long serial time to solve, so I think it’s still very worth pursuing.

Desiderata for decision theory

This is a list of mostly informal, possibly overlapping desiderata for a decision procedure or for a general theory of decision procedures. Note that the categorization is not clean; for example, understanding logical counterfactuals cuts across all categories. Further, some of these problems impinge on other parts of AI alignment theory; for example, reward hacking might also be viewed as a value specification / reward engineering problem. These problems don’t all carry the same philosophical importance or the same practical importance.

It seems worthwhile to formalize desiderata (these or others) for decision theory. In my current view, a formalization or a satisfaction of these desiderata is judged successful or not according to how enlightening it is about the two bullets in the previous section.

Many of these desiderata are parametrized by considering them in different settings and with different constraints. For example, agents can be judged according to:

Optimal bounded reasoning

Roughly, a bounded reasoner is “good” if it answers as many questions, as accurately, as inexpensively, as possible. We’ll know we have a solution when we can prove theorems giving optimal reasoners or impossibility results for various formal models of the problem class and optimality criterion. For example:

Philosophically sound reasoning

A good theory of reasoning should make us unconfused about bounded, embedded reasoning. For example:

Optimal decision-making

Roughly, a decision-making procedure is “good” if an agent that makes decisions using gets highly-valued outcomes in as many settings as possible. We’ll know we have a solution when we can prove theorems giving optimal agents or impossibility results for various formal models of the agent, problem class, and optimality criterion. For example:

Philosophically sound decision-making

A good theory of decision-making should make us unconfused about rational bounded agency. We’ll know we have a solution when we can analyze agent designs and give persuasive and empirically accurate accounts of whether they are taking actions for reasons we would endorse. For example:

1 comments

Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2016-11-13T01:09:01.000Z · LW(p) · GW(p)

I agree with most of what you wrote, but have questions about a couple of sections.

R has good third-person counterfactuals.

I'm not sure this problem is important or has a solution. Why do you think it's an important problem to solve?

A should handle exotic environments. That is, A should do something sensible, even when placed in environments that might be considered pathological. For example, A should make appropriate use of halting oracles, time travel, infinite computational resources, and so on.

This desideratum doesn’t seem very important.

Why do you say it doesn't seem very important? It's plausible to me that when we really understand ontology and our values, it will be apparent that the most important "environments" for A to influence are the ones higher up on the arithmetical hierarchy (or other mathematical hierarchies). It seems plausible that those worlds might be much "bigger" or "more complex" or "more interesting" or "has more resources", to such an extent that the utility we can get from them overwhelms the utility we can get from more mundane environments that we can also influence.

It's also plausible that the above is false but A fails by acting as if it's true. It seems to me that both of these failure modes are safety-relevant in the sense that errors could cost a large fraction of our potential utility, and probably won't become apparent until A becomes very powerful.

Earlier in a related section you wrote "This seems like a problem we should be able to defer to future versions". Did you mean future versions of decision theory that we research before building an AI, or future versions of AI that the initial AI self-improves into? In either case, why is this problem more deferrable than the other problems listed here?