Modeling AGI Safety Frameworks with Causal Influence Diagrams

post by Ramana Kumar (ramana-kumar) · 2019-06-21T12:50:08.233Z · LW · GW · 6 comments

This is a link post for https://arxiv.org/abs/1906.08663

We have written a paper that represents various frameworks for designing safe AGI (e.g., RL with reward modeling, CIRL, debate, etc.) as Causal Influence Diagrams (CIDs), to help us compare frameworks and better understand the corresponding agent incentives.

We would love to get comments, especially on

The paper's abstract:

Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of the proposed system should be trained and interact with each other. In this paper, we model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework. The unified representation permits easy comparison of frameworks and their assumptions. We hope that the diagrams will serve as an accessible and visual introduction to the main AGI safety frameworks.

6 comments

Comments sorted by top scores.

comment by TurnTrout · 2019-06-21T15:15:48.881Z · LW(p) · GW(p)

I really like this layout, this idea, and the diagrams. Great work.

I don't agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like "how is the automated system not vulnerable to manipulation" and "why do we think the system correctly formally measures the quantity in question?" (see more potential problems [AF · GW]). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don't see how to break (and probably not safety measures that don't break).

Also, on page 10 you write that during deployment, agents appear as if they are optimizing the training reward function. As evhub et al point out [LW · GW], this isn't usually true: the objective recoverable from perfect IRL on a trained RL agent is often different (behavioral objective != training objective).

Replies from: tom4everitt
comment by tom4everitt · 2019-06-28T12:43:35.558Z · LW(p) · GW(p)
I really like this layout, this idea, and the diagrams. Great work.

Glad to hear it :)

I don't agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like "how is the automated system not vulnerable to manipulation" and "why do we think the system correctly formally measures the quantity in question?" (see more potential problems) [AF · GW]. I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don't see how to break (and probably not safety measures that don't break).

Yes, the argument is only valid under the assumptions that you mention. Thanks for pointing to the discussion post about the assumptions.

Also, on page 10 you write that during deployment, agents appear as if they are optimizing the training reward function. As evhub et al point out, [LW · GW] this isn't usually true: the objective recoverable from perfect IRL on a trained RL agent is often different (behavioral objective != training objective).

Fair point, we should probably weaken this claim somewhat.

comment by Charlie Steiner · 2019-06-24T06:06:41.494Z · LW(p) · GW(p)

The reason I don't personally find these kinds of representation super useful is because each of those boxes is a quite complicated function, and what's in the boxes usually involves many more bits worth of information about an AI system than how the boxes are connected. And sometimes one makes different choices in how to chop an AI's operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms).

I actually have a draft sitting around of how one might represent value learning schemes with a hierarchical diagram of information flow. Eventually I decided that the idea made lots of sense for a few paradigm cases and was more trouble than it was worth for everything else. When you need to carefully refer to the text description to understand a diagram, that's a sign that maybe you should use the text description.

This isn't to say I think one should never see anything like this. Different ways of presenting the same information (like diagrams) can help drive home a particularly important point. But I am skeptical that there's a one-size-fits-all solution, and instead think that diagram usage should be tailored to the particular point it's intended to make.

Replies from: tom4everitt
comment by tom4everitt · 2019-06-28T10:36:23.230Z · LW(p) · GW(p)

Hey Charlie,

Thanks for your comment! Some replies:

sometimes one makes different choices in how to chop an AI's operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms)

There is definitely a modeling choice involved in choosing how much "to pack" in each node. Indeed, most of the diagrams have been through a few iterations of splitting and combining nodes. The aim has been to focus on the key dynamics of each framework.

As for the CIRL and IDA difference, this is a direct effect of the different levels the frameworks are specified at. CIRL is a high-level framework, roughly saying "somehow you infer the human preferences from their actions". IDA, in contrast, provides a reasonably detailed supervised learning criteria. So I think the frameworks themselves are already like apples and oranges, it's not just the diagrams. (And drawing the diagrams, this is something you notice.)

But I am skeptical that there's a one-size-fits-all solution, and instead think that diagram usage should be tailored to the particular point it's intended to make.

We don't want to claim the CIDs are the one-and-only diagram to always use, but as you mentioned above, they do allow for quite some flexibility in what aspects to highlight.

I actually have a draft sitting around of how one might represent value learning schemes with a hierarchical diagram of information flow.

Interesting. A while back I was looking at information flow diagram myself, and was surprised to discover how hard it was to make them formally precise (there seems to be no formal semantics for them). In contrast, causal graphs and CIDs have formal semantics, which is quite useful.

For hierarchical representations, there are networks of influence diagrams https://arxiv.org/abs/1401.3426

Replies from: Charlie Steiner
comment by Charlie Steiner · 2019-06-30T17:35:02.741Z · LW(p) · GW(p)

All good points.

The paper you linked was interesting - the graphical model is part of an AI design that actually models other agents using that graph. That might be useful if you're coding a simple game-playing agent, but I think you'd agree that you're using CIDs in a more communicative / metaphorical way?

comment by Davidmanheim · 2019-06-21T12:59:45.485Z · LW(p) · GW(p)

On point 2, which is the only one I can really comment on, yes, this seems like a useful paper, and I buy the argument that such an approach is critical for some purposes, including some of what we discussed on Goodhart's Law - https://arxiv.org/abs/1803.04585 - where one class of misalignment can be explicitly addressed by your approach. Also see the recent paper here: https://arxiv.org/abs/1905.12186 that explicitly models causal dependencies (like in figure 2,) to show a safety result.