(Some?) Possible Multi-Agent Goodhart Interactions

post by Davidmanheim · 2018-09-22T17:48:22.356Z · score: 21 (5 votes) · LW · GW · 2 comments

Contents

  Quick Reference
  New: 5 Ways Multiple Agents Ruin Everything
  Conclusion
2 comments

Epistemic Status: I need feedback on these ideas, and I've been delaying because I'm not sure I'm on the right track. This is the product of a lot of thinking, but I'm not sure the list is complete or there isn't something important I'm missing. (Note: This is intended to form a large part of a paper for an article to be submitted to the journal special issue here.)

Following up on Scott Garrabrant's earlier post on Goodhart's Law [LW · GW] and the resulting paper, I wrote a further discussion of non-adversarial goodhart [LW · GW], and explicitly deferred discussion of the adversarial case. I've been working on that.

Also note that these are often reformulations or categorizations of other terms (treacherous turn, faulty reward functions, distributional shift, reward hacking, etc.) It might be good to clarify exactly what went where, but I'm unsure.

To (finally) start, here is Scott's "Quick Reference" for the initial 4 methods, which is useful for this post as well. I've partly replaced the last one with the equivalent cases from the Arxiv paper.

Quick Reference

From the Arxiv Paper: (Note - I think this is still incomplete, and focuses far too much on the Agent-Regulator framing. See below.)

New: 5 Ways Multiple Agents Ruin Everything

To fix that insufficient bullet point above, here is a list of 5 forms of optimization failures that can occur in multi-agent systems. I intend for the new sub-list to be both exhaustive, and non-overlapping, but I'm not sure either is true. For obvious reasons, the list is mostly human examples, and I haven't formalized these into actual system models. (Anyone who would like to help me do so would be welcome!)

Note that the list is only discussing things that happen due to optimization failure and interactions. Also note that most examples are 2-party. There may be complex and specific 3-party or N-party failure modes that are not captured, but I can't find any.

1) (Accidental) Steering is when one agent alter the system in ways not anticipated by another agent, creating one of the above-mentioned over-optimization failures for the victim.

This is particularly worrisome when multiple agents have closely related goals, even if those goals are aligned.

Example 1.1 A system may change due to a combination of actors' otherwise benign influences, either putting the system in an extremal state or triggering a regime change.

Example 1.2 In the presence of multiple agents without coordination, manipulation of factors not already being manipulated by other agents is likely to be easier and more rewarding, potentially leading to inadvertent steering.

2) Coordination Failure occurs when multiple agents clash despite having potentially compatible goals.

Coordination is an inherently difficult task, and can in general be considered impossible\cite{Gibbard1973}. In practice, coordination is especially difficult when goals of other agents are incompletely known or understood. Coordination failures such as Yudkowsky's Inadequate equilibria\cite{Yudkowsky2017} are stable, and coordination to escape from such an equilibrium can be problematic even when agents share goals.

Example 2.1 Conflicting instrumental goals that neither side anticipates may cause wasted resources on contention. For example, both agents are trying to do the same thing in conflicting ways.

Example 2.2 Coordination limiting overuse of public goods is only possible when conflicts are anticipated or noticed and where a reliable mechanism can be devised\cite{Ostrom1990}.

3) Adversarial misalignment occurs when a victim agent has an incomplete model of how an opponent can influence the system, and the opponent selects for cases where the victim's model performs poorly and/or promotes the opponent's goal.

Example 3.1 Chess engines will choose openings for which the victim is weakest.

Example 3.2 Sophisticated financial actors can dupe victims into buying or selling an asset in order to exploit the resulting price changes.

4) Input spoofing and filtering - Filtered evidence can be provided or false evidence can be manufactured and put into the training data stream of a victim agent.

Example 4.1 Financial actors can filter by performing transactions they don't want seen as private transactions or dark pool transactions, or can spoof by creating offsetting transactions with only one half being reported to give a false impression of activity to other agents.

Example 4.2 Rating systems can be attacked by inputting false reviews into a system, or by discouraging reviews by those likely to be the least or most satisfied reviewers.

Example 4.3 Honeypots can be placed or Sybil attacks mounted by opponents in order to fool victims into learning from examples that systematically differ from the true distribution.

5) Goal co-option is when an agent directly modifies the victim agent reward function directly, or manipulates variables absent from the victim's system model.

The probability of exploitable reward functions increases with the complexity of both the agent and the system it manipulates\cite{Amodei2016}, and exploitation by other agents seems to follow the same pattern.

Example 5.1 Attackers can directly target the system on which an agent runs and modify its goals.

Example 5.2 An attacker can discover exploitable quirks in the goal function to make the second agent optimize for a new goal, as in Manheim and Garrabrant's Campbell's law example.

Conclusion

I'd love feedback. (I have plenty to say about applications and importance, but I'll talk about that separately.)

2 comments

Comments sorted by top scores.

comment by romeostevensit · 2018-09-22T21:16:41.855Z · score: 10 (5 votes) · LW · GW

In general I think working on taxonomizing failure modes is valuable. In the case of one of the meta generators of failure modes, proxy divergence, even more valuable.

Formalization generators: I often find it useful to think about which kinds of distinctions I can make in order to decompose a category. A few high level ones: split into variant and invariant parts, past/future asymmetry, descriptive/prescriptive parts, continuous vs discrete representation, implementation/algorithmic/functional level (Marr's levels), complexity classes (in particular some strategies forcing other strategies into worse complexity classes), breadth vs depth first search spaces, and strategies differing due to beliefs about payoff distribution shape (incl. type 1 and 2 error penalties).

With that last one an object level example: knowing that the payoff distribution has changed before others because you're the one who changed it (caused the proxy to diverge).

I like the generator of how markets might clear under some adversarial conditions and wonder what models quants have of this they might be willing to share.

comment by Davidmanheim · 2018-11-04T13:13:36.968Z · score: 2 (2 votes) · LW · GW
I like the generator of how markets might clear under some adversarial conditions and wonder what models quants have of this they might be willing to share.

In the preprint paper - https://arxiv.org/abs/1810.10862 - I discuss a few examples of these failure modes that occur in practice. In finance, most of the discussed failures are ways to create "momentum ignition."

Also, having done policy work on HFT, I found it's really really hard to get quants to share any details about strategies. I suspect this would be doubly-true if it's about manipulative strategies!

(And thanks for the other thoughts. I'm still working through what those generators' failure modes would look like.)