Counterfactual Mechanism Networks

post by StrivingForLegibility · 2024-01-30T20:30:00.954Z · LW · GW · 0 comments

Contents

  Many Paths to Peace
  Active Shields
  When Are These Networks Useful?
None
No comments

In the previous post [LW · GW], we saw an example of how a simple network of counterfactual mechanisms [LW · GW] can be used to produce logical commitments that resolve an incentive misalignment. In this post we'll generalize this technique to more complicated networks, and sketch out how such networks should be structured.

In our simple example, AliceBot and BobBot performed a single round of negotiation over conditional joint commitments. But open-source game theory [? · GW] lets us construct an entire network of counterfactual games, where agents in one game condition their behavior on the outcomes of any number of others. This information flow can even be loopy [LW · GW], using a more sophisticated logical crystal ball [LW · GW] than straightforward simulation. The subset relationship between logical commitments [LW · GW] makes it easy to compose them together: one mechanism restricts the joint policy space, and the next restricts it further. Ratcheting all participants towards the Pareto frontier in a deliberately designed manner.

One application for this is building up the equivalent of treaties layered on top of treaties. Alice might not be willing to discuss trade treaties at all until a satisfactory peace treaty has been committed to. A peace treaty is a joint commitment, but it doesn't fully specify the joint policy to be implemented by both parties. A joint commitment redefines the baseline for all participants, which any new agreement must improve upon. 

A peace treaty itself might be the result of a negotiation, where peace was only made attractive in the first place due to several other joint commitments taking the worst kinds of destructive conflict off the table. The equivalent of the Geneva Conventions, or performing a logical handshake to remove Nuke from both players' policy spaces in the Nuclear Prisoners' Dilemma.

In cases like peace, probabilistic rejection might risk fairly terrible outcomes. In these cases we might want to use a different tool, like leverage [LW · GW] in later mechanisms, to incentivize fair peace agreements. A guaranteed-but-unfair peace agreement, with the ability to recover those losses through later mechanisms, might still be a better deal than 96:4 odds of Fair Peace:War.

This seems like a generally useful pattern for our decision theory: 

This is a generalization of conditioning one's demands [LW · GW] in the Ultimatum game on the results of a strategically-earlier Prisoners' Dilemma. Alice can elicit better future treatment from Bob and possibly others if she Cooperates now. And demanding less from a Cooperative Alice is how Bob is able to elicit Cooperation from Alice in the first place. The idea is to use strategically-later decisions as leverage to induce prosocial behavior in strategically-earlier decisions.

Many Paths to Peace

Before being willing to discuss peace terms, Alice could require multiple previous commitments to all have been made satisfactorily. This introduces the AND pattern: A and B and C. For something as important as peace, Alice might prefer the OR pattern: A or B or C.

When establishing a secure connection to a website, your browser might perform something like a TLS handshake with the server it's trying to connect to. As part of this, your browser will send over a list of cryptographic ciphers that it supports, and the server will pick its favorite cipher from that list or else close the connection due to the lack of a mutually-supported cipher. This is the sort of "I split you choose" OR pattern that enables each side to freely add more supported options, and remove support for options that need to be deprecated.

When it comes to commitments, I like to think of an OR collection as a single module that serves a single purpose. And I might want a collection of these modules to all be present as a prerequisite for a particular negotiation. This framing makes it natural to group prerequisites together in Conjunctive Normal Form (CNF), also known as an AND of ORs. Any logical formula can be expressed in CNF, and having some kind of canonical form helps to prove theorems like "AliceBot will uphold (A or B) and (C or D)."

Unlike the information flow of the underlying games, the prerequisite graph probably shouldn't have any loops. This seems like the sort of thing that leads to the unhelpful kind of circular dependency. "Alice requires resolving A before negotiating B, and won't start negotiating A until B is resolved." Conditional commitments should give us the same ability to construct strategic time loops [LW · GW] that result in socially high-ranking outcomes.

So a general structure that seems reasonable for these mechanism networks is to put lots of paths to peace at the bottom. Layered on top of that are negotiations about things like trade, where we demand less from negotiators that have made certain commitments. And for some positive sum trades we might not be willing to negotiate at all without the right commitments in place.

It's self-defeating to sell iron to raiders that will forge that iron into weapons for raiding you. And it's an unethical negative externality to legibly enable raiders to raid others.[1] So by "peace" here I think I'm advocating for something like a legible commitment not to unethically initiate violence.[2] All it takes to bring about world peace is for the relevant actors to stop fighting. It's reshaping the incentives of the relevant actors that seems to be the hard part so far, but it does seem to be working.

Active Shields

In Engines of Creation 2.0, Drexler advocates for the use of active shields as a mechanism for reshaping incentives. He gives the example of a space-based platform, designed to shoot down any cluster of missile launches that look like an attempted first strike on another country:

We now can design devices that sense (looks like a thousand missiles have just been launched), assess (this looks like an attempted first strike) and act (try to destroy those missiles!). If a system will fire only at massive flights of missiles, then it cannot be used for offense or a space blockade. Better yet, it could be made incapable of discriminating between attacking sides. Though serving the strategic interests of its builders, it would not be subject to the day-today command of anyone’s generals. It would just make space a hazardous environment for an attacker’s missiles. Like a sea or a mountain range in earlier wars, it would threaten neither side while providing each with some protection against the other.

This sort of mechanism fits nicely into the pattern of "legible system that reshapes incentives towards better outcomes." Every piece of the design and construction could be subject to the inspection and approval from every stakeholder. The result would be an autonomous signal processor that transforms observation signals into action signals, according to a known and agreed-upon policy. Shutting down such a system or modifying its instructions could require a threshold number of signatures by the relevant stakeholders.

Deploying such a system probably feels risky to say the least. Maybe less risky than a nuclear arms race, but still not great. Having made it this far up the prerequisite graph, the relevant stakeholders might then commit to legibly dismantle all nuclear first-strike capabilities, or legibly not build nuclear weapons in the first place if they had enough foresight to negotiate enforceable treaties before an arms race broke out. This saves them from the risks posed by nuclear arms races AND the risks posed by imperfect active shields. Powerful superintelligences might negotiate such treaties [LW · GW] within moments of suspecting each other's existence.

When Are These Networks Useful?

In extremely simple games like the Prisoners' Dilemma, the hard part isn't identifying the socially optimal outcome. The hard part is reshaping each player's incentives to guide them towards it. Some games have a special structure [LW · GW] that makes it tractable to efficiently find optimal joint policies. But in general, it's really hard to evaluate the consequences of implementing even a single joint policy, let alone optimize over the entire superexponentially large joint policy space [LW · GW].

I expect that the general solution will involve the OR pattern. Suppose that all of the following are true:

Then in these cases, all agents can simply perform a logical handshake [LW · GW] and agree to implement the socially optimal outcome, conditional on their accurate prediction that all other agents will do the same. This is one of the most straightforward routes to the optimal outcome, and one of the first things our software systems should check.

Counterfactual mechanisms rely on that first criteria, since they also use logical handshakes. But they work even when those other criteria aren't satisfied. Networks of these counterfactual mechanisms can iteratively peel off suboptimal parts of the joint policy space, and different tools and insights can be applied to tackle different structures that exist within the joint policy space. "I don't know what the optimal joint policy looks like, but I know it doesn't involve us wasting a bunch of resources in destructive conflicts [LW · GW]."

  1. ^

    This is an ethical standard that's easier to hold mutually-legible software systems to. They have a number of superpowers that impose additional moral responsibility, including: 

    Mutual telepathy

    Accurate simulation of each other in any hypothetical scenario

    Trivial ability to make unbreakable commitments

     

    Software can also be designed to be more epistemically and instrumentally rational than even the smartest human. They can be more moral, considerate, and kind. They can be fearless, tireless, and selfless.

    This opportunity also transmits moral responsibility to the designers of software systems. A person might act unethically when surprised or emotionally distressed. But deliberately writing software to act unethically is a premeditated act, with correspondingly greater moral weight.

  2. ^

    The phrase "unethically initiate" is doing a lot of work here, so I want to clarify that I think it's morally necessary to initiate violence in some circumstances. Some laws need to be enforced upon agents that don't voluntarily agree to its enforcement, to preserve the voluntariness of other interactions. War might be inevitable between two coalitions, and initiating violence preemptively might be the only morally permissible action. People that are being mistreated might find that violence is their only way out of an unacceptable situation.

    The initiation of violence, especially violence that risks destroying something irreplaceable [LW · GW], should generally be a last resort. But a legible willingness to engage in destructive conflict can help to keep these branches of the game tree purely counterfactual. Actual reality can end up more peaceful through "conditional nonviolence" than through "unconditional nonviolence." FairBot elicits better treatment than CooperateBot.

0 comments

Comments sorted by top scores.