# Robust Cooperation in the Prisoner's Dilemma

post by orthonormal · 2013-06-07T08:30:25.557Z · score: 73 (71 votes) · LW · GW · Legacy · 146 comments## Contents

Background: Writing programs to play the PD with source code swap Previously known: CliqueBot and FairBot Generalization: Modal Agents PrudentBot: The best of both worlds? Results: Obstacles to Optimality Why is this awesome, and what's next? Further Reading on LessWrong None 146 comments

I'm proud to announce the preprint of Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium via Provability Logic, a joint paper with Mihaly Barasz, Paul Christiano, Benja Fallenstein, Marcello Herreshoff, Patrick LaVictoire (me), and Eliezer Yudkowsky.

This paper was one of three projects to come out of the 2nd MIRI Workshop on Probability and Reflection in April 2013, and had its genesis in ideas about formalizations of decision theory that have appeared on LessWrong. (At the end of this post, I'll include links for further reading.)

Below, I'll briefly outline the problem we considered, the results we proved, and the (many) open questions that remain. Thanks in advance for your thoughts and suggestions!

## Background: Writing programs to play the PD with source code swap

(If you're not familiar with the Prisoner's Dilemma, see here.)

The paper concerns the following setup, which has come up in academic research on game theory: say that you have the chance to write a computer program **X**, which takes in one input and returns either *Cooperate* or *Defect*. This program will face off against some other computer program **Y**, but with a twist: **X** will receive the source code of **Y** as input, and **Y** will receive the source code of **X** as input. And you will be given your program's winnings, so you should think carefully about what sort of program you'd write!

Of course, you could simply write a program that defects regardless of its input; we call this program **DefectBot**, and call the program that cooperates on all inputs **CooperateBot**. But with the wealth of information afforded by the setup, you might wonder if there's some program that might be able to achieve mutual cooperation in situations where **DefectBot** achieves mutual defection, without thereby risking a sucker's payoff. (Douglas Hofstadter would call this a perfect opportunity for superrationality...)

## Previously known: CliqueBot and FairBot

And indeed, there's a way to do this that's been known since at least the 1980s. You can write a computer program that knows its own source code, compares it to the input, and returns *C* if and only if the two are identical (and *D* otherwise). Thus it achieves mutual cooperation in one important case where it intuitively ought to: when playing against itself! We call this program **CliqueBot**, since it cooperates only with the "clique" of agents identical to itself.

There's one particularly irksome issue with **CliqueBot**, and that's the fragility of its cooperation. If two people write functionally analogous but syntactically different versions of it, those programs will defect against one another! This problem can be patched somewhat, but not fully fixed. Moreover, mutual cooperation might be the best strategy against some agents that are not even functionally identical, and extending this approach requires you to explicitly delineate the list of programs that you're willing to cooperate with. Is there a more flexible and robust kind of program you could write instead?

As it turns out, there is: in a 2010 post on LessWrong, cousin_it introduced an algorithm that we now call **FairBot**. Given the source code of **Y**, **FairBot** searches for a proof (of less than some large fixed length) that **Y** returns *C* when given the source code of **FairBot**, and then returns *C* if and only if it discovers such a proof (otherwise it returns *D*). Clearly, if our proof system is consistent, **FairBot** only cooperates when that cooperation will be mutual. But the really fascinating thing is what happens when you play two versions of **FairBot** against each other. Intuitively, it seems that *either* mutual cooperation or mutual defection would be stable outcomes, but it turns out that if their limits on proof lengths are sufficiently high, they will achieve mutual cooperation!

The proof that they mutually cooperate follows from a bounded version of Löb's Theorem from mathematical logic. (If you're not familiar with this result, you might enjoy Eliezer's Cartoon Guide to Löb's Theorem, which is a correct formal proof written in much more intuitive notation.) Essentially, the asymmetry comes from the fact that both programs are searching for the same outcome, so that a short proof that one of them cooperates leads to a short proof that the other cooperates, and vice versa. (The opposite is not true, because the formal system can't know it won't find a contradiction. This is a subtle but essential feature of mathematical logic!)

## Generalization: Modal Agents

Unfortunately, **FairBot** isn't what I'd consider an ideal program to write: it happily cooperates with **CooperateBot**, when it could do better by defecting. This is problematic because in real life, the world isn't separated into agents and non-agents, and any natural phenomenon that doesn't predict your actions can be thought of as a **CooperateBot** (or a **DefectBot**). You don't want your agent to be making concessions to rocks that happened not to fall on them. (There's an important caveat: some things have utility functions that you care about, but don't have sufficient ability to predicate their actions on yours. In that case, though, it wouldn't be a true Prisoner's Dilemma if your values actually prefer the outcome (*C*,*C*) to (*D*,*C*).)

However, **FairBot** belongs to a promising class of algorithms: those that decide on their action by looking for short proofs of logical statements that concern their opponent's actions. In fact, there's a really convenient mathematical structure that's analogous to the class of such algorithms: the modal logic of provability (known as GL, for Gödel-Löb).

So that's the subject of this preprint: **what can we achieve in decision theory by considering agents defined by formulas of provability logic?**

More formally *(skip the next two paragraphs if you're willing to trust me)*, we inductively define the class of "modal agents" as formulas using propositional variables and logical connectives and the modal operator (which represents provability in some base-level formal system like Peano Arithmetic), of the form , where is fully modalized (i.e. all instances of variables are contained in an expression ), and with each corresponding to a fixed modal agent of lower rank. For example, **FairBot** is represented by the modal formula .

When two modal agents play against each other, the outcome is given by the unique fixed point of the system of modal statements, where the variables are identified with each other so that represents the expression , represents , and the represent the actions of lower-rank modal agents against and vice-versa. (Modal rank is defined as a natural number, so this always bottoms out in a finite number of modal statements; also, we interpret outcomes as statements of provability in Peano Arithmetic, evaluated in the model where PA is consistent, PA+Con(PA) is consistent, and so on. See the paper for the actual details.)

The nice part about modal agents is that there are simple tools for finding the fixed points without having to search through proofs; in fact, Mihaly and Marcello wrote up a computer program to deduce the outcome of the source-code-swap Prisoner's Dilemma between any two (reasonably simple) modal agents. These tools also made it much easier to prove general theorems about such agents.

## PrudentBot: The best of both worlds?

Can we find a modal agent that seems to improve on **FairBot**? In particular, we should want at least the following properties:

- It should be un-exploitable: if our axioms are consistent in the first place, then it had better only end up cooperating when it's mutual.
- It should cooperate with itself, and also mutually cooperate with
**FairBot**(both are, common-sensically, the best actions in those cases). - It should defect, however, against
**CooperateBot**and lots of similarly exploitable modal agents.

It's nontrivial that such an agent exists: you may remember the post I wrote about the Masquerade agent, which is a modal agent that does *almost* all of those things (it doesn't cooperate with the original **FairBot**, though it does cooperate with some more complicated variants), and indeed we didn't find anything better until after we had Mihaly and Marcello's modal-agent-evaluator to help us.

But as it turns out, there is such an agent, and it's pretty elegant: we call it **PrudentBot**, and its modal version cooperates with another agent **Y** if and only if (there's a proof in Peano Arithmetic that **Y** cooperates with **PrudentBot** and there's a proof in PA+Con(PA) that **Y** defects against **DefectBot**). This agent can be seen to satisfy all of our criteria. But is it *optimal* among modal agents, by any reasonable criterion?

## Results: Obstacles to Optimality

It turns out that, even within the class of modal agents, it's hard to formulate a definition of optimality that's actually true of something, and which meaningfully corresponds to our intuitions about the "right" decisions on decision-theoretic problems. (This intuition is not formally defined, so I'm using scare quotes.)

There are agents that give preferential treatment to **DefectBot**, **FairBot**, or even **CooperateBot**, compared to **PrudentBot**, though these agents are not ones you'd program in an attempt to win at the Prisoner's Dilemma. (For instance, one agent that rewards **CooperateBot **over **PrudentBot** is the agent that cooperates with **Y** iff PA proves that **Y** cooperates against **DefectBot**; we've taken to jokingly calling that agent **TrollBot**.) One might well suppose that a modal agent could still be optimal in the sense of making the "right" decision in every case, regardless of whether it's being punished for some other decision. However, this is not the only obstacle to a useful concept of optimality.

The second obstacle is that any modal agent only checks proofs at some finite number of levels on the hierarchy of formal systems, and agents that appear indistinguishable at all those levels may have obviously different "right" decisions. And thirdly, an agent might mimic another agent in such a way that the "right" decision is to treat the mimic differently from the agent it imitates, but in some cases one can prove that no modal agent can treat the two differently.

These three strikes appear to indicate that if we're looking to formalize more advanced decision theories, modal agents are too restrictive of a class to work with. We might instead allow things like quantifiers over agents, which would invalidate these specific obstacles, but may well introduce new ones (and certainly would make for more complicated proofs). But for a "good enough" algorithm on the original problem (assuming that the computer will have lots of computational resources), one could definitely do worse than submit a finite version of **PrudentBot**.

## Why is this awesome, and what's next?

In my opinion, the result of Löbian cooperation deserves to be published for its illustration of Hofstadterian superrationality in action, apart from anything else! It's *really cool* that two agents reasoning about each other can in theory come to mutual cooperation for genuine reasons that don't have to involve being clones of each other (or other anthropic dodges). It's a far cry from a practical approach, of course, but it's a start: mathematicians always begin with a simplified and artificial model to see what happens, then add complications one at a time.

As for what's next: First, we don't *actually* know that there's no meaningful non-vacuous concept of optimality for modal agents; it would be nice to know that one way or another. Secondly, we'd like to see if some other class of agents contains a simple example with really nice properties (the way that classical game theory doesn't always have a pure Nash equilibrium, but always has a mixed one). Thirdly, we might hope that there's an actual implementation of a decision theory (TDT, UDT, etc) in the context of program equilibrium.

If we succeed in the positive direction on any of those, we'd next want to extend them in several important ways: using probabilistic information rather than certainty, considering more general games than the Prisoner's Dilemma (bargaining games have many further challenges, and games of more than two players could be more convoluted still), etc. I personally hope to work on such topics in future MIRI workshops.

## Further Reading on LessWrong

Here are some LessWrong posts that have tackled similar material to the preprint:

- AI cooperation in practice, cousin_it, 2010
- Notion of Preference in Ambient Control, Vladimir_Nesov, 2010
- A model of UDT with a halting oracle, cousin_it, 2011
- Formulas of arithmetic that behave like decision agents, Nisan, 2012
- A model of UDT without proof limits, An example of self-fulfilling spurious proofs in UDT, Löbian cooperation, version 2, Bounded versions of Gödel's and Löb's theorems, cousin_it, 2012
- Predictability of decisions and the diagonal method, Consequentialist formal systems, Vladimir_Nesov, 2012
- Decision Theories: A Semi-Formal Analysis: Part 0 (A LessWrong Primer), Part 1 (The Problem with Naive Decision Theory), Part 2 (Causal Decision Theory and Substitution), Part 3 (Formalizing Timeless Decision Theory), Part 3.5 (Halt, Melt, and Catch Fire), Part 3.75 (Hang On, I Think This Works After All), orthonormal, 2012

## 146 comments

Comments sorted by top scores.

Incidentally, at the MIRI workshop, we soon started referring to the source-code-swap Prisoner's Dilemma between two modal agents as "modal combat".

Or "MODAL COMBAT!", if you prefer.

**[deleted]**· 2013-06-23T09:09:01.259Z · score: 4 (4 votes) · LW(p) · GW(p)

That's extremely un-Googleable, as I realized when reading about modal combat in a different thread and trying to figure out what the hell that was about.

Fascinating paper. Will there be a release of the code used? I would like to be able to play these games and tinker with the code myself.

Hacking is believing...

I think you should put out a game called Modal Combat!

One downside of having a bot that's too complicated is that it makes the other bot less likely to trust you.

An interesting question is to what extent a similar phenomenon is present in human relationships.

Well, not at all for the literal complexity of agents, because we don't estimate the complexity of our peers. Aristotle thought the heart was the seat of intelligence, Shannon thought AGI could be built in a year, everyone and their mother anthropomorphizes inanimate objects like smoke alarms and printers.

I suspect perceived character traits that engender distrust, the Dark Triad traits, make the trait-possessor seem complex not because their brain must be described in more bits absolutely, but conditionally given the brain of the character judge. That is, we require a larger encoding diff to predict the behavior of people who display unfamiliar desires and intents, or to predict them with comparable accuracy as one does for one's warm, honest, emotionally stable peer group. For example, someone who appears paranoid is displaying extreme caution in situations the character judge finds forthright and nonthreatening, an extra piece of situational context to the other person's decision making.

This is a poor explanation overall because we're much less likely to distrust atypically nice humane people than Machiavellian, sub-psychopath people, even if they're both less conditionally compressible. It takes a *lot* of niceness (Stepford Wives) before the uncanny-differential-encoding-valley reaction trips.

Edit: This might have been uncharitable. People who are more prone to lying may be more absolutely complex, because lying skillfully requires keeping track of ones lies and building further lies to support them, while honest beliefs can simply be verified against reality. People who decide by a few fixed, stable criteria (e.g. always voting for the nominated candidate of their political party) might be called trustworthy in the sense of being reliable (if not reliably pro-social). Fulfilling promises and following contracts also make one more trustworthy, in both the weak sense of predictability and the stronger sense of moral behavior. Yudkowsky makes the argument that moral progress tends to produce simplified values.

When people make purchasing decisions, pricing models that are too complex make them less likely to purchase. If it's too confusing to figure out whether something is a good deal or not, we generally tend to just assume it's a bad deal. See http://ideas.repec.org/p/ags/ualbsp/24093.html (Choice Environment, Market Complexity and Consumer Behavior: A Theoretical and Empirical Approach for Incorporating Decision Complexity into Models of Consumer Choice), for example.

If I recall correctly, there's a mention in Axelrod's *Evolution of Cooperation* of bots which did worse than random because they were so complicated.

That depends on the other bot, doesn't it?

For bots searching for formal proofs about your behavior, in order of increasing proof length (in bounded time), being more complex makes it less likely that they will find such a proof. And makes it more likely that they will find it later rather than sooner.

Consider the heuristic that more complicated behavior is more likely to be someone trying for complicated behavior.

In some competitions, "cooperate against agents that are very long" might be a better default position than "defect" when the search fails.

That seems to imply that if the desired behavior is more complex, it is more likely to result in cooperation with you. Why?

Also, such a default strategy would be wide open to exploitation: just make a program that (when played against yours) looks very complex, and you can't prove anything about it, but that's all just obfuscation and in the end it reliably defects.

What are the odds that a system that reliably defects will not have a proof that it defects, but that a system that cooperates if it is cooperated with will have no proof.

See the agents called "WaitDefectBot" and "WaitFairBot" in the section on optimality in the paper. For any modal agent, there's a pair that are undecidable in high enough formal systems to be indistinguishable, such that the right move would be cooperation against one and defection against the other. And you can make them both arbitrarily complex if you like.

So no, it would not be a good idea in general to cooperate with agents that are undecidable to you, and it would only incentivize agents to be undecidable toward you (and then defect).

This does have the problem that in practice it cashes out as defect against anyone who is sufficiently smarter than you.

It depends on how they use their intelligence. For instance, consider the variant of FairBot that appears in the last Masquerade post, which oscillates between seeking proofs of its opponent's cooperation and defection, up to a high threshold.

The original FairBot, or PrudentBot, or Masquerade, all reach mutual cooperation with this variant (call it ToughButFairBot), despite its undecidability *in general* up to PA+N. That's because if it finds a proof at a lower level, it acts on that immediately, so you can figure out what it does in those particular cases without climbing too high up the tower of formal systems.

The upshot is that you can have higher levels of strategy when necessary, without sacrificing your ability to provably act on lower levels in other cases. So this principle doesn't cash out as "defect against anyone smarter than you", but rather as "defect against anyone who refuses to let you figure out how they're going to respond to you in particular".

It's less a matter of *having* a proof (or not having one), and more a matter of having a sufficiently short proof (or simple by other criteria) that your adversary will find it before they give up.

In general, I think any algorithm with behavior X and a proof has another algorithm with identical behavior and no proof of it. But here we're interested in algorithms that are chosen (by their makers) to be deliberately easy to prove things about.

What are the odds that a system that reliably defects will not have a proof that it defects,

It won't want others to prove it will defect. So if others cooperate when unable to prove either way, then it can always construct itself to make it unproveable that it defects.

that a system that cooperates if it is cooperated with will have no proof.

It will want to have a proof. Here the question is, how good the adversary is at finding proofs in practice. Which is what I asked elsewhere, and I would really like to have an answer.

However, perhaps the system knows its own proof (or its designer knew it). Then it can just write out that proof at the start of its source code where the adversary is sure to see it.

In particular I was thinking how the bots discussed here would play against the bots being discussed below the other PD post.

Wow, MIRI is just churning out [1] math papers now.

[1]: Using this expression entirely in a positive sense.

Actually, "MIRI is just churning out writeups of research which happened over a long preceding time", which I'd emphasize because I think there's some sort of systematic "science as press release bias" wherein people think about science as though it occurs in bursts of press releases because that's what they externally observe. Not aimed at you in particular but seems like an important general observation. We couldn't do twice as much research by writing twice as many papers.

I don't have any background in automated proofs; could someone please outline the practical implications of this, considering the algorithmic complexity involved?

How effective would a real PrudentBot be if written today? What amount of complexity in opponent programs could it handle? How good are proof-searching algorithms? How good are computer-assisted humans searching for proofs about source code they are given?

Are there formal languages designed so that if an agent is written in them, following certain rules with the intent to make its behavior obvious, it is significantly easier (or more probable) for a proof-finding algorithm to analyze its behavior? Without, of course, restricting the range of behaviors that can be expressed in that design.

Algorithmic proof search in general is NP-hard. Special cases can be simpler: i.e. if two modal agents are playing against each other, they could employ Kripke semantics (which may also be NP-hard, but has *much* smaller computational requirements in practice than searching through all proofs).

(Technically, you'd define a class of ModalPrime agents, which begin with code to see if their opponent is syntactically a ModalPrime agent, then use Kripke semantics to see what the outcome of their modal statement versus the other modal statement would be. Or you could have Omega accept modal statements as inputs, and then Omega would only need to use Kripke semantics to resolve the matches in the tournament.)

Long story short, I wouldn't submit a "search through proofs" modal algorithm into the PD tournament currently being run: it would always run out of time before finding Lobian proofs.

If search is hard, perhaps there is another way.

Suppose an agent intends to cooperate (if the other agent cooperates, etc.). Then it will want the other agent to be able to prove that it will cooperate. If it knew of a proof about itself, it would advertise this proof, so the others wouldn't have to search. (In the original tournament setup, it could embed the proof in its source code.)

An agent can't spend much time searching for proofs about every agent it meets. Perhaps it meets new agents very often. Or perhaps there is a huge amount of agents out there, and it can communicate with them all, and the more agents it cooperates with the more it benefits.

But an agent has incentive to spend a lot of time finding a proof about itself - once - and then it can give that proof to counterparties it wants to cooperate with.

(It's been pointed out that this could give an advantage to families of similar agents that can more easily prove things about one another by using existing knowledge and proofs).

Finally, somewhere there may be someone who created this agent - a programming intelligence. If so, that someone with their understanding of the agent's behavior may be able to create such a proof.

Do these considerations help in practice?

We only need to perform proof search when we're given some unknown blob of code. There's no need to do a search when we're the ones writing the code; we know it's correct, otherwise we wouldn't have written it that way.

Admittedly many languages allow us to be very sloppy; we may not have to justify our code to the compiler and the language may not be powerful enough to express the properties we want. However there are some languages which allow this (Idris, Agda, Epigram, ATS, etc.). In such languages we don't actually write a program at all; we write down a constructive proof that our properties are satisfied by some program, then the compiler derives that program from the proof. Such programs are known as "correct by construction".

http://en.wikipedia.org/wiki/Program_derivation http://en.wikipedia.org/wiki/Proof-carrying_code

I don't have any background in automated proofs; could someone please outline the practical implications of this, considering the algorithmic complexity involved?

Probably pretty low. Most real-world cooperation involves genetic or memetic kin or repetition or reputations. It is likely to be rare to meet a stranger *and* have reliable access to their source code.

This is true for meat-based evolution. In a software-based world, it can be trivial to prove what your source code is, and sometimes it's advantageous too.

For instance, software might be distributed with its code, advertised as doing something. The software benefits if it can convince you that it will do that something ("cooperate"), because in exchange, you run it.

An em that's good at some kind of work might want to convince you to run copies of it: it gets to live, you get useful work done. But you're afraid that it will spy on you or sabotage its work. In a world of ems, the assumption is that nobody knows how to write superhuman (superintelligent) AIs, so we emulate human brains. That means you can't just inspect an em's software to determine what it's going to do, because you don't really understand how it works.

It would be nice if the em could provide a formal proof that in a certain scenario it would behave in a certain way (cooperate), but for software of an em's complexity that's probably never going to be possible.

For instance, software might be distributed with its code, advertised as doing something. The software benefits if it can convince you that it will do that something ("cooperate"), because in exchange, you run it.

Yes. Open source software does derive some benefits from source code transparency. Today, though, reputations - and technical measures like sandboxes - are the main ways of dealing with the issue of running software from external sources.

Open source (as the term is usually used) is not *technically* necessary. You could have proofs about native binaries, too. The real distinction is between software you have the code for (the running code, not necessarily the source code) - and software that runs out of your control, you don't know its code, and you communicate with it remotely.

Sandboxes are good but suffer from the "boxed AI" problem, so it's nice to have defense in depth.

This is true for meat-based evolution. In a software-based world, it can be trivial to prove what your source code is, and sometimes it's advantageous too.

So hide the malicious behavior in hardware or in the compiler.

I was thinking of a scenario where the verifier controls the hardware and the compiler.

An em that's good at some kind of work might want to convince you to run copies of it

E.g. it sends you its source code to run on your own hardware because it wants to work for you.

So hide it in some other computer you interact with.

Concerns about what I do on my own computer are separate from proofs about the software I send to you - they might still be valid, and if so, trusted and influential.

Yes, I could achieve malicious ends by outsmarting you: I could give you software that does exactly what it says (and a formal proof of that), but through its seemingly innocent interactions with other servers (which are perfectly valid under that proof), actually create a malicious effect you haven't foreseen. But that concern doesn't invalidate the usefulness of the proof.

Patrick, congrats on writing this up! It's nice to see MIRI step up its game.

Will two PrudentBots cooperate if they're using theories of different strength?

Yes! We checked that both formally and using the fixed-point evaluator. I should add that to the draft.

**EDIT:** Whoops, it's a little more complicated than I remember. They find mutual cooperation iff the system PrudentBot1 uses to prove (its opponent defects against DefectBot) is strong enough to prove the consistency of the system PrudentBot2 uses to prove (its opponent cooperates with it), and vice versa.

OK, I see. So, unlike FairBots, a PrudentBot using PA and PA+1 won't cooperate with a PrudentBot using PA+1 and PA+2. I wonder if that can be fixed?

If I were submitting a modal agent, I'd probably use a PrudentBot that uses PA to look for mutual cooperation but PA+N (for some large N) to look for defection against DefectBot. There's not a simple downside to that variant.

In particular, Prudent(0,100)Bot finds mutual cooperation with Prudent(0,1)Bot, Prudent(99,1000)Bot, etc.

If I were submitting a modal agent, I'd probably use a PrudentBot that uses PA to look for mutual cooperation but PA+N (for some large N) to look for defection against DefectBot.

Or breakout the stronger systems, e.g., ZFC possibly with some inaccessible cardinals thrown in for good measure.

That would be even better in practice, but it wouldn't be expressible in the modal formalism.

I found situation where **PrudentBot** behaves suboptimally and a more intelligent agent can do better. I'm considering an opponent I call pseudo tit-for-tat at level N, or **PsTitTatBot[N]**. **PsTitTatBot[0]** is simply **CooperateBot**. For N>0, the strategy for **PsTitTatBot[N]** is to simulate its opponent running against **PsTitTatBot[N-1]**, and do the same thing its opponent did in the simulation. **PrudentBot** would have an outcome of (D,D) against **PsTitTatBot[N]** for any N. However, it's easy for an agent to ensure that against sufficiently high level tit-for-tat agent the outcome will be (C,C). To do this, there must be a specific N such that this agent plays suboptimally against **PsTitTatBot[N]**. However, the benefit of mutual cooperation at higher levels should compensate for that.

Indeed! This is a nice example of the "agents punishing you for playing correctly against other agents" obstacle to optimality.

But unless I expected PsTitTatBot[N] for N>0 to be more common than CooperateBots (which I think of as naturally occurring in reality), I'm OK with defecting against all of them.

I agree that it is best to defect against CooperateBot. However, the case for PsTitTatBot[630] is far from clear; when your opponent is that you should seriously consider the possibility that you are being simulated. Thinking about this some more, there are other strategies, for example, cooperating for 5<N<1000 but defecting for N=1000 based on the belief that the environment is likely to contain PsTitTatBot[1000] (although if you think PsTitTatBot[1000] is unusually likely, all you really need to do is cooperate against PsTitTatBot[999]). So I'm not certain at all what the best response is to a pseudo tit-for-tat agent. However, this situation provides a testing ground for agents that could be stronger than PrudentBot.

One possibility is to recognise such self-simulating bots (possibly by recognising that they simulate their opponent against a slightly modified version of themselves), finding their opponent's N-value, and cooperating if N is odd, acting as Prudentbot in all other cases.

Such a bot will likely defect against 50% of PsTitTatBot[N] bots when the PsTitTatBot[N] cooperates (i.e. where N is even). It will defect against CooperateBot. Unfortunately, against the other 50% of PsTitTatBots (i.e. where N is odd) it will cooperate while the PsTitTatBot defects.

Whether this does better than PrudentBot or not depends on the exact payoff matrix; is the cost of cooperating against a defection half the time worth the benefit of defecting against a cooperation the other half the time, as compared to mutual defection all the time?

Another possible strategy is simply to defect against CooperateBot, but cooperate with all PsTitTatBot[N] where N>0. This will lose against PsTitTatBot[1] *only*, and achieve mutual cooperation with all higher levels of PsTitTatBot. Again, whether this is better than the above or not depends on the exact payoff matrix. (This one is guaranteed to do better than PrudentBot, assuming few examples of PsTitTatBot[1] and many examples of PsTitTatBot[N] for N>1)

To recognize what?

What you need to recognize is bots that are simulated by other bots. Consider the pair of BabyBear, which does whatever, and MamaBear, which cooperates with you if and only if you cooperate with BabyBear. Estimating the ratio of MamaBears to any particular BabyBear is an empirical question.

What seems problematic about these strategies is that they are not evolutionarily stable. In a world filled with those bots, PsTitTat bots would proliferate and drive them out. That hardly seems optimal!

One simple way to study this is the game where one player chooses PsTitTatBot[N] for some N, and another player chooses any bot. Since this game has an infinite number of strategies, Nash equilibria are poorly behaved. Still, complete mutual cooperation occurs in some Nash equilibria. I have not completely studied which other possibilities can occur.

Stupid question: are there clear examples of any real-life events that would actually correspond to classic, one-shot Prisoner's Dilemma? I *thought* that I knew of plenty of examples before, but when I was challenged to name some, all the ones I could think of failed some of the conditions, like being tragedies of the commons rather than PD, or being iterated PD rather than one-shot PD, or the players being able to adapt their plans based on the way the opponent reacts instead of it being a single all-or-nothing choice (arms races are sometimes held up as an example of PD, but there both sides have access to intelligence reports on the kind of strategy that the enemy seems to be adopting, and can adjust their strategy accordingly).

Of course, one can always construct hypothetical scenarios in a way that makes them classic one-shot PD by construction, but being unable to name any clear real-life examples would seem to suggest that this particular scenario isn't that interesting to focus on.

The point of using the one-shot Prisoner's Dilemma (and variants like this one) isn't actually that it's a realistic approximation of real-life phenomena, so much as that it's one of the simplest decision-theory problems that standard game theory looks suboptimal for, and so it's a good proving ground for further development. (Think of it like the assumption that objects in kinetic physics are frictionless.)

Similarly, game theory started with assumptions of perfect common knowledge of payoff matrices, which didn't model real-life situations all that well, but the theory developed there was later extended to more realistic setups in politics and business.

That being said, I'm also curious to know if there are real-world examples that model one-shot Prisoner's Dilemmas reasonably well.

Sounds awesome, great job !

But there is something I didn't get : what does Con() mean in PA+Con(PA) ? Maybe it's stupid question and everyone is supposed to know, but I don't remember ever encountering that symbol, and I can't find it's meaning, neither on Wikipedia nor on mathworld...

It means consistent. PA + Con(PA) means PA together with the axiom that PA is consistent (which can be expressed in PA because PA can express "PA proves X," so "PA is consistent" can be expressed as "PA does not prove a contradiction").

Actually, "not proving a falsehood" is not the same as being consistent; assuming that PA is consistent, the theory PA+~Con(PA) is also consistent, but proves the false statement ~Con(PA). Consistency is the weaker condition of not proving both a formula and its negation.

I should have said "contradiction"; edited. I intended "falsehood" to mean "false in all models," not "false in the standard model."

Stupid question... Can one construct **ImpudentBot**, specifically designed to defeat the analysis done by **PrudentBot**, say, by forcing it to never halt or something? Or **PrudeBot**, which resists all attempts at uncovering its true appearance?

PrudentBot halts on all inputs. More accurately, I should say, the modal form of PrudentBot has an output that's always decidable in PA+2, while the algorithm version only looks for proofs up to a certain pre-specified length, and thus always halts. In either case, PrudentBot defects by default if it cannot prove anything.

So ImpudentBot is not possible, and PrudeBot earns a big fat D from PrudentBot.

Incidentally, this is the big distinction between "programs that try and run other programs' code" and "programs that try to prove things about other programs' output": Löb's Theorem allows you to avoid the danger of infinite regress.

Thanks! So, were you given a task of defeating PrudentBot, where would you start?

I wouldn't. PrudentBot is unexploitable, so the best thing I could do against it is achieve mutual cooperation.

But that doesn't mean PrudentBot is optimal in any strong sense, because the same statement above is true about FairBot as well, and FairBot wastes utility by mutually cooperating with CooperateBot. PrudentBot makes similar miscalculations, but they're all against complicated agents that you wouldn't realistically submit, and the mistakes are of the form "getting (D,D) when I could have gotten (C,C)" and "getting (C,C) when I could have gotten (D,C)".

the mistakes are of the form "getting (D,D) when I could have gotten (C,C)" and "getting (C,C) when I could have gotten (D,C)".

Interesting. How hard is it to construct this anti-prudent bot?

Relatedly, so the only way for PrudentBot to lose an open-source tournament is when there are many imprudent entries against which some other algorithm can get (D,C) where PrudentBot gets (C,C)?

You can't get more points than PrudentBot in one-to-one combat, but you can try doing better against the *environment* (other players).

There is always an environment where agent **X** gets more points than PrudentBot. Specifically, an environment of **ServantOfX** bots that recognize the source code of *X*, cooperate with *X*, and defect against everyone else.

Technically, there is a difference between "maximizing your score" and "beating PrudentBot". (Focusing on "beating PrudentBot" makes it a zero-sum game.) If you *know in advance* that there will be more copies of you than copies of PrudentBot, you could just decide to defect against PrudentBot, which will harm it *more* than it will harm you. But it *will* harm you. You will defeat PrudentBot, but get less utility than you could get otherwise. (You will prove suboptimality of PrudentBot by becoming suboptimal yourself.)

Actually, you prove that there does not exist an optimal strategy for the game "Score more points that everybody else in a given competition." The game "Score the maximum number of points possible" is subtly different.

The "maximum number of points" optimizing strategy maximizes the number of points in some *average* environment. In a sufficiently perverse environment, it will score low.

So, perhaps the question is how to define the "average" environment. Is it an environment containing bots with probability corresponding to their (Kolmogorov) complexity of code?

I was going to say that there was a program which would score more than any other program in any given environment.

But when I write that out, it's trivial to falsify.

There does not exist a strategy which is dominant in all environments.

Suppose we decide that proving systems are too (**NP**) hard and we want to restrict agents to polynomial time. If an agent has access to a random coin, can it interactively prove that its opponent is cooperative?

I stumbled across an interesting class of agent while attempting to design a probabilistically proving bot for AlexMennen's contest. The agent below always cooperates with opponents who always cooperate back, (almost) always defects against opponents who always defect against them, and runs in probabilistic polynomial time if its opponent does as well.

```
def mimic_bot(opponent):
if random() > ε:
my_source = quine_my_source()
return eval(opponent, my_source) # do as your opponent does to you
else:
# unlikely, but necessary to break out of recursive simulations in polynomial time
return COOPERATE
```

This particular agent is a bit naïve, but more sophisticated versions (e.g ones that do not cooperate with CooperateBot) are possible. Does anyone have any insight on probabilistically proving opponents will cooperate?

An interesting idea! One problem is that it doesn't halt (with probability 1 - epsilon) if the opponent doesn't halt, and I don't see a way to fix this while keeping its nice properties. (But there might well be a suitable fix!)

Proposed method to make MimicBot halt:

If eval() takes more than 1,000,000,000 * n^3 steps (where n is the length of the input programs), eval aborts and returns an error flag. MimicBot defects if the simulation returned an error flag.

Programming eval to count its search steps will add overhead, and that overhead could increase exponentially as the stack of simulations grows deeper. But that's OK! The expected depth of simulations depends on ε, not *n*, and ε is a constant. MimicBot still halts in polynomial time.

"But wait!" I hear you protest. "If the outermost program kills simulations after a fixed number of cycles, then an inner simulation will find that its world ends before the simulation was expecting the world to end in its subjective timeline. That means simulations aren't completely faithful mimics."

That is true, but aborting the simulation should not be the usual case. Most simulations will take less than 1,000,000,000 * n^3 steps—only the dawdling or non-halting agents will have their simulations aborted.

The idea looks similar to scenario 2 from my first PD post ;-)

I'm coming to this quite late, and these are the notes I wrote as I read the paper, before reading the comments, followed by my notes in the comments.

Has any of the (roughest) analysis been done to bound proving time for PrudentBot? It should be fairly trivial to get very bad bounds and if those are not so bad that seems like a worthwhile thing to know (although what does "not so bad" really mean I guess is a problem.)

Is there a good (easy) reference for the statement about quining in PA (on page 6 below CliqueBot)? Under modal agents Kleene's recursion algorithm is mentioned; should probably be mentioned at first use rather than second?

The proof of theorem 5.1 is losing me where other proofs didn't; I don't understand how minimality of modal rank gives other results.

Reading the comments to answer some of my own questions: orthonormal seems to believe that PrudentBot couldn't be implemented for the LessWrong PD competition, although he did say with algorithmic proof search, would he change his opinion using Kripke semantics?

Vaniver's comment seems like a promising way of thinking about how to identify TrollBots. Or at least, the most promising thing about it I've seen; has this been pursued any further? Perhaps as a possibly-answerable question: call a bot a troll if it can't survive in a population with only itself or only itself and CB and DB (since these are the simplest bots to describe). I'll also note that PseudoTitForTatBot seems like an interesting basic candidate to play against; or say "TimeLimitTitForTatBot" which plays tit for tat until the time limit is near, so no single N can always fool it. Is there a way to (modally; if that is the right word for not-by-quoting-source-code) defeat TLTFTB?

Overall I still have no understanding of theorem 5.1 though. I'm not terribly familiar with the field in general but the other proofs were still fairly straightforward, whereas this proof loses me in the first sentence, without referencing a result I can look up either inside or outside of the paper. Was this written by a different person/at a different time from the rest of the paper? The second sentence is less bad than the first but still worse than the entire rest of the paper.

Haskell code to run PrudentBot (quickly) and the other bots (also quickly) can be found on github here:

On quining in arithmetic, see any exposition on Gödel's First Incompleteness Theorem and the Wikipedia article on the diagonal lemma.

I am unsure which of my questions this is supposed to answer, although perhaps that will become clear on reading the wikipedia article.

That was a response to

Is there a good (easy) reference for the statement about quining in PA

Let me rephrase my question, then, because the diagonal lemma seems clear enough to me. What is a good definition of quining? The term isn't used at all either in the article you linked or it the page on self-reference, which surprised me.

Oh! Ok, a quine is

a computer program which takes no input and produces a copy of its own source code as its only output.

If you understand how to write a quine and you understand how the diagonal lemma works, you will find that the two concepts are basically the same thing. "Quine" is more of a computer science term than a mathematical logic term, so not all expositions of Gödel's incompleteness theorem will mention "quining".

Overall I still have no understanding of theorem 5.1 though. I'm not terribly familiar with the field in general but the other proofs were still fairly straightforward, whereas this proof loses me in the first sentence, without referencing a result I can look up either inside or outside of the paper.

Were you OK with the proof of Theorem 4.1? To me, that and the proof of Theorem 5.1 are of equal difficulty. (Some of the other authors had more experience with Kripke semantics than I did, so they did most of the editing of those proofs. They work better with diagrams.)

orthonormal seems to believe that PrudentBot couldn't be implemented for the LessWrong PD competition, although he did say with algorithmic proof search, would he change his opinion using Kripke semantics?

Yes; a PD tournament among modal sentences using the code Eliezer linked would be feasible and quite interesting!

I would say that "I am surprised that the bots have not been submitted to a PD tournament" but then I saw the paper was published in May and that's less than 6 months ago so instead I'll make the (silly, easy-to-self-fulfill) prediction that many or all of those bots will show up in the next PD tourney.

To the first point: I guess I'm not really comfortable with the proof of Theorem 4.1, per se, however the result seems incredibly intuitive to me. The choice of symbol/possible LaTeX error where one of the symbols is a square is confusing, and looking at it again three weeks later (I have not been on LW recently) I've forgotten too much of the notation to review it in depth in two to five minutes.

But I see Theorem 4.1 as the statement that "you can't stop someone from punishing you for something unless you think at a higher level than they do" and I assume that there exists a proof of that statement, and that the proof provided is such a proof.

I see Theorem 5.1 as saying that some set is compact for some reason and that implies existence of something for some reason, but I don't know why the thing is compact or why the desired object is the limit of the the things we constructed in the right way.

Although again it's been a while so I could be misremembering here.

The square is a symbol in Gödel-Löb provability logic.

The nice part about modal agents is that there are simple tools for finding the fixed points without having to search through proofs; in fact, Mihaly and Marcello wrote up a computer program to deduce the outcome of the source-code-swap Prisoner's Dilemma between any two (reasonably simple) modal agents. These tools also made it much easier to prove general theorems about such agents.

Would it be possible to make this program publicly available? I'm curious about how certain modal agents play against each other, but struggling to caculate it manually.

I have two proposed alternatives to PrudentBot.

If you can prove a contradiction, defect. Otherwise, if you can prove that your choice will be the same as the opponent's, cooperate. Otherwise, defect.

If you can prove that, if you cooperate, the other agent will cooperate, and you can't prove that if you defect, the other agent will cooperate, then cooperate. Otherwise, defect.

Both of these are unexploitable, cooperate with themselves, and defect against CooperateBot, if my calculations are correct. The first one is a simple way of "sanitizing" NaiveBot.

The second one is exactly cousin_it's proposal here.

If you can prove a contradiction, defect.

Should this be "If you can prove that you will cooperate, defect"? As it is, I don't see how this prevents cooperation with Cooperatebot, unless the agent uses an inconsistent system for proofs.

It kills the Lobian argument, I believe, since this implication "if there's a proof that you cooperate, then cooperate " is no longer true. Instead, here's a Lobian argument for defection:

Suppose there is a proof that you defect. Then either there is a proof of contradiction, or there is no proof that your move is the same as your opponent's. Either way, you defect.

Both of those agents are modal agents of rank 0 and so the fact that they defect against CooperateBot imply that FairBot defects against them by theorem 4.1.

Yes, this is problematic. But it's not clear to me that it's a problem for my bots, rather than for Fairbot. After all, one can equally say that they defect against FairBot.

Edit: I've thought more about why this is. The rational move against FairBot is to cooperate, unless FairBot's reasoning system is inconsistent, in which case FairBot is just CooperateBot, and the rational move is to defect. ADTBot rationally defects against inconsistent ones. Since Fairbot does not know its reasoning system is consistent, it defects. Since ADTBOt is unexploitable, it, too, defects. So FairBot is not quite so Fair as it seems - it unfairly punishes you for defecting against inconsistent FairBots.

That's an interesting take on it. My intuitions still strongly say that a good decision theory had better achieve mutual cooperation with FairBot, but I'll consider this more thoroughly.

Why?

Let FairBot_L be the FairBot function with proof bound L and let FairBot_L,N be the Nth program (according to an arbitrary Gödel numbering) that computes FairBot_L. Then

(FairBot_L,N; AnythingUnexploitableWhichMutuallyCooperatesWithFairBot_L,N) is payoff dominant Nash equilibrium.

But there are (infinitely) many other payoff dominant Nash equilibria not in this form, so there doesn't seem to be any good reason to prefer it.

In fact, equilibra involving FairBots (or even PrudentBots) are unstable: add some uncertainty and they drift away towards a stable equilibrium, which can be easily a sub-optimal equilibrium of mutually defecting bots.

On the other hand, equilibria involving CliqueBots (or generalized CliqueBots) are both payoff dominant and stable.

I don't follow. What kind of uncertainty are you introducing that you say will break cooperation between two FairBots but not between two CliqueBots? If you're thinking of an amended type of CliqueBot to handle such uncertainty, why not a similarly amended FairBot?

It's the other way round: FairBots (and PrudentBots) cooperate too much, and this causes instabilty.

A Nash equilibrium for a mixed strategy game is stable if a small change (specifically, an infinitesimal change) in probabilities for one player leads to a situation where two conditions hold: 1) the player who did not change has no better strategy in the new circumstance 2) the player who did change is now playing with a strictly worse strategy.

CliqueBots equilibria are stable:

Let CliqueBot_C be the CliqueBot for clique C. The pure stategy pair (CliqueBot_C, CliqueBot_C) is a Nash equilibrium.

If Player1 deviates from this strategy by playing CliqueBot_C with probability 1-epsilon and AnythingElseBot with probability epsilon, for small epsilon, then the optimal strategy for Player2 is still CliqueBot_C, and Player1 expected payoff is strictly worse than the equilibrium payoff, therefore the equilibrium is stable. QED.

(this analysis can be extended to generalized CliqueBots that recognize each other by an equivalence predicate weaker than textual equality but strong enough to guarantee functional equivalence. In this case, there is trivial instability within each clique, but the cliques themselves are stable)

Stability is important both in standard game theoretical settings, where players have some irreducible uncertainty about each others or in Evolutionary game theory, where there is a population of strategies that repeatedly face each other in memoryless games.

A population of a single clique of CliqueBots will preserve itself indefinitely in the face of mutations: any individual that deviates from the clique is immediately punished, therefore mutations fail to propagate.

What about the FairBots?

It's easily shown that FairBots equilibria are unstable. Thus, in an evolutionary setting, a population of FairBots will accumulate mutations until exploitable bots such as CooperateBots will appear. When the exploitable bots reach a certain mass, bots specialized to exploit them, such as DefectBots, will appear and proliferate, even if the remaining FairBots punish them. The attractor of this process is clearly the stable but suboptimal equilibrium of DefectBots (or other mutually defecting bots). (*)

Starting with a propulation of PrudentBots instead of FairBots doesn't make things better, since PrudentBots can (de)evolve into FairBots and anyway they fail to punish most types of exploitable bots.

Similar considerations can be made for the non-evolutionary setting.

( * the probability of falling into this attractor depends on the specific random walk dynamics. IIUC, under mild assumptions (e.g. bounded maximum program size) it will be 1 minus the probability of randomly reaching a clique)

One way I would think about PrudentBot is as not trying to approximate a decision theory, but rather a special consideration to the features of this particular format, where diverse intelligent agents are examining your source code. Rather than submitting a program to make optimal decisions, you submit a program which is simplified somewhat in a way that errs on the side of cooperation, to make it easier for people who are trying to cooperate with you.

But something about my reasoning is wrong, as it doesn't fit very well to the difference of the actual codes of ADTBot and PrudentBot.

Proposal 1 runs into the problem that it does not cooperate with itself if the two copies have slightly different bounds on proof lengths. Since if A cooperates, you can (with a not too long proof) show that B did not find a contradiction. But this contradicts the bounded version of the incompleteness theorem.

A similar problem holds for Proposal 2.

You can search for reasons to cooperate in a much stronger formal system than you search for reasons to defect in. Is there any decision-theoretic justification for this?

If you do that, you're back in the same situation that you started with and are cooperating with CooperateBot again.

This is clearly not true for proposal 2. No matter the formal system, you will find a proof (YouDefect => OpponentCooperate), and therefore defect.

It seems to me that TrollBot is sufficiently self-destructive that you are unlikely to encounter it in practice.

I wonder if there are heuristics you can use that would help you not worry too much about those cases.

If you presume that you're living in an iterated prisoner's dilemma with reproduction according to the payoff matrix, then you can argue that the percentage of TrollBots will decline to negligible for almost all diverse starting populations. So, one can discuss meaningful optimality in that sense, and demonstrate how PrudentBot will do better than FairBot if the population contains CooperateBots.

Actually, since this is a deterministic setup, you can go one better and consider an 'iterated tournament' as a difference equation in N-dimensional space, where N is the number of strategies you include, and the dimensions represent the proportion of the total population; then you can demonstrate the trajectories the demographics will take from any starting location. There will be a handful of point equilibria, as well as several equilibrium lines (actually, I think, an equilibrium volume, but this depends on which strategies you include), and you can talk about which equilbria are stable / unstable, and decide not to care about strategies who only exist in unstable equilibria. You probably need to require that the population space is seeded with CooperateBot, DefectBot, and possibly FairBot, in order to get neat results.

I wonder that too, but we haven't come up with anything satisfactory on a formal level despite working for a while. Anyone have a good idea?

This might be a practical problem rather than a mathematical / philosophical problem. Many human beings, for cultural/biological reasons, think certain strategies in various games of economic interaction are unfair in a basically arbitrarily manner. If you come across a group of unfamiliar intelligences, you might find that they make strategies which randomly punish certain strategies for no apparent (to you) reason. The likelihood of this happening is empircal/scientific, not philosophical.

Here is another obstacle to an optimality result: define UnfairBot as the agent which cooperate with X if and only if PA prove that X defect against UnfairBot, then no modal agent can get both FairBot and UnfairBot to cooperate with it.

Right! More generally, I claim that if a modal agent provably cooperates (in PA) against some other agent, then it cannot provably defect (in PA) against any other agent: roughly speaking, PA can never prove that any proof search in PA or higher fails, so a modal agent can only provably (in PA) do the action that it would do if PA were inconsistent.

trollDetector-a fundamental part of psychbot-gets both of these to cooperate.

TrollDetector tests opponents against DefectBot. If opponent defects, it cooperates. if opponent cooperates, TrollDetector defects.

So both UnfairBot and Fairbot cooperate with it, though it doesn't do so well against itself or DefectBot.

no

modal agentcan get both FairBot and UnfairBot to cooperate with it.

TrollDetector is not a modal agent.

hm. I'm still a bit shaky on the definition of modal agent...does the following qualify?

IF(opponentcooperates with me AND I defect is a possible outcome){defect} else{ if (opponent cooperates IFF i cooperate) (cooperate){else defect}

(edit: my comment about perfect unfair bots may have been based on the wrong generalizations from an infinite-case). addendum: if what I've got doesn't qualify as a model agent I'll shut up until I understand enough to inspect the proof directly.

addendum 2: well. alright then I'll shut up.

What do you even mean by "is a possible outcome" here? Do you mean that there is no proof in PA of the negation of the proposition?

The formula of a modal agent must be fully modalized, which means that all propositions containing references to actions of agents within the formula must be within the scope of a provability operator.

As you have defined UnfairBot, both FairBot and UnfairBot cooperate with PrudentBot.

UnfairBot defect against PrudentBot.

Proof: For UnfairBot to cooperate with PrudentBot, PA would have to prove that PrudentBot defect against UnfairBot which would require PA to prove that "PA does not prove that UnfairBot cooperate with PrudentBot or PA+1 does not prove that UnfairBot defect against DefectBot" but that would require PA to prove it's own consistency which it cannot do. QED

Oops, you're right. But how do you prove that every modal agent is defected against by at least one of FairBot and UnfairBot?

Proof: Let X be a modal agent, Phi(...) it's associated fully modalized formula, (K, R) a GL Kripke model and w minimal in K. Then, for all statement of the form ◻(...) we have w |- ◻(...) so Phi(...) reduce in w to a truth value which is independent of X opponent. As a result, we can't have both w |- X(FairBot) = C and w |- X(UnfairBot) = D and so we can't have both ◻(X(FairBot) = C) and ◻(X(UnfairBot) = D) and so we can't both have FairBot(X) = C and UnfairBot(X) = C. QED

I don't know what that means. Can you prove it without using Kripke semantics? (if that would complicate things enough to make it unpleasant to do so, don't worry about it; I believe you that you probably know what you're doing)

Proof without using Kripke semantic: Let X be a modal agent and Phi(...) it's associated fully modalized formula. Then if PA was inconsistent Phi(...) would reduce to a truth value independent of X opponent and so X would play the same move against both FairBot and UnfairBot (and this is provable in PA). But PA cannot prove it's own consistency so PA cannot both prove X(FairBot) = C and X(UnfairBot) = D and so we can't both have FairBot(X) = C and UnfairBot(X) = C. QED

Oh, I see. Thanks.

It turns out that, even within the class of modal agents, it's hard to formulate a definition of optimality that's actually true of something, and which meaningfully corresponds to our intuitions about the "right" decisions on decision-theoretic problems. (This intuition is not formally defined, so I'm using scare quotes.)

This seems too simple, but why can't PrudentBot search for a proof that (X(PrudentBot) = C and PrudentBot(X) = D) and defect if a proof is found. If no proof is found, search for the proof of (X(PrudentBot) = C) as before.

That agent gets (mutual) defection from FairBot, since it can never be proven in PA that it cooperates. (In order to do so, you'd need to show that the search for your first statement fails; it's not enough to show there is a proof in PA of an incompatible statement, because PA doesn't know that PA is consistent.)

Variants on that could end up cooperating with versions of FairBot that use PA+Con(PA) instead of just PA, of course. But this is a good illustration of just what's so tricky about modal agents.

That agent gets (mutual) defection from FairBot, since it can never be proven in PA that it cooperates. (In order to do so, you'd need to show that the search for your first statement fails; it's not enough to show there is a proof in PA of an incompatible statement, because PA doesn't know that PA is consistent.)

What if PrudentBot searches for sufficiently short proofs of (PrudentBot(X) = D and X(PrudentBot) = C)? There may be a constructive proof in PA that (there does not exist a proof in PA of length N or shorter that (PrudentBot(X) = D) and X(PrudentBot) = C). If so, as long as FairBot searches for long-enough proofs in PA of (X(PrudentBot) = C) it can first prove that (there does not exist a proof in PA of length N or shorter that (PrudentBot(X) = D) and X(PrudentBot) = C) and subsequently that (PrudentBot(X) = C) and so decide to cooperate. If N is also large enough that PrudentBot can find a proof in PA that (PrudentBot(CooperateBot) = D and CooperateBot(PrudentBot) = C) then I think it can do everything the original PrudentBot did in PA+Con(PA) in a generic way.

There's a hierarchy for existence of short proofs similar to the hierarchy for existence of proofs in this tower of formal systems. (If there weren't, then these agents wouldn't be much discussing! As it is, though, there are bounded algorithmic analogues of all these modal agents.) However, proving properties of such algorithms gets a lot trickier there, as you need to keep track of how long it takes to make all the trivial deductions in addition to Lobian ones.

Essentially, what you're suggesting is something like an algorithmic analogue of the modal agent you considered before: searching for short proofs of exploitability, then longer proofs of mutual cooperation, is akin to searching for exploitability in a lower formal system and then mutual cooperation in a higher one. Not a bad idea! But when we transfer it to the easier-to-work-with context, it's clear that waiting for a proof search to terminate must cost you against anything cooperative but impatient/low-powered. (Explicitly, the algorithm you suggested can never cooperate with an agent that only looks for proofs of cooperation up to length N.)

One last thing: all proofs in PA of the form "there does not exist a proof in PA of X with length less than N" must have length at least N, for analogous reasons to the fact that PA cannot prove its own consistency.

Essentially, what you're suggesting is something like an algorithmic analogue of the modal agent you considered before: searching for short proofs of exploitability, then longer proofs of mutual cooperation, is akin to searching for exploitability in a lower formal system and then mutual cooperation in a higher one. Not a bad idea! But when we transfer it to the easier-to-work-with context, it's clear that waiting for a proof search to terminate must cost you against anything cooperative but impatient/low-powered. (Explicitly, the algorithm you suggested can never cooperate with an agent that only looks for proofs of cooperation up to length N.)

What about optimizing proof search order? An extreme example would just be an agent with a lookup table indexed by the Goedel number of each possible agent up to a maximum value of some N, with either Defect or Cooperate in the lookup table which is the action the agent will take. If impatient agents came first in the lookup table they could quickly prove cooperation. The agent could append its quine to the end of the lookup table before searching in order to cooperate with itself. There should be an optimal lookup table for any length N and set S of games against other agents, defined as the lookup table that generates the highest score when playing all the games in set S.

Instead of including the lookup table the agent could perform the same sort of proof search necessary to generate the lookup table at runtime. If it searched for proofs in roughly the same order as the corresponding entries in the lookup table then impatient opposing agents would conveniently find a cooperation proof early in the simulation of the agent. In this case there would be an optimal list of rules for proof search order (perhaps including direct proofs for the most likely impatient opposing agents) that would approach the optimality of the lookup table, or perhaps exceed it if enough opposing agents give up when they see the size of the lookup-table-agent.

I have no idea if any of this is theoretically interesting or just a hack to potentially play PD slightly better.

One last thing: all proofs in PA of the form "there does not exist a proof in PA of X with length less than N" must have length at least N, for analogous reasons to the fact that PA cannot prove its own consistency.

And presumably more like O(2^N) in practice? I don't know what percentage of bit strings are well formed formulas in an average formal language, but I would imagine it's a significant fraction. The proof searcher would have to check every wff to completion because it doesn't have Con(PA) to rule out contradictions, and I assumed that most of the proofs would be of the form "First the agent writes a 0 and moves the read-head right and goes to state 5, ... , then the agent halts in a Cooperating state", so N would probably be very large.

I propose a variation of fairbot, let's call it two-tiered fairbot (TTF).

If the opponent cooperates* iff I cooperate, cooperateelse, if the opponent cooperates* iff (I cooperate iff the opponent cooperates), check to see if the opponent cooperates

and cooperate iff he/she does*

else, defect.

It seems to cooperate against any "reasonable" agents, as well as itself (unless there's something I'm missing) while defecting against cooperatebot. Any thoughts?

*As determined by proof check.

It cooperates with CooperateBot, for the same Löbian reason that FairBot does. The substatement "if I defected, then CooperateBot will defect" is *actually true* because you cooperate (and "if false, then false" is a tautology).

This is excellent stuff. (And well written, too!)

Thanks for posting this! I'm glad to see more work explained well and formally, and hope to continue seeing preprints like this here.

I'm much happier with this approach than I was a year ago. I don't know how much of that is change on its part, and how much is change on my part; probably some of both.

This echoes the impossibility-of-optimality results of Anderlini [2] and Canning [6] on game theory for Turning machines with access to each others' source codes.

Turning -> Turing

FairBot references itself in its definition, but as with CliqueBot , this can be done via Kleene's recursion theorem

Should there be a reference to Kleene's recursion theorem, or is this common background knowledge for your target audience?

Thanks! Will fix both.

I finally had time to read this, and I must say, that's an extremely creative premise. I'm puzzled by the proof of theorem 3.1, however. It claims "By inspection of FairBot, PA⊢◻(FB(FB) = C)-->FB(FB=C)."

However, by inspection of fairbot, the antecedent of the conditional should be FB(FB) = C rather than ◻(FB(FB) = C). The code says "if it's a theorem of PA that X cooperates with me, then cooperate." It doesn't say "if it's a theorem of PA that it's provable in PA that X cooperates with me, then cooperate." So I don't believe you can appeal to Lob's theorem in this case. .

They applied Lob's theorem correctly. "X is a theorem of PA" and "X is provable in PA" mean the same thing, and both are represented by ◻X. The antecedent of "if it's a theorem of PA that it's provable in PA that FB cooperates with me, then cooperate" would be represented as ◻(◻(FB(FB) = C)), but this never appears either in the code or in the antecedent for Lob's theorem. "FB(FB) = C" on its own, without the box, would mean it is true that fairbot cooperates with itself, not that PA can prove that fairbot cooperates with itself.

I am bothered by the fact that the reasoning that leads to PrudentBot seems to contradict the reasoning of decision theory. Specifially, the most basic and obvious fact of behavior in these competitive games is: if you can prove that the opponent cooperates if and only if you do, then you should cooperate. But this reasoning gives the wrong answer vs. CooperateBot, for Lobian reasons. Is there an explanation for this gap?

It's true that if you can prove that your opponent will cooperate counterfactual-if you cooperate and defect counterfacual-if you defect, then you should cooperate. But we don't yet have a good formalization of logical counterfactuals, and the reasoning that cooperates with cooperatebot just uses material-if instead of conterfactual-if.

We have Ambient Decision Theory, which is a pretty good formalization of logical counterfactuals.

Specifially, the most basic and obvious fact of behavior in these competitive games is: if you can prove that the opponent cooperates if and only if you do, then you should cooperate.

The point is that this "fact" is intuitive but wrong as stated, for reasons of Löbian self-reference.

I know that, as I tried to state in my question. This does not dispel my confusion.

What about the agent represented by the formula "P iff provable(P iff Q)"? It clearly cooperates with itself, and its bounded version should be able to do so with much less computing power than PrudentBot would need. What does it do against PrudentBot and FairBot? Is there anything wrong with it?

This agent is unexploitable, and mutually cooperates with PrudentBot and FairBot, but it cooperates with CooperateBot (since the fixed point reduces to "P iff Box(P)", which makes the Löbian sentence valid).

Notably, this agent is exactly the same as FairBot. Clearly it satisfies P => Provable (Q). We can also check Provable(Q) =>P, since if provable(Q), then P iff Provable(P), so P. Thus P <==> Provable(Q), so it's FairBot.

Notions of sameness need to be handled with care here in general. There are lots: literally identical code is the most obvious one, then provably identical behavior in PA (against some class of reasonable agents), then provably identical behavior in PA+1 (against some class of reasonable agents)... it's sometimes necessary to use the first notion of sameness because opponents may not behave fairly, e.g. they may punish exactly one program, so in general the set of opponents you want to prove sameness with respect to also matters.

I believe this argument proves identical behavior against all modal agents, and identical behavior of modal agents against.

How is "Given an opponent that defects against **Defectbot**, **PrudentBot**'s behavior is to cooperate with that program IFF **PrudentBot** can prove that it cooperates with that program" proven? (That is the case stated where **PrudentBot** plays against itself)

Has it been shown in general that there exists a proof in PA that all agent **Y** cooperates with **PrudentBot** for complicated agents **Y**?

You need an opponent that *provably* defects against DefectBot in PA+Con(PA). For instance, PrudentBot will not find mutual cooperation with the agent that cooperates with X iff PA+Con(PA) proves X cooperates with it.

(This is another reason why it's basically better in practice to use a version of PrudentBot that looks to prove in PA that the opponent cooperates against it, but prove in PA+N for some large N that the opponent defects against DefectBot.)

As for your second question: no, that's false. You can't compel cooperation from arbitrary agents; you can only hope to give the best response against them (so that you don't get exploited, you exploit them if possible, and you grab mutual cooperation if that's the best possible).

"For every decision-making process X for which there is a proof in PA that my opponent cooperates against X, if X defects against my opponent, (implement X; break) if no X defects but some exist, implement any X. If no X exists, defect."

Reflective consistency and self-modifying FTW.

Not sure if that's computable in finite time.

This is basically the idea behind the earlier Masquerade algorithm. The first tricky bit is that your opponent is under no obligation to treat you the same way that it treats X. The second tricky bit is that you can't search over *all* X within any formal system, because the class of cooperators is undecidable, and the more you search over, the stronger a formal system your opponent needs to figure out what you're actually doing.

(It's totally worth thinking about these things, of course!)

Yeah, I realized that I was talking about behavior, but that my opponent isn't limited to my behavior and can make decisions based on my nature. I was trying to design a program using sufficiently advanced magic that would defeat the best program I think I can create, and I succeeded. However, it isn't competitive against other programs which operate on the same level.

It does cooperate with programs one level higher than it, but those programs defect from it.

Two questions about the paper:

Would it be possible to specify FairBot directly as a formula (rather than a program, which is now the case)?

Also, I'm somewhat confused about the usage of FB(FB)=C, as FB(FB) should be a formula/sentence and not a term. Is it correct to read "FB(FB)=C" as "\box FB(FB)" and "FB(FB)=D" as "\box ~FB(FB)"? Or should it be without boxes?

Why do you associate a rank with each modal agent? It looks like it's designed to prevent situations where there are 2 modal agents X and Y such that X tests its opponent against Y and Y tests its opponent against X, but why do such situations need to be avoided? Doesn't the diagonal lemma make them possible?

I want it to be decidable whether a given agent is modal or not.

Hmm... isn't **FairBot** theoretically imposible because of Halting problem?

The halting problem implies that it's impossible to write a program that correctly deduces whether *every* other program halts or not. But you can write a program that correctly deduces whether many programs halt, and labels the rest "unknown".

Likewise, the finite version of FairBot checks whether there's a proof of cooperation of length less than N (for some large fixed N), cooperates if so, and defects otherwise. There are programs that cooperate with FairBot but have no short proof of that fact—and FairBot defects against those programs. (I.e. it defects against the "unknown" programs.)

Does that help clarify things?

I'm not convinced that FairBot cooperates with itself. If we call the statement that FairBot self-cooperates S, then it is not the case that we know that (There exists a proof of S) -> S [which would imply S by Lob's Theorem]. Instead, what we know is (There exists a proof of S of length at most Ackermann(20)) -> S. I'm not convinced that this is enough to conclude S. In fact, I am somewhat doubtful that S is true.

In particular, I can also prove (There exists a proof of (not S) of length at most Ackermann(20)) -> (not S). This is by case analysis:

- If S is false: By simulating FairBot, I can prove not S, and thus the statement
- If S is true: Since the axioms are consistent, there is no proof of (not S). Thus, by enumerating all proofs of length at most Ackermann(20), I can prove that there is no proof of (not S) of at most this length, thus proving the above statement. Then again, it is likely that any proof of this implication would need to be longer than Ackermann(20)...

On the other hand, you do need more than Lob's Theorem to prove that FairBot self-cooperates. Anybody know enough logic to resolve this issue?

Ah, never mind I see it. You just need to modify the proof of Lob's Theorem. Instead of using L = (there exists a proof of L) -> S you instead use L = (there exists a proof of L of length at most Ackermann(10)) -> S and the rest of the proof of Lob's Theorem still works.

Yeah, that's the main idea. Note that to state the bounded version of Lob's theorem, you actually have to use two separate limits: "if having a bounded-size proof of statement S would imply the truth of S, and that fact itself has a short enough proof, then S is provable". The proof is similar to the regular Lob's theorem, but each step has to keep track of length limits, and these bookkeeping details eventually determine how high the limits have to be in the first place for the theorem to be true.