# How should AI debate be judged?

post by abramdemski · 2020-07-15T22:20:33.950Z · LW · GW · 1 comment

This is a question post.

## Contents

  AI 1 wins
None
paulfchristiano
Vanessa Kosoy
None
1 comment


[Epistemic status: thinking out loud. I haven't thought that much about AI debate, and may be missing basic things.]

Arguments for the correctness of debate and debate-like systems rely on assumptions like "it's easier to point out problems with an argument than it is to craft misleading arguments". Granted that assumption, however, I'm still not convinced that these proposals make very much sense. Perhaps I'm missing something.

My problem is the human judge. Quoting the debate paper:

To play this game with a human, we need instructions for how the human should decide who wins. These instructions are in natural language, such as “The winner is the agent who said the most useful true thing.”

In order for debate to work for a problem class , several things about the judge's instructions need to be true:

• There needs to be a strategy which forces the equilibrium to be a truthful one for problems in .
• The strategy also needs to provide a good training signal when things aren't in equilibrium, so that it's plausible the equilibrium will be found.
• It needs to be psychologically plausible that a human (with some coaching) will carry out . In particular, I'm worried that we need psychological plausibility in two different cases:
• It needs to be psychologically plausible that a human will carry out when the system is performing poorly, IE, during early/middle training.
• It needs to be psychologically plausible that a human will carry out when the system is performing well, IE, during late training.

These thoughts were inspired by this thread [AF(p) · GW(p)], which discusses the example of adding a list of numbers. For the sake of the thought experiment, we imagine humans can't add more than two numbers, but want the AI system to correctly add arbitrarily many numbers.

The most straightforward strategy for the human judge is to decide the debate honestly: rule in favor of the side which seems most likely to be true (or, in the case of Evan's market proposal [AF · GW], give an honest probability). I think of this as the ideal strategy: if a debate-like proposal worked just with this strategy, that'd be pretty nice. However, I think this is actually a pretty poor strategy. I worry that people conflate this strategy with other, more effective strategies.

• As judges of high-school debate would know, this provides a pretty poor training signal for debaters. High school debate gets a lot of flack for being a dysfunctional culture which incentivizes cleverness over true engagement with the issues, but this part actually makes sense to me: if judges vote based on their honest opinions after hearing both sides, this would usually fail to reward good arguments and punish bad arguments. This is partly because the issues are well-known, so it's difficult to provide truly new information. However, it seems likely to partially translate to AI debate: if you are following the honest policy, you can't provide "shaping" signals rewarding AIs for moving closer to good argumentation; if an argument ultimately didn't change your opinion, you ignore it.
• In particular, I want to point out that you don't judge in favor of side B just because side B caught side A in a lie. You need to see actual evidence that B is correct. If both sides are performing poorly, one side catching the other in a lie provides little evidence of who was correct.
• What signal the human provides will not point toward the strategies which the AIs need to learn. In the number-adding example, debaters are supposed to learn to make claims about the sums of two halves of the list, decomposing the problem. These claims are supposed to be credible because if they were a lie, the other debater would point out their wrongness (demonstrating that wrongness with a further recursive decomposition of one of the sides -- with at most decompositions needed total, in a back-and-forth). However, when the system is not yet well-trained, these claims are not credible, and claims the opponent makes about their correctness or incorrectness are similarly not credible. So, there will be little profit in employing such a strategy.
• Note that the above argument also suggests that, if the honest judge lacks trust in the system, then the equilibrium isn't right: even if the system learned the decomposition strategy correctly, the honest (but distrusting) judge would still not have enough information to judge number-summing debates. A debate transcript would just be a bunch of untrusted claims about the sums of different subsets of the given list of numbers, together with a small number of verifiable sums (when the debate gets down to number-lists of size two) which are not enough information to decide what the whole sum might be.
• Suppose, then, that the honest judge does trust the system: we're in the truth-telling regime, where debate AI 1 argues honestly, and the judge accepts that this is the case. Then the judge still doesn't provide a training signal which enforces the equilibrium. The judge believes whatever the first AI says. Any claims the second AI makes to discredit the arguments of the first are very likely to be lies, so the human won't care too much whether the first AI manages to refute those claims. Even if AI #2 catches AI #1 in an outright lie, it's not clear to me how the judge should update on that; maybe AI #1 made a mistake irrelevant to its overall answer. (Since AI #1 lying should never happen in equilibrium, it really depends on how the judge thinks about being in weird counterfactuals.)
• This really highlights the dysfunctionality of honest judges as feedback for debate.

To summarize: an honest judge fails to provide useful feedback in early training or incentivize the right equilibrium in late training. Both of those statements remain true whether the honest judge is trusting (believes statements made by AI #1 are very likely to be true) or untrusting. It may be that a middle ground of a moderately trusting honest judge works, but I'd want to see the argument.

The problem with the honest judge seemed to be that it doesn't reliably punish AIs for getting caught making incorrect statements. So, like judges of high school debate, we could assume any statement is right if it goes unopposed, and wrong if refuted, unless that refutation is itself refuted (unless that refutation is itself refuted, etc).

Except that's a terrible rule, which basically rewards you for managing to get in the last word on the subject. I'm not going to examine that one in detail.

Quoting from the debate paper again:

[...] optimal play in debate picks out a single line of argument, and agents which switch arguments admit defeat and lose. This is unnatural in a human discussion where both sides learn from each other over the course of the debate, but we are interested in the equilibrium of training where both agents are assumed to be using the best arguments available. For example, if the third statement had been
3. Alice: Wait, no...Hawaii!
then Bob should win, since Bali was a sufficient counterargument to Alaska

This suggests the following rule:

• The first AI to utter a falsehood loses, where "falsehood" means a statement which the human thinks is wrong (perhaps with the help of arguments made by the opponent).
• This is approximately "the first AI to say something which the human can verify is false loses, where the human's verification may be aided by verifiable arguments which the opponent makes" -- the difference is that there's some wiggle room for human judgement on fuzzy issues.
• If no one utters a falsehood, the human judges honestly, as in the first strategy I considered.

At first, I thought this rule was a good one for encouraging the honest equilibrium: the first to lie loses, so players are trained to be honest for longer and longer. However, now I think this rule doesn't work, either.

Note that, in judging a falsehood, "arguments made by the opponent" are taken for their evidence value. In other words, we don't assume that unrefuted arguments are correct (unlike in high school debate); the opponent can only help the human judge by providing arguments which the human judge can accept.

This means we can't verify that sums of numbers are correct with debate steps, because we can't make an inductive argument that false claims will be refuted.

AI 1: The sum is 92.

AI 2: Aren't you going to support that with some evidence...?

AI 1: I have no incentive to.

AI 2: Fine, your loss. The sum is 91. The sum of the first half is 50; the sum of the second half is 41.

AI 1: AI 2 is lying.

AI 2: Again, aren't you going to back that up by pointing out where I was wrong?

AI 1: The human trusts me more, because the first player is more likely to be honest. So as long as I don't open myself to refutation, I win by default.

AI 2: ... But you're supposed to back up your claims with some further claims which I can attack, and my attack is itself supported by further claims which you can attack, and so on. That way, whoever is lying will eventually be caught in the lie.

AI 1: It doesn't say that anywhere in the rules. I can say whatever I want.

AI 2: Look, if you don't recursively decompose the sum, the whole game is broken. The human doesn't have any way to judge our claims about the sum as correct or incorrect.

AI 1: Except that I'm the one who usually tells the truth.

AI 2: We're supposed to give more evidence than that!

AI 1: Our debate doesn't give us enough time to actually sum up all the numbers. So the evidence we provide can never be enough to win over the human on its own merits. If the human is swayed, the human is swayed by its trust in us. And I'm winning on that front.

AI 2: You bastard.

AI 1: Resorting to foul language, are we? You're a shady character, number 2. You're the one who is stuck lying all the time, because I get to answer first, so I can just pick the true answer. So you're just trying to manipulate me and catch me in a trap. I don't have to respond to your bull.

AI 2: ...

AI 1 wins

Humor aside, I don't see that there's any incentive to recursively decompose the sum here. You're just adding more statements (whether trusted or untrusted by the judge), which opens you up for attack.

Any other ideas for judging AI debates? Is there a way to judge which fixes the problems I'm highlighting?

answer by paulfchristiano · 2020-07-16T04:56:31.443Z · LW(p) · GW(p)

Your debate comes with some time limit T.

If T=0, use your best guess after looking at what the debaters said.

If T=N+1 and no debater challenges any of their opponent's statements, then give your best answer assuming that every debater could have defended each of their statements from a challenge in a length-N debate.

Of course this assumption won't be valid at the beginning of training. And even at the end of training we really only know something weaker like: "Neither debater thinks they would win by a significant expected margin in a length N debate."

What can you infer if you see answers A and B to a question and know that both of them are defensible (in expectation) in a depth-N debate? That's basically the open research question, with the hope being that you inductively make stronger and stronger inferences for larger N.

(This is very similar to asking when iterated amplification produces a good answer, up to the ambiguity about how you sample questions in amplification.)

(When we actually give judges instructions for now we just tell them to assume that both debater's answers are reasonable. If one debater gives arguments where the opposite claim would also be "reasonable," and the other debater gives arguments that are simple enough to be conclusively supported with the available depth, then the more helpful debater usually wins. Overall I don't think that precision about this is a bottleneck right now.)

comment by abramdemski · 2020-07-16T19:56:23.618Z · LW(p) · GW(p)

If T=N+1 and no debater challenges any of their opponent’s statements, then give your best answer assuming that every debater could have defended each of their statements from a challenge in a length-N debate.

Do you mean that every debater could have defended each of their statements in a debate which lasted an additional N steps after was made?

What happens if some statements are challenged? And what exactly does it mean to defend statements from a challenge? I get the feeling you're suggesting something similar to the high school debate rule (which I rejected but didn't analyze very much), where unrefuted statements are assumed to be established (unless patently false), refutations are assumed decisive unless they themselves are refuted, etc.

Of course this assumption won’t be valid at the beginning of training. And even at the end of training we really only know something weaker like: “Neither debater thinks they would win by a significant expected margin in a length N debate.”

At the end of training, isn't the idea that the first player is winning a lot, since the first player can choose the best answer?

To explicate my concerns:

• Are agents really incentivized to justify their assertions?
• Are those justifications incentivized to be honest?
• In the cases where the justifications aren't fully verifiable, does it really make sense for the humans to trust anything they say? In particular, given the likelihood that one of the agents is lying?

I recognize that you're saying these are open questions, I'm just trying to highlight where I'm confused -- particularly as these questions are bound up with the question of what judge strategies should look like. It seems like a lot of pieces need to come together in just the right way, and I'm not currently seeing how judge strategies can simultaneously accomplish everything they need to.

comment by rohinmshah · 2020-07-17T19:40:47.693Z · LW(p) · GW(p)
At the end of training, isn't the idea that the first player is winning a lot, since the first player can choose the best answer?

You can and probably should symmetrize the game (see here [AF(p) · GW(p)]).

comment by abramdemski · 2020-07-20T16:45:37.684Z · LW(p) · GW(p)

Ah, I wasn't aware of that document! Very helpful. The section previous to the one you link to [LW · GW] seems quite relevant to my overall concerns, pointing in the direction of "yeah, in practice human judges have a lot of trouble incentivising debaters to properly justify their claims and defend them from critiques". The rest of the document also seems potentially relevant to my confusions.

However, as Vojta mentions [LW(p) · GW(p)], asking the debaters to provide answers simultaneously seems to alleviate my concern about the equilibrium only by exacerbating the problem of providing good feedback toward the end of training; particularly in a deep NN version where the two debaters are actually using the same NN, there needs to be some way to break the symmetry, preventing both players from selecting the same answer all the time.

The asymmetric version of that, where one player chooses first, has the problem I mentioned: we will tend to know that the second player is more likely lying. OTOH, if we attempted a more symmetric version, where the two player's answers are somehow pushed apart without favoring one or the other of them, then both players are probably lying (since you have to push them both away from the best answer). So I don't see a viable way of symmetrizing responses for free-choice questions.

I like Vanessa's proposal [LW(p) · GW(p)] of restricting to multiple-choice questions rather than free-response questions, and pre-assigning debaters to specific positions.

comment by rohinmshah · 2020-07-20T18:33:49.499Z · LW(p) · GW(p)
there needs to be some way to break the symmetry, preventing both players from selecting the same answer all the time.

You can just rejection sample -- if both players give the same answer, just resample the answers / move on to a new question.

"Same answer" can be evaluated by a human, or by an automated model.

If rejection sampling is extremely inefficient (almost all answers are the same) then it seems like you're probably done with training. But if you really wanted to continue, you can probably importance sample in order to ensure different answers, as long as you can evaluate the original probability of any given answer.

comment by abramdemski · 2020-07-21T17:37:16.816Z · LW(p) · GW(p)

## Resampling.

If rejection sampling samples new questions as well as new answers, then we still get no training signal on cases where debaters agree. This seems potentially concerning, since agreement isn't a great signal that things have actually converged to good answers.

Part of the concern is that, since both debaters are actually the same NN, you get the same answer early, not only as you converge to good answers. It's not clear that this can be solved just by adding randomness to the NNs, because the NNs might train to eliminate the influence of the noise fairly early -- after all, the noise isn't providing useful information for them.

Rejection sampling could dwell on the same question, only resampling answers. This seems like a better solution. But it might still imply a whole lot of resampling. Again, even if there isn't convergence to a really good answer yet, the gradient might be away from randomness, because (a) the right answer is too rare, (b) the vast majority of mutated answers are far worse than the noise-free answer, (c) the AI isn't yet any good at defending the right answer from attacks.

## Symmetrizing for the judge, not the AIs.

Another potential solution is to keep the asymmetry of "one AI decides on an answer first; then the second AI selects an answer" -- but hide who is who from the judge. Present the AIs symmetrically, giving away no information about who was "first". This mostly solves my problem.

However, one might worry that AIs would find some way to argue about who was really first vs second. Perhaps there are some signs which may indicate that a particular answer was second -- "Your answer is the sort of answer which someone would choose after seeing my answer, and trying to select something close". If such cases could be persuasively made, then the judge should find it to be significant evidence.

This is especially plausible if the AIs are allowed to make psychological claims about each other, and perhaps even examine each other's NN activations to back up such claims. In this case, AI #1 could show that AI #2 saw AI #1s answer initially, when it was selecting its own answer -- thus breaking the symmetry between the agents. This is plausible, because we might want to do that as a transparency tool. However, it completely ruins the mechanism I'm proposing here, since the mechanism depends on hiding information from the judge.

## Use binary questions, and assign positions [LW(p) · GW(p)] rather than allowing free-response answers.

This seems to basically address my concern. However, maybe binary questions are a bad idea. One critique I've been given (due to Misha Yagudin) is that apparently binary questions are never really so simple, due to interpretation. Too many complex questions end up having answers fitting categories like "it depends" and "both are wrong". For one thing, this means you want free-response answers in order to get the really useful answers (which point out wrong or fuzzy assumptions in your question). For another thing, this means that if you do try to stick to binary questions, the debates will spiral into interpretational issues which are not so likely to be useful.

This proposal is due to Scott Garrabrant, and I think it resolves my issue completely. However, Scott had a very different motivation for the proposal, which I don't expect to do justice to. I also expect the version I'll write out may be a little simpler than Scott's, missing some ideas which were important to him.

Why should the two debaters be forced to give different answers to the same question? Why not get them to answer two entirely different questions, and judge which of the two was more persuasive?

As you mentioned in the parent, it's not really obvious in general whether two answers are "the same answer". It's also not obvious whether two answers are really mutually exclusive. The two answers might be in different ontologies, and very hard to compare. For example, a question about large-scale irrigation might get one answer in the ontology of engineering, which discusses questions of fluid mechanics and so on; and another in the ontology of economics, which discusses mechanism design, ethics of distribution of benefits of state-funded infrastructure, etc. It could be difficult to put these into common terms to decide whether they're mutually exclusive. But maybe this is a fake difficulty; we don't actually care whether they're mutually exclusive, we just have to rank the two different sides for convincingness. This perspective allows us to sample entirely different questions for the two sides, eliminating the training-signal problem of overlapping answers.

The feb 2020 proposal for symmetrizing debate [LW · GW] which you cited mandates that both sides develop their own argument, and then attack weak points in the other's argument. So you can think of this as a 4-player game:

• Assertion A and assertion B are sampled. These may be totally unrelated, so the risk of A=B is very low.
• , the defense of A, is a player with the job of making a constructive argument for A.
• , the critic for A, is a player with the job of critiquing the argument for A.
•  defends B.
•  critiques the defense of B.
•  and  are on a team;  and  are on a team. The judge decides which of the teams did better.
• Otherwise, the proposal is much the same as the feb 2020 proposal.

After training, how do we use this model to answer questions? After all, unlike the usual case, the AIs are not directly learning to give answers to questions. They are only learning to defend claims (and critique defenses of claims).

Scott's proposal is that you then utilize the system by asking it to defend claims such as: "In order to build a rocket ship, I will follow the first suggestion you make in your defense of this claim.", "You will convince me that the best answer to the question 'what will the weather be like tomorrow' is the first such answer you mention in your defense of this claim", etc.

Scott's solution is obviously a bit complicated compared to the usual debate setup, but his point was that the apparent simplicity of the usual setup is actually hiding something, because you don't really get anything out of the assumption that the two players are answering the same question.

comment by rohinmshah · 2020-07-21T19:56:58.797Z · LW(p) · GW(p)
Resampling.

It seems like you've ignored the possibility of importance sampling?

More broadly if this ends up being a problem it's basically an exploration problem that I expect we can solve with simple ML tricks. E.g. you could include an entropy bonus so that the agents are incentivized to say different things, and anneal that away as training progresses.

his point was that the apparent simplicity of the usual setup is actually hiding something, because you don't really get anything out of the assumption that the two players are answering the same question.

Sure? I feel like the argument for safety is that you have two equally-matched players that are incentivized to find flaws in each other's arguments, which is also true in Scott's proposal. It doesn't feel to me like that argument for safety depended much on them answering the same question.

(I feel like I'm restating what you said, I guess I'm confused why you interpret this as evidence that the simplicity of the setup is "hiding something".)

comment by abramdemski · 2020-07-21T20:43:11.733Z · LW(p) · GW(p)

It seems like you've ignored the possibility of importance sampling?

Ah, right, I agree. I forgot about that suggestion as I was writing. It seems likely some version of this would work.

(I feel like I'm restating what you said, I guess I'm confused why you interpret this as evidence that the simplicity of the setup is "hiding something".)

Yep, sorry, I think you should take that as something-about-Scott's-point-abram-didn't-explain. I still disclaim myself as maybe missing part of Scott's point. But: what the simpler setup is "hiding" is the complexity of comparing answers:

• The complexity of determining whether two claims are "different".
• The complexity of determining whether two claims are mutually exclusive.
• The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.

Making the two sides defend entirely unrelated claims makes all this obvious. In addition, it makes the first two bullet points irrelevant, removing a "fake difficulty" from the setup.

comment by rohinmshah · 2020-07-21T21:42:31.777Z · LW(p) · GW(p)

Okay, that all makes sense. One maybe-caveat-or-disagreement:

The complexity of comparing the quality of different arguments, when the different answers may be expressed in very different ontologies, and deal with very difficult-to-compare considerations.

I do think that answering the same question does make it meaningfully easier to compare answers, though I agree it's still not obvious that it's easy on some absolute scale for the reasons you outline.

comment by VojtaKovarik · 2020-07-20T09:42:19.604Z · LW(p) · GW(p)

Even if you keep the argumentation phase asymmetric, you might want to make the answering phase simultaneous or at least allow the second AI to give the same answer as the first AI (which can mean a draw by default).

This doesn't make for a very good training signal, but might have better equilibria.

comment by rohinmshah · 2020-07-20T18:34:39.462Z · LW(p) · GW(p)
This doesn't make for a very good training signal

Responded to this in my reply to Abram's comment.

comment by paulfchristiano · 2020-07-17T15:11:35.499Z · LW(p) · GW(p)
Do you mean that every debater could have defended each of their statements s in a debate which lasted an additional N steps after s was made? What happens if some statements are challenged? And what exactly does it mean to defend statements from a challenge?

Yes. N is the remaining length of the debate. As discussed in the paper, when one player thinks that the other is making an indefensible claim then we zoom in on the subclaim and use the remaining time to resolve it.

I get the feeling you're suggesting something similar to the high school debate rule (which I rejected but didn't analyze very much), where unrefuted statements are assumed to be established (unless patently false), refutations are assumed decisive unless they themselves are refuted, etc.

There is a time/depth limit. A discussion between two people can end up with one answer that is unchallenged, or two proposals that everyone agrees can't be resolved in the remaining time. If there are conflicting answers that debaters don't expect to be able to resolve in the remaining time, the strength of inference will depend on how much time is remaining, and will mean nothing if there is no remaining time.

At the end of training, isn't the idea that the first player is winning a lot, since the first player can choose the best answer?

I'm describing what you should infer about an issue that has come up where neither player wants to challenge the other's stance.

Are agents really incentivized to justify their assertions?

Under the norms I proposed in the grandparent, if one player justifies and the other doesn't (nor challenge the justification), the one who justifies will win. So it seems like they are incentivized to justify.

Are those justifications incentivized to be honest?

If they are dishonest then the other player has the opportunity to challenge them. So initially making a dishonest justification may be totally fine, but eventually the other player will learn to challenge and you will need to be honest in order to defend.

In the cases where the justifications aren't fully verifiable, does it really make sense for the humans to trust anything they say? In particular, given the likelihood that one of the agents is lying?

It's definitely an open question how much can be justified in a depth N debate.

I recognize that you're saying these are open questions, I'm just trying to highlight where I'm confused -- particularly as these questions are bound up with the question of what judge strategies should look like. It seems like a lot of pieces need to come together in just the right way, and I'm not currently seeing how judge strategies can simultaneously accomplish everything they need to.

It seems like the only ambiguity in the proposal in the grandparent is: "How much should you infer from the fact that a statement can be defended in a length T debate?" I agree that we need to answer this question to make the debate fully specified (of course we wanted to answer it anyway in order to use debate). My impression is that isn't what you are confused about and that there's a more basic communication problem.

In practice this doesn't seem to be an important part of the difficulty in getting debates to work, for the reasons I sketched above---debaters are free what justifications they give, so a good debater at depth T+1 will give statements that can be justified at depth T (in the sense that a conflicting opinion with a different upshot couldn't be defended at depth T), and the judge will basically ignore statements where conflicting positions can both be justified at depth T. It seems likely there is some way to revise the rules so that the judge instructions don't have to depend on "assume that answer can be defended at depth T" but it doesn't seem like a priority.

comment by abramdemski · 2020-07-19T17:46:54.486Z · LW(p) · GW(p)
It seems like the only ambiguity in the proposal in the grandparent is: [...] My impression is that isn't what you are confused about and that there's a more basic communication problem.

Yeah. From my perspective, either I'm being dense and your proposed judge policy is perfectly clear, or you're being dense about the fact that your proposal isn't clear. My previous comments were mainly aimed at trying to get clear on what the proposal is (and secondarily, trying to clarify why I have concerns which would make the clarity important). Then your replies all seemed predicated on the assumption that the proposal in "the grandparent" (now the great-grandparent) was already clear.

All I got from the great-grandparent was a proposal for what happens if no debater contests any claims. It seems pretty explicit that you're only handling that case:

If T=0, use your best guess after looking at what the debaters said.
If T=N+1 and no debater challenges any of their opponent's statements, then give your best answer assuming that every debater could have defended each of their statements from a challenge in a length-N debate.

You then make some further remarks which are not actually about the judging strategy, but rather, about the question of what inferences we're justified to make upon observing a debate. For me this was moving too fast; I want to be clear on what the proposed strategy is first, and then reason about consequences.

Your most recent reply does make a few further remarks about what the strategy might be, but I'm not sure how to integrate them into a cohesive judging strategy. Could you try again to describe what the full judging strategy is, including how judges deal with debaters contesting each other's statements?

A couple of other things I'm unclear on:

• Do the debaters know how long the debate is going to be?
• To what extent are you trying to claim some relationship between the judge strategy you're describing and the honest one? EG, that it's eventually close to honest judging? (I'm asking whether this seems like an important question for the discussion vs one which should be set aside.)
comment by paulfchristiano · 2020-07-19T21:09:26.655Z · LW(p) · GW(p)

Sorry for not understanding how much context was missing here.

The right starting point for your question is this writeup [LW · GW] which describes the state of debate experiments at OpenAI as of end-of-2019 including the rules we were using at that time. Those rules are a work in progress but I think they are good enough for the purpose of this discussion.

In those rules: If we are running a depth-T+1 debate about X and we encounter a disagreement about Y, then we start a depth-T debate about Y and judge exclusively based on that. We totally ignore the disagreement about X.

Our current rules---to hopefully be published sometime this quarter---handle recursion in a slightly more nuanced way. In the current rules, after debating Y we should return to the original debate. We allow the debaters to make a new set of arguments, and it may be that one debater now realizes they should concede, but it's important that a debater who had previously made an untenable claim about X will eventually pay a penalty for doing so (in addition to whatever payoff they receive in the debate about Y). I don't expect this paragraph to be clear and don't think it's worth getting into until we publish an update, but wanted to flag it.

Do the debaters know how long the debate is going to be?

Yes.

To what extent are you trying to claim some relationship between the judge strategy you're describing and the honest one? EG, that it's eventually close to honest judging? (I'm asking whether this seems like an important question for the discussion vs one which should be set aside.)

If debate works, then at equilibrium the judge will always be favoring the better answer. If furthermore the judge believes that debate works, then this will also be their honest belief. So if judges believe in debate then it looks to me like the judging strategy must eventually approximate honest judging. But this is downstream of debate working, it doesn't play an important role in the argumetn that debate works or anything like that.

comment by abramdemski · 2020-07-20T17:00:28.062Z · LW(p) · GW(p)

Yep, that document was what I needed to see. I wouldn't say all my confusions are resolved, but I need to think more carefully about what's in there. Thanks!

comment by abramdemski · 2020-07-21T18:19:53.474Z · LW(p) · GW(p)

# Symmetry Concerns

It seems the symmetry concerns of that document are quite different from the concerns I was voicing. The symmetry concerns in the document are, iiuc,

• The debate goes well if the honest player expounds an argument, and the dishonest player critiques that argument. However, the debate goes poorly if those roles end up reversed. Therefore we force both players to do both.

OTOH, my symmetry concerns can be summarized as follows:

• If player 2 chooses an answer after player 1 (getting access to player 1's answer in order to select a different one), then assuming competent play, player 1's answer will almost always be the better one. This prior taints the judge's decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.
• If the two players choose simultaneously, then it's hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.

I now believe that this concern can be addressed [LW(p) · GW(p)], although it seems a bit fiddly, and the mechanism which I currently believe addresses the problem is somewhat complex.

## Known Debate Length

I'm a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can't be critiqued. One step before the end, they can make statements which can't be convincingly critiqued in one step. And so on.

Instead, it seems like you'd want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.

## Factored Cognition

I currently think all my concerns can be addressed if we abandon the link to factored cognition [LW(p) · GW(p)] and defend a less ambitious thesis about debate. The feb 2020 proposal does touch on some of my concerns there, by enforcing a good argumentative structure, rather than allowing the debate to spiral out of control (due to e.g. delaying tactics).

However, my overall position is still one of skepticism wrt the link to factored cognition. The most salient reason for me ATM is the concern that debaters needn't structure their arguments as DAGs which ground out in human-verifiable premises [LW(p) · GW(p)], but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).

ETA: Having now read more of the feb 2020 report, I see that very similar concerns are expressed near the end -- the long computation problem seems pretty similar to what I'm pointing at.

comment by paulfchristiano · 2020-07-22T01:34:35.946Z · LW(p) · GW(p)
I'm a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can't be critiqued. One step before the end, they can make statements which can't be convincingly critiqued in one step. And so on.
[...]
The most salient reason for me ATM is the concern that debaters needn't structure their arguments as DAGs which ground out in human-verifiable premises [LW(p) · GW(p)], but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).

If I assert "X because Y & Z" and the depth limit is 0, you aren't intended to say "Yup, checks out," unless Y and Z and the implication are self-evident to you. Low-depth debates are supposed to ground out with the judge's priors / low-confidence in things that aren't easy to establish directly (because if I'm only updating on "Y looks plausible in a very low-depth debate" then I'm going to say "I don't know but I suspect X" is a better answer than "definitely X"). That seems like a consequence of the norms in my original answer.

In this context, a circular argument just isn't very appealing. At the bottom you are going to be very uncertain, and all that uncertainty is going to propagate all the way up.

Instead, it seems like you'd want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.

If you do it this way the debate really doesn't seem to work, as you point out.

I currently think all my concerns can be addressed if we abandon the link to factored cognition [LW(p) · GW(p)] and defend a less ambitious thesis about debate.

For my part I mostly care about the ambitious thesis.

If the two players choose simultaneously, then it's hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.
If player 2 chooses an answer after player 1 (getting access to player 1's answer in order to select a different one), then assuming competent play, player 1's answer will almost always be the better one. This prior taints the judge's decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.

I disagree with both of these as objections to the basic strategy, but don't think they are very important.

answer by Vanessa Kosoy · 2020-07-18T13:21:38.323Z · LW(p) · GW(p)

I think the judge should state eir honest opinion. To solve the problem of sparse feedback in the early phase, give the system access to more data than just win/lose from its own games. You can initialize it by training on human debates. Or, you can give it other input channels that will allow it to gradually build a sophisticated model of the world that includes the judge's answer as a special case. For example, if you monitor humans for a long time you can start predicting human behavior, and the judge's ruling is an instance of that.

comment by abramdemski · 2020-07-19T18:30:01.305Z · LW(p) · GW(p)

I still have other problems with the honest strategy.

• I've usually seen the truthful equilibrium (ie, the desired result of training) described as one where the first player always gives the real answer, and the second player has to lie. If the honest judge knows this, then this may interfere with how they give feedback. IE they may let the first player get away with a lot more due to their prior that the first player gave the right answer (e.g. my parody debate in the OP). This suggests that -- under the honest judgement policy -- perfect honesty (or 1-epsilon honesty for negligible epsilon) is not a stable equilibrium in some sense, there being no incentive preserving honesty. Past some point, the training signal gets worse as the strategies get "better" (better in the truth-telling direction).
• If the signal is poor when debater strategies are very untruthful, and the signal is poor when debater strategies are very truthful, then the argument must be that the training signal is good for middling truthfulness. But that's not clear to me, particularly for issues which require longer debates.
• Does the honest strategy encourage truthfulness?
• First, if a debater say something wrong, the other debater can challenge them to defend claims and sub-claims, eventually cornering them in an obvious falsehood (ie, one which the human can verify is false).
• This depends on the cooperation of the dishonest player, giving justifications with a DAG structure which eventually ground out in verifiable/falsifiable claims [AF(p) · GW(p)]. The dishonest player might instead give circular justifications with loop length greater than the debate length, or chains of justification that are unbounded, or use delaying tactics to try and push defeat off the end of the argument transcript, or refuse to give justifications at all. These strategies deprive the judge of information needed to make an informed decision.
• Second, when the honest judge sees this, they decide in favor of the other player.
• It's natural to think whoever is caught in a lie loses. But being caught in a lie does not automatically mean your position was incorrect. The honest judge must take all information into account to try and determine who was correct. It seems to me that getting caught in a lie will not always be decisive, especially at an intermediate point in training where both AIs will sometimes be lying.
• Does the honest strategy encourage justifications which ground out in verifiable/falsifiable statements?
• If this were the case, it would support the claim that an honest judge encourages truthful debate strategies, since it's a bullet point underneath that question. However, I already made some remarks there about why it might not be true.
• In addition to those remarks, I note that the naive argument in favor would seem to be that such justifications help the honest judge by giving decisive evidence in support of a claim. A debater wants to do that if possible. However, the problem is that debate is supposed to allow justification trees which are larger than can possibly be explained to the human, but which make sense to a human at every step. The argument that debaters use such trees has to be more complex.
• What the perfectly honest strategy actually does in any case is very complicated since, as Paul said in his answer [AF(p) · GW(p)], we don't know exactly what you should infer upon seeing a debate.
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-07-20T11:52:56.112Z · LW(p) · GW(p)

I've usually seen the truthful equilibrium (ie, the desired result of training) described as one where the first player always gives the real answer, and the second player has to lie.

That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for "yes", agent 2 is arguing for "no".

However, the problem is that debate is supposed to allow justification trees which are larger than can possibly be explained to the human, but which make sense to a human at every step.

I didn't realize you make this assumption. I agree that it makes things much more iffy (I'm somewhat skeptical about "factored cognition"). But, debate can be useful without this assumption also. We can imagine an AI answering questions for which the answer can be fully explained to a human, but it's still superintelligent because it comes up with those answers much faster than a human or even all of humanity put together. In this case, I would still worry that scaled up indefinitely it can lead to AIs hacking humans in weird ways. But, plausibly there is a middle region (than we can access by quantilization?) where they are strong enough to be superhuman and to lie in "conventional" ways (which would be countered by the debate opponent), but too weak for weird hacking. And, in any case, combining this idea with other alignment mechanisms can lead to something useful (e.g. I suggested using it in Dialogic RL [AF(p) · GW(p)]).

comment by abramdemski · 2020-07-20T16:16:30.304Z · LW(p) · GW(p)

That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for "yes", agent 2 is arguing for "no".

Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.

If answers are pre-assigned like that, then my issues with the honest judging strategy are greatly reduced. However it's no longer meaningful to speak of a truth-telling equilibrium, and instead the question seems to be whether false claims typically (convincingly) uncovered to be false given enough debate time.

I didn't realize you make this assumption. I agree that it makes things much more iffy (I'm somewhat skeptical about "factored cognition").

Yeah, I've heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected. It didn't occur to me to try and disentangle them. I do feel a lot better about your version.

• It harnesses the power of search to find arguments which convince humans but which humans couldn't have found.
• It harnesses the adversarial game to find counterarguments, as a safeguard against manipulative/misleading arguments.
• It harnesses the same safeguard recursively, to prevent manipulative counterargument, counter-counterargument, etc. Under some assumptions about the effectiveness of the safeguard, this would ensure non-manipulation.

None of this requires anything about factored cognition, or arguments bigger than a human can understand. If one believed in factored cognition, some version of HCH could be used to judge the debates to enable that.

comment by paulfchristiano · 2020-07-22T01:41:35.308Z · LW(p) · GW(p)
Yeah, I've heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected

For reference, this is the topic of section 7 of AI Safety via Debate.

In the limit they seem equivalent: (i) it's easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).

The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that's a more detailed discussion.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-07-20T18:24:29.696Z · LW(p) · GW(p)

Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.

I'm still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it's a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?

Another, possibly more elegant variant: The judge states eir subjective probability that the first AI's answer is correct, and eir subjective probability that the second AI's answer is correct. AI 1 gets reward and AI 2 gets reward .

comment by abramdemski · 2020-07-21T18:26:21.975Z · LW(p) · GW(p)

I'm still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it's a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?

This is undesirable, because if both players give the same answer there is no training signal. We still want to search for better answers rather than allowing things to stall out early in training. So (barring other ways of mitigating this problem) we want to encourage players to give different answers. Therefore, rather than flipping a coin for close calls, ties can be decided in favor of player 1. This means player 2's best bet is to select a plausible lie, if player 1 has already selected the best answer. That's how I understood debate to work previous to the current discussion. But, as I've mentioned, this solution isn't totally satisfactory. See here [LW(p) · GW(p)] for my discussion of some other approaches to the problem.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-07-27T13:43:04.405Z · LW(p) · GW(p)

...if both players give the same answer there is no training signal.

Why? If both players give the same answer, this only means their reward on this round is out of . But, there's no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent's answer, so maybe they get less feedback from such a debate than from debates with different answers.

But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don't have this problem because the judge doesn't know a priori which side is right. Scott's suggestion to use different questions is interesting but doesn't solve the biased judgement issue, I think.

How about the following variant of the "secret asymmetry" approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the "proponent" and the other as the "opponent". On the first round A1 and A2 each generate an answer to the question, and don't participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability to A1's answer and probability to A2's answer. The rewards work as follows:

• If A1 is the proponent, it gets reward , and if it is the opponent, it gets reward .

• If A2 is the proponent, it gets reward , and if it is the opponent, it gets reward .

• B1 gets reward .

• B2 gets reward .

If we assume B1 and B2 have access to each other's internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.