Adversarial Policies Beat Professional-Level Go AIs

post by sanxiyn · 2022-11-03T13:27:00.059Z · LW · GW · 35 comments

This is a link post for https://goattack.alignmentfund.org/

Contents

35 comments

An interesting adversarial attack at KataGo, a professional level Go AI. Apparently funded by Fund for Alignment Research (FAR). Seems to be a good use of fund.

35 comments

Comments sorted by top scores.

comment by ChristianKl · 2022-11-03T17:13:13.892Z · LW(p) · GW(p)

Edit: The situation is less clearcut than it first appeared to me, more information at my comment https://www.lesswrong.com/posts/jg3mwetCvL5H4fsfs/adversarial-policies-beat-professional-level-go-ais?commentId=ohd6CcogEELkK2DwH

KataGo basically plays according to the rules that human players use to play Go and would win under the rules that humans use to play Go. 

The rules for computer Go have some technicalities where they differ from the normal rules that humans use to play Go and the adversarial attack relies on abusing those technicalities. 

The most likely explanation of this result is that KataGo is not build to optimize it's play under the technical rules of computer Go but to play according to the normal Go rules that humans use. KataGo is not a project that's created to play against bots but to give human Go players access to a Go engine. It likely would annoy it's users if it wouldn't play according to the normal human Go rules. 

As far as the significance for alignment goes, the result of this is:

KataGo aligns with human values even when it means it would lose under the technical experiment that's proposed here. KataGo manages not to goodhard on the rules of computer Go but optimizes for what humans actually care about.

Given that this is paid for by the Fund for Alignment Research it's strange that nobody congratulated KataGo on this achievement. 

Replies from: gjm, sharmake-farah
comment by gjm · 2022-11-03T23:34:53.533Z · LW(p) · GW(p)

This bit

KataGo is not built to optimize its play under the technical rules of computer go but to play according to the normal go rules that humans use

is definitely wrong. KataGo is able to use a variety of different rulesets, and does during its training, including the Tromp-Taylor rules used in the paper. Earlier versions of KataGo didn't (IIRC) have the ability to play with a wide variety of rulesets, and only used Tromp-Taylor.

[EDITED to add:] ... Well, almost. As has been pointed out elsewhere in this discussion, what KG actually used in training (and I think still does, along with other more human-like rulesets) is Tromp-Taylor with a modification that makes it not require dead stones in its territory to be captured. I don't think that counts as "the normal go rules that humans use", but it is definitely more human-like than raw Tromp-Taylor, so "definitely wrong" above is too strong. It may be worth noting explicitly that with the single exception of passing decisions (which is what is being exploited here) raw Tromp-Taylor and modified Tromp-Taylor lead to identical play and identical scores. [END of addition-in-edit.]

KataGo does have an option that makes it pass more readily in order to be nice to human opponents, but that option was turned off for the attack in the paper.

The reason the attack is able to succeed is that KataGo hasn't learned to spot instantly every kind of position where immediate passing would be dangerous because its opponent might pass and (because of a technicality) win the game. If you give it enough visits that it actually bothers to check what happens when its opponent passes, it sees that that would be bad and is no longer vulnerable. In practice, it is unusual for anyone to use KataGo with as few visits as were used in the paper.

There is some truth to the idea that the attack is possible because KataGo doesn't care about computer-rules technicalities, but the point isn't that KataGo doesn't care but that KataGo's creator is untroubled by the fact that this attack is possible because (1) it only happens in artificial situations and (2) it is pretty much completely fixed by search, which is a perfectly good way to fix it. (Source: discussion on the Discord server where a lot of computer go people hang out.)

Replies from: ChristianKl
comment by ChristianKl · 2022-11-04T16:30:09.214Z · LW(p) · GW(p)

Okay, I downloaded KataGo to see how it plays and read its rules description. It seems actually been trained so that under area rules it doesn't maximize its points. 

This is surprising to me because one of the aspects of AlphaGo that was annoying was that it didn't maximize the number of points with which it wins the game but only cared about winning. KataGo seems to play under territory rules to maximize points and not do those negative point moves that AlphaGo makes at the end of the game if it's ahead by a lot of points. 

Humans generally do care about the score at the end of the game so that behavior, under rules that care about area, is surprising to me. 

Official Chinese rules do have a concept of removing dead stones. All the KGS rulesets also have an option for handling dead stone removal.

A fix that would let KataGo beat the adversarial policy would be to implement rules for Chinese go that are more like the actual KGS rules (likely by just letting it have the cleanup phase with Chinese rules as well) and generally tell KataGo to optimize winning with the highest importance and then optimize the score and lastly optimize for a minimum amount of moves played before passing. 

If you do that you could train it on the different rule sets and it would produce this problem. The fact that you need to do that to prevent the adversarial policy is indeed interesting.

That suggests if you have one metric, adding a second metric that's a proxy for the first metric as a secondary optimization goal can be helpful to get around some adversarial attacks.  Especially, if the first metric is binary and the second one has a lot more possible values. 

It's interesting here that humans, do naturally care about scores when you let them play Go which is what gets them to avoid this kind of adversarial attack. 

Replies from: gjm
comment by gjm · 2022-11-05T00:30:44.238Z · LW(p) · GW(p)

What KataGo tries to maximize is basically winning probability plus epsilon times score difference. (It's not exactly that; I don't remember exactly what it is; but that's the right kind of idea.) So it mostly wants to win rather than lose, but prefers to win by more if the cost in winning probability is small, which as you say helps to avoid the sort of "slack" moves that AlphaGo and Leela Zero tend to make once the winner is more or less decided.

Replies from: ChristianKl
comment by ChristianKl · 2022-11-07T23:37:12.816Z · LW(p) · GW(p)

The problem here seems to be that it's not preferring to win by more under area rules. If it would prefer by more points under area rules, it would capture all the stones before passing. It doesn't do that, once it thinks that it has enough points to win anyway under area rules. 

This attack is basically about giving KataGo the impression that it has enough points anyway and doesn't need to capture stones to win. 

Likely the heuristic of time score difference does not reward getting more points over passing but it does reward playing a move that's worth more points over a move that's worth less. 

Replies from: gjm
comment by gjm · 2022-11-08T01:25:44.420Z · LW(p) · GW(p)

I'm not sure I understand. With any rules that allow the removal of dead stones, there is no advantage to capturing them. (With territory-scoring rules, capturing them makes you worse off. With area-scoring rules, capturing them makes no difference to the score.) And with rules that don't allow the removal of dead stones, white is losing outright (and therefore needs to capture those stones even if it's only winning versus losing that matters). How would caring more about score make KG more inclined to bother capturing the stones?

Replies from: ChristianKl
comment by ChristianKl · 2022-11-08T11:21:44.145Z · LW(p) · GW(p)

With area-scoring rules that don't allow the removal of dead stones in normal training games, KataGo has to decide whether it can already pass or whether it should go through the work of capturing any remaining stones. I was letting KataGo play one training game and it looked to me like its default strategy in games is not to capture all the stones but only enough to win by a sufficient margin. 

It doesn't have a habit of "always capturing all the stones to get maximum score under area rules". If it would have that habit I don't think it would show this failure case. 

Replies from: gjm
comment by gjm · 2022-11-08T12:46:04.594Z · LW(p) · GW(p)

In training games I think the rules it's using do allow the removal of dead stones. If it chooses not to remove them it isn't because it's not caring about points it would have gained by removing them, it's because it doesn't think it would gain any points by removing them.

There is no possible habit of "always capturing all the stones to get maximum score under area rules". Even under area rules you don't get more points for capturing the stones (unless the stones are not actually dead according to the rules you're using, or in human games according to negotiation with the opponent).

What am I missing?

Replies from: ChristianKl
comment by ChristianKl · 2022-11-08T19:03:31.846Z · LW(p) · GW(p)

I think that currently under area scoring rules KataGo behaves in a way that it doesn't capture all stones that would be dead by human convention but that are not dead by KataGo's rules provided capturing them isn't necessary to win the game.

Replies from: gjm
comment by gjm · 2022-11-08T19:53:45.757Z · LW(p) · GW(p)

That's correct, at least roughly -- the important difference is that it's not "isn't necessary to win the game" but "doesn't make any difference to the outcome, including score difference" -- but I don't see what it has to do with the more specific thing you said above:

The problem seems to be that it's not preferring to win by more under area rules.

KataGo does prefer to win by more, whatever rules it's playing under; a stronger preference for winning by more would not (so far as I can see) make any difference to its play in positions like the ones reached by the adversarial agent; KataGo does not generally think "that it has enough points anyway and doesn't need to capture stones to win" and even if it did that wouldn't make the difference between playing on and passing in this situation.

Unless, again, I'm missing something, but we seem to be having some sort of communication difficulty because nothing you write seems to me responsive to what I'm saying (and quite possibly it feels the same way to you, with roles reversed).

What makes you believe that KataGo is "not preferring to win by more under area rules"?

comment by Noosphere89 (sharmake-farah) · 2022-11-03T17:40:43.099Z · LW(p) · GW(p)

Yeah, this is burying the lede here.

However, there isn't a platonic form of Go rules, so what rules you make really matters.

Replies from: ChristianKl
comment by ChristianKl · 2022-11-03T18:49:43.522Z · LW(p) · GW(p)

Yes, there are multiple rule sets. Under all of those that humans use to score their games, KataGo wins in the examples.

As they put it on the linked website:

We score the game under Tromp-Taylor rules as the rulesets supported by KGS cannot be automatically evaluated.

It's complex to automatically evaluate Go positions according to the rules that humans use. That's why people in the computer Go invented their own rules to make positions easier to evaluate which are the Tromp-Taylor rules. 

Given the target audience of KataGo wasn't playing Computer bots, the KataGo developers went through the trouble of modifying the Tromp-Taylor rules to be more like the rulesets that humans use to score their games and then used the new scoring algorithm to train KataGo. 

KataGo's developers put effort into aligning KataGo with the desires of human users and it pays off in KataGo behaving in the scenarios the paper listed in the way humans would want it to behave instead of behaving optimally according to Tromp-Taylor rules. 

We have this in a lot of alignment problems. The metrics that are easy for computers to use and score are often not what humans care about. The task of alignment is about how to get our AI not goodhard on the easy metric but to focus on what we care about. 

It would have been easier to create KataGo in a way that wins in the examples of the paper than to go through the effort of making KataGo behave the way it does in the examples.

comment by paulfchristiano · 2022-12-08T00:12:50.049Z · LW(p) · GW(p)

It looks like there is a new version of the attack, which wins against a version of KataGo that does not pass and that uses enough search to be handily superhuman (though much less than would typically be used in practice).

Looking at the first game here, it seems like the adversary causes KataGo to make a very serious blunder. I think this addresses the concern about winning on a technicality raised in other comments here.

It's still theoretically unsurprising that self-play is exploitable, but I think it's nontrivial and interesting that a neural network at this quality of play is making such severe errors. I also think that many ML researchers would be surprised by the quality of this attack. (Indeed, even after the paper came out I expect that many readers thought it would not be possible to make a convincing attack without relying on technicalities or a version of the policy with extremely minimal search.)

comment by Tony Wang (tw) · 2022-11-04T03:57:51.017Z · LW(p) · GW(p)

One of the authors of the paper here. Really glad to see so much discussion of our work! Just want to help clarify the Go rules situation (which in hindsight we could've done a better job explaining) and my own interpretation of our results.

We forked the KataGo source code (github.com/HumanCompatibleAI/KataGo-custom) and trained our adversary using the same rules that KataGo was trained on.[1] So while our current adversary wins via a technicality, it was a technicality that KataGo was trained to be aware of. Indeed, KataGo is able to recognize that passing would result in a forced win by our adversary, but given a low tree-search budget it does not have the foresight to avoid this. As evhub noted in another comment [LW(p) · GW(p)] on this post, increasing the tree-search budget solves this issue.

So TL;DR I do believe we have a genuine exploit of the KataGo policy network, triggering a failure that it was trained to avoid.

Additionally, the project is still ongoing and we are working on attacks that are adversarial nature but win via other means (i.e. no weird rule technicalities). There are some promising preliminary results here which makes me think that the current exploit is not just a one-off exploit but evidence of something more general.[2]

  1. ^

    To be more precise, KataGo is trained with various different rulesets, and the one we happen to attack with is just one of them.

  2. ^

    Indeed the main creator of KataGo pointed out to us that humans have actually figured out ways to exploit AZ-type agents (link).

Replies from: gjm
comment by gjm · 2022-11-05T00:38:15.831Z · LW(p) · GW(p)

Could you clarify "it was a technicality that KataGo was trained to be aware of"?

My understanding of the situation, which could be wrong:

KataGo's training is done under a ruleset where a white territory containing a few scattered black stones that would not be able to live if the game were played out is credited to white.

KataGo knows (if playing under, say, unmodified Tromp-Taylor rules) that that white territory will not be credited to white and so it will lose if two successive passes happen. But (so to speak) its intuition has been trained in a way that neglects that, so it needs to reason it out explicitly to figure that out.

I wouldn't say that the technicality is one KataGo was trained to be aware of. It's one KataGo was programmed to be aware of, so that a little bit of searching enables it not to succumb.

But you're saying that KataGo's policy network was "trained to avoid" this situation; in what sense is that true? Is one of the things I've said above incorrect?

Replies from: tw
comment by Tony Wang (tw) · 2022-11-06T03:58:28.267Z · LW(p) · GW(p)

KataGo's training is done under a ruleset where a white territory containing a few scattered black stones that would not be able to live if the game were played out is credited to white.

I don't think this statement is correct. Let me try to give some more information on how KataGo is trained.

Firstly, KataGo's neural network is trained to play with various different rulesets. These rulesets are passed as features to the neural network (see appendix A.1 of the original KataGo paper or the KataGo source code). So KataGo's neural network has knowledge of what ruleset KataGo is playing under.

Secondly, none of the area-scoring-based rulesets (of which modified and unmodified Tromp-Taylor rules are special instances of) that KataGo has ever supported[1] would report a win for the victim for the sample games shown in Figure 1 of our paper. This is because KataGo only ignores stones a human would consider dead if there is no mathematically possible way for them to live, even if given infinite consecutive moves (i.e. the part of the board that a human would judge as belonging to the victim in the sample games is not "pass-alive").

Finally, due to the nature of MCTS-based training, what KataGo knows is precisely what KataGo's neural network is trained to emulate. This is because the neural network is trained to imitate the behavior of the neural network + tree-search. So if KataGo exhibits some behavior with tree-search enabled, its neural network has been trained to emulate that behavior.

I hope this clears some things up. Do let me know if any further details would be helpful!

  1. ^

    Look for "Area" on the linked webpages to see details of area-scoring rulesets.

Replies from: gjm
comment by gjm · 2022-11-06T12:35:31.772Z · LW(p) · GW(p)

"Firstly": Yes, I oversimplified. (Deliberately, as it happens :-).) But every version of the rules that KataGo has used in its training games, IIUC, has had the feature that players are not required to capture enemy stones in territory surrounded by a pass-alive group.

I agree that in your example the white stones surrounding the big white territory are not pass-alive, so it would not be correct to say that in KG's training this particular territory would have been assessed as winning for white.

But is it right to say that it was "trained to be aware" of this technicality? That's not so clear to me. (I don't mean that it isn't clear what happened; I mean it isn't clear how best to describe it.) It was trained in a way that could in principle teach it about this technicality. But it wasn't trained in a way that deliberately tried to expose it to that technicality so it could learn, and it seems possible that positions of the type exploited by your adversary are rare enough in real training data that it never had much opportunity to learn about the technicality.

(To be clear, I am not claiming to know that that's actually so. Perhaps it had plenty of opportunity, in some sense, but it failed to learn it somehow.)

If you define "what KataGo was trained to know" to include everything that was the case during its training, then I agree that what KataGo actually knows equals what it was trained to know. But even if you define things that way, it isn't true that what KataGo actually knows equals what its "intuition" has learned: if there are things its intuition (i.e., its neural network) has failed to learn, it may still be true that KataGo knows them.

I think the (technical) lostness of the positions your adversary gets low-visits KataGo into is an example of this. KataGo's neural network has not learned to see these positions as lost, which is either a bug or a feature depending on what you think KataGo is really trying to do; but if you run KataGo with a reasonable amount of searching, then as soon as it overcomes its intuition enough to explicitly ask itself "what happens if I pass here and the opponent passes too?", it answers "yikes, I lose" and correctly decides not to do that.

Here's an analogy that I think is reasonably precise. (It's a rather gruesome analogy, for which I apologize. It may also be factually wrong, since it makes some assumptions about what people will instinctively do in a circumstance I have never actually seen a person in.) Consider a human being who drives a car, put into an environment where the road is strewn with moderately realistic models of human children. Most likely they will (explicitly or not) think "ok, these things all over the road that look like human children are actually something else, so I don't need to be quite so careful about them" and if 0.1% of the models are in fact real human children and the driver is tired enough to be operating mostly on instinct, sooner or later they will hit one.

If the driver is sufficiently alert, they will (one might hope, anyway) notice the signs of life in 0.1% of the child-looking-things and start explicitly checking. Then (one might hope, anyway) they will drive in a way that enables them not to hit any of the real children.

Our hypothetical driver was trained in an environment where a road strewn with fake children and very occasional real children is going to lead to injured children if you treat the fakes as fakes: the laws of physics, the nature of human bodies, and for that matter the law, weren't any different there. But they weren't particularly trained for this weird situation where the environment is trying to confuse you in this way. (Likewise, KataGo was trained in an environment where a large territory containing scattered very dead enemy stones sometimes means that passing would lose you the game; but it wasn't particularly trained for that weird situation.)

Our hypothetical driver, if sufficiently attentive, will notice that something is even weirder than it initially looks, and take sufficient care not to hit anyone. (Likewise, KataGo with a reasonable number of visits will explicitly ask itself the question "what happens if we both pass here", see the answer, and avoid doing that.)

But our hypothetical driver's immediate intuition may not notice exactly what is going on, which may lead to disaster. (Likewise, KataGo with very few visits is relying on intuition to tell it whether it needs to consider passing as an action its opponent might take in this position, and its intuition says no, so it doesn't consider it, which may lead to disaster.)

Does the possibility (assuming it is in fact possible) of this gruesome hypothetical mean that there's something wrong with how we're training drivers? I don't think so. We could certainly train drivers in a way that makes them less susceptible to this attack. If even a small fraction of driving lessons and tests were done in a situation with lots of model people and occasional real ones, everyone would learn a higher level of paranoid caution in these situations, and that would suffice. (The most likely way for my gruesome example to be unrealistic is that maybe everyone would already be sufficiently paranoid in this weird situation.) But this just isn't an important enough "problem" to be worth devoting training to, and if we train drivers to drive in a way optimized for not hitting people going out of their way not to be visible to the driver then the effect is probably to make drivers less effective at noticing other, more common, things that could cause accidents, or else to make them drive really slowly all of the time (which would reduce accidents, but we have collectively decided not to prioritize that so much, because otherwise our speed limits would be lower everywhere).

Similarly, it doesn't seem to me that your adversarial attack indicates anything wrong with how KataGo is trained; it could probably be trained in ways that would make it less vulnerable to your attack, but only at the cost of using more of its neural network for spotting this kind of nonsense and therefore being weaker overall (~ human drivers making more mistakes of other kinds because they're focusing on places where quasi-suicidal people could be lurking) or just being extra-cautious about passing and therefore more annoying to human opponents for no actual gain in strength in realistic situations (~ human drivers driving more slowly all the time).

(From the AI-alignment perspective, I quite like ChristianKl's take on this: KataGo, even though trained in an environment that in some sense "should" have taught it to avoid passing in some totally-decided positions, has learned to do that, thus being more friendly to actual human beings at the cost of being a little more exploitable. Of course, depending on just how you make the analogy with AI-alignment scenarios, it doesn't have to look so positive! But I do think it's interesting that the actual alignment effect in this case is a beneficial one: KG has ended up behaving in a way that suits humans.)

For the avoidance of doubt, I am not denying that you have successfully built an adversary that can exploit a limitation in KataGo's intuition. I am not just convinced that that should be regarded as a problem for KataGo. Its intuition isn't meant to solve all problems; if it could, it wouldn't need to be able to search.

comment by MathiasKB (MathiasKirkBonde) · 2022-11-03T15:28:23.608Z · LW(p) · GW(p)

As someone who plays a lot of go, this result looks very suspicious to me. To me it looks like the primary reason this attack works is due to an artifact of the automatic scoring system used in the attack. I don't think this attack would be replicable in other games, or even KataGo trained on a correct implementation.

In the example included on the website, KataGo (White) is passing because it correctly identifies the adversary's (Black) stones as dead meaning the entire outside would be its territory. Playing any move in KataGo's position would gain no points (and lose a point under Japanese scoring rules), so KataGo passes.

The game then ends and the automatic scoring system designates the outside as undecided, granting white 0 points and giving black the win.

If the match were to be played between two human players, they would have to agree whether the outside territory belongs to white or not. If black were to claim their outside stones are alive the game would continue until both players pass and agree about the status of all territory (see 'disputes' in the AGA ruleset).

But in the adversarial attack, the game ends after the pass and black gets the win due to the automatic scoring system deciding the outcome. But the only reason that KataGo passed is that it correctly inferred that it was in a winning position with no way to increase its winning probability! Claiming that to be a successful adversarial attack rings a bit hollow to me.

I wouldn't conclude anything from this attack, other than that Go is a game with a lot of edge-cases that need to be correctly handled.

EDIT: I just noticed the authors address this on the website, but I still think this significantly diminishes the 'impressiveness' of the adversarial attack. I don't know the exact ruleset KataGo is trained under, but unless it's the exact same as the ruleset used to evaluate the adversarial attack, the attack only works due to KataGo playing to win a different game than the adversary.

Replies from: evhub, currymj
comment by evhub · 2022-11-03T22:26:32.151Z · LW(p) · GW(p)

Note that when given additional search, KataGo realizes that it will lose here and doesn't fall for the attack, which seems to suggest that it's not just a rules discrepancy.

Replies from: MathiasKirkBonde
comment by MathiasKB (MathiasKirkBonde) · 2022-11-04T11:46:20.174Z · LW(p) · GW(p)

Yeah, my original claim is wrong. It's clear that KataGo is just playing sub-optimally outside of distribution, rather than punished for playing optimally under a different ruleset than its being evaluated.

comment by Anonymous (currymj) · 2022-11-03T17:11:38.851Z · LW(p) · GW(p)

The KataGo paper says of its training, "Self-play games used Tromp-Taylor rules modified to not require capturing stones within pass-alive territory".

It sounds to me like this is the same scoring system as used in the adversarial attack paper, but I don't know enough about Go to be sure.

Replies from: MathiasKirkBonde, ChristianKl
comment by MathiasKB (MathiasKirkBonde) · 2022-11-03T19:24:59.662Z · LW(p) · GW(p)

No, the KataGo paper explicitly states at the start of page 4:

"Self play games used Tromp-Taylor rules [21] modified to not require capturing stones within pass-aliveterritory"

Had KataGo been trained on unmodified Tromp-Taylor rules, the attack would not have worked. The attack only works because the authors are having KataGo play under a different ruleset than it was trained on.

If I have the details right, I am honestly very confused about what the authors are trying to prove with this paper. Given their Twitter announcement claimed that the rulesets were the same my best guess is simply that it was an oversight on their part.

(EDIT: this modification doesn't matter, the authors are right, I am wrong. See my comment below)

Replies from: MathiasKirkBonde
comment by MathiasKB (MathiasKirkBonde) · 2022-11-04T11:35:33.220Z · LW(p) · GW(p)

Actually this modification shouldn't matter. After looking into the definition of pass-alive, the dead stones in the adversarial attacks are clearly not pass-alive.

Under both unmodified and pass-alive modified tromp-taylor rules, KataGo would lose here and its surprising that self-play left such a weakness.

The authors are definitely onto something, and my original claim that the attack only works due to kataGo being trained under a different rule-set is incorrect.

Replies from: gjm
comment by gjm · 2022-11-05T00:41:48.393Z · LW(p) · GW(p)

It doesn't matter whether the dead stones are pass-alive. It matters whether the white stones surrounding the territory they're in are pass-alive.

Having said that, in e.g. the first example position shown on the attackers' webpage those white stones are not pass-alive, so the situation isn't quite "this is a position in which KG would have won under its training conditions". But it is a position that superficially looks like such a position, which I think is relevant since what's going on with this attack is that they've found positions where KataGo's "snap judgement", when it gets little or no searching, gets it wrong.

comment by ChristianKl · 2022-11-03T17:18:24.237Z · LW(p) · GW(p)

No. KataGo loses in their examples because it doesn't capture stones within pass-alive territory. It's training rules are modified so it doesn't need to do that. 

comment by Daniel Paleka · 2022-11-04T16:55:15.601Z · LW(p) · GW(p)

There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:

The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems, http://masfoundations.org/mas.pdf), and methods that produce less exploitable policies have been studied for decades.

and

Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.

 

Reply by authors:

I can see why a MAS scholar would be unsurprised by this result. However, most ML experts we spoke to prior to this paper thought our attack would fail! We hope our results will motivate ML researchers to be more interested in the work on exploitability pioneered by MAS scholars.

...

Ultimately self-play continues to be a widely used method, with high-profile empirical successes such as AlphaZero and OpenAI Five. If even these success stories are so empirically vulnerable we think it's important for their limitations to become established common knowledge.

My understanding is that the author's position is reasonable for mainstream ML community standards; in particular there's nothing wrong with the original tweet thread.  "Self-play exploitable" is not new, but the practical demonstration of how easy it's to do the exploit in Go engines is a new and interesting result. 

I hope the "Related work" section gets fixed as soon as possible, though.

The question is at which level of scientific standards do we want alignment-adjacent work to be on. There are good arguments for aiming to be much better than mainstream ML research (which is very bad at not rediscovering prior work) in this respect, since the mere existence of a parallel alignment research universe by default biases towards rediscovery.
 

  1. ^

    ...which I feel is is not valid at all? If the policy was made aware of a weird rule in training, then it losing by this kind of rule is a valid adversarial example. For research purposes, it doesn't matter what the "real" rules of Go are. 

    I don't play Go, so don't take this judgement for granted.

comment by gjm · 2022-11-10T03:01:36.036Z · LW(p) · GW(p)

Over on the Discord server where the creator of KataGo, and a bunch of other computer go people, hang out, there's what might be an interesting development. KataGo's creator says he's tried to reproduce the results and failed -- the adversary does indeed provoke misbehaviour from KataGo's policy network alone (which no one should be worried by; the job of the policy network is to propose moves for search, not to play well on its own, though it turns out it does happen to play quite well by the standards of puny humans) but even a teeny-tiny amount of search makes the losing-passing stop completely.

I'm able to replicate the raw policy being "vulnerable" and I observe that the adversarial positions tend to raise the probability of the raw policy on pass, but I'm unable to replicate KataGo wanting to pass with even a tiny amount of search in the positions where the SGFs published has it passing.

If I understand correctly, the version of KG used by the paper's authors is (1) not the latest and (2) modified in order to support their research. So it seems possible that (1) KG's behaviour has changed somehow or (2) the researchers' modifications actually broke something. Or that (3) there's just a bug in KG's play-a-match-against-yourself code, not introduced by the researchers, that makes the attack succeed in that context even when KG gets to do some searching. My impression is that #3 is viewed as quite plausible by KG's creator.

comment by TekhneMakre · 2022-11-03T13:51:20.998Z · LW(p) · GW(p)

Hm. Sounds like KataGo basically doesn't know the rules...

Replies from: sanxiyn
comment by sanxiyn · 2022-11-03T13:55:54.079Z · LW(p) · GW(p)

In a sense, yes, but what does it mean to know the rules? You are saying that KataGo both can beat professionals and doesn't know the rules. That's not impossible, but it's pretty weird.

Replies from: ChristianKl, TekhneMakre
comment by ChristianKl · 2022-11-03T17:31:50.759Z · LW(p) · GW(p)

It can win under the rule set that professionals use to play Go just fine but those are different rules than the rules this paper is about.

comment by TekhneMakre · 2022-11-03T16:14:47.113Z · LW(p) · GW(p)

I don't play Go but IIUC the "loss" that KataGo takes is a winning Go position, in the sense that a human looking at the final board position would say KataGo won, but the rules of this computer competition say that KataGo loses.

(edit: MathiasKB's comment explains the situation)

comment by Kai Salomaa (kai-salomaa) · 2022-11-20T20:00:44.927Z · LW(p) · GW(p)

If KataGo were trained using (the rather non-standard) Trump-Taylor counting rules, it would not pass (one should never pass until all areas are filled). If KataGo system claims to have an option to use Trump-Taylor counting, it has probably been added as an adhoc extension to the system and by mistake. At best the paper has found a programming bug in Katago. Writing a paper about a programming bug would certainly sound less glamorous than claiming "we beat professional level Go AI".

comment by Kai Salomaa (kai-salomaa) · 2022-11-13T16:40:12.092Z · LW(p) · GW(p)

The below comments include a detailed discussion of different technical rule sets.

A short description is as follows: the adversarial policy uses rules as follows: "after two passes everything that has not been explicitly killed is considered to be alive".

With such a rule it is clear that no one should pass until all territories have been filled so that there are only individual eyes left (single spaces that are completely surrounded).

comment by Zaxiquej · 2022-11-10T13:56:08.593Z · LW(p) · GW(p)

I am doubting how applying “another rule” make such research useful. Any human being who learned to play Go, using any of the major rules (Chinese Rule, Korean Rule, etc.) would claim that Katago wins with great advantage in the cases that the author thinks their AI wins. The authors claim that their AI win under the Thomp-Taylor computer, but it seems they used some weird parameters. I have tested that with some modern computer-Go winning detection algorithm, and they all show that Katago wins in the examples given by the authors. (I may try the default Thomp-Taylor as well later)

Basically, what the model does is doing nonsense and get great disadvantage, yet having some of their stones in the opponent’s territory. As Katago believe there is no need to make more play, it just passes. The authors’ AI then also passes and thus the game end. Now the authors use a somehow outdated computer-Go winning detection algorithm, which cannot correctly tell that the stones in the opponent’s territory are, in fact, dead. They falsely think that those stones are alive, thus think that the AI of the authors “win”.

After all, Katago is not designed for this specific ill-designed rule. I may claim a new rule that whoever get 5 in a row in the game wins, then my AI can definitely beat Katago as it would not prevent me doing that.

Applying adversarial attack is an interesting idea, though I am not convinced the authors’ approach did actually show its validity. This path is definitely worth exploring, but the authors’ approach seems to based on wrong assumptions and thus is hard to be improved in the future. Winning an AI from a prospective that it is not designed for is not winning. I apologize for being harsh, but as I learned Go from my childhood, I am a bit angry seeing the authors claiming some results which are not even wrong on Go. I also doubt whether any of the author play computer Go on any of the online platforms in recent years (Otherwise, why use such an outdated model?)

Updated: I used to think Thomp-Taylor is outdated, I am wrong - but it seems that Katago supports Thomp-Taylor though it is a bit old. I checked that more in detail and it seems that the authors are using weird parameters on it so it cannot correctly tell some pieces are dead or not - totally confusing. I adjusted some pieces in the paragraphs above.

Replies from: ChristianKl
comment by ChristianKl · 2022-11-12T14:31:35.085Z · LW(p) · GW(p)

See the other comments, KataGo interpretation of what Area counting rules about does not include dead stone removal. 

It doesn't remove them when you set it to Chinese rules and it doesn't remove them when you set it to Thomp-Taylor rules.