Everything I ever needed to know, I learned from World of Warcraft: Goodhart’s law
post by Said Achmiz (SaidAchmiz) · 2018-05-03T16:33:50.002Z · LW · GW · 21 commentsThis is a link post for https://blog.obormot.net/Everything-I-ever-needed-to-know-I-learned-from-World-of-Warcraft-Goodharts-law
Contents
The combat log The damage meters The problem The Thing is valuable, but it’s not the only valuable thing We can’t afford to specialize Tunnel vision kills Tunnel vision kills… other people Optimization has a price Everyone wants the chance to show off their skill A good excuse for incompetence None 21 comments
This is the first in a series of posts about lessons from my experiences in World of Warcraft. I’ve been talking about this stuff for a long time—in forum comments, in IRC conversations, etc.—and this series is my attempt to make it all a bit more legible. I’ve added footnotes to explain some of the jargon, but if anything remains incomprehensible, let me know in the comments.
World of Warcraft, especially WoW raiding[1], is very much a game of numbers and details.
At first, in the very early days of WoW, people didn’t necessarily appreciate this very well, nor did they have any good way to use that fact even if they did appreciate it. (And—this bit is a tangent, but an interesting one—a lot of superstitions arose about how game mechanics worked, which abilities had which effects, what caused bosses[2] to do this or that, etc.—all the usual human responses to complex phenomena where discerning causation is hard.) And, more importantly and on-topic, there was no really good way to sift the good players from the bad; nor to improve one’s own performance.
This hampered progression. (“Progression” is a WoW term of art for “getting a boss down, getting better at doing so, and advancing to the next challenge; rinse, repeat”. Hence “progression raiding” meant “working on defeating the currently-not-yet-beaten challenges”.)
The combat log
One crucial feature of WoW is the combat log. This is a little window that appears at the bottom of your screen; into it, the game outputs lines that report everything that happens to or around your character. All damage done or taken, all hits taken or avoided, abilities used, etc., etc.—everything. This information is output in a specific format; and it can be parsed by the add-on system[3].
Naturally, then, people soon began writing add-ons that did parse it—parse it, and organize it, and present various statistical and aggregative transformations of that data in an easy-to-view form—which, importantly, could be viewed live, as one played.
Thus arose the category of add-ons known as “damage meters”.
The damage meters
Of course the “damage meters” showed other things as well—but viewing damage output was the most popular and exciting use. (What more exciting set of data is there, but one that shows how much you’re hurting the monsters, with your fireballs and the strikes of your sword?) The better class of damage-meter add-ons not only recorded this data, but also synchronized and verified it, by communicating between instances of themselves running on the clients of all the people in the raid.
Which meant that now you could have a centralized display of just what exactly everyone in the raid was doing, and how, and how well.
This was a great boon to raid leaders and raid guilds everywhere! You have a raid of 40 people, one of the DPSers[4] is incompetent, can’t DPS to save his life, or he’s AFK[5] half the time, or he's just messing around—who can tell?
With damage meters—everyone can tell.
Now, you could sift the bad from the good, the conscientious from the moochers and slackers, and so on. And more: someone’s not performing well but seems to be trying, but failing? Well, now you look at his ability breakdown[6], you compare it to that of the top DPSers, you see what the difference is and you say—no, Bob, don't use ability X in this situation, use ability Y, it does more damage.
The problem
All of this is fantastic. But… it immediately and predictably began to be subverted by Goodhart’s law.
To wit: if you are looking at the DPS meters but “maximize DPS” is not perfectly correlated with “kill the boss” (that being, of course, your goal)… then you have a problem.
This may be obvious enough; but it is also instructive to consider the specific ways that those things can come uncoupled. So, let me try and enumerate them.
The Thing is valuable, but it’s not the only valuable thing
There are other things that must be done, that are less glamorous, and may detract from doing the Thing, but each of which is a sine qua non of success. (In WoW, this might manifest as: the boss must be damaged, but also, adds must be kited—never mind what this means, know only that while a DPSer is doing that, he can’t be DPSing!)
And yet more insidious elaborations on that possibility:
We can’t afford to specialize
What if, yes, this other thing must be done, but the maximally competent raid member must both do that thing and also the main thing? He won’t DPS as well as he could, but he also can't just not DPS, because then you fail and die; you can’t say “ok, just do the other thing and forget DPSing”. In other words, what if the secondary task isn’t just something you can put someone full-time on?
Outside of WoW, you might encounter this in, e.g., a software development context: suppose you’re measuring commits, but also documentation must be written—but you don’t have (nor can you afford to hire) a dedicated docs writer! (Similar examples abound.)
Then other possibilities:
Tunnel vision kills
The Thing is valuable, but tunnel-visioning on The Thing means that you will forget to focus on certain other things, the result being that you are horribly doomed somehow—this is an individual failing, but given rise to by the incentives of the singular metric (i.e., DPS maximization).
(The WoW example is: you have to DPS as hard as possible, but you also have to move out the way when the boss does his “everyone in a 10 foot radius dies to horrible fire” ability.)
And yet more insidious versions of this one:
Tunnel vision kills… other people
Yes, if this tunnel-vision dooms you, personally, in a predictable and unavoidable fashion, then it is easy enough to say “do this other thing or else you will predictably also suffer on the singular metric” (the dead throw no fireballs).
But the real problem comes in when neglecting such a secondary duty creates externalities; or when the destructive effect of the neglect can be pushed off on someone else.
(In WoW: “I won’t run out of the fire and the healers can just heal me and I won’t die and I’ll do more DPS than those who don’t run out"; in another context, perhaps “I will neglect to comment my code, or to test it, or to do other maintenance tasks; these may be done for me by others, and meanwhile I will maximize my singular metric [commits]”.)
It’s almost always the case that you have the comparative advantage in doing the secondary thing that avoids the doom; if others have to pick up your slack there, it’ll be way less efficient, overall.
Optimization has a price
The Thing is valuable, yes; and it may be that there are ways to in fact increase your level of the Thing, really do increase it, but at a non-obvious cost that is borne by others. Yes, you are improving your effectiveness, but the price is that others, doing other things, now have to work harder, or waste effort on the consequences, etc.
(Many examples of this in WoW, such as “start DPSing before you’re supposed to, and risk the boss getting away from the tank and killing the raid”. In a general context, this is “taking risks, the consequences of which are dire, and the mitigation of which is a cost borne by others, not you”.)
Then this one is particularly subtle and may be hard to spot:
Everyone wants the chance to show off their skill
The Thing is valuable, and doing it well brings judgment of competence, and therefore status. There are roles within the project’s task allocation that naturally give greater opportunities to maximize your performance of the Thing, and therefore people seek out those roles preferentially—even when an optimal allocation of roles, by relative skill or appropriateness to task, would lead them to be placed in roles that do not let them do the most of the Thing.
(In WoW: if the most skilled hunter is needed to kite the add, but there are no “who kited the add best” meters, only damage meters… well, then maybe that most skilled hunter, when called upon to kite the add, says “Bob over there can kite the add better”—and as a result, because Bob actually is worse at that, the raid fails. In other contexts… well, many examples, of course; glory-seeking in project participation, etc.)
Of course there is also:
A good excuse for incompetence
This is the converse of the first scenario: if the Thing is valuable but you are bad at it, you might deliberately seek out roles in which there is an excuse for not performing it well (because the role’s primary purpose is something else)—despite the fact that, actually, the ideal person in your role also does the Thing (even if not as much as in a Thing-centered role).
“Raid dungeons” were the most difficult challenges in the game—difficult enough to require up to 40 players to band together and cooperate, and cooperate effectively, in order to overcome them. “Raiding” refers to the work of defeating these challenges. Most of what I have to say involves raiding, because it was this part of WoW that—due to the requirement for effective group effort (and for other, related, reasons)—gave rise to the most interesting social patterns, the most illuminating group dynamics, etc. ↩︎
“Boss monsters” or “bosses” are the powerful computer-controlled opponents which players must defeat in order to receive the in-game rewards which are required to improve their characters’ capabilities. The most powerful and difficult-to-defeat bosses were, of course, raid bosses (see previous footnote). ↩︎
WoW allows players to create add-ons—programs that enhance the game’s user interface, add features, and so on. Many of these were very popular—downloaded and used by many other players—and some came to be considered necessary tools for successful raiding. ↩︎
“Damage Per Second”, i.e. doing damage to the boss, in order to kill it (this being the goal). Along with “tank” and “healer”, “DPS” is one of the three roles that a character might fulfill in a group or raid. A raid needed a certain number of people in each role, and all were critical to success. ↩︎
“Away From Keyboard”, i.e., not actually at the computer—which means, obviously, that his character is standing motionless, and not contributing to the raid’s efforts in the slightest. ↩︎
In other words: which of his character’s abilities he was using, in what proportion, etc. Is the mage casting Fireball, or Frostbolt, or Arcane Missile? Is the hunter using Arcane Shot, and if so, how often? By examining the record—recorded and shown by the damage meters—of a character’s ability usage, it was often very easy to determine who was playing optimally, and who was making mistakes. ↩︎
21 comments
Comments sorted by top scores.
comment by Jameson Quinn (jameson-quinn) · 2020-01-15T18:55:40.409Z · LW(p) · GW(p)
This is a moderately interesting and well-written example, but did not really surprise me at any point. Worth having, but wouldn't be something I'd go out of my way to recommend.
comment by Said Achmiz (SaidAchmiz) · 2018-05-03T19:02:46.137Z · LW(p) · GW(p)
Of course I can’t resist playing devil’s advocate, even to myself.
It’s almost always the case that you have the comparative advantage in doing the secondary thing that avoids the doom; if others have to pick up your slack there, it’ll be way less efficient, overall.
But how true is this, really? And, more importantly, in what circumstances does it fail to be true?
This is no idle pedantry; identifying scenarios when this heuristic (on which the OP strongly insists) fails to hold can allow us to extract additional value, additional performance, out of a system (but at a price—see below).
A WoW example first (generalization to follow):
Supposing I am playing a DPS character in a raid, and we’re fighting a boss which periodically emits intense flames for a short period of time, doing terrible damage to all characters in the immediate vicinity. Being adjacent to (“in melee with”) the boss during this time causes a character to sustain tremendous damage; any character who remains within the area of effect for the duration of this periodic flame eruption—and does not receive massive, sustained healing—will die. Thus, the standard tactic for dealing with this threat is for all characters in melee to immediately run away when the flame eruption begins, and return when it subsides. (Obviously, characters who flee for their lives are doing absolutely no DPS at all in the meantime—which is the price of surviving to resume DPS once the danger passes.)
Suppose, however, I say to the healers in the raid: “If I stay put instead of running—if I make no effort to avoid this harm—I know that you can heal me, and prevent my death. It will be very difficult for you, of course—but consider the benefit: I will be able to keep DPSing non-stop! My impressive damage output will benefit greatly—will be even more substantial—by avoiding these periodic interruptions, and as a result, we will kill the boss more quickly [which is desirable for many reasons].”
The healers will protest, of course. But how seriously should I take their protests? (Suppose I am the raid leader, or a raid officer whose advice the raid leader will weight heavily—in short, suppose I have the power to tell the raid’s healers whom to heal, and when.) Perhaps I have good reason to believe (based on past performance, for instance, or on some informed analysis, etc.) that they do, indeed, have the capacity to do this difficult thing which I am asking of them; their protests, then, are borne out of a desire to do less work, to put in less effort, than they are capable of. After all, what is gained by some folks in the raid having had an easier time of it than they might’ve? Nothing… right?
This is the choice that seems to confront us: I (and possibly some other DPSers in the raid) can work harder (by avoiding the doom ourselves), or the healers can work harder (by healing us while we blithely DPS despite the flames raging around us). In the second scenario, the boss dies more quickly (which, again, is beneficial for many reasons). As far as consequences (i.e., outputs of our strategy) go, that seems to be the only difference. This seems to strongly suggest that the second strategy is superior. Certainly that’s true unless we can find a difference in costs of this strategy that outweigh the clear benefit (in that case, the “heal through the damage” approach will be inferior on net).
In terms of inputs, the difference is twofold: who shoulders the burden (the DPS or the healers), and how great the burden differential is. In the first approach (“DPSers save themselves”), the DPSers shoulder the burden, and the burden is relatively light; in the second approach (“DPSers stay put, and the healers heal them”), the healers shoulder the burden, and the burden is massive.
Well, but—so what? Is there any principle which says that every raid encounter must tax each raid member equally, that no one has an easier time of it than anyone else? There is not. Merely the fact that it’s easier for the DPSers to save themselves than for the healers to work extra-hard to save the DPSers, does not, by itself, recommend the first approach over the second. In other words, we’ve uncovered a cost to some individual raid members, but it’s not clear at all whether we ought to count this as a cost to the raid. The fact remains that with the second approach, we kill the boss more quickly, with no detrimental effects on our chances of success.
There are, in fact, two ways in which such a cost to individual raid members may translate into a cost to the raid. One is trivial and contingent, the other is fundamental.
The trivial way first: raid members are human. The healers may not appreciate being asked to work so much harder, just so that the DPSers can work a bit less hard, and “but this benefits the raid” may not suffice to persuade them. Even if we assume perfect rationality, perfect understanding of game theory, and perfect sublimation of one’s own selfish desires to the collective goal, we may still expect that the stress, the strain, the exertion of these great demands that we propose to place on the healers, will accelerate burn-out, will disincentivize prospective raid members from signing on as healers, and will have a myriad other effects that result from a particular position in an organization, a particular role in a collective effort, being very difficult and stressful.
(I say this effect is ‘trivial’, but of course it’s anything but—in practice, a good leader must well understand, and deftly manage, effects of this type. It’s just that this is a phenomenon which is not unique to the specific problem I describe in this comment; we may abstract away from it, and be left with an equally stark dilemma.)
The fundamental way: what we propose to do, in the “let the healers heal the DPSers while the latter take the damage” strategy, is to exploit reserves of capacity (in order to improve overall performance/effectiveness/output). We believe that the healers have reserves of capacity, which they are, currently, not exploiting. We say: let us exploit those reserves, converting them to useful output (damage done to the boss)—at a relatively low rate (it takes a great deal of additional effort on the healers’ part to effect what is certainly a non-trivial, but nonetheless not earth-shaking, amount of additional DPS), but every bit counts, and what good is that reserve capacity doing us, if it’s just (metaphorically) sitting there, unused?
You may begin to see the problems with this approach. There are several; let’s make them explicit:
(1) We may be mistaken about how much reserve capacity there is—or, equivalently, about how much capacity will be required in order to do what we propose. Perhaps the healers try to heal the DPSers through the damage, and fail; some or all of the DPSers die (this is bad!). Or, perhaps, while the healers are trying to heal the DPSers, they can’t quite devote the attention that their other tasks require, such as healing other folks in the raid (themselves included)—and as a result, some of the healers die (this is very bad!) or the tank dies (this is utterly catastrophic!).
This may play out in a subtler, and more insidious, way. Perhaps the healers do pull it off, and all the DPSers live, having taken no action to save themselves, and kept up their damage output all the while; but in order to accomplish this, the healers had to draw on reserves of resources that they would otherwise save for later in the engagement… this “borrowing against future capacity” may then come back to bite the raid in its collective ass.
This is a particularly insidious type of consequence because it may be quite difficult to diagnose. Suppose you do some difficult thing, which severely drains your future capacity; that future comes, and you are found wanting, with collective failure resulting. You say: “this is because I had to do that difficult thing!” “Was it really that?” the leader asks, “or perhaps you could’ve managed your resources better?” (And he may well be right. Or not. How can you tell? Suppose that leader is the one who proposed the “make you do the difficult thing” strategy in the first place. Might he not now resist admitting that he asked too much of you? Isn’t this a bias on his part? On the other hand, perhaps you are using this “I had to do that difficult thing” business as an excuse for sub-par performance on the long-term-resource-management front!)
(2) Reserve capacity is not useless. The unexpected may occur. If we plan to use all of our capacity, what happens when things don’t go according to plan? “Exploit all possible reserve capacity to generate maximum performance” is a brittle strategy; it fails to be resilient. A good strategy is resilient both to the sorts of subjectively random, relatively small fluctuations in conditions that result from a variety of sources, and to large fluctuations caused by specific failure modes that may be easy to reduce in probability, but difficult to eliminate entirely.
The problem, of course, is that some challenges are so difficult that a brittle strategy is the only way to succeed. All the more resilient strategies are also going to be lower in maximum possible performance; these maxima may all fall below what is required to overcome a given challenge. In that case, it’s brittle or go home. How do you tell if that’s the situation you face? By choosing the brittle strategy, might you be neglecting opportunities to improve overall performance in many small, subtle, difficult, unexciting, robust ways? There is no easy answer.
Finally we are equipped to answer the question I asked at the beginning of this comment: in what scenarios does the heuristic “take personal responsibility for avoiding the negative consequences of your actions; do not foist it off on others” fail to generate optimal collective performance?
It seems to me that the answer, suggested by the analysis above, is this:
Ask two questions: first, whether shifting the responsibility for preventing the doom your actions cause improves the overall output (by whatever is the relevant metric of performance) of the group. If the answer is “yes”, then ask the second and more important (and more difficult) question: can you be confident that the remaining reserves of capacity—after using up the reserve capacity you propose to convert into effective output—are sufficient to maintain a suitable degree of resilience in your strategy? (You should, of course, work in a suitable margin of error into your estimates of available reserve capacity and of required capacity to effect your proposed responsibility shift, taking into account various sources of uncertainty—about the task at hand, about the capabilities of the group members, etc.)
If the answer is “yes” also to the second question, then seriously consider violating the given heuristic.
In short: ask whether it is beneficial to shift the burden, and ask whether you can afford to shift the burden. If so, do it.
Replies from: Vaniver↑ comment by Vaniver · 2018-05-14T19:04:53.211Z · LW(p) · GW(p)
The healers may not appreciate being asked to work so much harder, just so that the DPSers can work a bit less hard, and “but this benefits the raid” may not suffice to persuade them.
I note also that healers are much less replaceable than DPS are--or at least, that was the way of things when I was playing WoW--and so the maintenance of healer morale is considerably more important for the guild than the maintenance of DPS morale, or potentially even finishing the encounter sooner or more successfully.
Replies from: SaidAchmiz↑ comment by Said Achmiz (SaidAchmiz) · 2018-05-14T19:11:29.615Z · LW(p) · GW(p)
Very true! This is an excellent point. (Furthermore, healing is a more stressful role than DPSing, and healers are more prone to burnout—and raid healing takes more skill[1] than DPSing, so for these reasons they are certainly less replaceable; despite a raid needing much fewer healers than DPS, the supply of good healers is lower still.)
[1] Or, to be more precise: the combination of type of skill set + level of competence + disposition, that is required to play a good raid healer, is more rare than the corresponding things that are required to play a good DPSer.
comment by habryka (habryka4) · 2019-12-02T06:01:29.730Z · LW(p) · GW(p)
I've referenced this post a few times a very good and concrete example of Goodhart's law, that felt like it both illustrated the costs, while also showing the actual (usually good) reasons for why people put metrics in place in the first place.
comment by Ben Pace (Benito) · 2019-12-02T18:57:26.003Z · LW(p) · GW(p)
Seconding Habryka.
comment by Kaj_Sotala · 2018-05-17T10:12:12.503Z · LW(p) · GW(p)
Curated this post for:
- Having a detailed empirical analysis of all the different ways by which measurements are useful, but also susceptible to Goodharting
comment by Said Achmiz (SaidAchmiz) · 2018-05-03T23:08:36.873Z · LW(p) · GW(p)
Uh, what happened to the images in this post? They showed up just fine when it was a draft, but now I don’t see them.
Replies from: Raemon, Raemon↑ comment by Raemon · 2019-12-10T22:45:27.873Z · LW(p) · GW(p)
Reading this thread in the future, I find myself kinda wishing for ways comment threads like this could be auto-collapsed or resolved or something after reaching their conclusion.
Replies from: SaidAchmiz↑ comment by Said Achmiz (SaidAchmiz) · 2019-12-10T23:06:43.146Z · LW(p) · GW(p)
Agreed, that would be a nice feature. The trick would be to have a good way of identifying such “now totally irrelevant, except for esoteric academic reasons” threads that wouldn’t run into any controversy or require non-trivial moderator attention.
Replies from: Raemon↑ comment by Raemon · 2019-12-10T23:11:44.407Z · LW(p) · GW(p)
The latest version of the "offtopic comment" feature that the team had chatted about was a "collapse" feature, where some comments are just forcibly collapsed with a flag, and this is just a generic tool that admins and some authors have access to. Doesn't really require anything automatic, just, when you notice such a thread, you can close it. (It's still appear in the comment list, just collapsed as if it had low karma, possibly with a reason displayed)
Replies from: SaidAchmiz↑ comment by Said Achmiz (SaidAchmiz) · 2019-12-10T23:15:53.337Z · LW(p) · GW(p)
Yes, that is exactly the sort of thing I had in mind, which would clearly be open to all sorts of, perhaps not “abuse”, but at least—controversial application. It seems to me that it would be useful to differentiate such threads as this one we are discussing now, where nothing “on-topic” is really being discussed, and no one has nor could have any strong feelings about, etc. (This is not to say that the general-purpose tool you’re talking about would not also be useful—very plausibly it would.)
↑ comment by Raemon · 2018-05-03T23:22:37.535Z · LW(p) · GW(p)
Huh. Were you editing on greaterwrong or lesswrong, and if the latter, were you in Rich or Markdown editing mode?
Replies from: SaidAchmiz↑ comment by Said Achmiz (SaidAchmiz) · 2018-05-03T23:24:19.182Z · LW(p) · GW(p)
The former. But this hasn’t happened before…
Edit: Also, the code for the images is still there when I edit the post…! They just don’t display…
Replies from: Raemon↑ comment by Raemon · 2018-05-03T23:36:55.759Z · LW(p) · GW(p)
Interestingly, the code *isn't* there in the lesswrong markdown editor.
In the past month or so, when we added the markdown editor, we made some changes to how the markdown and Rich editors work (basically making it sure every time the post is saved, it updates both versions of the data to be in sync with each other). If the last time you made a post with images was a month+ ago that might be related.
(I think you said greaterwrong specifically saves what _it_ remembers you entering for markdown, so that you don't have to deal with frustration of our markdown editor, say, preferring underscores to asterixes. Is it possible the code that you still see is a result of that saved version?)
Replies from: SaidAchmiz, clone of saturn↑ comment by Said Achmiz (SaidAchmiz) · 2018-05-04T01:13:27.498Z · LW(p) · GW(p)
It seems that I have now fixed it by opening the post for editing (on GW) and re-saving it (without changing anything).
Replies from: Raemon↑ comment by Raemon · 2018-05-17T12:12:12.844Z · LW(p) · GW(p)
Hmm – apparently this just happened again when Kaj moved the post to curated. Apologies, but could you try re-saving it on greaterwrong again? I haven't had time to fix the images bug yet.
Replies from: SaidAchmiz↑ comment by Said Achmiz (SaidAchmiz) · 2018-05-17T13:19:42.828Z · LW(p) · GW(p)
Done.
↑ comment by clone of saturn · 2018-05-04T01:08:28.453Z · LW(p) · GW(p)
It looks like opening a markdown post in the rich editor causes all the images to disappear, which probably happened when Ben moved the post to frontpage.
Replies from: Raemoncomment by Ben Pace (Benito) · 2018-05-03T18:41:38.883Z · LW(p) · GW(p)
Promoted to frontpage.