Duels & D.Sci March 2022: Evaluation and Ruleset

post by aphyer · 2022-04-05T00:21:28.170Z · LW · GW · 12 comments

Contents

  RULESET
  CARD LIST 
  STRATEGY
  PVE LEADERBOARD
  PVP LEADERBOARD
    Note: The commentary below should be considered non-final for a few days to give people time to point out that I've misread the decks they submitted/added up win percentages wrong/made other obvious mistakes.  If I have messed something like that up I'll have to recalculate, so don't count on victor...
  FEEDBACK REQUEST
None
12 comments

This is a follow-up to last week's D&D.Sci scenario [LW · GW]: if you intend to play that, and haven't done so yet, you should do so now before spoiling yourself.

There is a web interactive here where you can test your submission against your rival's deck (or against a couple other NPCs I added for your amusement, if you think yourself mighty enough to challenge those who wield the legendary power of the Sumerian Souvenirs).  

NOTE: Win rates in the interactive are Monte-Carlo-d with small sample sizes and should be taken as less accurate than the ones in the leaderboard.

RULESET

Code is available here for those who are interested.

A game is played as follows:

CARD LIST
 

Cards available to you were the following:

 

Card NameMana CostPowerAlignment
Gentle Guard1700Good
Lilac Lotus1200*Artefact
Patchy Pirate1700Evil
Horrible Hooligan2900Evil
Kindly Knight2900Good
Sword of Shadows32000*Artefact
Virtuous Vigilante31300Good
Bold Battalion4600x*Good
Murderous Minotaur41700Evil
Alessin, Adamant Angel51800*Good
Dreadwing, Darkfire Dragon52500Evil
Evil Emperor [omitted for length]64500Evil

*Some cards have special abilities:

Congratulations here to: 

STRATEGY

There were three available 'archetypes': coherent decks that fit together and did something powerful.  All ended up with a roughly 75% winrate against your rival if optimized, but they performed differently against one another.

Good Tribal is based around Bold Battalion.  It tries to play lots of cheap Good creatures (Gentle Guard) in early turns, and then lots of Battalions to gain high Power.  The best build for this against your rival is very simple, with 6xGuard and 6xBattalion. This gets 75.4% winrate, the best available.

Sword Aggro is based around Sword of Shadows.  This time, your goal is to get cheap Evil creatures to wield Sword of Shadows, and then try to use the Sword's very high score for its cost to win on turn 3 or 4.  The best build for this against your rival is 3xPirate, 5xSword, and 4xAngel.  (The Angels make your lategame a little more reliable, and don't get in the way of your early game thanks to their ability).  This gets 74.6% winrate.

Lotus Ramp is based around Lilac Lotus.  Your goal is to play Lotuses in the early turns and use them to play out large late-game threats like the Emperor.  The best build for this against your rival is 5xLotus, 3xAngel, and 4xEmperor.  This gets 74.5% winrate.

When these decks as-written play against each other, their relative speeds lead to some rock-paper-scissors (where being slightly stronger and slower than your opponent advantages you as you end up with more power, but being much stronger and slower disadvantages you as you can lose before getting off the ground):

It is possible to build these decks differently - most noticeably, you could build more defensive Lotus Ramp decks (with more Angels or even Vigilantes, and fewer Emperors).  This improves the matchup against Sword Aggro and other fast decks substantially, but worsens the matchups against Good Tribal, other Lotus Ramp, and any other slow decks.

Particular congratulations are due here to gammagurke, who managed not only to identify these three archetypes but to actually give two of them exactly their correct names.  (Sadly, 'Evil Equip' doesn't quite work out as a name when the deck often contains more Good creatures than it does Evil ones).

I was expecting Good Tribal to be the simplest available deck, and for Sword Aggro and particularly Lotus Ramp to take the most work to optimize - it seems I was mistaken in this regard, and many players submitted Sword or Lotus-based decks while no-one submitted a Good-based deck.

 

PVE LEADERBOARD

Note: all winrates below were Monte-Carlo calculated rather than explicitly derived. 

Submitted decks were as follows:

abstractapplic: Sword Aggro (3xPirate, 3xHooligan, 4xSword, 2xAngel)

gammagurke: Lotus Ramp (5xLotus, 2xPirate, 3xAngel, 2xEmperor)

GuySrinivasan: Dragon Ramp (6xLotus, 4xDragon, 2xEmperor)

jsevillamol: Knight Aggro (1xPirate, 9xKnight, 1xBattalion, 1xEmperor)

Maxwell Peterson: Sword Aggro (6xPirate, 2xSword, 1xMinotaur, 3xAngel)

Measure: Lotus Midrange (3xLotus, 2xPirate, 1xSword, 1xVigilante, 2xAngel, 1xDragon, 2xEmperor)

Pablo Repetto: Lotus Aggro (3xLotus, 3xPirate, 2xSword, 4xAngel)

Yonge: Lotus Midrange (2xLotus, 1xPirate, 1xHooligan, 2xVigilante, 1xMinotaur, 2xAngel, 1xDragon, 2xEmperor)

PlayerWinrate
Optimal Play75.4%
abstractapplic68.46%
GuySrinivasan66.16%
Measure61.38%
Maxwell Peterson61.06%
gammagurke59.00%
Yonge57.76%
Pablo Repetto52.57%
Random Play40.5%
jsevillamol31.39%

Congratulations to all submitters, in particular to abstractapplic (whose Sword Aggro build was fairly well-tuned despite the Hooligans not being quite optimal) and GuySrinivasan (who managed to place a very close second despite playing an archetype I didn't even think existed).

PVP LEADERBOARD

Note: all winrates below were Monte-Carlo calculated rather than explicitly derived.

Note: The commentary below should be considered non-final for a few days to give people time to point out that I've misread the decks they submitted/added up win percentages wrong/made other obvious mistakes.  If I have messed something like that up I'll have to recalculate, so don't count on victory/defeat until some more eyes have confirmed.

Submitted decks were as follows:

abstractapplic: Sword Aggro (3xPirate, 3xHooligan, 4xSword, 2xAngel)

gammagurke: Lotus Ramp (6xLotus, 2xAngel, 4xEmperor)

GuySrinivasan: Dragon Ramp (6xLotus, 1xMinotaur, 5xDragon)

jsevillamol: Knight Aggro (1xPirate, 9xKnight, 1xBattalion, 1xEmperor)

Maxwell Peterson: Sword Aggro (2xPirate, 4xHooligan, 2xSword, 4xAngel)

Measure: Sword Aggro with Emperors (4xPirate, 4xSword, 1xVigilante, 3xEmperor)

Pablo Repetto: Lotus Aggro (3xLotus, 3xPirate, 2xSword, 4xAngel)

Yonge: Lotus Midrange (2xLotus, 1xPirate, 1xHooligan, 2xVigilante, 1xMinotaur, 2xAngel, 1xDragon, 2xEmperor)

Standings were as follows:

PlayerabstractapplicgammagurkeMeasureMaxwell PetersonYongeGuySrinivasanPablo RepettojsevillamolTotal Score
abstractapplic50%56.44%55.51%59.99%59.4%59.78%68.85%85.62%4.96
gammagurke43.56%50%53.28%51.37%67.67%58.18%60.17%82.46%4.67
Measure44.49%46.72%50%50.99%52.98%56.78%59.87%78.97%4.41
Maxwell Peterson40.01%48.63%49.01%50%53.28%53.91%60.98%82.4%4.38
Yonge40.6%32.33%47.02%46.72%50%51.45%56.43%74.96%4.00
GuySrinivasan40.22%41.82%43.22%46.09%48.55%50%53.58%70.46%3.94
Pablo Repetto31.15%39.83%40.13%39.02%43.57%46.42%50%71.15%3.61
jsevillamol14.38%17.54%21.03%17.60%25.04%29.54%28.85%50%2.04

gammagurke's Lotus Ramp deck was designed to try to prey on other midrange and ramp decks by going over the top with more Emperors - sadly, there were many Sword-based aggro decks and fewer midrange and ramp decks, making this otherwise-reasonable deck poorly positioned for the opponents it actually faced and pushing it back to second place.

Measure and abstractapplic both submitted solid Sword Aggro lists, but abstractapplic's was somewhat better-tuned and came out on top.

Congratulations abstractapplic!  Once you've figured out what theme/work* you want to request an upcoming scenario be based on, PM or comment and I'll try to get it to happen.  I can't promise it'll happen soon (it'll take some time to write one of these, I have at least one other scenario that'll likely get posted before it, and other people may also submit some), so you'll end up waiting most likely a few months.

*Ability to select a specific work is contingent on me being familiar with that work and thinking I can write a scenario based on it.

FEEDBACK REQUEST

As usual, I'm interested to hear feedback on what people thought of this scenario.  If you played it, what did you like and what did you not like?  If you might have played it but decided not to, what drove you away?  What would you like to see more of/less of in future?  Do you think the underlying data model was too complicated to decipher?  Or too simple to feel realistic?  Or both at once?

It also looks like we had a few new players.  Congratulations to you, and I hope you enjoyed the game - if you liked it, the sequence here [? · GW]contains past scenarios, and you can subscribe to that to get notifications when new ones are posted (I try to make sure this happens around once a month).

Thanks for playing!

12 comments

Comments sorted by top scores.

comment by abstractapplic · 2022-04-05T13:43:01.586Z · LW(p) · GW(p)

Reflections on my attempt:

My PvE approach, as I mentioned, was to copy the plan that worked best in a comparable game: train a model to predict deck success, feed it the target deck, then optimize the opposing deck for maximum success chance. I feel pretty good about how well this worked. If I'd allocated more time, I would have tried to figure out analytically why the local maxima I found worked (my model noticed Lotus Ramp as well as Sword Aggro but couldn't optimize it as competently for some reason), and/or try multiple model types to see what they agree on (I used a GBT, which has high performance but doesn't extrapolate well).

My PvP approach was a lot more scattershot. After some hilariously bad attempts to get an uncounterable deck by having two decks repeatedly optimize against each other, I decided to just recycle my PvE deck and hope I happened to win the rock-paper-scissors game. As it happened, Fate smiled on me, but if there had been any Good Tribal decks in play I wouldn't be looking quite so clever right now.

Reflections on the challenge:

This was fun. I particularly like that it was superficially similar to Defenders of the Storm, while having profoundly different mechanics: I came in expecting another game that's mostly about counters, and instead got a game that's mostly about synergy. And, as everyone (including me) has already said, the premise and writing are hilarious.

My only problem with this game was the extra difficulty associated with approaching it analytically if you don't happen to know about mtg-style card games (I remember looking at the comments on the main post late last week and wondering what a 'ramp' was). However, this issue is mitigated by the facts that:

.It (presumably) gave card game fans a chance to practice balancing-priors-against-new-evidence skills and not just ML/analysis skills.
.It's not unreasonable for card game knowledge to help pick cards in a game centered on card games.
.I won despite lacking this background.

Replies from: aphyer
comment by aphyer · 2022-04-05T15:22:22.726Z · LW(p) · GW(p)

My only problem with this game was the extra difficulty associated with approaching it analytically if you don't happen to know about mtg-style card games (I remember looking at the comments on the main post late last week and wondering what a 'ramp' was).

 

I actually considered this to be mostly a feature rather than a bug?  I think real-world data science problems also benefit from having some knowledge of the domain in question.

It's possible to apply data science techniques to a completely unfamiliar domain - you don't need to know anything about card games to notice that 'P' and 'S' showing up together, or 'L' and 'E' showing up together, improves your payoff function, and to try submitting an answer that has lots of 'P's and lots of 'S's in it.

But if you have some level of domain knowledge, you have more ability to guess what kind of patterns are likely to appear, and to extrapolate details.  When you see that 'L' works well with 'D', 'E' and 'A' that doesn't tell you much else: when you notice that 'L' works well with all three of the cards that have long and bombastic names, that lets you start guessing things like 'there are some kind of costs to playing these powerful cards, and L helps you pay those costs to play them'.  This lets you guess in turn things like 'adding more Emperors might make the deck stronger against other decks like itself but weaker against faster decks' that would be very hard to pull out of the data directly without some amount of domain knowledge to help.

This is part of the reason why I gave the cards names instead of just saying 'Card ID 1', 'Card ID 2', etc.  (The other part is of course to sound cooler :P)

Replies from: abstractapplic
comment by abstractapplic · 2022-04-05T17:06:41.256Z · LW(p) · GW(p)

It's definitely a feature as well; the exact tradeoff comes down to personal taste.

comment by gammagurke · 2022-04-05T05:04:48.567Z · LW(p) · GW(p)

This was the most fun I've had analysing data and writing code probably ever. Unfortunately I missed the previous editions, but I'm looking forward to the next one. If I had played the previous ones, I might have steered further away from trying to explain effects by complex interactions that are common in this sort of card game in favour of simpler interactions that are more likely to be put into this sort of ruleset (for example things having one combat-relevant stat instead of two). 

Because this was such a blast to play around with, I don't really have any specific things I would change. 
The way the decks in the dataset were generated, along with the datasets size, made it easy to check how some specific decktype did in general, but hard to check how it did against some other specific decktype, which seems like the perfect middleground.

I really appreciate the work that was put into the theme, letting me roleplay as Kaiba playing children's card games and trying to win using the power of arcane computer analysis tools.

comment by SarahSrinivasan (GuySrinivasan) · 2022-04-05T02:04:23.801Z · LW(p) · GW(p)

I enjoyed this one a lot! It was simple enough to inspire educated guesses about how the system worked, but complex enough that probably no exact values would be forthcoming. Intuitions from similar games ported pretty well. I was "coerced" into writing some fun code. Thank you for explicitly stating the random nature of the dataset's decks, I think this particular challenge would have been worse if we needed to try to model [something] about that, too.

comment by Pablo Repetto (pablo-repetto-1) · 2022-04-05T21:14:45.068Z · LW(p) · GW(p)

First of all: thank you for setting up the problem, I had lots of fun!

This one reminded me a lot of D&D.Sci 1 [LW · GW], in that the main difficulty I encountered was the curse of dimensionality. The space had lots of dimensions so I was data-starved when considering complex hypotheses (performance of individual decks, for instance). Contrast with Voyages of the Grey Swan [LW · GW], where the main difficulty is that broad chunks of the data are explicitly censored.

I also noticed that I'm getting less out of active competitions than I was from the archived posts. I'm so concerned with trying to win that I don't write about and share my process, which I believe is a big mistake. Carefully composed posts have helped me get my ideas in order, and I think they were far more interesting to observers. So I'll step back from active competitions for a bit. I'll probably do the research summaries I promised, "Monster Carcass Auction", "Earwax" (maybe?), then come back to active competitions.

comment by Measure · 2022-04-05T15:30:26.969Z · LW(p) · GW(p)

For my PvE approach, I filtered the dataset for decks similar to our opponent's deck (the rule I used was "decks with at least eight different card types"), and looked at which single card inclusion (zero vs. one-or-more) yielded the best win rate. Then I further filtered for matchups that included that card and looked at adding an additional copy of that card vs. adding a different card. I repeated this process until I had filled eight or so slots (I think I had AADEELLL____) and then filled the rest with generally-good diverse cards (PPSV).

For PvP, I guessed that lots of people would find similar Lotus ramp decks to my PvE deck, so I filtered the dataset for decks with a lot of Lotuses and a lot of Angels+Dragons+Emperors. I then used the same process as above to fill one card at a time until I had PPPPSSSS____. At this point, there were very few matchups in the dataset that passed the filters, so I wasn't confident in how to finish the deck, but the process was weakly pointing to Emperor and Vigilante, and I wanted a bit more diversity, so I filled it out with EEEV.

Replies from: aphyer
comment by aphyer · 2022-04-05T15:37:47.755Z · LW(p) · GW(p)

Interesting!  It looks to me like your initial algorithm was excellent but your 'filling-out' process may have shot you in the foot a little: both of your decks ended up sort of indecisive about whether to go for aggro or ramp. Your PVP deck would have done much better with more Pirates and Swords rather than switching over to Emperors, and your PVE deck would have done much better with more ramp stuff rather than switching to Pirates and Swords.

comment by Measure · 2022-04-05T03:27:29.147Z · LW(p) · GW(p)

Can you play any of the previously drawn cards, or only one of the two drawn this turn?

Replies from: aphyer
comment by aphyer · 2022-04-05T03:30:13.904Z · LW(p) · GW(p)

Only one of the two drawn this turn, I'll edit to clarify.

comment by benjamincosman · 2023-06-10T21:52:14.237Z · LW(p) · GW(p)

I just noticed that this fictional game is surprisingly similar to Marvel Snap (which was released later the same year); I assume based on the timing that this is a coincidence but I thought it was amusing.

Replies from: aphyer
comment by aphyer · 2023-06-10T22:34:12.023Z · LW(p) · GW(p)

...they stole my game!!!