6

I'm playing the iPhone/Steam game Hero Academy. It's a bit like Chess, except there's some randomness, and at the beginning of a game players choose what team they'll use, each with different strengths and weaknesses. Both players can pick the same team.

About 400 of us play in an unofficial league, where we record our games online for Elo rankings. I'm not sure how many are active players, though.

I have the win/loss record for every matchup between the four teams. Draws are very rare. The rarest matchup has 143 games; the most-common matchup has 260. There are 3,234 games in the database.

What is the appropriate test for determining whether the teams are of equal strength?

Let's assume that rock-paper-scissors imbalances are undesirable. That is, if Team A beats Team B which beats Team C which beats Team A, we have an imbalanced game. Every matchup should be fair.

A complication is that many players have preferred teams, so that if Team A is very popular among top players, it'll appear stronger. I suspect we'll have to ignore this effect for now.

Another complication is that there's a slight advantage to going first, despite our efforts to ameliorate this with a handicap. Let's ignore this effect as well.

warbaker
  • 161
  • 3
  • 1
    Do most players always play with the *same* team or is there a (considerable) amount of variation in the team selection even for a single ("prototypical") participant? – cardinal Aug 28 '12 at 19:06
  • When you define imbalance are you speaking about individual games or head-to-head win-loss records? If it is individual games it should be expected to happen occasionally. What does detect imbalance mean? Isn't imbalance something you can simply observe from the database? – Michael R. Chernick Aug 28 '12 at 19:06
  • @Michael I mean statistically significant imbalance at, say, 95% confidence. Just because a Team A wins 70 of 150 matches against Team B doesn't necessarily mean we know Team A is disfavored in the matchup -- it could be chance, especially if we only noticed that imbalance when looking at one of the twelve matchups. – warbaker Aug 28 '12 at 20:01
  • 1
    @cardinal Some players play only one team, others play a couple, and some use a randomizer built into the game. I'm not sure of the ratios. – warbaker Aug 28 '12 at 20:03
  • I don't see how it is relevant that intransitive imbalances are "undesirable." They exist in many games, and you should use a test which doesn't ignore the possibility. – Douglas Zare Aug 28 '12 at 20:52
  • Assuming that there are scores (sorry I am not aware of the background of the game), you could setup a least squares based spreadsheet in Excel with Team Ratings as a variable to effectively derive a power rating for each team e.g. something a bit like this but you could have starting first instead of a home edge: http://office.microsoft.com/en-us/excel-help/using-solver-to-rate-sports-teams-HA001124601.aspx - my concern would be though surely it's the player who's playing the team that makes the difference (e.g. a strong player with Team A is better than a weak player with Team A?). – user8812 Aug 29 '12 at 21:51
  • @DouglasZare Intransitive imbalances aren't as bad as a team trouncing every other team, but they're still bad. You don't want the outcome of a match decided by the pre-game team selection, which is essentially rock-paper-scissors if there are intransitive imbalances. – warbaker Aug 30 '12 at 00:11
  • In other games, trying to outpick your opponent is considered part of the game. – Douglas Zare Aug 30 '12 at 00:32
  • @DouglasZare That's true, but not so much in this one. Anyway, it's interesting to test both kinds of imbalance. Honestly, I'd think it'd be easier to test imbalance without caring whether the imbalance is transitive. – warbaker Aug 30 '12 at 21:08

1 Answers1

4

I think it's a bad idea to ignore the strengths of the players, but it may be hard to completely separate the possible flaws in the rating system from the possible advantages of one option versus another.

You could try the following test for each pair of options A and B. Your null hypothesis is that the rating formula is accurate and that the games are independent. Compute the number of wins predicted by the rating formula for option A, and compare this with the observed number of wins. If the rating formula predicts that the player using option A will win with probability $p$, add $p$ to the total expected wins, and add $(p(1-p))$ to the total variance according to the null hypothesis. If the games are not overwhelmingly lopsided, then you should be able to use a normal approximation since you have over $100$ data points for each match-up. Determine how extreme the observed result was in terms of standard deviations away from the predicted mean.

Since you would apply this test for each possible match-up, you would expect more false positives if you use a typical significance threshold for a single test. So, instead of asking for the results to be significant at the $0.05$ level on at least one of $6$ tests, you might want to require $0.05/6 \approx 0.008$ or about $2 \frac23$ standard deviations from the mean in either direction in order to reject the null hypothesis.

If you reject the null hypothesis, it doesn't necessarily mean that team A has an advantage over team B. It could also be that the ratings formula fails, which might happen for lopsided matches. If you have enough data you can try to compare players of similar ratings, where you can expect the ratings formula to be more accurate.

Douglas Zare
  • 10,278
  • 2
  • 38
  • 46
  • By the way, in case it wasn't obvious, instead of using the ratings of the players at the time, or at the end, it's best to update the ratings to approximate the best fit to all data. If there are no undefeated players and no players who have never won, you can do this iteratively with a few passes through the data, although it is more accurate to do some sort of Bayesian update from some prior distribution. – Douglas Zare Aug 31 '12 at 20:33