logistic regression for competitive games

Question

I am interested in using logistic regression to model a competitive game.

The data looks something like this:

winner      loser
teamA       teamB
teamB       teamC
teamA       teamD
...         ...

and each team in the dataset has at least 5 wins and 5 losses.

What I've done seems a little kludgey. I've made a fake outcome column that's all ones and a design matrix that's GxT where G is the number of games and T is the number of teams. Each row has a 1 in the column that corresponds to the team that won and a -1 in the column that corresponds to the team that lost.

So, for any given row in the logistic regression, we have:

$logit(p(win_A)) = (\beta_A - \beta_B)$

And all the observations are where team A won.

This model works and it gives estimates for $\beta$ that are consistent with my knowledge of the game (the best teams have the highest values and the worst teams have the lowest).

But is this the most natural way to model this dataset with logistic regression? It seems a bit odd to have all 1's observed.

Next, I'll want to elaborate the model. Here are some model elaborations I've thought of, but I want to be able to keep thinking of more as well.

Account for the time-varying nature of each team's "skill level"
How many time-zones away from "home" is the game played?
How many games has this team played in the last week? (Fatigue)

I have considered "models" like Elo/Glicko/Stephenson, but I am concerned that they won't allow for arbitrary elaborations.

What is the recommended way to set up a model like this? Could each team's skill level over time be a 1D Gaussian process? What if it were an individual game (e.g. chess or ping-pong) and I had some prior information about the overall shape of a player's skill level over time (players get better until some age then start getting worse)?

I believe your model equation should read $\text{logit}(p) = \beta_A - \beta_B$. — Jarle Tufto, Oct 19 '17 at 17:55
thank you -- I sadly only have about a 50/50 chance of selecting from logit/inverse-logit correctly on the first try — rcorty, Oct 19 '17 at 17:57

Jarle Tufto · Accepted Answer · 2018-11-11T15:20:24.050

The model you describe is known as the Bradley-Terry model. This model has been extended to include covariates as well as random effects predicting team abilities and contest specific covariates (including home advantages) available as the R-package BradleyTerry2, see this paper in Journal of Statistical Software. For dynamic extensions of the Bradley-Terry model, see this paper in JRSSC and references therein. For intransitive hierarchies modelled through covariates, you may want to look at this paper.

AndreaL · Answer 2 · 2017-10-21T06:18:24.860

An alternative approach could be to use a word embedding, possibly learned by a neural network.

The idea is to represent each team as a low-dimensional vector (perhaps 1 or 2 dimensions are enough).

So, the design could be like this: you have two shared embeddings layers, one for teamA and one for teamB.

Then you could take the difference of those two layers (here is an example in Keras): basically this means that you are doing (teamA - teamB) in the embedded space. The next layer could simply be logistic regression. The target variable could be 1 if teamA won and 0 if team B won.

I would train the network with both: two examples for each game. In fact, I wonder how your logistic regression managed to work so well. It could have just set all the $\beta$s to zero and have a large bias towards 1 (maybe regularisation saved the day?).

If you do the embedding in 1D, each team would be mapped to a single number, and this should be equivalent to what you currently have.

The advantage could be that here you could try to embed the teams in 2D for example. Also, you could try to concatenate the layers rather than taking the difference and, of course, add more layers. This would also be very simple to extend concatenating more features to the embeddings.

logistic regression for competitive games

2 Answers2