I am interested in using logistic regression to model a competitive game.
The data looks something like this:
winner loser
teamA teamB
teamB teamC
teamA teamD
... ...
and each team in the dataset has at least 5 wins and 5 losses.
What I've done seems a little kludgey. I've made a fake outcome
column that's all ones and a design matrix that's GxT where G is the number of games and T is the number of teams. Each row has a 1 in the column that corresponds to the team that won and a -1 in the column that corresponds to the team that lost.
So, for any given row in the logistic regression, we have:
$logit(p(win_A)) = (\beta_A - \beta_B)$
And all the observations are where team A won.
This model works and it gives estimates for $\beta$ that are consistent with my knowledge of the game (the best teams have the highest values and the worst teams have the lowest).
But is this the most natural way to model this dataset with logistic regression? It seems a bit odd to have all 1
's observed.
Next, I'll want to elaborate the model. Here are some model elaborations I've thought of, but I want to be able to keep thinking of more as well.
- Account for the time-varying nature of each team's "skill level"
- How many time-zones away from "home" is the game played?
- How many games has this team played in the last week? (Fatigue)
I have considered "models" like Elo/Glicko/Stephenson, but I am concerned that they won't allow for arbitrary elaborations.
What is the recommended way to set up a model like this? Could each team's skill level over time be a 1D Gaussian process? What if it were an individual game (e.g. chess or ping-pong) and I had some prior information about the overall shape of a player's skill level over time (players get better until some age then start getting worse)?