0

Can anyone give ideas on the possible best way forward to solve this specific machine learning problem for sports analytics?

Data set looks like:

HomeTeam             AwayTeam                NoOfSpectators
AC Milan             FC Barcelona              56900
Real Madrid          Bayern Munchen            78900

The outcome variable is NoOfSpectators but there are many levels in both HomeTeam and AwayTeam.

There are about 50 levels in both HomeTeam and AwayTeam. I know you can do OneHot encoding or Label encoding but what other options are worth trying?

For example use RandomForest or LightGBM that can automatically handle categorical / factor variable?

Also since for example: HomeTeam AwayTeam NoOfSpectators AC Milan FC Barcelona 56900

is the same as: HomeTeam AwayTeam NoOfSpectators FC Barcelona AC Milan 56900

How do you suggest that the data set should be structured / modeled before input to a ML model?

  • 2
    Welcome to CV! Could you edit your question to include what you have tried so far? It is not clear to me why many categorical features would be a problem. Is your sample size perhaps too small? In case you are asking how to do the entire analysis, I'm afraid your question is off-topic. – Frans Rodenburg Aug 28 '18 at 07:36
  • 1
    How many different levels in `HomeTeam` and `AwayTeam`? You might find some help in https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels (maybe a duplicate) – kjetil b halvorsen Aug 28 '18 at 09:54
  • Updated my question as I wasn't very clear – user1764534 Aug 28 '18 at 11:54
  • 1
    Surely " AC Milan Barcelona FC" and "Barcelona FC AC Milan" are different matches at different venues, so why would the attendance be the same for both ? – Robert Long Aug 28 '18 at 12:09
  • You should try a better title like "Modeling number of spectators in football" – kjetil b halvorsen Aug 28 '18 at 13:13

1 Answers1

1

Some ideas for modeling: You should definitely not start with some complicated black-box model like RandomForest, you should start with simpler models like linear models which can give some understanding (and serve as a reference if you later decide to try RandomForest etc).

But I think you should start with a better variable encoding. A linear additive model like ~ \text{HomeTeam} + \text{AwayTeam} will not really use the information that A is Home and B is Away, since A+B = B+A. Adding an interaction term will not change that. So you need a better encoding. Add a third variable HomeStadion, which you can code from the information that you have. Then you can try a model like $$ \text{NoOfSpectators} = \beta_0+ \beta_1 \text{HomeStadion} + \beta_2 \text{HomeTeam} +\beta_3 \text{AwayTeam} + \epsilon $$ or you could try to include an interaction term ($\beta_1$ can be interpreted as "HomeTeam advantage"). But with an interaction term the number of parameters will be very large (about 50x50=2500) so will need very large sample size, or you must use regularization. See the discussion in Principled way of collapsing categorical variables with many levels?. It could maybe be useful to use different penalization's for main effects and interactions? It could be that the most important effect of interaction is an "star meeting" effect, so most interaction parameters could be close to zero. That could be represented via lasso, giving sparsity, while the additive parameter could use ridge penalty.

EDIT

answer to new question in comments: The factor variables should be represented by dummy variables, which can be used directly. Good software should construct the dummys for you. But, if you use regularization, note that you should not leave out one "reference level" as usually is done, as it changes the model when regularization is done. See Dropping one of the columns when using one-hot encoding

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    Thanks, Is it possible to use factor variables directly in linear regression? Or what would you suggest is the best way to encode the variable if you have, say, 75 unique values in HomeTeam? – user1764534 Aug 28 '18 at 14:27
  • Besides OneHot encoding the team names, you can have a look at neural network embeddings. They will probably capture much more information compared to OHE. – Stergios Aug 29 '18 at 09:42
  • @Stergios: Certainly you can do that, but it is a good strategy to first build a simple, logical, interpretable model, at least as a baseline. – kjetil b halvorsen Aug 29 '18 at 10:00
  • 1
    Sure, I totally agree with that. I just mentioned an idea for possible further improvement. – Stergios Aug 29 '18 at 10:04