Can anyone give ideas on the possible best way forward to solve this specific machine learning problem for sports analytics?
Data set looks like:
HomeTeam AwayTeam NoOfSpectators
AC Milan FC Barcelona 56900
Real Madrid Bayern Munchen 78900
The outcome variable is NoOfSpectators
but there are many levels in both HomeTeam
and AwayTeam
.
There are about 50 levels in both HomeTeam and AwayTeam. I know you can do OneHot encoding or Label encoding but what other options are worth trying?
For example use RandomForest or LightGBM that can automatically handle categorical / factor variable?
Also since for example: HomeTeam AwayTeam NoOfSpectators AC Milan FC Barcelona 56900
is the same as: HomeTeam AwayTeam NoOfSpectators FC Barcelona AC Milan 56900
How do you suggest that the data set should be structured / modeled before input to a ML model?