I am trying to predict the winner of an NBA game. Roughly, this is what I want my dataset to look like:
- 30 columns, one column dedicated to each NBA team. These are the categorical variables. For a given row, two of these columns will be set to 1, while the rest are 0. This indicates which two teams are playing.
- A number of other columns, with offensive and defensive stats for both teams, which team is the home team, etc..
- The prediction values, AKA the point totals for each team
EDIT: The research objective is to best predict the prediction values, or the score of an NBA game.
The issue I am having is that for a given row two of the categorical variables are active, and the stat columns are only relevant to one of the two active teams, but the model will have no way of inferring that. In other words, let the two active teams be Team A and Team B. If I have 8 stats associated with Team A, and 8 stats associated with Team B, how can I get the model to correlate Team A's stats with Team A, and not Team B, and vice versa?
EDIT: An alternative strategy I have considered is to not keep track of which two teams are playing at all. Instead, for a given row, I could store the cumulative statics for that team up to but not including the game being predicted. So if I am predicting the 3rd game of a season, and a team shot 48% and 55% field goal percentage in the first two games, I could use 48 + 55 / 2 = 51.5% average field goal percentage to predict the 3rd game.
EDIT: However, I would rather use the 1st approach, because it would be easier to add new columns and I have a feeling it would be an overall better approach, but I am open to suggestions.
EDIT: Also, I am planning on using an assortment of regression algorithms and see which one performs best.