How to handle more than 1 categorical variable per row when training a model

Question

I am trying to predict the winner of an NBA game. Roughly, this is what I want my dataset to look like:

30 columns, one column dedicated to each NBA team. These are the categorical variables. For a given row, two of these columns will be set to 1, while the rest are 0. This indicates which two teams are playing.
A number of other columns, with offensive and defensive stats for both teams, which team is the home team, etc..
The prediction values, AKA the point totals for each team

EDIT: The research objective is to best predict the prediction values, or the score of an NBA game.

The issue I am having is that for a given row two of the categorical variables are active, and the stat columns are only relevant to one of the two active teams, but the model will have no way of inferring that. In other words, let the two active teams be Team A and Team B. If I have 8 stats associated with Team A, and 8 stats associated with Team B, how can I get the model to correlate Team A's stats with Team A, and not Team B, and vice versa?

EDIT: An alternative strategy I have considered is to not keep track of which two teams are playing at all. Instead, for a given row, I could store the cumulative statics for that team up to but not including the game being predicted. So if I am predicting the 3rd game of a season, and a team shot 48% and 55% field goal percentage in the first two games, I could use 48 + 55 / 2 = 51.5% average field goal percentage to predict the 3rd game.

EDIT: However, I would rather use the 1st approach, because it would be easier to add new columns and I have a feeling it would be an overall better approach, but I am open to suggestions.

EDIT: Also, I am planning on using an assortment of regression algorithms and see which one performs best.

Which kind of model do you wnt to fit? What is the research questions? — kjetil b halvorsen, Jun 19 '21 at 16:44
@kjetilbhalvorsen I plan on fitting many different regression models (as I am predicting points scored, which is a regression task). I guess my "research question" is: how can you use statistics to predict the score of an NBA game? I didn't think that the model I am using is relevant to my question, but I could be wrong. — joshblech, Jun 19 '21 at 16:51
Please add new information as an edit to the post and not only in comments! Not everybody reads comments ... as to the data organization, what about, with your format, let the extra info always refer to the home team? What you have is called dyadic data, I will add that tag, look into it. — kjetil b halvorsen, Jun 19 '21 at 21:21
@kjetilbhalvorsen Interesting. So let's say I assign a number 1 - 30 to each team, and let's say teams 2 and 7 are playing each other, and team 2 is home. In the "Home" column, for the row for the given game, I put a 2. I reserve the first 8 stats columns for the home team and the last 8 stats column for the away team, and I have a column for home points and a column for away points. Do you think a given model would be able to relate the correct stats to the correct team in this scenario? Would this structure solve my issue? — joshblech, Jun 19 '21 at 21:58

score 1 · Answer 1 · answered Jun 20 '21 at 20:52

The issue I am having is that for a given row two of the categorical variables are active, and the stat columns are only relevant to one of the two active teams, but the model will have no way of inferring that.

Yes, there is no way of inferring that. This is a causal inference problem. In general causal relations among encoded information within the design matrix, i.e., feature vectors, can not be modelled with "standard" statistical models or machine learning. You would need to build a causal model if you want to induce causal relations among your predictors or set of predictors.

How to handle more than 1 categorical variable per row when training a model

1 Answers1