1

I am building a multiple linear regression model and wonder how many dummy variables can be included. I have 2 categorical variables: 1 with 13 levels and the second with 20 levels. Can I include all of them and it's way too much for Multiple Linear Regression?

Prasad Dalvi
  • 135
  • 11
Daria
  • 375
  • 2
  • 11

3 Answers3

3

You have two categorical variables, one with 20, other with 13 levels. That is in itself not to much for multiple regression. To estimate those will use $(20-1)+(13-1)=31$ df (degrees of freedom). If that will work, depends on the total number of observations (and number of continuous, measured variables.) One rule of thumb is to have at least 15 subjects per parameter in the model. How many observation you have for each level could also be a consideration.

So, if you do not have enough observations, you could consider regularization. For categorical variables the fused lasso is an idea. See Principled way of collapsing categorical variables with many levels?.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
0

You can include as many dummy variables as you want, but it will make the interpretation in the model coefficient a bit complex. You can check if all the levels in the variables are really important to be included in the model.

Prasad Dalvi
  • 135
  • 11
-2

This depends on the amount of data you have. In general, you might want to consider pooling some similar levels

  • 4
    Although what you say about amount of data is true your might want to expand this to say how much is needed. Pooling is dangerous as the model then becomes data dependent and not theory driven. – mdewey Jun 03 '19 at 13:42