2

I have a categorical variable, geoid's for census tracts in Manhattan, with 288 levels. After running a linear regression on my categorical variable and other predictors (population, weather, ...) I get the warning message that my dataset might not be of full rank and 13 coefficients are not defined.

As a next step, I examined the variance inflation factors of each predictor except for the categorical variable. All the vif's were below 5.

What can I conclude from these results? Might this signal that some tracts have identical characteristics in terms of population, income, and so on? Does that even matter?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
vranjes
  • 65
  • 5
  • how many data points are feeding into your regression? (hopefully something like 5000 or higher) and how evenly distributed is the sample across those 288 categories? (hopefully not too many with less than 10 sample units) – probabilityislogic Jun 27 '19 at 14:03
  • I suspect that one (or more) of your predictors is an exact duplicate or and exact linear combination of the others, resulting in a non-invertible design matrix. Double check your predictors and make sure that no two of them contain essentially the same information. Have you run correlations between each variable? This might help you identity the problem. – StatsStudent Jun 27 '19 at 14:29

1 Answers1

2

How many observations do you have? Some census tracts with very few observations? It is very little information in your question, so we have to guess. But I would start with trying regularization, specifically the fused lasso, which might be a good idea with categorical variables with very many levels.

With census tracts one might imagine that some are similar with respect to your target variable, and fused lasso will then use the data to fuse similar tracts. For details see Principled way of collapsing categorical variables with many levels?.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467