5

If $n\gg p$ ($n$ is the number of observations, $p$ is the number of dimensions), is it always OK to use categorical predictors with many levels in regression? Here $p$ is also pretty high as the categorical predictors have many levels, although $n$ far outnumbers $p$. Or is there a better way?

This came up on one of the data scientist interview questions that I read online a while ago, but after giving it some thought, I still can't figure out what would be a good answer to it.

Any ideas/references would be greatly appreciated.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user98235
  • 505
  • 4
  • 10
  • 2
    The way this question is phrased, I would assume it is asking about using categorical predictor (aka "fixed effect") vs. modeling the individual terms as coming from a higher-level distribution (aka "random effect") as is done in hierarchical models aka mixed models. I am surprised that none of the answers mentions this alternative. – amoeba Dec 25 '16 at 12:47

2 Answers2

5

Nothing is "always ok", as there are always exceptions. For example, logit and probit models get into trouble when one or more categories of your predictor perfectly predict the outcome. This can easily happen regardless of how large your sample size is.

Another case where your model would be somewhat problematic occurs when n is large but the number of observations in one or more categories is very small. This would be problematic when your interest focuses on these small categories.

Maarten Buis
  • 19,189
  • 29
  • 59
3

I don't think there is a definite answer. If there are no purely statistical issues (See Maarten Buis' answer) than this is a more theoretical issue.

The way I see it, is while many properties are naturally multi-categorical, there is not always a logical reason of making use of all that data. It can make a model cumbersome, and it might be self defeating. Lets say we have a variable $x_1$ with $d$ levels. If $x_1$ is a control variable, it might not make a big difference in using it as is (besides being an eye-sore). If, however, $x_1$ is an effect that is theoretically interesting, some reduction might be in order. I'll elaborate. Using $x_1$ as an explanatory variable means that we have $d-1$ categories, each with a coefficient which is the difference between it and the reference category. If we are determined to understand differences between world countries and Japan, than fine, but this conveys little information on the relationship between the other $d-1$ categories and themselves. When we are interested in measuring interactions with $x_1$, having many categories makes it very annoying to interpret. So oftentimes it would be prudent to think if there is logic behind merging categories. Perhaps East Asian countries can go together, maybe EU countries (maybe not). Maybe customers who are new are whats interesting and comparing them to various categories of seniority is not as interesting as to non-new ones. Many times clumping categories together will sacrifice specificity, but gain clarity - and that's not a bad thing.

Yuval Spiegler
  • 1,821
  • 1
  • 15
  • 31
  • 2
    I would agree with what you are saying, but rather than merging categories just *add* extra categorical/other variables (eg East Asian/...), and ideally other descriptive variables. (eg GDP) . What you then need to do is add l2 regularisation to favour the high level category over the lowest level categories) – seanv507 Dec 23 '16 at 09:40
  • 1
    in other words there is nothing wrong in adding higher level categories ( it is not increasing the curse of dimensionality) – seanv507 Dec 23 '16 at 10:23
  • 1
    If it could be of intyerest to have the data help in grouping categories have a look at http://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-categories/237000#237000 – kjetil b halvorsen Dec 23 '16 at 10:42