2

Most of the neural net algorithms I'm aware of require multilevel, ANOVA-type categorical features to be preprocessed into a set of dummy (0,1) variables. So, if one has a single categorical feature such as the teams in the NFL (Bears, Packers, Cowboys, etc.) then each level (team) would be transformed into separate 0,1 dummy variables indicating membership of a unit (observation, record, entity). This approach makes NNs computationally feasible but has many drawbacks including:

  • For data with many categorical features possessing many labels, the addition of many irrelevant 0,1 dummy variables

  • Impossibility of summarizing the structure of explained variance

    • For instance, US residential 5 digit zip codes have about 36,000 possible labels. No one cares about the 'impact' of a single zip code 0,1 dummy variable on a target but explaining the overall impact of 'zip codes' is highly likely to be quite relevant to understanding variance structure

Do neural net algorithms exist which are able to handle multilevel categorical features without conversion into dummy variables?

Mike Hunter
  • 9,682
  • 2
  • 20
  • 43
  • May I ask, why are you asking this? – Richard Hardy Mar 13 '18 at 13:16
  • @RichardHardy It's pretty simple really. Decomposing a single categorical feature into a bunch of indicator variables loses a lot of information. To take an extreme example a categorical feature such as US reisdential 5 digit zip code may have as many as 36,000 possible values. No one would care if a single zip code is related to the target variable but knowing that *zip codes* taken as a single factor are strongly predictive of the target would be useful in explaining variance. – Mike Hunter Mar 13 '18 at 13:20
  • I couldn't understand your question properly. I feel one hot vectors can successfully handle categorical variables... – tired and bored dev Mar 13 '18 at 13:38
  • Interesting. If the categorical variable is nominal, then coding it into dummies should not lead to a loss of information. If it is on a rank scale, then there is the loss of ranking, but perhaps that could be dealt with by building a single feature vector with ranks corresponding to the categorical variable. – Richard Hardy Mar 13 '18 at 13:46
  • @tiredandboreddev Would you say more about 'one hot vectors'? Or link to a paper describing this? – Mike Hunter Mar 13 '18 at 14:16
  • This should help - https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/ – tired and bored dev Mar 13 '18 at 14:20
  • @RichardHardy If I understand you correctly you are saying that converting a categorical feature, e.g., NFL teams, into a rank scaled feature (e.g., Bears=1, Packers=2, Cowboys=3, etc.) is equivalent to treating them as a single ANOVA-type, nominal factor. That doesn't seem correct. – Mike Hunter Mar 13 '18 at 14:21
  • @tiredandboreddev Yeah...that link perfectly illustrates the problem I am trying to solve. 'One hot encoding' is a preprocessing transformation into a set of 0,1 dummy variables that decomposes the information available, e.g., in an ANOVA-type regression. Not only that, it increases the number of features from 1 categorical variable to as many features as there are levels. Unless you know how to *recompose* the disaggregated dummies into a single expression of 'variance accounted for', this is not a solution to my query. – Mike Hunter Mar 13 '18 at 14:25
  • I am making a distinction between rank scale and nominal scale. I do not suggest to rank nominal variables, which is nonsense. If you have a a variable that is on a rank scale, though, then my decomposition (dummies plus one variable reflecting the ranks) might not be entirely equivalent to the original one, but I think it preserves all information there is. – Richard Hardy Mar 13 '18 at 14:33
  • @RichardHardy My view is that there is way more information in terms of variance explained by treating a categorical, ANOVA-type factor as a single entity as opposed to decomposing it into many 0,1 dummy variables using 'one hot encoding' or whatever. The ranking method you're proposing would probably work (i.e., it is programmable) but, in the absence of a more rigorous explanation, seems incomplete, even arbitrary as a solution to *recomposing* the disaggregated dummies back into a single, ANOVA-type factor that would be useful wrt variance explained. It creates as many problems as solved – Mike Hunter Mar 13 '18 at 14:57
  • @RichardHardy Thank you for thinking about my question. – Mike Hunter Mar 13 '18 at 15:05
  • 1
    I find it interesting (+1 and the star is from me). I am wondering if your concerns are mainly computational or mainly information-theoretic, because I do not quite see the problem in the information content (I think there is a bijection between the original rank-scale factor and the dummies+ranks or the original nominal-scale factor and the dummies alone). If they are computational, then perhaps you could highlight that a bit more in your post to steer the discussion in the direction of your interest. – Richard Hardy Mar 13 '18 at 15:53
  • @RichardHardy The CV comment structure does not permit an extended discussion of these issues but my concerns, using your words, are primarily *Information-theoretic* wrt explaining the variance structure inherent in a predictive model whether logistic regression, ML or NNs. This recent paper *Deep Learning for Mortgage Risk* (https://arxiv.org/pdf/1607.02470.pdf) is illustrative of the problems that neural nets have with doing this adequately. The authors' approach leaves huge amounts of unexplained variance on the table by 'one hot encoding' multilevel categorical features. – Mike Hunter Mar 14 '18 at 11:26
  • I suspect this might be either computational or yet another aspect, because the models do not seem to be able to utilize the information effectively even though the information is there. – Richard Hardy Mar 14 '18 at 11:45

0 Answers0