2

I have a model with about 200,000 training observations, where I am regressing, with 4 factors and 2 continuous variables. One of my features has 927 levels, which is causing the R implementation of randomForest to fail (it has a limit of 32 levels for any feature). Unfortunately, I don't see a simple way to avoid using this factor, or to decompose it into a series of continuous variables. Since my predictors are a mix of categorical and continuous, I thought of trees. Can anyone suggest a different implementation (package or language), ML approach, or a better way to pre-process or massage my inputs?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Ed Fine
  • 261
  • 1
  • 3
  • 5
  • Have a look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and the links in there (possible duplicate) – kjetil b halvorsen May 16 '17 at 23:08

4 Answers4

2

I'm not an R user, but some general comments about random forest

Discrete features can be treated two ways depending on their property

  1. if they have a clear ordering, eg year, (bad/med/good) result, then you can just treat them as continuous.
  2. if they can't, eg gender,company name, then they should be dummy'd.
jf328
  • 739
  • 4
  • 12
  • My factors fall into the second category. When I dummy them, I find myself with roughly 2000 variables, and only a few continuous variables. I can run a linear or generalized linear model on this system. Are there other approaches you recommend? – Ed Fine Mar 17 '13 at 16:12
1

Have you tried cforest in party package? I know it can handle more than 30-40 levels but I am not sure about 900 levels.

http://www.inside-r.org/packages/cran/party/docs/cforest

Roozbeh
  • 11
  • 1
1

You might try representing that one column differently. You could represent the same data as a sparse dataframe with dummy variables.

Minimum viable code;

example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"

for(level in unique(example$strcol)){
      example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level,     1, 0)
}
Vincent Warmerdam
  • 1,129
  • 1
  • 9
  • 10
0

You can use sklearn (python) and use OneHotEncoding.

OneHotEncoder basically creates one column per each level for a categorical variable (927 in your case). There created columns can take only 0 or 1 as values. This method by definition puts no limit on the number of levels. However, it does not capture the interaction between levels the same way as R does. The reason cforest doesn't allow you to have more than 30 levels is the huge number of permutations (~2^30).

Sklearn aside, I suggest finding a way to reduce your number of levels.

Roozbeh
  • 11
  • 1