I have a model with about 200,000 training observations, where I am regressing, with 4 factors and 2 continuous variables. One of my features has 927 levels, which is causing the R implementation of randomForest to fail (it has a limit of 32 levels for any feature). Unfortunately, I don't see a simple way to avoid using this factor, or to decompose it into a series of continuous variables. Since my predictors are a mix of categorical and continuous, I thought of trees. Can anyone suggest a different implementation (package or language), ML approach, or a better way to pre-process or massage my inputs?

- 63,378
- 26
- 142
- 467

- 261
- 1
- 3
- 5
-
Have a look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and the links in there (possible duplicate) – kjetil b halvorsen May 16 '17 at 23:08
4 Answers
I'm not an R user, but some general comments about random forest
Discrete features can be treated two ways depending on their property
- if they have a clear ordering, eg year, (bad/med/good) result, then you can just treat them as continuous.
- if they can't, eg gender,company name, then they should be dummy'd.

- 739
- 4
- 12
-
My factors fall into the second category. When I dummy them, I find myself with roughly 2000 variables, and only a few continuous variables. I can run a linear or generalized linear model on this system. Are there other approaches you recommend? – Ed Fine Mar 17 '13 at 16:12
Have you tried cforest in party package? I know it can handle more than 30-40 levels but I am not sure about 900 levels.

- 11
- 1
You might try representing that one column differently. You could represent the same data as a sparse dataframe with dummy variables.
Minimum viable code;
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"
for(level in unique(example$strcol)){
example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}

- 1,129
- 1
- 9
- 10
You can use sklearn (python) and use OneHotEncoding.
OneHotEncoder basically creates one column per each level for a categorical variable (927 in your case). There created columns can take only 0 or 1 as values. This method by definition puts no limit on the number of levels. However, it does not capture the interaction between levels the same way as R does. The reason cforest doesn't allow you to have more than 30 levels is the huge number of permutations (~2^30).
Sklearn aside, I suggest finding a way to reduce your number of levels.

- 11
- 1