RandomForest factor with too many levels

Question

I have a model with about 200,000 training observations, where I am regressing, with 4 factors and 2 continuous variables. One of my features has 927 levels, which is causing the R implementation of randomForest to fail (it has a limit of 32 levels for any feature). Unfortunately, I don't see a simple way to avoid using this factor, or to decompose it into a series of continuous variables. Since my predictors are a mix of categorical and continuous, I thought of trees. Can anyone suggest a different implementation (package or language), ML approach, or a better way to pre-process or massage my inputs?

Have a look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and the links in there (possible duplicate) — kjetil b halvorsen, May 16 '17 at 23:08

score 2 · Answer 1 · answered Mar 07 '13 at 11:04

2

I'm not an R user, but some general comments about random forest

Discrete features can be treated two ways depending on their property

if they have a clear ordering, eg year, (bad/med/good) result, then you can just treat them as continuous.
if they can't, eg gender,company name, then they should be dummy'd.

answered Mar 07 '13 at 11:04

jf328

739
4
12

My factors fall into the second category. When I dummy them, I find myself with roughly 2000 variables, and only a few continuous variables. I can run a linear or generalized linear model on this system. Are there other approaches you recommend? – Ed Fine Mar 17 '13 at 16:12

score 1 · Answer 2 · answered Jun 01 '15 at 00:59

1

Have you tried cforest in party package? I know it can handle more than 30-40 levels but I am not sure about 900 levels.

http://www.inside-r.org/packages/cran/party/docs/cforest

answered Jun 01 '15 at 00:59

Roozbeh

11
1

score 1 · Answer 3 · answered May 05 '14 at 09:33

You might try representing that one column differently. You could represent the same data as a sparse dataframe with dummy variables.

Minimum viable code;

example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"

for(level in unique(example$strcol)){
      example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level,     1, 0)
}

score 0 · Answer 4 · answered May 18 '17 at 07:41

You can use sklearn (python) and use OneHotEncoding.

OneHotEncoder basically creates one column per each level for a categorical variable (927 in your case). There created columns can take only 0 or 1 as values. This method by definition puts no limit on the number of levels. However, it does not capture the interaction between levels the same way as R does. The reason cforest doesn't allow you to have more than 30 levels is the huge number of permutations (~2^30).

Sklearn aside, I suggest finding a way to reduce your number of levels.

RandomForest factor with too many levels

4 Answers4