Predictors in random Forest

Question

I am building a random forest to predict a binary variable y.

I have several predictors named x1..n.

One predictor, lets say x1, is a very strong predictor of y but only in some cases (see below) while the others x2..n are fair predictors in all cases.

Specifically, x1 is a strong predictor when a variable z is greater than a threshold A.

Now if I do a standard random forest (randomForest(y~x1+x2+...)), x1 has a low importance and the overall prediction is not as good as it could be if the random forest was taking into account that x1 is a very good predictor when z>A.

Is there a way to indicate to the random forest that the power of prediction of x1 is high only when z>A?

Subsidiary question : if A is unknown, is a model able to find the optimum A value for which x1 would be the best predictor?

Thank you for your help. I did my best to explain my problem but do not hesitate to tell me if something is still unclear.

Note: I posted a related question on a more specific case (for decision tree) there : https://stackoverflow.com/questions/48910293/conditional-partioning

Edit

Thank you for your help. I show an example below to illustrate the issue.

In this example y is build directly from x1, x2..5 = x1 + noise. When z<5, x1 is random. Since when z>=5, y is explicitly defined by x1 the prediction should be perfect.

But this is true only if the random forest is trained only with y~x1+z and not with the full set of variable y~x1+x2+x3+x4+x5+z.

Of course all of this make sense since, as Tim Biegeleisen writes, the random forest chooses the variable which allows the best split at each node (and obviously this is not x1). But is there a way to indicate to the random forest that the power of prediction of x1 is high only when z>A?

Since I know in this example that when z>=5 y=f(x1) I could train two random forest (for z>=5 and z<5) but in practice I don't know for which value of z, x1 is the best predictor.

Regarding the comment of Stephan Kolassa, I also included some tests with the z variable (+z and *z)

library(randomForest);
set.seed(100)
x1<-runif(300);y<-ifelse(x1>.5,1,0);x2=jitter(x1,amount=.5);x3<-jitter(x1,amount=.5);x4<-jitter(x1,amount=.5);x5<-jitter(x1,amount=.5)
z<-sample(1:10,300,replace=T);x1[z<5]<-runif(length(which(z<5)));

#Trained with x1,z
rf<-randomForest(as.factor(y)~x1+z,importance=T)
varImpPlot(rf,type=1)
p<-predict(rf);table(p[z>=5],y[z>=5])

#     0  1
#  0 86  0
#  1  0 87

#Trained with x1,x2,x3,x4,x5
rf<-randomForest(as.factor(y)~x1+x2+x3+x4+x5,importance=T)
varImpPlot(rf,type=1)
p<-predict(rf);table(p[z>=5],y[z>=5]) 

#     0  1
#  0 80  6
#  1  6 81

#Trained with x1,x2,x3,x4,x5,z
rf<-randomForest(as.factor(y)~x1+x2+x3+x4+x5+z,importance=T)
varImpPlot(rf,type=1)
p<-predict(rf);table(p[z>=5],y[z>=5]) 

#     0  1
#  0 83  3
#  1  3 84

#Trained with x1*z,x2,x3,x4,x5
rf<-randomForest(as.factor(y)~x1*z+x2+x3+x4+x5,importance=T)
varImpPlot(rf,type=1)
p<-predict(rf);table(p[z>=5],y[z>=5])

#     0  1
#  0 83  3
#  1  3 84

#Trained for z>=5 
rf<-randomForest(as.factor(y[z>=5])~x1[z>=5]+x2[z>=5]+x3[z>=5]+x4[z>=5]+x5[z>=5],importance=T)
varImpPlot(rf,type=1)
p<-predict(rf);table(p,y[z>=5]) 

#     0  1
#  0 86  0
#  1  0 87

Edit

The solution of RUser4512 (which requires to know the threshold A) is working :

x_1_bis = x1 * (z>=5) - 99 * (z<5)

rf<-randomForest(as.factor(y)~x2+x3+x4+x5+x_1_bis,importance=T)
varImpPlot(rf,type=1)
p<-predict(rf);table(p[z>=5],y[z>=5]) 
     0  1
  0 86  0
  1  0 87

And Stephan Kolassa is also right : with more data the rf succeeds to partition the training dataset :

set.seed(100)
x1<-runif(10000);y<-ifelse(x1>.5,1,0);x2=jitter(x1,amount=.5);x3<-jitter(x1,amount=.5);x4<-jitter(x1,amount=.5);x5<-jitter(x1,amount=.5)
z<-sample(1:10,10000,replace=T);x1[z<5]<-runif(length(which(z<5)));

rf<-randomForest(as.factor(y)~x1+x2+x3+x4+x5+z,importance=T)
varImpPlot(rf,type=1)
p<-predict(rf);table(p[z>=5],y[z>=5]) 
     0    1
0 3043    1
1    1 3011

and the variable importance is more meaningful (x1 and z are found as the most important).

Thanks all.

Are you including $z$ in your model? It quite obviously also is a predictor (in an [tag:interaction] with $x_1$). — Stephan Kolassa, Feb 23 '18 at 09:24
@StephanKolassa I did some tests above with $z$. Is it possible to explicitly add an interaction in a Random Forest? — MassCorr, Feb 23 '18 at 13:40
Well, you can add the interaction using `x1*z`, as you did. However, RFs [are very good at modeling nonlinearities](https://stats.stackexchange.com/a/327480/1352), so they should in principle detect such interactions automatically. However, the example you added has quite a small number of observations for an RF. Things should look better if you have more data. — Stephan Kolassa, Feb 23 '18 at 14:07
+1 regarding the comment about interation detection by @StephanKolassa. Moreover: Note that the permutation variable importance measure is bad at conveying such interactions because you just assess marginal importances of each variable. The default permutation importance also breaks up all correlations between the regressors (e.g., x1 and x3 here) which can make the importances even harder to interpret. — Achim Zeileis, Feb 24 '18 at 23:38

Tim Biegeleisen · Answer 1 · 2018-02-24T07:23:54.877

3

The random forest algorithm, as implemented by Breiman, is designed such that each predictor is given a fair chance to manifest its importance in the overall forest model. The reason for this is that each tree is built by taking a random set of features, and then choosing the feature with the best split at each node, starting with the root. Features/predictors which are relevant will influence the tree heavily in the first few splits.

Regarding your predictor x1, which has a strong predicting power for a certain range of z values, then the resulting forest should pick up on this. Specifically, if the x1 predictor be important, then it should be involved with splits in many trees, most likely early on in the building of the tree. For how much of your input data does x1 behave this way? If only for a small percentage then the model rightfully should not pick up on it. Or, perhaps there are other predictors which are more important, but you are just not aware of it.

In general, if you use a clean data set and tune your model, you can have a reasonable level of faith in the random forest model you build. Of course, you can check the importance, compare against the best constant model, etc., to make sure the model is statistically meaningful.

edited Feb 24 '18 at 07:23

answered Feb 23 '18 at 09:01

Tim Biegeleisen

131
4

Thank you, I did some tests above and the random forest model is good but not as good as it could be if I could indicate to use $x1$ when $z > A$ . Maybe the random forest is just not designed for this purpose. – MassCorr Feb 23 '18 at 13:43
Without you quantifying this I can't say much more. If the `z > A` segment of your data represents a large percentage of all observations, and the RF model is failing these cases, then perhaps RF is not the best approach. – Tim Biegeleisen Feb 23 '18 at 13:46
Actually I don't know what percentage of the real data this represents :/ But I agree with your point the higher this percentage the worst the random forest will perform – MassCorr Feb 23 '18 at 14:08
"Regarding your predictor x1, which has a strong predicting power for a certain range of z values, if this is statistically significant, then the resulting forest should pick up on this." This is not correct. At no point does RF assess statistical significance of predictors/response relationship. Splits are chosen on the basis of information gain. – Sycorax Feb 23 '18 at 16:35
@Sycorax Maybe my language is off. What I was trying to say ia that if `x1` really be an important predictor, then I would expect it would be chosen for splitting, most likely early on. – Tim Biegeleisen Feb 24 '18 at 00:27
You should edit your post to reflect that. – Sycorax Feb 24 '18 at 04:31

score 2 · Accepted Answer · answered Feb 24 '18 at 12:19

The answer given is already quite complete.

However, regarding the question :

Is there a way to indicate to the random forest that the power of prediction of x1 is high only when z>A?

You can build the new feature (as an extra column) in your data-set, using, per example the following :

x_1_bis = x_1 * (z>A) - 99 * (z<=A)

Basically, this is x_1 if z>A and -99 otherwise (you can put a value smaller than the smallest x_1, so that the random forest can split them easily from). Now you can look at the importance of x_1_bis.

Note however that the importance may be reduced if x_1 and z are present in your dataset (since the randomForest algorithm may detect this interaction as well).

Thank you. This works perfectly. Of course it requires that the A threshold is known. — MassCorr, Feb 26 '18 at 13:02

Predictors in random Forest

Edit

Edit

2 Answers2

Linked