4

Let's start with data description of the website visits I analyse :

  • 6M rows
  • Dependant variable quotation is binary and takes values 0 and 1 with 1% of value 1
  • The other 3 variables are temperature, humidity and minute of the day

The objective is to identify quotation trend based on the weather to optimize communication campaigns and not to determine if for a given visit there will be a quotation.

To avoid overfitting problems due to the large dataset I decided to cross-validate my tree-models to determine the right one.

My questions :

Due to the low probability of quotation = 1 even the best leaf-node gets a 5% with the training sample. Therefore, if I do a predict() on my Testing sample I get only 0 for all nodes.

  1. Is there a way with the party package to attribute the corresponding node to each value of the Testing sample
  2. Is that the right method to evaluate my different models since predict() doesn't seem to work for me (0 for all observations)?

I went there but every suggestions are based on predict which is I feel of no help in my case...

Yohan Obadia
  • 363
  • 2
  • 5
  • 16
  • 1
    You should definitely consider using your own threshold rather than the default one which is 50%. Try to return a probability an not a class (0/1) as everything will be classified as 0. Try to plot an ROC curve to see how your model performs. The fact that the best node is at 5% doesn't mean you can't spot the most likely cases. – AntoniosK Sep 10 '15 at 19:06
  • Do you by any chance know when to apply the threshold ? Should it be when you create the model or when you use it for prediction ? Moreover, the fact that I use a different threshold, say 3%, will all my data be set as a 1 if it goes above it ? – Yohan Obadia Sep 11 '15 at 02:15
  • Threshold should be applied after you get the predicted probabilities. If you use 3% then each new observation with predicted probability >= 0.03 will be classified as 1 and the rest as 0. I'll try to post an example if it's not very clear to you. – AntoniosK Sep 11 '15 at 06:58
  • I really don't get how to praticly do it. I've been looking at it all day. If you have an example with the `party package`, that would be very helpful. In the mean time, I'll try to undersample, a method founded here : http://stats.stackexchange.com/questions/28029/training-a-decision-tree-against-unbalanced-data – Yohan Obadia Sep 11 '15 at 15:29

1 Answers1

7

Run this example:

library(party)

set.seed(15)

# example data
dt = data.frame(y = c(rbinom(n=2000,size=1,prob=0.1), 
                      rbinom(n=2000,size=1,prob=0.2),
                      rbinom(n=2000,size=1,prob=0.3)),
                group = c(rep("A",3000), rep("B",3000)),
                x = c(sort(rnorm(3000,50,2)), sort(rnorm(3000,70,3), decreasing = T)))

dt$y = as.factor(dt$y)

# separate train and test set (50/50 split here)
rn = sample(1:nrow(dt), 3000)

dt_train = dt[rn,]
dt_test = dt[-rn,]

# build model
model = ctree(y~group+x, data = dt_train)

# visualise model
plot(model, type="simple")

# predict new data
dt_test$predClass = predict(model, newdata=dt_test, type="response")    # obtain the class (0/1)
dt_test$predProb = sapply(predict(model, newdata=dt_test,type="prob"),'[[',2)  # obtain probability of class 1 (second element from the lists)
dt_test$predNode = predict(model, newdata=dt_test, type="node")   # obtain the predicted node (in case you need it)

You will see that ALL predClass values in dt_test are 0. You can use column predProb to create your own classification based on your threshold. For example:

table(dt_test$predClass, dt_test$y)  # everything is classified as 0

#      0    1
# 0 2392  608
# 1    0    0

# pick a threshold of 0.2
dt_test$predClass2 = 0
dt_test$predClass2[dt_test$predProb >= 0.2] = 1


table(dt_test$predClass2, dt_test$y)  # you have some cases classified as 1

#      0    1
# 0 1342  229
# 1 1050  379

If you want to see how good your model is you can use an ROC curve. No need to use thresholds to classify now, as the process will use your predicted probabilities as they are:

library(ROCR)

# plot ROC
roc_pred <- prediction(dt_test$predProb, dt_test$y)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")

# get area under the curve
performance(roc_pred,"auc")@y.values
AntoniosK
  • 576
  • 2
  • 7
  • This is really hepfull to understand how to use the `party package` ! Do you imply in your code that the threshold you mentioned before is just to be applyed manually based on the result, like in your example like `dt_test$y 0.2, 1, 0)`? (I tried it and got a completely vertical ROC curve with your data) – Yohan Obadia Sep 11 '15 at 16:52
  • Also, what do you think of my suggestion to try undersampling as an alternative method to deal with unbalanced data ? – Yohan Obadia Sep 11 '15 at 16:53
  • 1
    The threshold is something you'll pick based on your objectives (eg. false positives and false negatives) so it is manually picked, but not based on your intuition or what seems reasonable. ROC curve shouldn't use that threshold. ROC curve process need to know only the ACTUAL classes (0 or 1; you can't change them) and the predicted probabilities (from 0 to 1; obtained by the model). – AntoniosK Sep 11 '15 at 16:58
  • What do you mean by picked manually but not based on intuition ? I want to minimize false negative but in a proportion that is yet to be defined (I don't want 0 false negative). By any chance, and I know I ask you a lot, is there a solution for each value in dt_test to know in which terminal node they fall ? Thanks again ! – Yohan Obadia Sep 11 '15 at 17:03
  • 1
    You can do `dt_test$predNode = predict(model, newdata=dt_test, type="node")` to save the predicted node as a column, next to your predicted probabilities. Intuition would be "my max probability is 0.3, so let's pick a threshold of 0.25". You have a clear objective, so you can try a series of thresholds (eg. from 0.1 to 0.25) and see which one minimises your false negatives. – AntoniosK Sep 11 '15 at 17:09
  • And this was so simple ! Thank you very much, I've been struggling a lot lately on this matter. You helped me greatly ! – Yohan Obadia Sep 11 '15 at 17:12
  • Glad that I've helped. Again, be careful about what things you change. In your first comment you use a threshold to change `y` which is the actual values/classes and you can't do that. You can use the threshold to create another column and compare it with `y`. – AntoniosK Sep 11 '15 at 17:14
  • This is the docs that R is missing... But shouldn't it be `roc_pred – PHPirate Dec 10 '17 at 17:07
  • @PHPirate Predicted class vs. actual class will always produce a confusion matrix, because both variables are discrete. ROC curve, obviously a curve, needs one numeric variable and that is the predicted probability. The objective of the ROC curve is to examine the false positive and false negatives for different values of the predicted probability. – AntoniosK Dec 10 '17 at 17:12
  • @PHPirate Another good question is "how do you or the model obtains the predicted class"? What is the probability threshold that you choose and say if predicted probability is above `p` then the predicted class is 1. Typically models use `p = 0.5`, but that's not always correct. ROC curve helps you chose that `p` in order to obtain the predicted class. Therefore, ROC curve function cannot have as input the predicted class. – AntoniosK Dec 10 '17 at 17:17
  • Ah thanks! I got confused with the documentation of `RORC:predictions` which speaks about parameters 'predictions' and 'labels', but predicted probabilities makes more sense than predicted classes for a curve indeed. Interesting question, hadn't considered that yet! – PHPirate Dec 10 '17 at 17:38