1

I have come across papers where they have calculated the AUROC for both a training and a testing set;

enter image description here

When I am using the package MLeval, I have used my training data set here

     randomforestfit1 <- train(T2DS ~ ., 
               data = mod_train.newy, 
               method = "rf",     
               trControl = trainControl(method = "repeatedcv", 
                                     number = 10, 
                                     repeats = 5, 
                                     savePredictions= TRUE, 
                                     classProbs= TRUE, 
                                     verboseIter = TRUE))

 ##

  x <- evalm(randomforestfit)

 ## get roc curve plotted in ggplot2

 x$roc

 ## get AUC and other metrics

x$stdres

My AUROC for metabolites+ visceral fat + crp-1 is 0.82

My AUROC for visceral fat and crp-1 is 0.69

When using my validation set it is 0.88 and 0.86 respectively. I thought it was better to mention only the validation set rather than both. Please can anybody advise?

Willow9898
  • 57
  • 4

2 Answers2

1

YES, report both.

Comparing performance on training data vs out-of-sample data can give an idea of whether performance could be improved via bias reduction or variance reduction. If both have poor performance, you would suspect that you haven't captured the trends in the data; you need more parameters or perhaps additional variables to explain the outcome. (This is high bias.) If you have strong in-sample performance but weak out-of-sample performance, you would suspect that you have overfit to the training data. (This is high variance.)

You might be thinking that this is important to track in your group but not important to report. I disagree. You publish work so that others can build on your work. For someone to build upon your work, they should see where the weaknesses are. This is why other papers are reporting both.

Forgottenscience is too dismissive of in-sample performance but does allude to the valid point that you should not be happy with performance unless you get adequate performance from out-of-sample testing, not in-sample testing.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • Thanks for your comment. I am new to machine learning/random forests. In my case, my out of sample performance is slightly stronger than my in-sample performance. So in terms of explaining this, please could you tell me whether this suggests that it is unlikely that I have overfitted the training data, and my model is performing ok, but could inevitably still be optimized by incorporating variables with a greater predictive capacity at discriminating disease from control? – Willow9898 May 26 '20 at 16:49
  • I am actually a little confused with my results. The accuracy for my prediction with my metabolites + visceral fat + crp1 is 0.8261, the AUROC was 0.88. Whilst for the visceral fat + crp-1, the accuracy is higher at 0.8696, yet the AUROC is lower at 0.86. This doesn't make any sense to me, so I will probably ask another question, as I assume I've gone wrong somewhere. – Willow9898 May 26 '20 at 16:50
  • @Willow9898 I have no way of knowing if 0.86 is any good for your task (it could be amazing, pitiful, or anywhere in between). An 86% sounds like a solid grade, though, good for a B, so let's say that you're doing a solid job. Your AUC values are quite close, so it does not seem like you've overfit. It's okay to get slightly better in-sample performance than out-of-sample performance. In fact, that is expected. You should do at least a little better on in-sample data (except for the occasional fluke). – Dave May 26 '20 at 16:53
  • It's certainly possible to get better accuracy with one model but better AUC with another. AUC, loosely speaking, measures accuracy at all thresholds. Maybe it makes sense to use 0.5 as the threshold; maybe it makes more sense to use 0.8. This depends on the problem and is up to the designer. Really, though, often is a flawed metric, bizarre as the sounds. Shamelessly, I will mention a question of mine that gives an example where accuracy might not be a great metric: https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email. – Dave May 26 '20 at 16:56
  • Thank-you for your help. I suppose I will have to be careful with what values I am stating ( for instance mentioning both the AUROC and accuracy values here will probably confuse my reader). I think I will probably just go with mentioning the AUROC values. – Willow9898 May 26 '20 at 17:02
  • @Willow9898 Accuracy without a cutoff threshold (do you know what I mean by that?) is almost meaningless. We call many of these algorithms classifiers, but they output probabilities. They become classifiers along with a decision rule. – Dave May 26 '20 at 17:13
  • I need to increase my understanding of accuracy cut off thresholds. There is a post on here about it https://stats.stackexchange.com/questions/112388/how-to-change-threshold-for-classification-in-r-randomforests I am unsure as to how you select the cut off threshold and how this improves interpretation of accuracy ( definitely need to do more reading). If you could please elaborate if at all possible? – Willow9898 May 26 '20 at 17:34
  • @Willow9898 That is a separate question. Please post a new question where you specify what you do and do not understand from related posts like the one you linked. – Dave May 26 '20 at 17:35
  • I think I agree, @Dave - I think my field is extraordinarily wary of in-sample performance, which is why I answered as I did. – Forgottenscience May 26 '20 at 21:02
0

No one should care about the performance on the training set, the key quantity is whether you can generalize out of sample. Having the area under the ROC curve be 1 for the training data is irrelevant (and occasionally achievable with random forests), and thus you'd in any case need to include the same estimates for the validation set and/or test set to provide useful information on your classifier.

Again, whatever you report on the training set can be overfit to an arbitrary degree, the only real test of value is hold-out set performance.

Forgottenscience
  • 1,186
  • 6
  • 10