3

I made a 70:30 split of the data to build a random forest model for binary classification. Although the prevalence of $Y=1$ was about 25% in both training and test sets, the two sets became imbalanced while building the model and making predictions due to missingness in covariates. I observed that the "complete" training set had half the $Y=1$ cases compared to the "complete" test set.

The AUC for the training data was about 0.70 and the AUC of the test data was about 0.85.

How should I explain that? I thought the training data would always show higher AUC than the test data because we used training data to build our model.

Blain Waan
  • 3,345
  • 1
  • 30
  • 35
  • 1
    Did you split the data randomly into train / test? – user2974951 Dec 04 '18 at 10:35
  • Yes, I did it randomly. But there were missing covariates in the data so while building the model it only used the "complete" training data or observations which had values in all covariates. Similarly, for prediction, it would only give me a prediction for "complete" test data. Those "complete" data didn't have 25% $Y=1$ each. Sorry that it was not clear from my question. – Blain Waan Dec 04 '18 at 10:40
  • Try 10 different test-train splits. How big is your data? Could just be that you got lucky. A good sanity check is to test that the mean and standard deviation is equal between test and training sets – Jon Nordby Dec 05 '18 at 02:34
  • How did the data 'become unbalanced' during training? If you are removing samples to deal with missing values, you should probably do that before splitting into test/train splits. If you want a model that has validity for such samples then either impute the missing values, or remove the features which has missing values. – Jon Nordby Dec 05 '18 at 02:38

1 Answers1

1

This can be easily attributed to random variation. While, indeed the in-sample performance is expected better than the out-of-sample performance (i.e. our training error be less than our test error), that is not a necessity; as the AUC value calculated here is a statistic, a function of our present sample, it is subject to sampling variability. It would be reasonable to use multiple training/test splits (i.e. bootstrap the sample at hand) so we are able to quantify the variability of that statistic. Repeated cross-validation and/or bootstrapping are standard approaches to estimate the sampling distribution of a statistic of interest. There a very informative thread in CV on: Hold-out validation vs. cross-validation that I think will help clarify things even further.

usεr11852
  • 33,608
  • 2
  • 75
  • 117