2

My original sample has 350 observations drawn randomly from a population of 60,000 people.

My independent variable is Default, with 35 observations with value of 1, and the rest with value of 0.

I split my sample to Train (60%) and Test (40%), then use multi-variate logistic regression to predict Default on Train, and validate on Test.

Because of small sample size, I want to validate confidence interval of a performance statistics (ROC) using bootstrap in two ways:

(1) Create 10,000 samples with the same size of the original sample, split them into train and test, and fit the model on train, and calculate ROC on test for each of the samples. Finally, plot the distribution of ROC.

However, when doing this, my ROC values are very disperse with high variance.

(2) Create 10,000 samples, but each sample has 3000 observations (randomly drawn from the original 350 observations), then repeat the process. In this case, my ROC and GINI values become very concentrated. This is what I prefer.

My question is:

In theory, what is the difference between the two methods?

Since (2) will most likely produce a better result, why (1) is much more popular?

AdamNYC
  • 121
  • 1
  • 1
    The reason for doing a bootstrap analysis is to gauge how "uncertain" your estimate is. If you do (2) you risk being over-certain. – Rasmus Bååth Nov 19 '14 at 14:07
  • Thank you @RasmusBååth. Could you explain why (2) would be over-certain? What would be the statistical concept should I relate "over-certain" to? – AdamNYC Nov 19 '14 at 14:35
  • 1
    A way of seeing this is to take it to the limit. What if you would draw 100,000 observations or 1,000,000? Then the ROC curves estimated using these estimates would be almost exactly the same every time. What would be the point of that? – Rasmus Bååth Nov 19 '14 at 14:39
  • @RasmusBååth: that is very clear. Thank you. My trouble lies in the fact that although I know for sure the Default rate in my population is around 10%, the sample I collected was too small, so the bad rate in boostraped test samples would fluctuate a lot (can be 0% to 30%). Given this knowledge, can I constrain my bootstrap samples for a narrower confidence interval? – AdamNYC Nov 19 '14 at 14:44

0 Answers0