0

I have a statistical model with around 20 predictor variables, built on 90% of a dataset consisting of over 600k observations. The original developer held out 10% of the original dataset for the purpose of external validation.

From my reading, it seems that even cross-validation is sensitive to the partition of the dataset, let alone this one-fold data-splitting. I was trying to get a more objective measure of the predictive ability of the model that has been developed by bootstrapping.

If I am not concerned about the robustness about the model development procedure, but just interested in quantifying the (estimated) predictive ability of this particular model that was developed (without re-estimating the parameters),

  1. Would it be appropriate to sample with replacement 10000 observations and compute the c-statistics, repeating the process say 100-500 times?
  2. What are some potential drawbacks / danger of this (not obtaining sample the same size as the original dataset - as it would be too costly and time-consuming to do)?
  3. What alternative approach would you suggest?

Thank you very much!

Related stats.stackexchange.com entries:

  1. How to draw valid conclusions from "big data"?
Clark Chong
  • 581
  • 4
  • 12
  • Is $Y$ continuous? What is the distribution of $Y$ if it is discrete? – Frank Harrell Nov 05 '14 at 22:10
  • This is quite a complex model consisting of several logistic regressions and a Markov chain. There are at least two different "output" variables I can use: a) a continuous variable - which would give one number based on each sample of 10000 observations; I can then compared with the actual and compute the difference b) for each of the 10000 observations, I will get a probability for each of the five states it could be in. I could compare its actual state with what the model predicts it has the highest probability to be in. I do not know how to describe the distribution of $Y$ in this case. – Clark Chong Nov 06 '14 at 02:41
  • I'm confused about why the observations are blocked in groups of 10000 as opposed to having the original 600k observations you mentioned. And if you have the opportunity to stick to a continuous $Y$ you will have *much* more information. If you really need too deal with categories of a discrete $Y$ what is the number of original observations in the smallest category of interest? All this relates to which method of validation is adequate for the task. – Frank Harrell Nov 06 '14 at 12:51
  • Apologize about the confusion, as I have just started thinking about this recently, so many thoughts are still very rough. 1) 10000 was an arbitrary choice, as it took five hours to run the projection code once on the original 600k observations. I was trying to make the computation manageable. 2) There could be other potential target to assess. As of right now, the choice is between: getting a single number out of the sample vs getting a 5-category partition of the sample. To me, both seem to lose a lot of info, just try to get the lesser evil. The smallest category is several thousand.Thanks! – Clark Chong Nov 06 '14 at 13:40
  • So I gather from that that you are unable to run your prediction development on the whole 600k sample because of computational difficulties. It may be necessary to solve the computational problem first. Otherwise you might average predictions over many blocks of 10k. If your smallest $Y$ category has at least 4000 observations you may get away with one-time data splitting for development/validation, holding back 40k observations for validation and avoiding (for now) the bootstrap. – Frank Harrell Nov 06 '14 at 13:50
  • My prelim thought is the computation problem may be difficult to tackle until one could figure out a way to vastly simplify the model. I will look into potential ways to make the code more efficient. As for your second suggestion - I was hoping to assess the mean and variance of the (yet to be determined) "predictive/error metric". so that I may compare this model with say a commercial model. With one-time data splitting, it seems to me that the variance might be too large to make meaningful comparison, so maybe the sample size you suggested would help with that? Thank you very much! – Clark Chong Nov 06 '14 at 14:01

0 Answers0