0

I am interested in assessing possible uncertainty in enumerated data.

To do this, I randomly pick 80% of the enumerated data to predict the remaining 20% through a regression analysis. I reapeat the process 1000 times. Ideally, this should allow me to assess uncertainty affecting explanatory and predictive performances in a regression context.

Is that correct?

Gion Mors
  • 129
  • 7
  • Yes, it seems that what you are describing is simply a cross validation, see https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 – Tim Feb 24 '16 at 13:06
  • Could you explain what you mean by "enumerated" data? – whuber Feb 24 '16 at 13:20

1 Answers1

1

Yes, this is called cross-validation. It is a part of broader family of resampling methods. In such methods you repeat resampling procedure multiple times:

  • In permutation test you sample without replacement $N$ values out of sample consisting of $N$ cases (i.e. randomly shuffle the cases) and then compute some statistic $S$ on such sample. This enables you to generate null distribution of $S$ so to compute probability of observing such (or more extreme) value of $S$ as you obtained on your original sample.

  • In bootstrap you sample with replacement $N$ values out or your original sample. This lets you to re-create by simulation distribution of your sample and asses uncertainty of your estimate.

  • In cross-validation you randomly pick some proportion $p$ of $N$ cases, so to use $p \times N$ cases for training your model and $(1-p) \times N$ cases for prediction. By this you can learn about uncertainty of your predictions.

Notice that those are general descriptions of most common methods, while in reality there is a greater variety of such methods and in practice the procedures can be and often are more complicated than the simplified descriptions.

Tim
  • 108,699
  • 20
  • 212
  • 390