Uncertainty in Enumerated Data

Question

I am interested in assessing possible uncertainty in enumerated data.

To do this, I randomly pick 80% of the enumerated data to predict the remaining 20% through a regression analysis. I reapeat the process 1000 times. Ideally, this should allow me to assess uncertainty affecting explanatory and predictive performances in a regression context.

Is that correct?

Yes, it seems that what you are describing is simply a cross validation, see https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 — Tim, Feb 24 '16 at 13:06

Tim · Accepted Answer · 2016-02-24T13:31:02.670

Yes, this is called cross-validation. It is a part of broader family of resampling methods. In such methods you repeat resampling procedure multiple times:

In permutation test you sample without replacement $N$ values out of sample consisting of $N$ cases (i.e. randomly shuffle the cases) and then compute some statistic $S$ on such sample. This enables you to generate null distribution of $S$ so to compute probability of observing such (or more extreme) value of $S$ as you obtained on your original sample.
In bootstrap you sample with replacement $N$ values out or your original sample. This lets you to re-create by simulation distribution of your sample and asses uncertainty of your estimate.
In cross-validation you randomly pick some proportion $p$ of $N$ cases, so to use $p \times N$ cases for training your model and $(1-p) \times N$ cases for prediction. By this you can learn about uncertainty of your predictions.

Notice that those are general descriptions of most common methods, while in reality there is a greater variety of such methods and in practice the procedures can be and often are more complicated than the simplified descriptions.

Uncertainty in Enumerated Data

1 Answers1