Why isn't the holdout method (splitting data into training and testing) used in classical statistics?

Question

In my classroom exposure to data mining, the holdout method was introduced as a way of assessing model performance. However, when I took my first class on linear models, this was not introduced as a means of model validation or assessment. My online research also doesn't show any sort of intersection. Why is the holdout method not used in classical statistics?

Glen_b · Answer 1 · 2022-01-09T12:32:06.123

A more productive question might be "why was it not used in the classical statistics I learned?"

Depending on the level(s) at which it was taught, the course content (and time available) that choice may be due to a combination of various factors. Often important topics are left aside because other material must be taught for one reason or another, with the hope that they might be covered in later subjects.

In some senses at least, the notion has long been used by a variety of people. It was more common in some areas than others. Many uses of statistics don't have prediction or model selection as a major component (or in some cases, even at all), and in that case, the use of holdout samples may be less critical than when prediction is the main point. Arguably, it ought to have gained more widespread use at an earlier stage in some relevant applications than it did, but that's not the same thing as being unknown.

If you look at areas that focus on prediction, the notion of model assessment by predicting data you didn't use to estimate your model was certainly around (though not universal). I was certainly doing it with time series modelling I was doing in the 1980s, for example, where out-of-sample predictive performance of the most recent data was particularly important. It certainly wasn't a novel idea then, there was plenty of examples of that sort of notion around at the time.

The notion of leaving out at least some data was used in regression (deleted residuals, PRESS, the jacknife, and so on), and in outlier analysis, for example.

Some of these ideas data back a good deal earlier still. Stone (1974)[1] refers to papers on cross-validation (with the word in the title) from the 1950s and 60s. Perhaps even closer to your intent, he mentions Simon (1971)'s use of the terms "construction sample" and "validation sample" -- but also points out that "Larson (1931) employed random division of the sample in an educational multiple-regression study".

Topics like cross validation, and the use of statistics based on prediction and so on, were becoming substantially more frequent in the statistics literature in the 70s and through the 80s, for example, but many of the basic ideas been around for quite some time even then.

[1]: Stone, M., (1974)
"Cross-Validatory Choice and Assessment of Statistical Predictions,"
Journal of the Royal Statistical Society. Series B (Methodological), Vol. 36, No. 2., pp. 111-147

Just for the record, that M. Stone is not me, nor is (s)he related to me, except possibly through Adam and Eve. — Mark L. Stone, Jan 30 '16 at 15:45

kjetil b halvorsen · Answer 2 · 2016-01-30T15:40:08.720

14

To complement on the answer by Glen_b, classical statistics often had/have emphasis on optimal use of the data, optimal tests, optimal estimators, sufficiency, and so on, and in that theoretical framework it is difficult to justify not using part of the information! Part of that tradition is emphasis on situations with small samples, where hold-out is practically difficult.

Fisher worked, for instance, mainly with genetics and agricultural experimentation, and in those fields small number of observations was the rule. So he was mainly exposed to such problems with small data sets.

edited Jan 30 '16 at 15:40

answered Jan 29 '15 at 09:55

kjetil b halvorsen

63,378
26
142
467

It's also true that Fisher regarded it as obvious that you shouldn't rely on a single hypothesis test; there is always a need to check for whether a result will hold up with similar datasets. – Nick Cox Jan 09 '22 at 15:50
1

John Nelder in 1999 "I contend that the $P$-value culture has encouraged the non-scientific cult of the single study, considered in isolation. It is ironic that the development of methods for dealing with small amounts of information began with Student and Fisher, two men who were fully aware of the need to combine information across studies. Others seem to have had a more restricted vision." . – Nick Cox Jan 09 '22 at 17:26

score 7 · Answer 3 · edited Apr 13 '17 at 12:44

I'll answer from an applied field that is maybe in between classical statistics and machine learning: chemometrics, i.e. statistics for chemical analyses. I'll add two different scenarios where hold-out is not as important as it is in typical machine learning classes.

Scenario 1:

I think one crucial point here is to realize that there is a fundamental difference in what is small sample size for training vs. testing:

For training, typically the ratio of number of cases : model complexity (number of parameters) matters (degrees of freedom)
For testing, the absolute number of test cases matters.
(The quality of the testing procedure must be independent of the model: that is treated as a black box by validation with independent test cases)

The second point I'm going to need for my argumentation is that the situation where independent test cases are crucial is overfitting. If the model is not complex enough (bias $\gg$ variance, so underfitting), residuals can tell you as much about total prediction error as independent cases.

Now, statistics lectures on "classical" linear models often emphasise univariate models very much. For a univariate linear model, the training sample size is likely not small: training sample sizes are typically judged in comparison to model complexity, and the linear model has just two parameters, offset and slope. In analytical chemistry, we actually have a norm that states you should have at least 10 calibration samples for your univariate linear calibration. This ensures a situation where model instability is reliably not an issue, so hold-out is not needed.

However, in machine learning, as well as with modern multi-channel detectors in chemical analysis (sometimes 10⁴ "channels" e.g. in mass spectrometry), model stability (i.e. variance) is an important issue. Thus, hold-out or better resampling is needed.

Scenario 2:

A completely different situation is that hold-out may be skipped in favor of a combination of an easier (residuals) plus a more sophisticated performance measurement. Note that hold-out in the sense of (randomly) setting aside part of a data set and excluding this from training is not equivalent to what independent testing can achieve. In analytical chemistry, dedicated validation experiments may be conducted that will include e.g. measuring the performance degradation over time (instrument drift) which cannot be measured by hold-out and establishing e.g. the performance of the sensor in the actual industrial environment (whereas the sensor calibration was done in the lab on calibration samples). See also https://stats.stackexchange.com/a/104750/4598 for more details on independent testing vs. hold-out.

Frank Harrell · Answer 4 · 2022-01-09T14:40:10.653

Besides the excellent discussion above, there are other reasons why holdout samples were not and still are not frequently used in statistics. Holding out data from discovery and model fitting is inefficient and wasteful of information, and in some cases we have analytic results that efficiently provide what you need, e.g., an estimate of likely future model performance. The simplest example is residual $\sigma^2$ in linear models where we've long had an estimate that is unbiased by overfitting. $R^{2}_{\mathrm{addj}}$ is another example. Then there is resampling which is an invention from the field of statistics. 100 repeats of 10-fold cross-validation is a very unbiased and 9/10 efficient procedure. The bootstrap is an almost unbiased and fully efficient procedure. Both of these estimate the likely future performance on observations from the same stream of observations used to build the model.

This discussion touches on independent sample validation vs. rigorous internal validation a la resampling. And it is a common mistake to label estimated performance on a holdout sample as "external validation" which it's usually not. This is discussed here.

Bayesian modeling thinks of this in yet another way where prior information starts the process and the parameter "estimates" (actually distributions) are trusted based on that information and there is no overfitting per se.

Why isn't the holdout method (splitting data into training and testing) used in classical statistics?

4 Answers4

Scenario 1:

Scenario 2:

Linked