Why do we care more about test error than expected test error in Machine Learning?

Question

In Section 7.2 of Hastie, Tibshirani, and Friedman (2013) The Elements of Statistic Learning, we have the target variable $Y$, and a prediction model $\hat{f}(X)$ that has been estimated from a training set $\mathcal{T} = \{Y_1, ..., Y_N, X_1, ..., X_N\}$. The loss is denoted $L(Y, \hat{f}(X))$, and then the authors define the test error: \begin{equation} \mathrm{Err}_{\mathcal{T}} = \mathbb{E} \left[ L(Y, \hat{f}(X)) | \mathcal{T} \right] , \end{equation} and the expected test error: \begin{equation} \mathrm{Err} = \mathbb{E} (\mathrm{Err}_{\mathcal{T}}) . \end{equation} The authors then state:

Estimation of $\mathrm{Err}_{\mathcal{T}}$ will be our goal...

My question: Why do we care more about $\mathrm{Err}_{\mathcal{T}}$ than $\mathrm{Err}$?

I would have thought that the quantity that measures expected loss, regardless of the training sample used, would be more interesting than the expected loss that conditions on one specific training sample. What am I missing here?

Also, I've read this answer here which (based on my possibly incorrect reading) seems to agree with me that $\mathrm{Err}$ is the quantity of interest, but suggests that we often talk about $\mathrm{Err}_{\mathcal{T}}$ because it can be estimated by cross-validation. But this seems to contradict Section 7.12 of the textbook, which (again by my possibly incorrect reading) seems to suggest that cross-validation provides a better estimate of $\mathrm{Err}$ than $\mathrm{Err}_{\mathcal{T}}$.

I'm going around in circles on this one so thought I would ask here.

score 30 · Accepted Answer · answered Jul 28 '21 at 12:20

Why do we care more about $\operatorname{Err}_{\mathcal{T}}$ than Err?

I can only guess, but I think it is a reasonable guess.

The former concerns the error for the training set we have right now. It answers "If I were to use this dataset to train this model, what kind of error would I expect?". It is easy to think of the type of people who would want to know this quantity (e.g. data scientists, applied statisticians, basically anyone using a model as a means to an end). These people don't care about the properties of the model across new training sets per se, they only care about how the model they made will perform.

Contrast this to the latter error, which is the expectation of the former error across all training sets. It answers "Were I to collect an infinite sequence of new training examples, and were I to compute $\operatorname{Err}_{\mathcal{T}}$ for each of those training sets in an infinite sequence, what would be average value of that sequence of errors?". It is easy to think of the type of people who care about this quantity (e.g. researchers, theorists, etc). These people are not concerned with any one instance of a model (in contrast to the people in the previous paragraph), they are interested in the general behavior of a model.

So why the former and not the latter? The book is largely concerned with how to fit and validate models when readers have a single dataset in hand and want to know how that model may perform on new data.

+1 it is a bit like the distinction between a credible interval and a confidence interval, which are answers to different questions. A credible interval gives you a probabilistic bound on the quantity of interest given the information contained in your particular sample of data. The confidence interval tells you something about what you would see from analysis of a population of datasets. Both are useful, but not for the same thing. — Dikran Marsupial, Jul 28 '21 at 14:05
Thanks for this wonderful answer. I've up-voted and given the answer tick. I just thought I'd add that I've incorporated some ideas from this response into the lecture slides I'm writing and referenced the URL so I'm not passing it off as my own :-) If you'd like I can also include your username in the slides...? — Colin T Bowers, Aug 09 '21 at 07:13
@ColinTBowers You can use my username (which happens to be my real name) should you care to. Do what you think is best — Demetri Pananos, Aug 09 '21 at 12:24

score 11 · Answer 2 · answered Jul 28 '21 at 12:25

+1 to Demetri Pananos's answer.

It may well be that we apply the same model $f$ to two different training datasets $\mathcal{T}$ and $\mathcal{T}'$. And $\mathrm{Err}_{\mathcal{T}}$ may be quite different than $\mathrm{Err}_{\mathcal{T}'}$ - either much lower, or much higher. This may be of vastly larger importance when we actually apply $f$ than the expected error $\mathrm{Err}$ over all possible $\mathcal{T}$s.

As an example, I do forecasting for supermarket replenishment and apply my model to many, many training datasets (essentially, historical sales of one product at one store). The loss directly transforms into the necessary safety stock. It's much more important to know the necessary safety stock per product and store than the "overall" safety stock.

Dikran Marsupial · Answer 3 · 2021-07-28T14:00:40.043

Computational Learning Theory, often is concerned with putting bounds on $\mathrm{Err}$, e.g. VC dimension (which doesn't depend on the training set). The Support Vector Machine is an approximate implementation of one such bound (although IMHO the thing that makes it work well is the regularisation, rather than the hinge loss part). Perhaps it could be said that $\mathrm{Err}$ is important in the design of learning algorithms, where as $\mathrm{Err}_\mathcal{T}$ is more relevant when applying them to a particular problem/dataset.

Why do we care more about test error than expected test error in Machine Learning?

3 Answers3