4

I'm reading about test/generalization error in Hastie et al.'s Elements of Statistical Learning (2nd ed). In section 7.4, it is written that given a training set $\mathcal{T} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\}$ the expected generalization error of a model $\hat{f}$ is $$Err = E_{\mathcal{T}}[E_{X^0, Y^0}[L(Y^0, \hat{f}(X^0))|\mathcal{T}]],$$

where the point $(X^0, Y^0)$ is a new test data point, drawn from $F,$ the joint distribution of the data.

Suppose my model is a linear regression (OLS) model, that is, $\hat{f}(X) = X\hat{\beta} = X(X^TX)^{-1}X^TY$, assuming that $X$ has full column rank. My question is, what does it mean to (1) take the expected value over $X^0, Y^0$, and (2) take the expected value over the training set $\mathcal{T}$?

For example, suppose $Y = X\beta + \epsilon$, where $E[\epsilon]=0, Var(\epsilon) = \sigma^2I.$

(1) Consider evaluating $E_{X^0, Y^0}[X_0\hat{\beta}|\mathcal{T}]$, is the following correct?

\begin{align*} E_{X^0, Y^0}[X^0\hat{\beta}|\mathcal{T}] &= E_{X^0, Y^0}[X^0(X^TX)^{-1}X^TY|\mathcal{T}]\\ &= E_{X^0, Y^0}[X^0|\mathcal{T}](X^TX)^{-1}X^TY\\ &= E_{X^0, Y^0}[X^0](X^TX)^{-1}X^TY \end{align*}

The last equality holds if $X^0$ is independent of the training set $\mathcal{T}$.

(2) Consider evaluating $E_{\mathcal{T}}[X^0\hat{\beta}|X^0]$, is the following correct? \begin{align*} E_{\mathcal{T}}[X^0\hat{\beta}|X^0] &= X^0 E_{\mathcal{T}}[(X^TX)^{-1}X^TY|X^0]\\ &= X^0 (X^TX)^{-1}X^TE_{\mathcal{T}}[Y|X^0]\\ &= X^0 (X^TX)^{-1}X^TX\beta \end{align*}

The second equality holds assuming that the covariates $X$ are fixed by design, so the only thing that's random with respect to the training set $\mathcal{T}$ is $Y$, correct?

Adrian
  • 1,665
  • 3
  • 22
  • 42

1 Answers1

1

You can drop all subscripts in the expected values, and via the Law of Total Expectation, we have $$\text{Err}=\mathbb E[\mathbb E[L(Y^0,\hat f(X^0))|\mathcal T]]=\underbrace{\mathbb E[L(Y^0,\hat f(X^0))]}_{\text{Expected Loss}}$$

In the end, we're interested in knowing the expected loss. The conditioning is important because as Hastie explains in the subsequent sections, the outer expected value is estimated via cross-validation. You can analytically calculate it if you know the distribution of data, i.e. $\mathcal T$.

(1) is correctly calculated. (2) is not correct because the expected value is taken wrt the distribution of $\mathcal T$. So, $X$ is not fixed (Is $X$ fixed in cross-validation?). The only thing that's fixed in $E_{\mathcal{T}}[X^0\hat{\beta}|X^0]=\mathbb E[X^0\hat \beta|X^0]$ is $X^0$ because it's in the given side of the expression. Without knowing the data distribution, you can't analytically calculate this expected value. Instead you can estimate it via cross-validation.

gunes
  • 49,700
  • 3
  • 39
  • 75
  • Thank you for your answer. Regarding (2), can you expand on "knowing the data distribution"? For example, is knowing that $Y = X\beta + \epsilon$ where $\epsilon$ has mean 0 and variance $\sigma^2I$ sufficient? If $X$ is fixed by study design, what is the distribution of $\mathcal{T}$? – Adrian Jul 19 '20 at 14:36
  • $X$ is not fixed, because it's not given in the outer expectation. Otherwise, expectation over $\mathcal T$ wouldn't make sense. The data distribution is the distribution that your samples, i.e. $(X_i,Y_i)$ are drawn from. – gunes Jul 19 '20 at 15:30
  • I see. So the expected value wrt $\mathcal{T}$ is equivalent to the expected value wrt the joint distribution of $(X, Y)$, correct? And we can estimate it using cross-validation, such as leave-one-out CV, correct? – Adrian Jul 19 '20 at 17:17
  • @Adrian That is correct. With one addition: wrt Joint distribution of $(X_1,Y_1),...(X_N,Y_N)$ – gunes Jul 19 '20 at 18:08