As in this question:What is the difference between in sample error and training error, and intuition of optimism?
In the book Elements of Statistical Learning in Chapter 7 (page 228), given a data set $\mathcal{T}=\{(x_i,y_i)\}, i=1,\dots, N$ the generalization error of a model $\hat{f}$ is defined by
$$ Err_{\mathcal{T}}=E_{X^0, Y^0}[L(Y^0, \hat{f}(X^0))|\mathcal{T}] $$
Whereas in-sample error is defined as $$ Err_{in} = \frac{1}{N}\sum_{i=1}^{N}{E_{Y^0}[L(Y_{i}^{0},\hat{f}(x_i))|\tau]} $$
The $Y^0$ notation indicates that we observe N new response values at each of the training points $x_i, i = 1, 2, . . . ,N$.
the training error is defined as: $$ \overline{err} = \frac{1}{N}\sum_{i=1}^{N}{L(y_i,\hat{f}(x_i))} $$
Question:
(1) What is the difference between generalization error and in-sample error? How to understand them?
(2) Why do we define AIC and BIC for estimation of in-sample error rather than the generalization error?
(3) For any loss function, do we always have a generalization error that is larger than training error? Is there theoretical proof? I only found the proof of mean square loss.