Confused by Derivation of Regression Function

Question

I just got a copy of The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. In chapter 2 (Overview of Supervised Learning) section 4 (Statistical Decision Theory), he gives a derivation of the regression function.

Let $X \in \mathbb{R}^p$ denote a real valued random input vector, and $Y\in\mathbb{R}$ a real valued random output variable, with joint distribution $Pr(X,Y)$. We seek a function $f(X)$ for predicting $Y$ given values of the input $X$. This theory requires a loss function $L(Y,f(X))$ for penalizing errors in prediction, and by far the most common and convenient is squared error loss: $L(Y,f(X))=(Y −f(X))^2$. This leads us to a criterion for choosing $f$,

$$\begin{align*} EPE(f) &= E(Y-f(X))^2 \\ &= \int [y - f(x)]^2Pr(dx, dy)\end{align*}$$ the expected (squared) prediction error.

I completely understand the set up and motivation. My first confusion is: does he mean $E[(Y - f(x))]^2$ or $E[(Y - f(x))^2]$? Second, I have never seen the notation $Pr(dx,dy)$. Can someone who has explain its meaning to me? Is it just that $Pr(dx) = Pr(x)dx$? Alas my confusion does not end there,

By conditioning on $X$, we can write $EPE$ as $$\begin{align*}EPE(f) = E_XE_{Y|X}([Y-f(X)]^2|X)\end{align*}$$

I am missing the connection between these two steps, and I am not familiar with the technical definition of "conditioning". Let me know if I can clarify anything! I think most of my confusion has arisen from unfamiliar notation; I am confident that, if someone can break this derivation down into plain English, I'll get it. Thanks stats.SE!

score 12 · Accepted Answer · edited Aug 11 '13 at 04:44

For your first confusion, it should be Expectation of squared error, so it is $E[(Y-f(x))^2].$

For the notation of $Pr(dx,dy)$, it is equal to $g(x,y)\,dx\,dy$, where $g(x,y)$ is the joint pdf of x and y. And $Pr(dx)=f(x)\,dx$, this can be interpreted as the probability of x being within a tiny interval of $[x,x+dx]$ is equal to pdf value at the point $x$, i.e. $f(x)$ times the interval length $dx$.

The equation about the EPE stems from the theorem $E(E(Y|X))=E(Y)$ for any two random variables $X$ and $Y$. You can prove this by using the conditional distribution. The conditional expectation is the expectation calculated using the conditional distribution. The conditional distribution $Y|X$ means the probability of $Y$ after you know something about $X$.

In our case, suppose we denote the squared error as a function $L(x,y)=(y-f(x))^2$, the EPE is calculating

$$\begin{equation}\begin{split}E(L(x,y))&=\int\int L(x,y)g(x,y)dx\,dy \\ &=\int\bigg[\int L(x,y)g(y|x)g(x)dy\bigg]dx \\ &=\int\bigg[\int L(x,y)g(y|x)dy\bigg]g(x)dx \\ &=\int\bigg[E_{Y|X} (L(x,y)\bigg]g(x)dx \\ &=E_X(E_{Y|X} (L(x,y)))\end{split}\end{equation}$$

The outcome of above corresponds to the result you listed. Hope this can help you a bit.

For the final result after conditioning, the book also has the |X, while the final result of this answer is missing it. Is it important? — lagrange103, Apr 04 '17 at 03:17

Confused by Derivation of Regression Function

1 Answers1

Linked