1

Given a training set of $(X, Y )$'s where the $X$'s are the source variables and the $Y$'s are the targets, derive an estimator that minimizes the mean squared error between target values and corresponding predictions by the estimator.

Solution

Denote our estimator for particular $x$ as $\theta(x)$ and take $L(y,\theta, x) = |y - \theta(x)|^2$. The total loss will be defined as \begin{align} E_{XY}[L(Y,\theta, X)] &= \int_{\mathbb{R}}\int_{\mathbb{R}}L(y,\theta, x)p(x,y)dxdy\tag{1}\\ &= \int_{\mathbb{R}}\int_{\mathbb{R}}L(y,\theta, x)p(y|x)p(x)dxdy\tag{2}\\ &= \int_{\mathbb{R}}\Big[\int_{\mathbb{R}}(\theta(x) - y)^2p(y|x)dy\Big]p(x)dx\tag{3}\\ \end{align} Now: \begin{align} \frac{dE_{XY}[L(y,\theta, x)]}{d\theta(x)} &=\int_{\mathbb{R}}\frac{d}{d \theta (x)}(\theta(x) - y)^2p(y|x)dy\tag{4}\\ &= 2\int_{\mathbb{R}}(\theta(x) - y)p(y|x)dy\tag{5}\\ &= 2\theta(x)\underbrace{\int_{\mathbb{R}}p(y|x)dy}_{ = \int_{\mathbb{R}}p(y)dy = 1} - 2\int_{\mathbb{R}}yp(y|x)dy\tag{6}\\ &= 2\theta(x) - 2\int_{\mathbb{R}}yp(y|x)dy\tag{7}\\ \end{align} Setting $\frac{dE_{XY}[L(y,\theta, x)]}{d\theta(x)}$ to 0 yields $$ \theta(x) = \int_{\mathbb{R}}yp(y|x)dy\tag{8} $$ My questions

  • How do we justify go from (3) to (4) (step by step)?
  • In (6) how can we justify that $\int_{\mathbb{R}}p(y|x)dy = \int_{\mathbb{R}}p(y)dy = 1$?
Xi'an
  • 90,397
  • 9
  • 157
  • 575
ecjb
  • 539
  • 1
  • 5
  • 16

1 Answers1

2

The derivation $$\frac{\text{d}\mathbb E_{XY}[L(Y,\theta, X)]}{\text{d}\theta(x)} $$ is meaningless since $$\mathbb E_{XY}[L(Y,\theta, X)]$$ depends on $\theta$ and $X$ is integrated out. (In other words, there is no $x$.) Since $\theta$ is a function, standard derivation does not apply.

The proper argument to the result is that, in order to minimise (3) in $\theta$, one need minimise $$\mathbb E_{Y|X}[L(Y,\theta, X)|X]=\mathbb E_{Y|X}[L(Y,\theta(x), x)|X=x]$$ for (almost) every value $x$ of the random variable $X$, which leads to consider $$\frac{\text{d}\mathbb E_{Y|X}[L(Y,\theta(x), x)|X=x]}{\text{d}\theta(x)} $$ with equations (5)-(8) being correct.

Furthermore, $$\int_{\mathbb{R}}p(y|x)dy = \int_{\mathbb{R}}p(y)dy = 1$$ is correct because both $p(\cdot|x)$ and $p(\cdot)$ are probability densities, but the central integral is irrelevant and hence confusing.

Xi'an
  • 90,397
  • 9
  • 157
  • 575