1

In the book Tom Mitchell - Machine Learning, while deriving Least Squared Error from maximum likelihood, the author considers the training dataset of the form: $<x_i, d_i>$ where: $$d_i = f(x_i) + e_i$$ Here, $f(x_i)$ is the noise free value of the target function and $e_i$ is the random variable representing noise, which is distributed according to normal distribution with $0$ mean.

The author then says that given the noise $e_i$ obeys a Normal distribution with 0 mean and an unknown variance $\sigma^2$, each $d_i$ must also obey a Normal distribution with variance $\sigma^2$, centered around the true target value $f(x_i)$.

Can anyone please explain that if the error $e_i$ is Normally distributed, then why should $d_i$ also be Normally distributed ?

  • 1
    See https://en.wikipedia.org/wiki/Normal_distribution#General_normal_distribution. So if $f(x_i)$ can be treated as a constant (sometimes, you will encounter the terminology "fixed regressor" in econometrics), the linked result applies. – Christoph Hanck Jul 06 '20 at 13:40
  • @ChristophHanck So when you say that $f(x_i)$ can be treated as constant, you mean that for every x, ($x_1, x_2....x_n$) [considering n points], f(x) = c, which I don't think is the case here. For every new $x_i$, we will get a new $f(x_i)$ – Saurabh Verma Jul 06 '20 at 13:50
  • 1
    I rather mean that $f(x_i)$ is not a random process itself, but something which can be controlled through, say, experimental design. I assume that the author has such a scenario in mind when he writes about the "noise free value of the target function". Think the value of some drug mice get in analyses of the efficiency of a new drug. Of course, each $f(x_i)$ can be different, resulting in a different mean for each unit $i$. – Christoph Hanck Jul 06 '20 at 14:04

2 Answers2

2

Let $z \sim \mathcal{N}(\mu, \sigma)$. Then

$$ \dfrac{z-\mu}{\sigma} \sim \mathcal{N}(0,1)$$

Conversely, if $x \sim \mathcal{N}(0,1)$, then

$$ \mu + \sigma x \sim \mathcal{N}(\mu,\sigma)$$

The noise is normal $e_i \sim \mathcal{N}(0,\sigma)$, so if I add some noiseless constant to this random variable, the mean changes

$$ f(x_i) + e_i = d_i \sim \mathcal{N}(f(x_i), \sigma)$$

EDIT: There is a slight abuse of terminology in most regression text. Note that $d_i$ corresponds to observations of $x_i$. So this means the conditional distribution of the outcome is normal, not the marginal. Mathematically

$$ d_i \vert x_i \sim \mathcal{N}(f(x_i), \sigma)$$

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
0

Brief expansion of user Christoph Hanck's comment:

The measurements $x_i$ are assumed to be known exactly.* Under this assumption, $f(x_i)$ is distributed like a normal distribution with mean $f(x_i)$ and variance $0$. If $e_i\sim\mathcal{N}(0, \sigma^2)$, it follows that $$\underbrace{d_i}_{\sim\mathcal{N}(f(x_i), \sigma^2)} = \underbrace{f(x_i)}_{\sim\mathcal{N}(f(x_i), 0)} + \underbrace{e_i}_{\sim\mathcal{N}(0, \sigma^2)}.$$

*This assumption can of course be criticized, see, e.g., this question.

jhin
  • 749
  • 4
  • 12
  • If $x_i$ are assumed to be known exactly, then why will $f(x_i)$ always form a Normal Distribution ? - This doesn't seem correct eg if the function f is an exponent function, then $f(x_i)$ will not form a Normal Distribution – Saurabh Verma Jul 06 '20 at 13:57
  • @SaurabhVerma $f(x_i)$ follows a Dirac distribution, which can be represented as the limiting case of a normal distribution for $\sigma^2 \to 0$. – jhin Jul 06 '20 at 13:59
  • See, e.g., [here](https://math.stackexchange.com/questions/2072415/proof-that-the-limit-of-the-normal-distribution-for-a-standard-deviation-approxi) or [here](http://hitoshi.berkeley.edu/221a/delta.pdf). – jhin Jul 06 '20 at 14:02
  • I see where your confusion comes from - been there. :) The key is that $x_i$ follows a Dirac distribution, and you can transform a Dirac distributed variable nonlinearly as much as you want; it stays a Dirac distribution. – jhin Jul 06 '20 at 14:05