Maximum Likelihood with Least Squared Error

Question

In the book Tom Mitchell - Machine Learning, while deriving Least Squared Error from maximum likelihood, the author considers the training dataset of the form: $<x_i, d_i>$ where: $$d_i = f(x_i) + e_i$$ Here, $f(x_i)$ is the noise free value of the target function and $e_i$ is the random variable representing noise, which is distributed according to normal distribution with $0$ mean.

The author then says that given the noise $e_i$ obeys a Normal distribution with 0 mean and an unknown variance $\sigma^2$, each $d_i$ must also obey a Normal distribution with variance $\sigma^2$, centered around the true target value $f(x_i)$.

Can anyone please explain that if the error $e_i$ is Normally distributed, then why should $d_i$ also be Normally distributed ?

See https://en.wikipedia.org/wiki/Normal_distribution#General_normal_distribution. So if $f(x_i)$ can be treated as a constant (sometimes, you will encounter the terminology "fixed regressor" in econometrics), the linked result applies. — Christoph Hanck, Jul 06 '20 at 13:40
@ChristophHanck So when you say that $f(x_i)$ can be treated as constant, you mean that for every x, ($x_1, x_2....x_n$) [considering n points], f(x) = c, which I don't think is the case here. For every new $x_i$, we will get a new $f(x_i)$ — Saurabh Verma, Jul 06 '20 at 13:50
I rather mean that $f(x_i)$ is not a random process itself, but something which can be controlled through, say, experimental design. I assume that the author has such a scenario in mind when he writes about the "noise free value of the target function". Think the value of some drug mice get in analyses of the efficiency of a new drug. Of course, each $f(x_i)$ can be different, resulting in a different mean for each unit $i$. — Christoph Hanck, Jul 06 '20 at 14:04

Demetri Pananos · Accepted Answer · 2020-07-06T14:12:12.937

Let $z \sim \mathcal{N}(\mu, \sigma)$. Then

$$ \dfrac{z-\mu}{\sigma} \sim \mathcal{N}(0,1)$$

Conversely, if $x \sim \mathcal{N}(0,1)$, then

$$ \mu + \sigma x \sim \mathcal{N}(\mu,\sigma)$$

The noise is normal $e_i \sim \mathcal{N}(0,\sigma)$, so if I add some noiseless constant to this random variable, the mean changes

$$ f(x_i) + e_i = d_i \sim \mathcal{N}(f(x_i), \sigma)$$

EDIT: There is a slight abuse of terminology in most regression text. Note that $d_i$ corresponds to observations of $x_i$. So this means the conditional distribution of the outcome is normal, not the marginal. Mathematically

$$ d_i \vert x_i \sim \mathcal{N}(f(x_i), \sigma)$$

jhin · Answer 2 · 2020-07-06T13:58:15.490

0

Brief expansion of user Christoph Hanck's comment:

The measurements $x_i$ are assumed to be known exactly.* Under this assumption, $f(x_i)$ is distributed like a normal distribution with mean $f(x_i)$ and variance $0$. If $e_i\sim\mathcal{N}(0, \sigma^2)$, it follows that $$\underbrace{d_i}_{\sim\mathcal{N}(f(x_i), \sigma^2)} = \underbrace{f(x_i)}_{\sim\mathcal{N}(f(x_i), 0)} + \underbrace{e_i}_{\sim\mathcal{N}(0, \sigma^2)}.$$

*This assumption can of course be criticized, see, e.g., this question.

edited Jul 06 '20 at 13:58

answered Jul 06 '20 at 13:54

jhin

749
4
12

If $x_i$ are assumed to be known exactly, then why will $f(x_i)$ always form a Normal Distribution ? - This doesn't seem correct eg if the function f is an exponent function, then $f(x_i)$ will not form a Normal Distribution – Saurabh Verma Jul 06 '20 at 13:57
@SaurabhVerma $f(x_i)$ follows a Dirac distribution, which can be represented as the limiting case of a normal distribution for $\sigma^2 \to 0$. – jhin Jul 06 '20 at 13:59
See, e.g., [here](https://math.stackexchange.com/questions/2072415/proof-that-the-limit-of-the-normal-distribution-for-a-standard-deviation-approxi) or [here](http://hitoshi.berkeley.edu/221a/delta.pdf). – jhin Jul 06 '20 at 14:02
I see where your confusion comes from - been there. :) The key is that $x_i$ follows a Dirac distribution, and you can transform a Dirac distributed variable nonlinearly as much as you want; it stays a Dirac distribution. – jhin Jul 06 '20 at 14:05

Maximum Likelihood with Least Squared Error

2 Answers2