How to calculate likelihood for simple linear regression, why it is not zero?

Question

I am confused with the likelihood for simple linear regression, in this note it says

$$ \large \prod_{i=1}^n p(y_i \mid x_i; \beta_0,\beta_1,\sigma^2) = \prod_{i=1}^n \frac 1 {\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i -(\beta_0 + \beta_1 x_i))^2}{2\sigma^2}} $$

But I do not understand why.

Let's simplify the problem that, we only have $1$ data point, where error is $\varepsilon_1=2$, should $P(\varepsilon=2)=0$? Since $\varepsilon$ is a continuous distribution? To make it non-zero, should we add make something like $P(1.999\leq\varepsilon\leq2.001)$ ?

Likelihood is not (in general) probability -- in spite of the suggestive appearance of the symbol $p$ there, you're dealing with products of *densities*, not probabilities. — Glen_b, Feb 23 '17 at 04:17
@Glen_b thanks for educating me. I learned the likelihood from discrete case, which is why I am asking this question. Could you give me a reference on how likelihood is formally defined? — Haitao Du, Feb 23 '17 at 04:22
To an extent it depends on who is doing the defining (e.g. some people would define $L(\theta)= f(\mathbf{x};\theta)$; some will say that he likelihood is only defined up to a constant of proportionality and so replace $=$ with $\propto$). For a simple regression the expression you have above is fine; the problem is more interpreting what it says. Is the definition [here](https://en.wikipedia.org/wiki/Likelihood_function#Likelihood_function_of_a_parameterized_model) sufficient for your purposes? — Glen_b, Feb 23 '17 at 04:56
@Glen_b thanks for answering my question. Now, I have no doubts. May be should ask the definition of likelihood function instead of getting linear regression involved. — Haitao Du, Feb 23 '17 at 05:02
It would be a duplicate (likely several times over) ... e.g. http://stats.stackexchange.com/questions/29682/how-to-rigorously-define-the-likelihood — Glen_b, Feb 23 '17 at 05:07
Check also https://stats.stackexchange.com/questions/4220/can-a-probability-distribution-value-exceeding-1-be-ok — Tim, Dec 12 '18 at 08:30

score 1 · Answer 1 · edited Feb 23 '17 at 02:52

The error $\varepsilon_1$ is random AND unobservable. Before you see your data, it follows a mean zero normal distribution (continuous random variable). After you see your data, it follows $p(\varepsilon_1|y_1) = \delta_{y_1 - \beta_0 - \beta_1 x_1}(\cdot)$ (discrete random variable). The latter is not "nice," and you can't really do anything with it because you don't have the dataset $\varepsilon_1, \ldots, \varepsilon_n$ in addition to your $Y$ and $X$ data. So when you say, "where error is $\varepsilon_1 = 2$", that isn't the supposition that is going to elucidate anything for you.

Now, assuming that you actually did observe an $\varepsilon_1$, (which is impossible), and let's say the value was $2$, like you say. This is no longer random. Many books go from upper case to lower case to convey this, but for this particular Greek letter, it's difficult. $P(\varepsilon_1=2) = 0$, yes, but this is asking a question before you observe your $\varepsilon_1$. There is no probability after you see an outcome, so don't use any $P(\cdot)$.

Here's a better way to think about it. For this setup you have posted in your question, picture two columns of data in an Excel spreadsheet or csv file. Your first column will be $y_1,\ldots,y_n$, and your second column will be $x_1, \ldots, x_n$. I am writing these in lowercase because these data are not random anymore, since you already see specific fixed values.

Your above model is equivalent to assuming each of the $Y_i$s (before you see your data for them) are independent from one another, and that their probability distributions only differ by having different means. You're assuming that, before you observe your $Y$ data, if you have some information $x_1, \ldots, x_n$, then you have different normal distributions for each $Y_1, \ldots, Y_n$. In other words, all rows are mutually independent, and for each row of data $i$ $$ Y_i\mid X_i=x_i \sim \text{Normal}(\beta_0 + \beta_1 x_i, \sigma^2). $$

Because they are independent $$ p(y_1,\ldots,y_n;\beta_0,\beta_1,\sigma^2) = p(y_1;\beta_0,\beta_1,\sigma^2)\cdots p(y_n;\beta_0,\beta_1,\sigma^2), \tag{1} $$ or the joint density for your first column factors, and each $p(y_i)$ is normal with the same variance, but they all have potentially different means which depend linearly on their corresponding $x_i$s.

But then you have your data for the $Y$s: $y_1,\ldots,y_n$. So you can evaluate your joint density/likelihood (1), and move around $\theta = (\beta_0,\beta_1,\sigma^2)$ until you get good parameters (think maximum likelihood or restricted maximum likelihood or least squares).

@hxd1011 $\delta_{x}(A) = 1$ if $x \in A$, and $0$ otherwise ($A$ is a set). It's the measure to denote the discrete random variable $X$ that puts all of it's probabillity/mass on the point $x$. — Taylor, Feb 23 '17 at 01:54
Thanks!, could you also give me some book reference on how people define likelihood using $\delta$ notation? — Haitao Du, Feb 23 '17 at 02:24
If find the notation $\text{“} Y_1\mid X_1=x_1, \ldots, Y_n\mid X_n=x_n \text{''}$ rather odd. If you're saying that in the conditional distribution of several random variables given a certain event, they are independent, then you are talking about a _joint_ distribution, not just several separate distributions, so you'd have to be conditioning them all on the same event. I'd write $Y_1,\ldots, Y_n \mid X_1=x_1,\ldots, X_n=x_n \sim \cdots\cdots.$ — Michael Hardy, Feb 23 '17 at 02:35
@MichaelHardy I'm thinking of them as separate random variables, though. It is odd, but I am trying to convey the connection between the iid situation. But I do agree with you. — Taylor, Feb 23 '17 at 02:38
They're separate random variables regardless of whether you make separate statements about their distributions or one statement about their _joint_ distribution. But the word "independent" is about the _joint_ distribution. — Michael Hardy, Feb 23 '17 at 02:40
@MichaelHardy Then you have to write the entire joint distribution on the RHS, which was one of OP's simplifying assumptions. Hopefully you like it now. And thank you for the edits — Taylor, Feb 23 '17 at 02:50
Better, but I still have a bit of a qualm. If you write $$Y_i\mid X_1=x_1, \ldots, X_n=x_n \sim \text{Normal}(\beta_0 + \beta_1 x_i, \sigma^2),$$ then it is explicit that the conditional distribution of $Y_i$ given _all_ of the $X$s depends on only one of them. — Michael Hardy, Feb 23 '17 at 02:55

How to calculate likelihood for simple linear regression, why it is not zero?

1 Answers1