I am not able to understand the exact meaning of bias. The least squares is said to give us unbiased estimates during linear regression but still we refer linear regression to have a high bias because of its assumption ?
-
3I don't think this is a duplicate. The other question is about the classical statistical definition of bias; this one is about how that definition relates to the more informal meaning of bias in the machine learning literature. In particular, the other question says nothing about how linear regression can be unbiased and biased at the same time (at least, not explicitly). – Hong Ooi Mar 27 '18 at 12:52
-
4Then maybe you should explain better what that other meaning of bias is, with some references. We say least square estimator for linear regression is unbiased because it is so, *if the model is true*. People saying it is biased must mean that the model is not true. Do you have some references? – kjetil b halvorsen Mar 27 '18 at 13:33
-
1Re:title -- there's more than one use for the word "bias" in statistics (e.g. test bias is not quite the same as bias in an estimator), but your question body focuses specifically on bias in estimation. I'll edit the title to reflect that. – Glen_b Mar 28 '18 at 01:26
2 Answers
Bias is a relative term, meaning approximately
How far on average is the estimated thing from the truth.
Depending on what we are assuming the word "truth" means, we have different conceptions of bias. You are experiencing that two of those conceptions are relevant for linear regression, and they can come to opposite conclusions about the model.
The least squares is said to give us unbiased estimates during linear regression
When we say this, we are assuming that the truth has a specific structure
$$ y \mid X \sim \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k + \epsilon $$
(where $\epsilon$ is a random noise term which does not depend on $X$) and we are using the model to try to uncover something about the numbers $\beta_0, \beta_1, \ldots, \beta_k$. The unbiasedness of linear regression in this scenario says that on average, when we use linear regression to estimate the $\beta$s, we get the correct answer.
The assumption here is strong, we need to be willing to accept that, in truth, the conditional mean of $y$ is a linear function of $X$. If we weaken this assumption...
but still we refer linear regression to have a high bias because of its assumption
Here we are making much weaker assumption about what the truth looks like. We are just assuming that there is some function $f$ such that
$$ y \mid X \sim f(X) + \epsilon $$
Since the only shapes our fit model can assume is a line, but it is possible in this case that $f$ is very not a line, it is impossible for our fit linear regression to assume, on average, the correct shape. In this setup, we may say that linear regression is biased (*).
(*) Note though, in the case where $f$ really is a linear function, we would not say that linear regression is biased, even in the second setup.

- 33,314
- 2
- 101
- 132
-
1Just to clarify here, the bias arises from assuming that your 'truth' is a linear function, while in reality the data follows some other kind of function/distribution? Am I correct in thinking such a bias will then also arise whenever one is using an estimator that involves fitting a function which may not reflect the true distribution? – l.bee Jun 04 '19 at 07:16
-
-
Great thanks for clarifying. I would like to learn some more about estimator bias with fits - do you have a textbook/paper reference for these statements? – l.bee Jun 04 '19 at 23:45
I don't have enough points to comment so here goes:
- I think there may be a confusion between linear regression as an unbiased model of the data, and a linear model's estimation method, e.g. Ordinary Least Squares (OLS), being an unbiased estimator for the real coefficients.
- I think OP may be referring to the second definition of bias, and "The least squares is said to give us unbiased estimates during linear regression" is a sentence he would have typically found in a book describing OLS under Gauss-Markov assumptions.
- If that's what OP means, then I would point to the difference between the linear regression's coefficient estimates and a linear relationship as an 'estimate' of the true relationship.
- The coefficient estimator can be unbiased using OLS or other estimators depending on the assumptions
- While the linear model is often biased because the relationship between the dependent variable and its predictors is not linear (so the predicted value is a biased estimate of the real target value).

- 1
- 1
-
You started well but the last bullet about predicted value is not very accurate. If you are fitting a linear model to a non-linear truth there is not even a one-to-one match between coefficients, hence concept of biased estimate of a particular coefficient doesn't even make sense. In this case it is better to call the entire model biased. – Cagdas Ozgenc Mar 31 '21 at 12:30
-
Updated the last point, but do you have an issue with "predicted value is a biased estimate of the real target value"? – mchl_k Mar 31 '21 at 16:53