24

I'm struggling to grasp the concept of bias in the context of linear regression analysis.

  • What is the mathematical definition of bias?

  • What exactly is biased and why/how?

  • Illustrative example?

Fabian
  • 1,341
  • 4
  • 12
  • 13

3 Answers3

29

Bias is the difference between the expected value of an estimator and the true value being estimated. For example the sample mean for a simple random sample (SRS) is an unbiased estimator of the population mean because if you take all the possible SRS's find their means, and take the mean of those means then you will get the population mean (for finite populations this is just algebra to show this). But if we use a sampling mechanism that is somehow related to the value then the mean can become biased, think of a random digit dialing sample asking a question about income. If there is positive correlation between number of phone numbers someone has and their income (poor people only have a few phone numbers that they can be reached at while richer people have more) then the sample will be more likely to include more people with higher incomes and therefore the mean income in the sample will tend to be higher than the population income.

The are also some estimators that are naturally biased. The trimmed mean will be biased for a skewed population/distribution. The standard variance is unbiased for SRS's if either the population mean is used with denominator $n$ or the sample mean is used with denominator $n-1$.

Here is a simple example using R, we generate a bunch of samples from a normal with mean 0 and standard deviation 1, then compute the average mean, variance, and standard deviation from the samples. Notice how close the mean and variance averages are to the true values (sampling error means they won't be exact), now compare the mean sd, it is a biased estimator (though not hugely biased).

> tmp.data <- matrix( rnorm(10*1000000), ncol=10 )
> mean( apply(tmp.data, 1, mean) )
[1] 0.0001561002
> mean( apply(tmp.data, 1, var) )
[1] 1.000109
> mean( apply(tmp.data, 1, sd) )
[1] 0.9727121

In regression we can get biased estimators of slopes by doing stepwise regression. A variable is more likely to be kept in a stepwise regression if the estimated slope is further from 0 and more likely to be dropped if it is closer to 0, so this is biased sampling and the slopes in the final model will tend to be further from 0 than the true slope. Techniques like the lasso and ridge regression bias slopes towards 0 to counter the selection bias away from 0.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Greg Snow
  • 46,563
  • 2
  • 90
  • 159
  • SRS? $\text{ }$ – cardinal Jul 30 '11 at 14:03
  • @cardinal Simple Random Sample. – whuber Jul 30 '11 at 15:34
  • 1
    @whuber: Wow. While the abbreviation makes sense, I don't recall having come across it in any more formal settings. Are there particular subfields or applied areas where that is a "standard" initialism? – cardinal Jul 30 '11 at 15:56
  • 1
    @cardinal See http://en.wikipedia.org/wiki/Simple_random_sample – whuber Jul 30 '11 at 16:07
  • 1
    (+1) @whuber's edit was helpful in clarifying this answer. – cardinal Jul 30 '11 at 16:08
  • @whuber: I know what a simple random sample is. :) It was the initialism that struck me as odd, combined with the assumption that it was ubiquitous enough to use without defining. – cardinal Jul 30 '11 at 16:10
  • @cardinal I'm sorry; I had no intention of even suggesting you didn't know. I apologize for giving you that impression. The point of the Wikipedia reference is that it uses "SRS" (see the fifth paragraph), indicating that it's not confined to some subfield but actually is quite common. – whuber Jul 30 '11 at 16:41
  • @whuber: No apology necessary at all. I was not "offended". In my quick perusal of the page, I missed the reference in the fifth paragraph; thanks for the pointer. (From the perspective of proper composition, I'd argue that SRS still needs to be defined, even in that Wiki article.) – cardinal Jul 30 '11 at 16:47
  • Can you include and put 'positive bias' and 'negative bias' into context? – Roman Luštrik Feb 04 '14 at 19:18
  • @RomanLuštrik, positive and negative bias could just refer to the sign of the difference between the estimator and the true parameter. Or it could be positive if the bias is minor but greatly reduces the variance and negative if is has a large impact on estimates. Do you have more context for your question? – Greg Snow Feb 04 '14 at 20:05
  • @GregSnow I was thinking in the context of the sign. – Roman Luštrik Feb 07 '14 at 11:51
7

Bias means that the expected value of the estimator is not equal to the population parameter.

Intuitively in a regression analysis, this would mean that the estimate of one of the parameters is too high or too low. However, ordinary least squares regression estimates are BLUE, which stands for best linear unbiased estimators. In other forms of regression, the parameter estimates may be biased. This can be a good idea, because there is often a tradeoff between bias and variance. For example, ridge regression is sometimes used to reduce the variance of estimates when there is collinearity.

A simple example may illustrate this better, although not in the regression context. Suppose you weigh 150 pounds (verified on a balance scale that has you in one basket and a pile of weights in the other basket). Now, you have two bathroom scales. You weigh yourself 5 times on each.

Scale 1 gives weights of 152, 151, 151.5, 150.5 and 152.

Scale 2 gives weights of 145, 155, 154, 146 and 150.

Scale 1 is biased, but has lower variance; the average of the weights is not your true weight. Scale 2 is unbiased (the average is 150), but has much higher variance.

Which scale is "better"? It depends on what you want the scale to do.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 1
    Although the definition of bias is correct, I fear that the examples confuse it with inaccuracy, which is something altogether different! Bias is a property of a *statistical procedure* (an estimator) whereas accuracy is a property of a *measurement process*. (-1). – whuber Jul 30 '11 at 15:33
  • @whuber: I think the main problem that your point addresses is the lack of explicit mention of assumptions and a modeling setup. I suppose, in my own mind, I filled these in as I was reading the answer. It seems to me it would not take much to clean that up. – cardinal Jul 30 '11 at 15:59
  • @cardinal It might not take much work, but the work is essential for the reply to be fully correct. – whuber Jul 30 '11 at 16:06
  • 1
    @whuber: Yes, I agree with that. And, I still think that, even so, it is necessary to make clear the difference between mathematical expectation and a sample average, as they relate to bias. – cardinal Jul 30 '11 at 16:12
  • 1
    No, I was not trying to say anything about "inaccuracy" (which is awfully hard to define) but about "variance". One scale is unbiased, the other scale has low variance. I did not use the word "accurate" or "accuracy". A scale which tends to estimate your weight too high (or too low) is biased. – Peter Flom Jul 30 '11 at 16:31
  • 1
    But this sense of "bias" is just a synonym for inaccurate; it is not the same as the definition you gave in the first line. Moreover, as @cardinal points out, the example also confounds an expectation with the mean of a particular sample. – whuber Jul 30 '11 at 16:38
  • 3
    I agree with @whuber here. In the (proper) sense of bias that the OP is asking about, it is *not* the scale that is biased or unbiased, but rather whatever estimate of your weight that you derive from its measurements! – cardinal Jul 30 '11 at 16:50
  • 1
    OK, so, I should say "the measurements that you get from stepping on the scale are biased"? That seems needlessly wordy. And my use of "bias" even if technically wrong, is not the same as inaccurate; BOTH scales (or the measurements gotten from them) are inaccurate, just in different ways. – Peter Flom Jul 31 '11 at 00:33
  • 1
    @Peter, no you shouldn't say that, either, as it is also incorrect. For an appropriate discussion of bias, you need (a) a statistical model for how the data are collected---and the associated measurement process---which identifies at least one parameter of interest and (b) a procedure (a function of the data!) to take the measurements and produce an estimate of the parameter of interest. The former is neglected entirely in your post and, with due respect, you sound a bit confused about the latter. – cardinal Jul 31 '11 at 01:07
1

In Linear regression analysis, bias refer to the error that is introduced by approximating a real-life problem, which may be complicated, by a much simpler model. In simple terms, you assume a simple linear model such as y*=(a*)x+b* where as in real life the business problem could be y = ax^3 + bx^2+c.

It can be said that the expected test MSE(Mean squared error) from a regression problem can be decomposed as below. E(y0 - f*(x0))^2 = Var(f*(x0)) + [Bias(f*(x0))]^2 + Var(e)

f* -> functional form assumed for linear regression model y0 -> original response value recorded in test data x0 -> orginal predictor value recorded in test data e -> irreducible error So, the goal is selecting a best method in arriving a model that achieves low variance and low bias.

Note: An Introduction to Statistical Learning by Trevor Hastie & Robert Tibshirani has a good insights on this topic

ganga
  • 11
  • 2
  • 4
    This is often referred to by something like "model mis-specification error" in order not to confuse it with the standard definition of bias given in the accepted answer. Otherwise it would be impossible to make sense of the (correct) assertion that OLS is an *unbiased* estimator of the coefficients of the regressors. – whuber Feb 01 '19 at 21:32