11

I understand the concept of bias-variance tradeoff. Bias based on my understanding, represents the error because of using a simple classifer(eg: linear) to capture a complex non-linear decision boundary. So I expected OLS estimator to have high bias and low variance.

But came across Gauss Markov Theorem which says that bias of OLS =0 which is surprising to me. Please explain how bias is zero for OLS because I expected bias of OLS to be high. Why is my understanding of bias wrong?

GeorgeOfTheRF
  • 5,063
  • 14
  • 42
  • 51
  • 5
    The proof that the bias of ols (for linear models) is zero, assumes that the model is TRUE, that is, that **all** relevant variables are included in the model, that their effect is exactly linear, and so on ... . If that is not true, the result do not follow. – kjetil b halvorsen Jul 28 '17 at 11:41
  • https://economictheoryblog.com/2015/02/26/markov_theorem/ – GeorgeOfTheRF Jul 28 '17 at 11:49
  • The Gauss-Markov Theorem is telling us that in a regression model, where the expected value of our error terms is zero, E(\epsilon_{i}) = 0 and variance of the error terms is constant and finite \sigma^{2}(\epsilon_{i}) = \sigma^{2} \textless \infty and \epsilon_{i} and \epsilon_{j} are uncorrelated for all i and j the least squares estimator b_{0} and b_{1} are unbiased and have minimum variance among all unbiased linear estimators. – GeorgeOfTheRF Jul 28 '17 at 11:49
  • In this statement I dont see any assumption which says that model should fit perfectly on the data. Am i missing something? – GeorgeOfTheRF Jul 28 '17 at 11:50
  • 5
    I did'nt say that the model should fit perfectly, I said that all relevant variables should be included. That are two different conditions! – kjetil b halvorsen Jul 28 '17 at 11:52
  • 6
    The zero mean assumption on the errors amounts to requiring what @kjetilbhalvorsen mentions: there are no systematic effects left in the error term. – Christoph Hanck Jul 28 '17 at 12:23
  • The error term $\epsilon_i$ of a *linear model* is defined as $y_i - x_i^T \beta$ (where $x_i$ is the $i$th observation vector). Thus if $E \epsilon_i = 0$, then this means the linear model is true. Bias refers to systematic error, not to individual errors in some given outcomes. If the "true model" is non-linear, then $E \epsilon_i$ using a linear model will not be distributed as such. – Kevin Jul 28 '17 at 17:01
  • The key sentence in the post to hone in on is "Bias based on my understanding, represents the error because of using a simple classifer(eg: linear) to capture a complex non-linear decision boundary." Bias is a statistical concept that relates the difference between the expected value of the estimator and the true value of the parameter(thing to be estimated). There are many different potential reasons why an estimator might be biased one is model mis-specification but this is not the only reason why an estimator might be biased. – Lucas Roberts Jul 28 '17 at 20:54

1 Answers1

10

We can think of any supervised learning task, be it regression or classification, as attempting to learn an underlying signal from noisy data. Consider the follwoing simple example:

enter image description here

Our goal is to estimate the true signal $f(x)$ based on a set of observed pairs $\{x_i, y_i\}$ where the $y_i = f(x_i) + \epsilon_i$ and $\epsilon_i$ is some random noise with mean 0. To this end, we fit a model $\hat{f}(x)$ using our favorite machine-learning algorithm.

When we say that the OLS estimator is unbiased, what we really mean is that if the true form of the model is $f(x) = \beta_0 + \beta_1 x$, then the OLS estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ have the lovely properties that $E(\hat{\beta}_0) = \beta_0$ and $E(\hat{\beta}_1) = \beta_1$.

enter image description here

This is true for our simple example, but it is very strong assumption! In general, and to the extent that no model is really correct, we can't make such assumptions about $f(x)$. So a model of the form $\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x$ will be biased.

What if our data look like this instead? (spoiler alert: $f(x) = sin(x)$)

enter image description here

Now, if we fit the naive model $\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x$, it is woefully inadequate at estimating $f(x)$ (high bias). But on the other hand, it is relatively insensitive to noise (low variance).

enter image description here

If we add more terms to the model, say $\hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2x^2 + ... \hat{\beta}_p x^p$, we can capture more of the "unkown" signal by virtue of the added complexity in our model's structure. We lower the bias on the observed data, but the added complexity necessarily increases the variance. (Note, if $f(x)$ is truly periodic, polynomial expansion is a poor choice!)

enter image description here

But again, unless we know that the true $f(x) = \beta_0 + \beta_1 sin(x)$, our model will never be unbiased, even if we use OLS to fit the parameters.

Andy Kreek
  • 246
  • 1
  • 5