12

How can a straight line (or plane) over-fit? My question is not about polynomial regression (although it too is considered 'linear'), but regarding a linear regression model with no higher order features, such as the following equation:

$$ y = \theta_0 + \theta_1 X_1 + \theta_2 X_2 + \theta_3 X_3 + \theta_4 X_4 + ... $$

I did go through other answered questions on this web which speak of poor generalization of a linear regression model with too many features (and that is in fact over fitting), but geometrically speaking, I cannot understand how can a linear model over-fit.

Here is Prof Andrew Ng's example of over-fitting shown geometrically. As far as I can see, a linear model (with no higher order features) can only under fit (the first figure depicting logistic regression):

enter image description here

Similar question: Overfitting a logistic regression model

Batool
  • 267
  • 2
  • 10
  • 1
    I removed my accepted answer due to a helpful downvote to inspire a better answer, @gwg. But for someone may find a compact answer useful, I place it here as comment: Assuming the real model is: $y_i = \beta_0 + \beta_1 X_{i1} + \epsilon_i$ but you add a factor $X_{2i}$ which is not related the $y_i$ to model and fit the new model $y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i$ In general, you will get a $\hat{\beta_2} \neq 0$ , then if you run the model to predict something including factor $X_{i2}$ , you will suffer over-fit. –  Jan 31 '20 at 16:01
  • I posted a simulation I like over at the Data Science Stack: https://datascience.stackexchange.com/a/79994/73930. – Dave Dec 25 '20 at 01:42

2 Answers2

4

Deleted, see question comment for content. (This post remains here because I can't delete an accepted answer, see https://meta.stackexchange.com/questions/14932/allow-author-of-accepted-answer-to-delete-it-in-certain-circumstances).

  • What is $\hat{\beta}_2$? – Ramanujan Sep 01 '19 at 21:07
  • Why is this true? Can you be more specific? I see upvotes, but no justification. – jds Jan 30 '20 at 01:04
  • @gwg, I can't see what I can do to be more specific. Maybe something like this? Say we want "predict" NPC income (Y) in a simulation game (so we can define "true" model). The game includes IQ (X1) and leg length (X2) as factors. Since the leg length (X2) is exactly not related to income (Y) in that game, "true" $\beta_2=0$. But in general $\hat{\beta_2}$ will get non-zero value due to $\epsilon$. Thus if you use this model to predict $Y$, you will suffer from wrong $\beta_2$ configuration, it's exactly what over-fitting denotes, an wrong pattern (leg length affects income) fitted from data. –  Jan 30 '20 at 01:48
  • Sure. I think you're describing shrinkage, which has to do with selecting features, reducing the variance of the estimator, and increasing predictive power. Wikipedia says, "This idea is complementary to overfitting and, separately, to the standard adjustment made in the coefficient of determination to compensate for the subjunctive effects of further sampling." – jds Jan 30 '20 at 11:49
  • See Tibshirani's justification for the Lasso: http://www.math.yorku.ca/~hkj/Teaching/6621Winter2017/Coverage/lasso.pdf. He mentions (1) greater prediction accuracy because OLS estimates can have high variance and (2) interpretability. Also see Wikipedia's justification of Tikhonov regularization: https://en.wikipedia.org/wiki/Tikhonov_regularization. Neither frame the problem as overfitting as OP means it. Many people would agree with what you've written, but I think a good answer to OP's question would add a lot more nuance. – jds Jan 30 '20 at 12:10
  • @gwg You mention two regularization methods, so I suppose that you want an answer for this question like "Point out a non-overfitting version of OLS, thus the overfitting in vanilla OLS is obvious to see." While I agreed with it, I think it may be off topic. Overfitting in true model specification $Y=X\beta+\epsilon$, in essence, is due to noise $\epsilon$ as indicated by me, even under any regularization. Unless you specify a too strong regularization like "assign 0 to all parameter", in that case, you will get both zero overfitting and zero fitting if the true model is not a "zero model". –  Jan 31 '20 at 00:21
  • Okay, you've convinced me. (Sadly, SO won't let me change my vote.) – jds Jan 31 '20 at 15:34
  • @gwg No problem, I will remove this answer to inspire what you want. –  Jan 31 '20 at 15:50
  • @user137795 I think your answer wasn't wrong. – Sextus Empiricus Dec 25 '20 at 11:48
1

This is an old post but I just came across it. I think the question refers to how can a line become "curvy" when over estimation occurs. If we have 2 3D points we should be over fitting. The algorithm will try to fit a plane through 2 points. An infinite number of planes can go through 2 points. It can 'tilt' any way to fit a 3rd point. If we add another dimension and another point it will tilt towards it and so on. That's where the problem is. The 2D plots they use to illustrate the concept are an abstraction or refer to polynomial regression.