31

I am trying to understand overfitting and underfitting better. Consider a data generating process (DGP) $$ Y=f(X)+\varepsilon $$ where $f(\cdot)$ is a deterministic function, $X$ are some regressors and $\varepsilon$ is a random error term independent of $X$. Suppose we have a model $$ Y=g(Z)+u $$ where $g(\cdot)$ is a deterministic function, $Z$ are some regressors (perhaps partly overlapping with $X$ but not necessarily equal to $X$) and $u$ is a random error term independent of $Z$.

Overfitting

I think overfitting means the estimated model has captured some noise patterns due to $\varepsilon$ in addition to the deterministic patterns due to $f(X)$. According to James et al. "An Introduction to Statistical Learning" (2013) p. 32,

[Overfitting] happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function $f$.

A similar take is available in Wikipedia,

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

A difference between the first and the second quote seems to be that Wikipedia mentions how many parameters are justified by the data, while James et al. only consider whether $g(\cdot)$ is capturing patterns due to $\varepsilon$. If we follow James et al. but not Wikipedia, the line between overfitting and absence thereof seems a bit blurry. Typically, even a very simple $g(\cdot)$ will capture at least some of the random patterns due to $\varepsilon$. However, making $g(\cdot)$ more flexible might nevertheless improve predictive performance, as a more flexible $g(\cdot)$ will be able to approximate $f(\cdot)$ better. As long as the improvement in approximating $f(\cdot)$ outweighs the deterioration due to approximating patterns in $\varepsilon$, it pays to make $g(\cdot)$ more flexible.

Underfitting

I think underfitting means $g(Z)$ is insufficiently flexible to nest $f(X)$. The approximation of $f(X)$ by $g(Z)$ would be imperfect even given perfect estimation precision of the model's parameters, and thus $g(Z)$ would do worse than $f(X)$ in predicting $Y$. According to Wikipedia,

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An under-fitted model is a model where some parameters or terms that would appear in a correctly specified model are missing. Under-fitting would occur, for example, when fitting a linear model to non-linear data.

Simultaneous over- and underfitting

If we follow the definition of overfitting by James et al., I think overfitting and underfitting can occur simultaneously. Take a very simple $g(Z)$ which does not nest $f(X)$, and there will obviously be underfitting. There will be a bit of overfitting, too, because in all likelihood, $g(Z)$ will capture at least some of the random patterns due to $\varepsilon$.

If we follow the definition of overfitting by Wikipedia, I think overfitting and underfitting can still occur simultaneously. Take a rather rich $g(Z)$ which does not nest $f(X)$ but is rich enough to capture lots of random patterns due to $\varepsilon$. As $g(Z)$ does not nest $f(X)$, there will be underfitting. As $g(Z)$ captures lots of random patterns due to $\varepsilon$, there will be overfitting, too; a simpler $g(Z)$ could be found which would improve predictive performance by learning less of the random patterns.

Question

Does my reasoning make sense? Can overfitting and underfitting occur simultaneously?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • Related question: ["Impossible to overfit when the data generating process is deterministic?"](https://stats.stackexchange.com/questions/486708) and another one (only loosely related): ["Bias of a model that nests the DGP"](https://stats.stackexchange.com/questions/485945). – Richard Hardy Sep 22 '20 at 15:32
  • I'm surprised your model includes a stochastic term ($u$). I think of the prototypical model as being deterministic, even most machine learning models, although there are some models, especially process models, that include randomness as part of the model. – gung - Reinstate Monica Sep 26 '20 at 14:37
  • @gung-ReinstateMonica, an interesting perspective. I do not think this is as clear cut. Models for $f(X)$ are usually deterministic, but models for $f(X)+\varepsilon$ need not be. Regression and logistic regression are two mainstream examples of models for regression and classification, respectively, and none of these assume randomness away. – Richard Hardy Sep 26 '20 at 14:49
  • w/ linear regression, I would say the DGP is, eg, $Y = \beta_0 + \beta_1X = \varepsilon$, where $\varepsilon \sim \mathcal{N}(0, \sigma^2)$, & the corresponding model is $\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i$. It's true that a residual variance, $s^2$, is also estimated, but I would say that's typically conceptualized as a nuisance parameter, not as the model, & even if someone disagrees (which would be a very defensible position), $s^2$ still isn't a stochastic disturbance. – gung - Reinstate Monica Sep 26 '20 at 17:38
  • For an example of a model where a stochastic component is an inherent part of the model, I would point to Nate Silver's [election forecasting models](https://fivethirtyeight.com/features/how-fivethirtyeights-2020-presidential-forecast-works-and-whats-different-because-of-covid-19), where the output from one run to the next won't be identical, even if the inputs are. (Note that this may be a semantic difference in how we use the terms.) – gung - Reinstate Monica Sep 26 '20 at 17:38
  • @gung-ReinstateMonica, yes, the devil could lie in the formulations. E.g. you wrote a model for $\hat{y}_i$, not $y_i$, while I wrote one for $y_i$. – Richard Hardy Sep 26 '20 at 17:45
  • Maybe underfitting over some interval while overfitting over some other interval. – DifferentialPleiometry Oct 30 '21 at 19:29

2 Answers2

26

Your reasoning makes sense to me.

Here is an extremely simple example. Suppose that $X$ consists of only two columns $x_1$ and $x_2$, and the true DGP is

$$ y=\beta_1x_1+\beta_2x_2+\epsilon $$

with nonzero $\beta_1$ and $\beta_2$, and noise $\epsilon$.

Next, assume that $Z$ contains columns $x_1, x_1^2, x_1^3, \dots$ - but not $x_2$.

If we now fit $g(Z)$ (using OLS, or any other approach), we cannot capture the effect of $x_2$, simply because $x_2$ is unknown to $g(Z)$, so we will have underfitting. But conversely, including spurious powers of $x_1$ (or any other spurious predictors) means that we can overfit, and usually will do so, unless we regularize in some way.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • 6
    The example illustrates the general formulas $f(X)$ and $g(Z)$ nicely. I guess, my problem is that overfitting and underfitting are often loosely defined. I am trying to see whether I should alter the definitions to make them mutually exclusive or proceed as is, yielding the surprising idea of simultaneous over- and underfitting. – Richard Hardy Sep 21 '20 at 10:55
  • 3
    I'm a firm believer in so-called "tapering effect sizes", i.e., there are always influences with weaker and weaker effects, and we can't model them all (e.g., because of the bias-variance tradeoff per Sextus). As such, we *always* underfit. Therefore, I honestly don't really think changing the definitions to make the two concepts mutually exclusive is very useful. (Also, I don't quite see how it could be done and still leave us with recognizable concepts.) – Stephan Kolassa Sep 21 '20 at 13:47
  • Thank you, that makes sense. I am a conditional believer, conditioning on the field of application. In economics and finance, certainly yes. In gene expression data, perhaps no. – Richard Hardy Sep 21 '20 at 14:19
  • 1
    There's something of a philosophical issue regarding what "underfitting" means. Here, you seem to take it as meaning that there is information in the data generating process that is not extracted by the model (because you don't have it in your dataset). Another, stricter, possibility is that it means information *in your dataset* that isn't extracted by your model. Consider a case where $x_2$ is in your dataset, but the relationship is curvilinear & no polynomial terms are used for $x_2$ (that would be somewhat inexplicable, given the overzealousness WRT $x_1$, but whatever...). – gung - Reinstate Monica Nov 25 '20 at 16:55
  • I think this is a great answer (+1), & perhaps I'm being too pedantic, but it seems to me that working from the definition used, essentially all models have to underfit by definition. I recognize the application of Box's famous dictum here, but I think it may be useful to distinguish between "wrong model" in that sense & "underfit model" in the common setting. – gung - Reinstate Monica Nov 25 '20 at 16:59
  • @gung-ReinstateMonica: I think I'm not completely understanding the distinction you make. What would it mean for information to be in the dataset, but not in the DGP? In your example with a curvilinear relationship of $x_2$, I would say that the information about this relationship is indeed in the DGP. – Stephan Kolassa Nov 26 '20 at 10:29
  • @StephanKolassa, certainly it's in the DGP. As I understand the way you are using "underfit", the information is in the DGP, but not in the dataset you have access to. The way I use the term is where the information is in both the DGP & the data, but the model misses it. I would call such a model underfit, but I'd hardly blame a model as having a flaw when it never had access to the requite information in the first place--that seems unfair to the model. – gung - Reinstate Monica Nov 26 '20 at 12:45
  • @gung-ReinstateMonica: I think I'm still not understanding. When I think about underfitting, I indeed assume the information to be both in the DGP and in the data (I am still a little confused about the difference). We may actually be on the same page here. And I wouldn't blame a *model* for underfitting if it doesn't see data in the first place, but the *modeler*, who provided the data to the model. And actually I wouldn't "blame" anyone at all, since per Box, underfitting models can be far more useful than correct ones... – Stephan Kolassa Nov 26 '20 at 16:35
  • Hmmm, perhaps I misinterpreted your answer. What is $Z$? I interpreted it as the dataset. So when the model is fit to the dataset, "$x_2$ is unknown to $g(Z)$, so we will have underfitting". To me, that reads as, when the data available don't have all the relevant variables (which they never will in practice), the model will be underfit. This sounds like conceiving of underfitting as relative to the truth, which is certainly a justifiable stance. – gung - Reinstate Monica Nov 26 '20 at 17:41
  • But I think of underfitting as relative to the information available in the dataset (consider [this](https://stats.stackexchange.com/q/496539/) recent thread where subset selection typically leads to worse models). – gung - Reinstate Monica Nov 26 '20 at 17:41
  • @gung-ReinstateMonica: $Z$ is the design matrix we use in modeling, as per the original question. (So I don't expect a *model* to look at possible transformations, like squaring a column.) I don't think there is all that much difference between "underfitting wrt the truth" vs. "underfitting wrt to the information in the dataset", because if something is in the dataset, it is *a fortiori* also in the truth, and if something is in the truth but not in the dataset, it can't have been all that truthy to start with. (Details in the argument to be filled in by the reader...) – Stephan Kolassa Nov 27 '20 at 11:30
  • The question states, "$Z$ are some regressors (perhaps partly overlapping with $X$ but not necessarily equal to $X$)". I interpreted that to be the dataset. The distinction I'm making b/t the true DGP & your dataset, is that there can be relevant variables in the DGP that aren't in your dataset. There's nothing that the model / modeler can do when the required info just isn't there. As a result, it doesn't sound right to me to say a model is underfit if it wasn't given all the relevant information. But, it does seem reasonable to say it's underfit if it didn't extract the info it was given. – gung - Reinstate Monica Nov 27 '20 at 13:15
14

I like the idea of having a bad fit of the deterministic part and also overly fitting the noise as being both overfitting and underfitting, but that is not how I view those terminologies.

I consider the issue of overfitting versus underfitting as related to the trade-off between bias and variance. Sure you can have situations that are both with high bias and high variance, but that is not the point of expressing the situation overfitting (relatively high variance) versus underfitting (relatively high bias). These concepts are relative to some ideal point. In practice this ideal point may be still biased and also with variance. We are never (completely) without bias and/or variance.

(Actually, I would say that often the most efficient answer, with lowest error, is often always with some bias, and therefore both underfitting and overfitting)

So with overfitting versus underfitting, I always think of these graphs like

overfitting and underfitting in shrinking of sample mean

So to me this overfitting versus underfitting is something that is relative, relative to some parameter and we can plot it as a function of that parameter.

But sure, this plot, where one side (left/right) is overfitting and the other side (right/left) is underfitting, can also be considered to be shifted up and down relating to the question of the total error (bias + variance) being both inceased or decreased.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • even if overfitting and underfitting occurred simultaneously, perhaps it could only result in either *net* overfitting or *net* underfitting – develarist Sep 21 '20 at 15:11