Relationship between overfitting and robustness to outliers

Question

What's the relationship between overfitting and sensitivity to outliers? For example:

Does robustness to outliers make necessarily models less prone to overfitting?
What about the other way around? Are models that are less prone to overfitting usually more robust to outliers?

Or do these concepts bear no relationship at all?

Noise driving overfitting and outliers

Consider for example this definition in Wikipedia:

"The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure", that suggests a deeper connection between noise and overfitting.

So clearly some form of noise plays a role in overfitting. Similarly, one often models outliers as noise that the model may generate, i.e. it's something you can protect yourself against by using a noise model that would explain outliers with e.g. fat tail distributions.

So maybe the relationship here comes down to what type of noise we are fighting against in overfitting vs outliers? And if so, what is a good definition of these types of noise, and what's their relationship?

If the current answers do not satisfy you, could you comment on what they are lacking in your view? — jhin, Jul 22 '20 at 12:43
For instance, would you prefer if I extended my answer with a few more formal equations? I intentionally left them out because I felt the question was more about intuition than about precise maths, but of course, one could also formalize things. — jhin, Jul 22 '20 at 12:46
Thanks - @jhin I left a comment seeking clarification on your answer — Josh, Jul 23 '20 at 04:16

Match Maker EE · Answer 1 · 2020-07-22T23:22:23.423

Interesting questions posed. I will address the two questions for the use case of statistical classifiers in order to demarcate the analysis to a model domain we can oversee.

Before embarking onto an elaborate answer I do want to discuss the definition of Robustness. Different definitions have been given for the concept of robustness. One can discuss model robustness - as opposed to outcome robustness. Model robustness means that your general model outcome - and hence the distribution of its predictions - that they are less sensitive or even insensitive to an increasing amount of extreme values in the training set. Outcome robustness, on the other hand, refers to the (in)sensitivity to increasing noise levels in the input variables with respect to one specific predicted outcome. I assume that you address model robustness in your questions.

To address the first question, we need to make a distinction between classifiers that use a global or local distance measure to model (probability of) class dependency, and distribution-free classifiers.

Discriminant analysis, k-nearest neighbor classifier, neural networks, support vector machines - they all calculate some sort of distance between parameter vectors and the input vector provided. They all use some sort of distance measure. It should be added that nonlinear neural networks and SVMs use nonlinearity to globally bend and stretch the concept of distance (neural networks are universal approximators, as proved and published by Hornik in 1989).

'Distribution-free' classifiers

ID3/C4.5 decision trees, CART, the histogram classifier, the multinomial classifier - these classifiers do not apply any distance measure. They are so-called nonparametric in their way of working. This having said, they are based on count distributions - hence the binomial distribution and the multinomial distribution, and nonparametric classifiers are governed by the statistics of these distributions. However, as the only thing that matters is whether the observed value of an input variable occurs in a specific bin/interval or not, they are by nature insensitive to extreme observations. This holds when the intervals of input variable bins to the leftmost and rightmost side are open. So these classifiers are certainly model robust.

Noise characteristics and outliers

Extreme values are one kind of noise. A scatter around a zero mean is the most common kind of noise that occurs in practice.

This image illustrates scatter noise (left) and salt-and-pepper noise (right). Your robustness questions relate to the right-hand kind of noise.

Analysis

We can combine the true value of classifier input $i$, $z(i)$ with scatter noise $\epsilon$, and an outlier offset $e$ as

$ x(i) = z(i) + \epsilon + e \cdot \delta(\alpha) $

with $\delta(\alpha)$ the Kronecker delta function governed by the parameter $\alpha$. The parametrized delta-function determines whether the outlier offset is being added, or not. The probability $P(\delta(\alpha)=1) \ll 1$, whereas the zero-mean scatter is always present. If for example $P(\delta(\alpha)=1) = \frac{1}{2}$, we do not speak of outliers anymore - they become common noise additive offsets. Note also that distance is intrinsic to the definition of the concept outlier. The observed class labels themselves in a training set cannot be subject to outliers, as follows from the required notion of distance.

Distance based classifiers generally use the L2-norm $\mid \mid {\bf x} \mid \mid_2$ to calculate degree of fit. This norm is well-chosen for scatter noise. When it comes to extreme values (outliers), their influence increases with the power of $2$, and of course with $P(\delta(\alpha)=1)$. As nonparametric classifiers use different criteria to select the optimal set of parameters, they are insensitive to extreme value noise like salt-and-pepper.

Again, the type of classifier determines the robustness to outliers.

Overfitting

The issue with overfitting occurs when classifiers become 'too rich' in parameters. In that situation learning triggers that all kinds of small loops around wrongly labeled cases in the training set are being made. Once the classifier is applied to a (new) test set, a poor model performance is seen. Such overgeneralization loops tend to include points pushed just across class boundaries by scatter noise $\epsilon$. It is highly unlikely that an outlier value, which has no similar neighboring points, is included in such a loop. This because of the locally rigid nature of (distance-based) classifiers - and because closely grouped points can push or pull a decision boundary, which one observation in its own cannot do.

Overfitting generally happens between classes because the decision boundaries of any given classifier become too flexible. Decision boundaries are generally drawn in more crowded parts of the input variable space - not in the vicinity of lonely outliers per se.

Having analyzed robustness for distance based and nonparametric classifiers, a relation can be made with the possibility of overfitting. Model robustness to extreme observations is expected to be better for nonparametric classifiers than for distance-based classifiers. There is a risk of overfitting because of extreme observations in distance-based classifiers, whereas that is hardly the case for (robust) nonparametric classifiers.

For distance-based classifiers, outliers will either pull or push the decision boundaries, see the discussion of noise characteristics above. Discriminant analysis, for example, is prone to non-normally distributed data - to data with extreme observations. Neural networks can just end up in saturation, close to $0$ or $1$ (for sigmoid activation functions). Also support vector machines with sigmoid functions are less sensitive to extreme values, but they still employ a (local) distance measure.

The most robust classifiers with respect to outliers are the nonparametric ones - decision trees, the histogram classifier and the multinomial classifier.

A final note on overfitting

Applying ID3 for building a decision tree will overgeneralize model building if there is no stopping criterion. The deeper subtrees from ID3 will begin fitting the training data - the fewer the observations in a subtree the higher the chance of overfitting. Restricting the parameter space prevents overgeneralization.

Overgeneralization is in distance based classifiers also prevented by restricting the parameter space, i.e. the number of hidden nodes/layers or the regularization parameter $C$ in an SVM.

Answers to your questions

So the answer to your first question is generally no. Robustness to outliers is orthogonal to whether a type of classifier is prone to overfitting. The exception to this conclusion is if an outlier lies 'lightyears' away and it completely dominates the distance function. In that really rare case, robustness will deteriorate by that extreme observation.

As to your second question. Classifiers with well-restricted parameter spaces tend to generalize better from their training set to a test set. The fraction of extreme observations in the training set determines whether the distance based classifiers be led astray during training. For non-parametric classifiers, the fraction of extreme observations can be much larger before model performance begins to decay. Hence, nonparametric classifiers are much more robust to outliers.

Also for your second question, it's the underlying assumptions of a classifier that determine whether it's sensitive to outliers - not how strongly its parameter space is regularized. It remains a power-struggle between classifier flexibility whether one lonely outlier 'lightyears away' can chiefly determine the distance function used during training. Hence, I argue a generally 'no' to your second question.

Thanks, I like much about this answer, but there is an aspect of overfitting that I think still went unanswered. You mention _"Overfitting generally happens between classes_" but one could also consider outliers as examples of data points that come from the wrong class. Moreover, you discuss noise in the context of robustness to outliers, but noise can also drive overfitting: _"[From Wikipedia] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the **noise**) as if that variation represented underlying model structure"_ (see more context on the OP) — Josh, Jul 22 '20 at 15:16
See extensions in text and my modified conclusions. I enjoy such an analytic discussion. — Match Maker EE, Jul 22 '20 at 23:18
Thanks. _"Hence, nonparametric classifiers are much more robust to outliers."_ I'm not sure if you can make this type of broad generalization. A Gaussian process is often considered non-parametric (its ability to interpolate the data grows with the data), and I think if I use, say, a linear GP kernel (linear covariance function), you are still fitting data to a Normal distribution on a linear model that is more sensitive to outliers than, say, a parametric linear regression with a Laplace (fatter tails) distribution. No? — Josh, Jul 23 '20 at 04:22
A with regards to nonparametric classifiers, the salt-and-pepper noise is commonly removed by a 3 x 3 median filter by convolution. Hence, there the (nonparametric) histogram based technique has found its way. It is true, however, that using the L1-norm as a fit criterion does not suffer from increasing influence of an extreme value. — Match Maker EE, Jul 23 '20 at 05:59

score 3 · Answer 2 · answered Jul 20 '20 at 19:57

How does a model become "robust to outliers"? It does so by acknowledging their presence in the specification of the model, by using a noise model that contains outliers. In probabilistic modeling, this may be achieved by assuming some kind of fat-tailed noise distribution. From an optimization perspective, the same thing can be achieved by using an "outlier-robust cost function" (such as the Huber loss function). Note that there is an equivalence between these two worlds, e.g., whereas L2 norm error minimization corresponds to the assumption of Gaussian noise, L1 norm error minimization (which is more robust to outliers) corresponds to the assumption of Laplacian noise. To summarize, robustness to outliers has nothing to do with the model of the process itself; it depends only on the correctness of the noise model.

How does a model become "robust to overfitting"? Overfitting is a symptom of model mismatch: the process model is too flexible and the noise model is incorrect. If we knew exactly what level of measurement noise to expect, even a very flexible model would not overfit. In practice, robustness to overfitting is achieved by using a flexible model class but biasing the model towards simpler explanations by means of regularization (using a prior over the parameters or, equivalently, an L1/L2 regularization term).

What's the relation of the two properties? Use a flexible model class without appropriate parameter priors or regularization and assume a fat-tailed noise distribution or a robust loss function, and you have an inference procedure that is robust to outliers but not to overfitting. Use an appropriate regularization term but usual L2 error minimization, and you have a method that is robust to overfitting but not to outliers. The two properties are orthogonal to each other, since they relate to different components of the assumed statistical model: robustness to outliers depends on the correctness of the noise model / error loss function, whereas robustness to overfitting depends on the correctness of the parameter priors / regularization term.

Thanks, maybe it's a matter of semantics, but e.g. consider [this](https://en.wikipedia.org/wiki/Overfitting): _"The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure"_. That suggests a deeper connection between noise and overfitting. Maybe this comes down to **types of noise**, and what we mean by **"outlier"**, but clearly "noise" can influence both overfitting and whether or not we fit to outliers. So what's your take on these types of noise, their definition and relationship? — Josh, Jul 22 '20 at 17:44
@Josh those are interesting remarks. Unfortunately, I'm a bit short on time right now but will update my answer once I have time again! — jhin, Jul 23 '20 at 16:47

score 0 · Answer 3 · answered Jul 04 '20 at 20:15

There is a lot of things to influence the outliers, if the model is overfitting then it will learn specific details of data including noise data points like outliers. But it's not necessarily that if model not robust to outliers then it's overfitting, there is models is sensitive to outliers.

score 0 · Answer 4 · answered Jul 04 '20 at 21:10

Per Wikipedia on contraposition to quote:

In logic and mathematics, contraposition refers to the inference of going from a conditional statement into its logically equivalent contrapositive, and an associated proof method known as proof by contraposition.[1] The contrapositive of a statement has its antecedent and consequent inverted and flipped. For instance, the contrapositive of the conditional statement "If it is raining, then I wear my coat" is the statement "If I don't wear my coat, then it isn't raining."...The law of contraposition says that a conditional statement is true if, and only if, its contrapositive is true.[3]

So, on the slightly reworded question: Is a model that does not overfit easily than one that does, necessarily implied more robustness to outliers, the contraposition is, as 'not more' is 'equal or less': Does equal or less robustness necessarily follow from a model that overfits easily than one that does not?

To assist in the answer, take the case of Least Absolute Deviation regression which is known for its robustness. It also curious in the case of estimation of single parameter, it reduces to a median estimate as opposed to the mean (which is highly susceptible to outliers as it incorporates all the data). So, the mean can be viewed as 'overfitting' but in samples, the mean and median can be close due to a balancing of large positives and negative values.

Per the 'if and only if standard' placed on the veracity of the councontrapositive, necessarily less robustness does not follow from a model that overfits easily than one that does not, so my answer is no.

Relationship between overfitting and robustness to outliers

Noise driving overfitting and outliers

4 Answers4