0

Are there ways to automatically detect outliers ( we can fix uni-dimensional datasets ) when the underlying distribution is difficult to model ?

Intuitively, reseampling techniques could help.

(1) You split the data into two sets, S1 and S2.

(2) You fit an empirical distribution on S1 (e.g. using hystograms).

(3) You develop an engine to detect if points in S2 are compatible with the non-parametric distribution fitted using data from S1.

(4) Repeat this for may S1, S2 splits and record how many times the point was detected as an outlier.

As an example of Step (3), one could evaluate the interquartile range based on S1 and check how many and which points in S2 do not fall in the interquartile range.

Questions:

  • I am not sure that this intuitive idea make sense or can be made more rigorous or even implemented in a sensible way. Are there such methods ?

  • Alternatively, is there a "standard" for outlier estimation in a non-parametric setting ?

Thomas
  • 623
  • 3
  • 14
  • You’re saying, I think, that you can detect outliers somehow if you can detect outliers in a sample obtained with replacement. That appears correct but does not move you from your origin. Also, if with the same unspecified method you don’t detect outliers in a different sample with replacement, how do you handle the contradiction? – Nick Cox Mar 01 '22 at 08:46
  • (1) You split the data into two sets, S1 and S2. (2) You fit a distribution on S1. (3) You develop an engine to detect if points in S2 are compatible with the parametric distribution fitted. – Thomas Mar 01 '22 at 08:57
  • Once you have a distribution, even numerically estimated, maybe there are methods to detect if a point is compatible with it or not. This is the "unspecified method" you are referring to I guess. This is a part of the question ... – Thomas Mar 01 '22 at 08:58
  • What is non-parametric or bootstrapping about that? "develop an engine" is what you need to do; it is not a solution. I am sympathetic here; I just don't see that How do I model a distribution that is difficult to model? is a helpful question. – Nick Cox Mar 01 '22 at 08:59
  • I am not sure what to think about comments like "why you are making the question if you do not have an answer"... – Thomas Mar 01 '22 at 09:00
  • That's not a quotation from me. My difficulty is that I don't see that you have a question that can be answered at all. – Nick Cox Mar 01 '22 at 09:02
  • I understand that this is your opinion about the question. Getting your feedback into consideration, I removed the word "bootstrap" that maybe was leading to some confusion, and clarified a bit what I would like to receive as an answer. As a side point, I do not think there are questions that cannot be answered, but that is my "gnoselogical" point of view you may have a different opinion. – Thomas Mar 01 '22 at 09:17
  • I appreciate that you're not getting much from my comments, but I fear that the edits make the question no easier to answer helpfully. What does fitting a distribution nonparametrically mean? I think you're seeking a completely assumption-free way of identifying outliers, to which my reactions are (1) if this existed, don't you think it would be well publicised (2) to me, that is a contradiction in terms, as outliers are surprising relative to some model of the data and that might be specified or it might be in a researcher's mind but it can't be automated generally. Sorry.... – Nick Cox Mar 01 '22 at 09:39
  • I would say that fitting a distribution nonparametrically means that we do not restrict to a functional form the density, but we consider all the functional space of square integrable functions e.g. making an expansion into some complete base. I guess "https://en.wikipedia.org/wiki/Density_estimation" can be interpreted this way. – Thomas Mar 01 '22 at 09:45
  • Regarding your points (1) True, but I do not know many things I may be missing something big (2) If you define outliers as relative to a model I see your point. – Thomas Mar 01 '22 at 09:50
  • But if we look at a plot we "recognise" outliers sometimes by eye no ? This "by eye" recognition means I think that we are internally fitting in our mind a model on a big subset of data, and see that some of them are not fitted by that model. This "fitting" that our mind does is non parametric, since we do not know about "Gaussian distribution" or smth similar (even if we know about "smoothness"). There could be something more rigorous based on these ideas, I thought. – Thomas Mar 01 '22 at 09:50
  • If I know that an outlier is an impossible value because of subject-matter knowledge, you can call that non-parametric if you like -- it's a term I try hard to to avoid -- but that doesn't make it any easier to program any such rule. Again, I am trying to back off here without seeming rude or unhelpfully sceptical, but I can't sense that there is any useful juice to be squeezed out of this particular orange. – Nick Cox Mar 01 '22 at 09:59
  • In the meantime I discovered this https://hal.archives-ouvertes.fr/hal-01640325/document , https://www.monash.edu/business/ebs/research/publications/ebs/wp02-2021.pdf reporting efforts towards non-parametric outlier detection – Thomas Mar 01 '22 at 17:57
  • 1
    The idea is good, but it's overkill. A leave-out-one comparison of each value to its complement is a standard, effective, and simple way to detect outliers; and it's obviously superior to leaving out more than one because the comparison datasets (the complements) are as large as possible. The real challenge lies in identifying *how many* outliers to detect. A principled variant of your approach can work. The main reason there is no standard method is that "detect outliers" is too vague a question: how many outliers should be specified (or should this be estimated)? In which directions? – whuber Mar 02 '22 at 15:35
  • 1
    @Nick A non-parametric version of this approach, which would be rather good, is a leave-out-one comparison to boxplot fences. Except when the dataset is very small, though, this doesn't look like it would improve much on standard boxplot methods. And when the dataset *is* small, the value of any nonparametric outlier-detection exercise is dubious. – whuber Mar 02 '22 at 15:38
  • 1
    It's all too easy to remind anyone of complications: here are some. Several variables in your dataset, so that what is an outlier is not a univariate question? Categorical variables too? Time and space structure? – Nick Cox Mar 02 '22 at 15:45
  • Thanks @whuber googling what you suggested I found https://lmfit.github.io/lmfit-py/examples/example_detect_outliers.html and https://stats.stackexchange.com/questions/121071/can-we-use-leave-one-out-mean-and-standard-deviation-to-reveal-the-outliers , let me know if you have better examples. – Thomas Mar 03 '22 at 22:58
  • Anyway yes ideally a method should estimate the number of outliers to be useful I think... For the boxplot approach, this is very similar to the method proposed in the linked question. But also that approach has a drawback no ? If we let N tend to infinity, it will for sure detect a number of outliers scaling with N, even if all points are from the same distribution. A good outlier detection method should take that into account no ? – Thomas Mar 03 '22 at 23:08
  • Thinking of which, I wonder if there are methods that estimate "the number of outliers" in a sample just by looking at an "excess" w.r.t. something, rather than providing which points are classified as outliers. But this is probably an other question. Sorry for writing too much, I still do not have clear ideas about this problem of outliers spotting... – Thomas Mar 03 '22 at 23:21
  • 1
    Even in the univariate case the problem is delicate, because the presence of one outlier can mask the presence of others. Some of the better methods identify the most extreme values one at a time, peel them off, and repeat. The process needs careful control and theoretical justification. Entire books have been written about this--just in the univariate case. – whuber Mar 03 '22 at 23:26
  • Note that the examples you found are looking for very particular kinds of outliers. Ultimately, an outlier is an individual datum that differs from *your characterization* of an entire batch of data. There cannot be a general, one-size-fits all "outlier detection" algorithm. – whuber Mar 04 '22 at 13:40
  • Thanks for making this explicit. Such a characterization may enter in step (3) in the (incomplete) pseudo-algorithm in the question. The general idea of the procedure I wrote is that if we take random subsets S1 and S2, than the vast majority of the time S1 does not contain any outlier and therefore we would have a "pure" set to evaluate a "pure" characterization on which to detect outliers in S2. – Thomas Mar 04 '22 at 20:18
  • Talking about generality, if we fix a distribution $f_0(x)$ and are given a random number $x$, we have hypothesis tests that are able to test the null hypothesis that $x$ comes from $f_0$ (the null hypothesis). Why cannot we have something similar where $f_0(x)$ is learnt "statistically" from the data ? I understand that the masking effect makes this more difficult, but my limited knowledge does not permit me to see why this task is so undoable ... – Thomas Mar 04 '22 at 20:22

0 Answers0