0

I have a dataframe with numeric and categorical variables and no target variables and I need to check for multivariate outliers.

Could you suggest a model (using Python) that works good for outliers detection and doesn't need any parameter to set in advance?

I do not have a target variable to train the model on as I do not know in advance which value is tagged as an anomaly. Also, I do not know the contamination parameter for this dataframe, so I would like to avoid setting it.

The goal would be to find a model that needs little to no setting at all.

Is there a way to use both categorical and numeric variables in a single model?

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
carguerriero
  • 13
  • 1
  • 6
  • 3
    This is not much more precise than How should someone find a criminal? or the love of their life (if not found already)? You're asking for a general rule or code to find outliers in any kind of data. If this Holy Grail had been found, it would be in all the textbooks! So far, so negative, but here are a few hints. There are many threads on outliers here; you should at least try reading some of the most highly voted ones. A outlier has been defined as a data point that causes surprise in the researcher, and can't be defined well independently of a precise model for key variables. – Nick Cox Oct 02 '19 at 15:36
  • 1
    What would you do with outliers once you found them? – Nick Cox Oct 02 '19 at 15:38
  • 1
    Python-specific questions are off-topic here. Please see advice in the Help Center on software-specific questions. – Nick Cox Oct 02 '19 at 15:39
  • Unfortunately this is the problem I'm facing with the client! How would you solve it? Where would you start? – carguerriero Oct 02 '19 at 15:40
  • 3
    The client, on this evidence, has unrealistic expectations of what you or machine learning can do. I don't have good advice for that side, as my work experience in academia allows me to tell a student or a colleague what I think with no bigger risk than their being disappointed or of learning something new. On whether you're out of your depth too, I can't say. But otherwise, my experience is that most outliers are genuine data points and should be accommodated in a model. No device works for all problems, but top of my personal list is working on logarithmic scale. – Nick Cox Oct 02 '19 at 15:47
  • The easiest decisions are when subject-matter knowledge allows you to tell that a data point is impossible (and there is no chance of correcting it). No software can use such criteria without being told. – Nick Cox Oct 02 '19 at 15:54
  • Thanks for the explanation @Nick Cox! – carguerriero Oct 02 '19 at 16:58
  • 1
    You need to clarify what you mean by outliers. Univariate or multivariate? You also need to answer (as @NickCox asked) what you are going to do with the outliers once you find them. Another issue why you need to find them in the first place. – Peter Flom Oct 03 '19 at 15:16
  • @Peter Flom - The idea would be to show the outliers to the user and check with him if these outliers truly are anomalies based on his expertise. – carguerriero Oct 03 '19 at 15:40
  • OK ... Univariate or multivariate outliers? – Peter Flom Oct 03 '19 at 15:42
  • Multivariate, sorry I missed this info! – carguerriero Oct 03 '19 at 15:52

2 Answers2

1

Boxplots are a feature of many statistical programs, and so the boxplot method of designating 'outliers' is one of the most often used (and mis-used). The usual criterion is to label as an 'outlier' any observation below $Q_1 - 1.5\text{IQR}$ or above $Q_3 + 1.5\text{IQR},$ where $Q_1$ and $Q_3$ are the lower and upper quartiles, respectively, and $\text{IQR} = Q_3-Q_1.$ The choice of the constant $1.5$ has become traditional, but it is arbitrary.

@NickCox has cautioned that most outliers are genuine data values. Samples of moderate size from many distributions (even normal) typically show outliers. As an illustration, here are boxplots of 20 samples of size $n = 100$ from $\mathsf{Gamma}(\text{shape} = 4, \text{rate} = 0.4) .$ sampled and plotted using R:

set.seed(2019);  m = 20; n = 100
x = rgamma(m*n, 4, .4);  g = rep(1:m, each=n)
boxplot(x~g, col="skyblue2", pch=20)

enter image description here

If one were systematically to remove outliers before computing sample averages, that would underestimate the the population mean and overestimate the standard error of the population mean. Below a is a vector of $m = 100,000$ honest sample averages of gamma samples of size $n = 100.$ The vector b has averages that disregard boxplot 'outliers'.

set.seed(210)
m = 10^5;  a = b = numeric(m)
for(i in 1:m) {
  x = rgamma(100, 4, .4); a[i] = mean(x)
  x.out = boxplot.stats(x)$out  # list of 'outliers'
  b[i] = (sum(x)-sum(x.out))/(100-length(x.out))
  }
mean(a); mean(b)
[1] 10.00083         # aprx E(X) = 10
[1] 9.644603         # underestimate
sd(a); sd(b)
[1] 0.5012889
[1] 0.5455079

The plot below shows the kernel density estimators of the simulated distributions of honest means (black) and of the means of non-'outliers' (red).

enter image description here

BruceET
  • 47,896
  • 2
  • 28
  • 76
1

Since outlier detection is commonly considered to be an unsupervised learning, most do not require labeled data. Parameters thus must be mostly guessed by experience, studying algorithms, domain understanding, and experimentation (try some parameters, study the result).

Examples include:

  • k nearest neighbors
  • local outlier factor
  • loop

There are some exceptions. For example, one-class SVMs require training data free of anomalies and then are used to detect anomalous instances in future data. So these are partially supervised.

Mario
  • 341
  • 2
  • 12
Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96