1

I sample vectors (each sample is a vector) with length equals to $n$ and my objective is the detection of outliers. In case those elements would distribute normally, for outlier detection, I could use Standardized Euclidean Distance, followed by extraction of the survival function based on the $\chi^2_{n}$ distribution. Unfortunately, the elements in my vector distribute with respect to $\chi^2$. What tool I should use?

Remark: Before posting this question, I have looked into a more complicated question with different specifications.

Gideon Kogan
  • 250
  • 1
  • 10
  • Do you sample "from" the vector of length n or is this vector the result of your sampling? What is your goal? Outlier detection? – frank Feb 15 '22 at 08:38
  • my objective is outlier detection. Each sample is a vector with length equals to $n$ – Gideon Kogan Feb 15 '22 at 08:47
  • Do you want to find outliers in each single vector (an outlier would be an element of the vector) or do you consider a set of vectors and an outlier would be a complete vector (an odd vector amongst all the other vectors). – frank Feb 15 '22 at 08:57
  • The outlier would be a complete vector. – Gideon Kogan Feb 15 '22 at 10:00

1 Answers1

-1

I would suggest nonparametric methods, like those implemented here. E.g., one particularly popular method is Isolation Forest.

Edit: Here is why I think that in this case, those methods should be considered: At first sight, those nonparametric ML methods might seem inferior to standard statistical methods that have a proper distribution and p-value and all that. I also do love all those things, but I think one should not overestimate the merits of those tools, especially not in this case.

First, think of the p-value. The p-value has the principle problem of asymmetry, meaning that you can sometimes claim that something is an outlier but you have no criterion to determine that there is none. With methods like isolation forest, you can do this, at least in principle.

Second, when having only an approximation of your true distribution, as in this case, you would have to ask yourself how good this approximation is, how meaningful your p-value still is, whether you can measure the error, and so on, stuff that usually cannot be done. So you don't actually have an analytic description of your problem anymore and you could as well go for those ML methods.

Third, those ML methods are very flexible, are easy to use, easy to understand, have readily available well-optimized and well-tested implementations, and have been shown to usually do a very good job. I am doing anomaly detection all the time and I can only recommend them.

frank
  • 1,434
  • 1
  • 8
  • 13
  • thanks. I will go over it. Do you see any advantage of those methods over approximation of the $\chi^2$ by the normal distribution? How I access the p-value, using those methods? – Gideon Kogan Feb 15 '22 at 17:13