2

I am trying to do anomaly detection on a heterogeneous dataset (There are unknown groups present in the dataset). I want to try multivariate Gaussian distribution based approach, but I was thinking of the following problem:

Should I try to use a single multivariate Gaussian distribution for the entire dataset or should I try to cluster the dataset first and for each of the clusters, I should use a different multivariate Gaussian distribution? My intuition tells me to do the latter, but I am a bit hesitant to use K-Means clustering (My dataset has millions of records, but few features < 100).

Would you kindly advise?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Learning_Spark
  • 231
  • 3
  • 5
  • 1
    I have seen the following question: http://stats.stackexchange.com/questions/118168/clustering-based-anomaly-detection and I completely agree with Anony-Mousse. It would be great to have some reference(s) which state that "KMeans simply doesn't work". I also want to know which algorithm(s) may work in my case. – Learning_Spark Dec 19 '14 at 18:30

1 Answers1

3

If you plan on assuming Gaussian distributions, use Gaussian Mixture Modeling (EM algorithm on Wikipedia) instead of k-means. Why optimize the squared deviations, when you could instead optimize the fit of your gaussian distributions?

There is also an implementation in ELKI of that. It works - as long as your data is nicely Gaussian. Once your data is not as nicely, the clustering may return rather random results, and the outlier scores will be all over the place, unfortunately.

There is also a stick-breaking prior for Gaussian modeling (I haven't seen that in ELKI yet). I tried it with scipy (DPGMM, Dirichlet Process Gaussian Mixture Models), but it did not work very well at all for me, even on artificial gaussian data - the best possible data set for this type of problem... Essentially, I have tried the code from this SO question (by increasing the number of iterations, it would eventually converge to a reasonable solution). However, despite the data being generated from 5 well separated gaussian distributions, the clustering it produced merged several clusters. So the result was much worse than using a fixed k and the regular GMM clustering approach...

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Thanks a lot, Anony-Mousse. In my case, one problem is that the number of unknown groups could be large (e.g., 100 or 250). How can I take care of this part? As you correctly pointed out, the data may not be nicely Gaussian and I do not really know anything about the different groups present in the dataset (except their "presence"). Still I need to find the outliers! – Learning_Spark Dec 20 '14 at 16:52
  • You may want to try out one of the many techniques that do not attempt to model the data then. Like kNN and LOF. These put very little assumptions on your data, except that outliers have a lower density. – Has QUIT--Anony-Mousse Dec 20 '14 at 17:54
  • Thanks a lot, Anony-Mousse. Please tell me something: How fast is GMM for large data? I find that KMeans is quite time-consuming for large values of K (like 100 or 250). Also kindly mention something about the run-time of the other algorithms (for large data), if possible. – Learning_Spark Dec 21 '14 at 20:23
  • GMM is much worse than k-means. Not theoretically (still $O(n*k*i*d)$ for models without covariance, and $O(n*k*i*d*d)$ with covariance). But usually, $i$ is much larger IMHO. I'd say it's at least 10, if not 50-100 times more expensive than k-means. – Has QUIT--Anony-Mousse Dec 21 '14 at 21:44