Clustering a dataset to get the most abnormal data

Question

I have several datasets in R+, each containing two training and test sets. For example the following dataset. I want to train a classifier by using training data such that by applying the test data, I get some reasonable number of points as anomaly so that I can analysis the related situations. The higher the value, the more abnormal it is.

By reasonable number I mean it to be less than P% in each 100 points (each day 100 points are generated, most of them should be considered normal and I want to analysis about P of the most abnormal ones).

I tried K-means with K=2. But as you see in the above link, anomaly cluster is selected by the outliers to be too high. So there would be no anomaly in test data.

score 0 · Answer 1 · answered Apr 23 '13 at 09:54

The data sets are 1 dimensional, right?

Then you should just be doing kernel density estimation. Don't bother to look at "clustering" algorithms, these are usually designed for multivariate data; they will also be much slower, because multidimensional data cannot be sorted, whereas one dimensional can.

Clustering a dataset to get the most abnormal data

1 Answers1