0

enter image description here

I have a frequency table (2 columns) of 20 rows of various transaction amounts. Some of these amounts are fraudulent in nature and are pretty obvious as they appear to be outliers in the scatter plot. I also want to break the data into clusters.

  • Is there a limit on the minimum data set required for clustering?
  • Can I use any specific technique?
  • What techniques can I use to identify the outliers?
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
user40465
  • 1
  • 1
  • 1
    Could you post the data, or scatterplot? The best kind of clustering depends on the data. See: http://stats.stackexchange.com/a/133694/82893 – John Madden Aug 20 '15 at 15:29
  • The relationship here looks well behaved to me. It looks like a reciprocal relationship. Is there any reason something like that *couldn't* be the true relationship? Other than 'outlier-looking' nature of the data in the plot, is there any reason to think these really are outliers (eg, are these impossible values)? – gung - Reinstate Monica Aug 30 '15 at 17:00
  • Its more from a business point of view.Somebody should not be spending more than 20 dollars – user40465 Aug 31 '15 at 10:37

1 Answers1

0

Why don't you use an outlier detection first, then do clustering second?

Also, there are clustering methods (not k-means) that have a notion of noise.

Experiment. Every data set is different. We don't have your data.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96