1

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The current method used by the system I'm on is K-means, but that seems like overkill.

Is there a better way of performing this task?

Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?

I see how KDE returns a density, but how do I tell it to split the data into bins?

How do I have a fixed number of bins independent of the data (that's one of my requirements) ?

More specifically, how would one pull this off using scikit learn?

My input file looks like:

 str ID     sls
 1           10
 2           11 
 3            9
 4           23
 5           21
 6           11  
 7           45
 8           20
 9           11
 10          12

I want to group the sls number into clusters or bins, such that:

Cluster 1: [10 11 9 11 11 12] 
Cluster 2: [23 21 20] 
Cluster 3: [45] 

And my output file will look like:

 str ID     sls    Cluster ID  Cluster centroid
    1        10       1               10.66
    2        11       1               10.66
    3         9       1               10.66 
    4        23       2               21.33   
    5        21       2               21.33
    6        11       1               10.66
    7        45       3               45
    8        20       2               21.33
    9        11       1               10.66 
    10       12       1               10.66
Skander H.
  • 10,602
  • 2
  • 33
  • 81
  • Possible duplicate: http://stats.stackexchange.com/questions/13781/clustering-1d-data – CatsLoveJazz Jan 28 '16 at 23:21
  • 1
    @CatsLoveJazz I don't think this is a duplicate: I am asking specifically about KDE. – Skander H. Jan 28 '16 at 23:47
  • what is the underlying distribution? What kind of sample density are you going to have? What are you going to use this for? – EngrStudent Jan 29 '16 at 00:00
  • @EngrStudent I don't know what the underlying distribution is, this is real world sales data which I want to group into performance bins. We have sales data for different products p1: 50 Units sold, p2: 30 Units sold, ... p1345: 12 Units sold, there's no time dependency or anything. I'm using about 30 000 data points or so. – Skander H. Jan 29 '16 at 00:37
  • What package are you using? That is significant because the software limits the methods available. Given the small sample size (less than a few million rows, one column), I would want to use a JMP constellation plot, and the cubic clustering criterion. It doesn't care about 1d vs. 2d. – EngrStudent Jan 29 '16 at 00:41
  • @EngrStudent I'm using scikit learn. I could switch to R if I have to, but I'd rather stick to sklearn cause it's already there on the server. Never heard of JMP, what is it? – Skander H. Jan 29 '16 at 00:43
  • @EngrStudent Ok just checked, JMP is not an option as I don't have access to SAS. – Skander H. Jan 29 '16 at 00:44
  • [JMP}(http://www.jmp.com/en_us/software.html) - Intel has enterprise licenses and does nearly everything in it. It is somewhere between Excel and MatLab in diversity of function, but can handle very large data, and is compiled so in its specialized tasks it is quite fast. It is also good for building a consistent, one-click, analysis that gives the same format of results. Making consistent output in large organizations is helpful. – EngrStudent Jan 29 '16 at 00:46
  • [Weighted Cluster](https://cran.r-project.org/web/packages/WeightedCluster/vignettes/WeightedCluster.pdf), perhaps. Let me look for something sci-kit apropros. Scikit has this [link](http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering) which has some good content. – EngrStudent Jan 29 '16 at 00:49

0 Answers0