4

I'm attempting to generate approximately even-sized clusters of a PCA'd feature set in Scikit-learn, but I'm not having any luck. I'm only familiar with KMeans clustering, and with that algorithm the largest cluster contains the majority of the examples (in the case of K=2 it's 80%, in K=4 it's 65%, etc...). I've done some basic experiments with default MeanShift and AffinityPropagation, but I've had no luck there.

This is where a graduate degree would have come in handy, but in the meantime can anyone point me in the direction of some good resources on what types of clustering algorithms can control cluster size (specifically any that are implemented in sklearn!)

I realize this question is super vague, but I'm not sure what information is relevant to the problem. My data set begins as a combination of normalized continuous variables and one-hot encoding for categorical variables. 36 original features are reduced with PCA to 20 features that describe 99%+ of the variance. Attempts to modify my pre-PCA data set don't really effect how the clustering divides up the examples.

Thanks for any suggestions/input!

Dave Novelli
  • 163
  • 1
  • 7
  • The question is indeed unclear. Are you _generating_ random clusters or doing _cluster analysis_ of existing data? – ttnphns Sep 10 '14 at 05:57
  • I thought this question was different in that I wondered if there was a solution specifically in scikit-learn, which the linked question does not mention. – Dave Novelli Sep 12 '14 at 02:12
  • @ttnphns, my ultimate goal is a binomial classification task (the Kaggle Titanic comp) as I'm getting familiar with scikit-learn. I've tried a wide variety of feature engineering tasks and different types of models, but I know I'm leaving a few percentage points of accuracy on the table. I wanted to experiment to see if I could generate a small number of clusters (2-3) and improve my overall accuracy by training separate models for each cluster. Unfortunately, because the cluster sizes are so uneven, I don't have enough examples (<150) in the small clusters to generate good models. – Dave Novelli Sep 12 '14 at 02:21
  • 2
    This should not be marked as a duplicate. This question is specifically asking about scikit-learn, whereas that other ticket is purely a math problem. None of the answers there or here answer adequately OP's question. – Cerin May 07 '17 at 17:37