Cluster many thousands observations (mixed variable types). Cluster subsample and then classify the rest observations?

Question

I'm trying to run a cluster analysis on a large dataset (70k+ observations to cluster) with mixed variables (numeric, ordinal, binary and nominal). I don't think I can create the distance matrix using SAS over the entire dataset. So, I have tried to run a hierarchical clustering using Gower's distance over a subsample of my data. I've got some questions.

If the above method (hierarchical clustering of a subsample) is appropriate, how can I then score the rest of the observations and assign (classify) them to the clusters obtained?
If the above method isn't good, what are other recommended methods to cluster a large dataset with mixed variables? (Available in SAS if possible.)
How can I check for correlations/multicolinearity among mixed variables? I don't know if running something like PCA or factor analysis makes sense with categorical data.

Please check my editing of your question. – ttnphns Sep 25 '13 at 20:57 — ttnphns, Sep 25 '13 at 20:57

score 1 · Accepted Answer · answered Sep 26 '13 at 08:19

Hierarchical clustering in general does not scale well to large data sets. There are some special cases such as SLINK that need only $O(n)$ memory and $O(n^2)$ runtime (naive implementations need $O(n^2)$ memory and $O(n^3)$ runtime). So may need to look into alternative methods such as DBSCAN. DBSCAN will work with arbitrary distance measures; but you will probably not have index acceleration, so it will be $O(n^2)$ runtime, too. But it should still scale to 70k observations; I have ran DBSCAN on 100k years ago. The key is to not compute a complete distance matrix, because that needs $O(n^2)$ memory then.

However, neither will have an obvious way of classifying new observations. Clustering is just something different than classification. It's about getting a sketch of structure in the data to then analyze and turn into knowledge. No clustering will ever be perfect, but it may be able to tell you something you did not know before. You should then formalize it in a way that you can make use of it later.

Obviously, an universal approach is to train a classifier on the clusters afterwards.

I don't know what is available in SAS. I believe it only has the most basic methods available, nothing advanced.

Cluster many thousands observations (mixed variable types). Cluster subsample and then classify the rest observations?

1 Answers1

Linked