Data Preparation for Cluster Analysis

Question

Updated answer to "Data Preparation for Cluster Analysis":

Based on the discussions, data normalization and removing correlation among data are often recommended. References posts:

1) Are mean normalization and feature scaling needed for k-means clustering?

2) Why vector normalization can improve the accuracy of clustering and classification?

3) Is it OK to use correlated variables for cluster analysis?

4) Correlated variables in kmeans clustering

----------------------------------------------------------------------------

Original question: I see many threads discussing about the data standardization for preparing a PCA analysis. I guess PCA and Cluster Analysis are interconnected in nature (correct me if I am wrong). So that is why doing data standardization is often a first step for both of them (reference Quick-R: Clustering analysis). Maybe I can refer to PCA data preparation steps, but it still might be beneficial to make these questions clear:

1) What are the recommended data preparation steps for Cluster Analysis?

2) What are the characteristics of the data sets that are likely to have good clustering results?

Example datasets: If I want to do cluster analysis on a variety of social-economic factors, including continuous and discrete variables (e.g., housing unit density, population density, green space area, count of schools/health centers numbers, etc.).

My understanding to Questions

Question 1): removing missing data and rescale variables is often a necessity. So I used the scale() function to standarlize data. Is the scale function working for both continuous and discrete variables?

Question 2): PCA analysis indicated 9 principle components would explain 90% variation. I feel like that is not a successful clustering result. Any suggestion on how to reformat/organize the data to better reveal meaningful clusters? And actually what kind of data are likely to have successful clustering results?

If you want to cluster on a _mixture_ of attributes of different type (such as continuous, ordinal, nominal) - which may be not a very good idea - you have not much choice. Use Gower similarity measure to compute the distance matrix, then perform hierarchical clustering (or, perhaps, medoid clustering). Gower similarity does not require you to scale/standardize your data. — ttnphns, Jun 23 '15 at 19:34
PCA has nothing to do with cluster analysis. It is sometimes done _prior_ cluster analysis to reduce dimensionality, if it is necessary for some reason (often there is no such reason). PCA does have scaling/standarizing to be considered. — ttnphns, Jun 23 '15 at 19:39
Hi ttnphns, thanks for your inputs. Is it a good idea to convert the discrete attributes into continuous ones? For example, if I have count of school numbers, maybe I can use the feature scaling equation: X' = (X-Xmin)/(Xmax-Xmin). https://en.wikipedia.org/wiki/Normalization_(statistics) — enaJ, Jun 23 '15 at 20:50
Blackbox normalization usually does not work very well. Use your knowledge on the data, not random functionality available in your toolbox or on Wikipedia. *Understand your data*, then choose normalization based on data understanding. Also, I suggest to not run/consider clustering analysis until you have a good *visualization* of your data which *looks like clusters*. — Has QUIT--Anony-Mousse, Jun 23 '15 at 21:06
Hi Anony-Mousse, is data normalization a necessity for doing k-mean clustering? I often see examples don't do any normalization/standardization before running a k-mean clustering. — enaJ, Jun 25 '15 at 20:28
Data normalization is about numerical representation and also scale of the jacobian. So when numbers are very near to 1.0 the current double-precision representation has the least roundoff error. It is about making the discretization of the domain rich in the region of interest. If the condition is bad, it can be hard for the jacobian to work in some axes. The normalization reduces the impact of that too. PCA can be thought of as a rotation, and aligning the principle components with the axes improves jacobian estimates. — EngrStudent, Jun 30 '15 at 00:04

Data Preparation for Cluster Analysis

----------------------------------------------------------------------------

0 Answers0

Linked