Standardize binary variables in cluster analysis?

Question

I'm performing a cluster analysis on a health insurance dataset containing 4,343 observations with mixed continuous and binary variables.

I understand the importance of standardizing continuous variables. However, given the wide range of values for some of my continuous variables (notably outlier values for hospital visit counts and total medical expenses) I'm still seeing maximum values of 15 or higher for standardized continuous variables compared with maximum values of 1 for unstandardized binary variables.

Should binary variables be standardized as well to prevent undue weight being placed on continuous variables?

For example, rare binary events such as MED_STROKE=1 (only 7 cases) would receive a standardized value of 24.9 given their "distance" from the mean value of MED_STROKE which is close to zero.

   Stnd. Continuous Variables     N       Mean         Minimum         Maximum
   -----------------------------------------------------------------------------
   ED_BH_COUNT                    4343    0            -0.3900056      18.4851212
   TOTAL_CLAIM_COST               4343    0            -0.2958079      18.8621133
   -----------------------------------------------------------------------------




   Stnd. Binary Variables         N       Mean         Minimum         Maximum
   -----------------------------------------------------------------------------
   sexcode                        4343    0            -0.9809550       1.0191800
   SA_ALCOHOL_RELATED_DO          4343    0            -0.2320733       4.3079913
   MED_STROKE                     4343    0            -0.0401749      24.8854565
   -----------------------------------------------------------------------------

This type of question has been asked before. See https://stats.stackexchange.com/questions/68077/are-categorical-variables-standardized-differently-in-penalized-regression — Jon, May 01 '17 at 16:54
Also, here is a useful reference from a statistician http://www.stat.columbia.edu/~gelman/research/unpublished/standardizing.pdf — Jon, May 01 '17 at 16:54
@Jon The duplicate questions you provided is asking about standardization of binary variables for penalized regressions (i.e., predictive models). Cluster analysis is different - we're not attempting to predict anything but instead correctly specify the "distance" between stndized or unstndized observations. — RobertF, May 01 '17 at 17:36
Cluster analysis is still discriminative, except there are no labels as in supervised predictive models. — Jon, May 01 '17 at 17:53

Standardize binary variables in cluster analysis?

0 Answers0