I'm performing a cluster analysis on a health insurance dataset containing 4,343 observations with mixed continuous and binary variables.
I understand the importance of standardizing continuous variables. However, given the wide range of values for some of my continuous variables (notably outlier values for hospital visit counts and total medical expenses) I'm still seeing maximum values of 15 or higher for standardized continuous variables compared with maximum values of 1 for unstandardized binary variables.
Should binary variables be standardized as well to prevent undue weight being placed on continuous variables?
For example, rare binary events such as MED_STROKE=1 (only 7 cases) would receive a standardized value of 24.9 given their "distance" from the mean value of MED_STROKE which is close to zero.
Stnd. Continuous Variables N Mean Minimum Maximum
-----------------------------------------------------------------------------
ED_BH_COUNT 4343 0 -0.3900056 18.4851212
TOTAL_CLAIM_COST 4343 0 -0.2958079 18.8621133
-----------------------------------------------------------------------------
Stnd. Binary Variables N Mean Minimum Maximum
-----------------------------------------------------------------------------
sexcode 4343 0 -0.9809550 1.0191800
SA_ALCOHOL_RELATED_DO 4343 0 -0.2320733 4.3079913
MED_STROKE 4343 0 -0.0401749 24.8854565
-----------------------------------------------------------------------------