How would I choose to handle having a bunch of binary variables and one numerical variable when doing PCA? My thinking was to standardize the numerical variable and let the binary variables be then apply PCA.
Thanks!
How would I choose to handle having a bunch of binary variables and one numerical variable when doing PCA? My thinking was to standardize the numerical variable and let the binary variables be then apply PCA.
Thanks!
Binary variables are considered categorical variables, thus applying PCA is not a good idea, because PCA is for continuous variables using variance.
You should instead, create new features in a meaningful way using these categorical variables.
PCA should only be applied to continuous variables, as it decomposes their variance-covariance structure. These measures (variance and covariance) are not defined for binary or other categorical variables.
After converting categorical variables (binary or not) to dummy variables, it becomes possible mathematically to calculate (co)variances and execute PCA. Yet this is meaningless, since 1) The 0-1 coding convention is arbitrary, and so are their resulting (co)variances 2) Dummy variables of categorical variables with more than two levels are negatively correlated by design, which distorts the variance-covariance matrix 3) It implicitly assumes the dummies are continuous.
In your case I would recommend limiting the variance reduction to the binary variables only, and e.g perform correspondence analysis.