PCA with binary and numerical variable

Question

How would I choose to handle having a bunch of binary variables and one numerical variable when doing PCA? My thinking was to standardize the numerical variable and let the binary variables be then apply PCA.

Thanks!

See https://stats.stackexchange.com/q/5774/930. – chl Nov 07 '20 at 13:05 — chl, Nov 07 '20 at 13:05

score 0 · Answer 1 · answered Nov 02 '20 at 11:05

0

Binary variables are considered categorical variables, thus applying PCA is not a good idea, because PCA is for continuous variables using variance.

You should instead, create new features in a meaningful way using these categorical variables.

answered Nov 02 '20 at 11:05

Long Luu

125
5

score 0 · Answer 2 · answered Nov 02 '20 at 11:17

PCA should only be applied to continuous variables, as it decomposes their variance-covariance structure. These measures (variance and covariance) are not defined for binary or other categorical variables.

After converting categorical variables (binary or not) to dummy variables, it becomes possible mathematically to calculate (co)variances and execute PCA. Yet this is meaningless, since 1) The 0-1 coding convention is arbitrary, and so are their resulting (co)variances 2) Dummy variables of categorical variables with more than two levels are negatively correlated by design, which distorts the variance-covariance matrix 3) It implicitly assumes the dummies are continuous.

In your case I would recommend limiting the variance reduction to the binary variables only, and e.g perform correspondence analysis.

PCA with binary and numerical variable

2 Answers2