2

I have a mixture of numeric and categorical inputs, the categorical inputs are relatively low cardinality (perhaps 10-15).

I want to use PCA for anomaly detection, but am not sure how best to encode the categorical attributes.

Will one hot encoding work, and if not, what should I try?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
sanity
  • 350
  • 2
  • 10

1 Answers1

1

I would not try pca, but rather correspondence analysis (or some its generalization to mixed categorical/continuous data.) See Can principal component analysis be applied to datasets containing a mix of continuous and categorical variables? for ideas and linkt to generalizations such as homogeneity analysis.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    I may support this. Dummy coding of nominal variables in PCA leads essentially to a (Multiple) Correspondence analysis (MCA). Categorical PCA (CATPCA) is a technique which incorporates them both. It allows a mixture of numeric, ordinal, nominal variables and does dimensionality reduction quantifying them optimally. – ttnphns Jul 25 '19 at 16:05