Why convert categorical data into numerical using one hot encoding

Question

I don't have very strong statistical background, and I'm new in data science...

Now, I am practicing PCA (Principle Component Analysis) for dimension reduction. This tutorial looks very complete, but one step I got confused. PCA Dimension Reduction Tutorial

Before they are using PCA in R or Python, all the categorical data has to be converted to numerical data. The tutorial is using one hot encoding, so that a column with different values will be separate into different columns. For example, if a column called Outlet_TypeSupermarket has 3 values Type 1, Type 2, Type 3 originally, after one hot encoding, it will become 3 columns Outlet_TypeSupermarket Type 1, Outlet_TypeSupermarket Type 2, Outlet_TypeSupermarket Type 3. They do this for each column. Then using PCA on all the generated columns.

Finally, in this case, even if PCA choses the most important 30 components (important columns), it maybe just using part of the original columns. For example, it may only use Outlet_TypeSupermarket Type 1, Outlet_TypeSupermarket Type 2 from the original Outlet_TypeSupermarket

Is this the right way to do dimension reduction? I thought the chosen columns would at least be complete columns from the original data set... If this is the correct way, could you tell me why?

You may be interested in reading: [Can principal component analysis be applied to datasets containing a mix of continuous and categorical variables?](http://stats.stackexchange.com/q/5774/7290) — gung - Reinstate Monica, Apr 28 '16 at 00:39

score 2 · Answer 1 · answered Apr 28 '16 at 10:23

2

PCA uses all original variables by design: each individual PC is a linear combination of all original dimensions. Therefore, even when discarding some PC dimensions obtained from PCA, the remaining PC dimensions still contain information from all original variables.

PCA requires numerical data from the mathematical point of view. Categorial variables don't have the required properties, e.g. relations between categories are not defined as they are for numeric information (e.g. variable 1 with possible levels A, B, and C: it is not defined if value A is e.g. twice as big as B). Creating dummy variables from categories solves this problem: each dummy variable can be treated as numeric (e.g. -1,1 or 0,1), which therefore allows PCA computation.

answered Apr 28 '16 at 10:23

geekoverdose

3,691
2
14
27

1

A linear combination not necessarily contains all members as a result, especially if they come in with a factor of 0. :-) – Diego Apr 28 '16 at 10:36
1

True, I skipped those details in favour of simplicity and transporting the general idea ;-) Anyway, a factor of 0 is very unlikely as long as no regularization is used to enforce such. – geekoverdose May 11 '16 at 12:34
1

Your statement is still not necessarily true, when formulated like this "still contain information from all original variables" and then suggesting to discard some dimensions. I just have pointed out the case where it is the easiest to find a counterexample. – Diego May 12 '16 at 02:51

Why convert categorical data into numerical using one hot encoding

1 Answers1