1

I'm new in ML and I would like some explanation about dummy variables from any experienced data scientist.

I've understood through reading on the web that pre-processing the data is a really important step before even thinking about the model we want to implement. One subject is how we deal with categorical variables and the answer I found so far on the web was turn it into dummy variables. However, I'm struggling to understand why we do that. What's the reason hidden behind this method? Is that something we do automatically as soon we see a categorical variable in our dataset?

Why do algorithms learn better from $K-1$ columns made of $1$ and $0$ than from $1$ column made of $K$ features?

Is there something else we can think of doing when we see a categorical variable?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • Some dups: https://stats.stackexchange.com/questions/115049/why-do-we-need-to-dummy-code-categorical-variables, https://stats.stackexchange.com/questions/24185/why-dummy-variables-rather-than-one-factor-variable-in-modelling, – kjetil b halvorsen Jul 27 '19 at 10:02

2 Answers2

0

Introduction of dummy variables, or One-hot encoding, is a way to include nominal variables in a regression model.

Say Z is a nominal variable representing occupation, with 3 levels: Doctor, Engineer and Writer, and you wish to include this variable in your regression model to predict income.

To include it in a regression model, you need to somehow convert it into a number. Being a nominal variable, there is no natural ordering that you can use to assign a number to each category. Then, say you use some arbitrary mapping, assigning 1 for Doctor, 2 for Engineer and 3 for Writer.

Now after you fit the model, no matter what coefficient you get, the income for Engineer will always be between that of Doctor and Writer. That is not how you would want your regression model to work, hence this arbitrary mapping is not a suitable way of including nominal variables.

The proper way to include nominal variables is One-Hot encoding. In One-Hot encoding, if your variable has n levels, you add n-1 columns to your design matrix. In the above example, you would add 2 columns, because there are 3 occupations.

The first column, say ZEngineer you add would be an indicator variable corresponding to Engineer. That variable would take value 1 if the person is an Engineer, 0 otherwise. The second column, say ZDoctor you add would be a similar indicator, but for Doctor.

These variables ZDoctor and ZEngineer are what are called as dummy variables. For each level of the nominal variable, there is a unique configuration of dummy variables. Note that, if the variable Z takes the level Writer, both of the dummy variables are zero. Two or more dummy variables do not take the value 1 simultaneously.

In your regression with dummy variables, there are now two parameters for occupation: the parameter for ZDoctor, and the parameter for ZEngineer. These two parameters can take different values, and thus consider nominal variables as unordered.

One question you might ask is, why not three dummy variables, one for each occupation? The answer is that, since exactly one of the dummy variables will be 1 for each person, the design matrix becomes singular if you already have an intercept term in regression.

-1

First of all, categories may be not numbers but strings and you'll need to translate it to numbers somehow. But let's think you already did it with LabelEncoder, for example. So, if you using some advanced ML algorithm with trees it's not really necessary to do OneHotEncoding, trees can split on every one category and do the job right. But if you use some simple method like linear regression, this method can be confused if you use comparable/ordered numbers in one column for different substances. Linear methods can find different coefficients for data in different columns to take in consideration it's different nature. But if you pack different data in one column such as it's different ordered numbers it's difficult for linear methods to treat these numbers in a right way.

For example, right way:

x1 x2 x3    y
 1  0  0 -> 1
 0  1  0 -> 2
 0  0  1 -> 7

It's easy to find simple formula:

y = 1 * x1 + 2 * x2 + 7 * x3

But without OneHotEncoding it was (for example):

x    y
1 -> 1
2 -> 2
3 -> 7

You will need some advanced degree formula to solve this:

y = 7/3 * x - 2 * x^2 + 2/3 * x^3

Not so linear any more, right? Now imagine you have 100 or 1000 categories... Of course, if you will have 10 000 categories your table will explode with OneHotEncoding method and you will need to do some grouping with it or use some tree methods at last without doing OneHotEncoding.

CrazyElf
  • 99
  • 2