0

I've just learnt about dummy variables. Say this is my data:

Location Nest
XXX Yes
XXX No
ZZZ Yes
YYY Yes
YYY No

And I want to do multicolinearity tests/logistic regression in RStudio, so I don't want the dependent variable (Nest) to be in this text format. What is the difference between changing all "no" to 0, and all "yes" to 1 versus having the below output (which, as I understand, is the 'dummy' encoded version of the above data).

Location Nest No nest
XXX 1 0
XXX 0 1
YYY 1 0

Moreover, if I have say 15 categories that I want to analyse, can I just label them #1-15, or do I need to have 14 columns (for the k-1 I suppose) to make the categories into dummy variables?

  • 1
    You don't need to recode the Location or Nest variables or worry about dummy variables in R or Rstudio. R will create dummy variables for you automatically when you include the variables Location or Nest in a linear model or glm. – Gordon Smyth Aug 30 '21 at 12:38

1 Answers1

1

You are confusing few things.

  • There is no such data coding type as "factor". The factor is an R's internal data representation type. It is R's legacy from many years ago when computers had much less memory available. To overcome the memory issues, R's predecessor, the S language, introduced the factor data type for categorical variables. It encodes the values as numbers but displays them using human-readable labels. A binary variable having two values "Yes" and "No" when encoded as factor would in fact be coded using the numerical values of 0 and 1. When you use it, under-the-hood R decodes it into dummy encoded variables when needed.
  • Your second table shows a variable encoded using the one-hot encoding. In such encoding, every level of the variable gets its own column, so for $k$ categories, you end up with $k$ columns.
  • Another way of coding is to use dummy encoding, where we drop one of the columns so there are $k-1$ columns. We can do this because one of the columns is redundant.

For learning more about the differences between the two latter categories check the One-hot vs dummy encoding in Scikit-learn thread.

Tim
  • 108,699
  • 20
  • 212
  • 390