1

For example below a data-set, Columns B & C have categorical variables, the rest are all binaries. Column J is the dependent variable to be predicted.

enter image description here

Is there a general rule, or experience, that if a logistic regression model works better on binary variables, or a mix of categorical and binaries?

(in this case, better to convert the categoricals to binaries?)

Thank you!

Mark K
  • 235
  • 1
  • 8
  • 3
    You should note that in conducting the regression, your categorical variables are going to be coded as binary dummy variables anyway, for example for 'Production day' you will have three dummies, one for each day, which are 1 if the day is the day they correspond to and 0 if otherwise. – NatWH May 19 '18 at 15:56
  • @NatWH, thank you! shall I code them as binary by myself, or the programming (R) can take care of it when the model building? – Mark K May 20 '18 at 00:21
  • 1
    if R detects a factor as a predictor it will code it into dummy variables – NatWH May 20 '18 at 01:49
  • @NatWH, thanks again! Can you sumarize the comments into an answer? Last question - would it be a good choice to prepare the data (variables) as binary (as possible) before building the model? (thinking saving time and twitting in R, also modeling efficiency) – Mark K May 20 '18 at 07:59

1 Answers1

1

This is a general summary of the comments on the question, which pertains to the default behaviour of fitting regressions in R, and also the standard coding of categorical variables in linear regression.

When a categorical variable is used as a predictor variable in regression, it will be coded into dummy variables based on the number of categories. For k categories, it will be coded as k-1 dummy variables, each of which will take a value of 1 if the observation is in that category of that variable, and 0 if otherwise. We use k-1 dummies and code one level into the intercept as a reference level. So when interpreting the coefficients of these dummy variables, they're interpreted as contrasts between that level and the reference level. There are plenty of other good answers on Cross Validated discussing this interpretation.

In R, if a factor is detected as predictor in a call to lm( ), it will be coded into k-1 dummy variables and a reference level. The default behaviour for R is to take the first level of the factor as a reference, and this usually the first alphabetically ordered level. This behaviour can be changed easily however by reordering the levels of a factor with calls to reorder( )

Hope that helps.

EDIT: this answer nicely discusses the encoding and interpretation: Significance of categorical predictor in logistic regression

NatWH
  • 519
  • 4
  • 12