7

I have a dataset which contains only categorical data i.e.A,B,C,D (like factors) for each predictor. There are 10 predictors and the dependent variable is binary, 0,1.

UPDATE: MY predictors are answers for multiple choice questions for a questionnaire. So each predictor only takes on categorical values, i.e. X_1 can be A,B,C or D, X_2 can be A,B,C,D,E,F,G or H.

Is it feasible to fit a logistic regression over this dataset? Ideally, if I can fit a logistic regression the data, I will then use it for prediction over a set of test data, which again contains only categorical data.

What are the pitfalls that I should look out for?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
mynameisJEFF
  • 1,583
  • 4
  • 24
  • 29
  • 1
    Yes, you should be able to. I would watch out for how you're grouping/binning the levels of each predictor to improve credibility and homogeneity. I would also look out for missing data. Lastly, because you're fitting to a logistic regression, you will need to have three separate datasets - one for model fitting, the second to select the logistic probability/value for which you have 0 vs 1 separation, and the third for model validation. – Frank H. Nov 03 '15 at 16:54
  • 1
    Yes, you can train a logistic regression model on categorical data. Each feature will be basically on/off which actually simplifies the things. It depends though on implementation how it handles such features. – Vladislavs Dovgalecs Nov 03 '15 at 16:54
  • Hi @Frank.H, you mentioned that I will need 3 mutually exclusive datasets. I understand the first and third are for training and testing the model. What is the second one for? I thought if the probability computed from the logistic regression is greater than 0.5, then the response variable should be `1`. If `p <0.5`, then response variable should be be `0`. And regarding binning levels of each predictor, since all my predictors have values like `A,B,C` and number of levels for each predictor is different, can I just use`as.factor` for all the predictor variables ? – mynameisJEFF Nov 03 '15 at 17:17
  • @mynameisJEFF See: http://www.ats.ucla.edu/stat/r/dae/logit.htm – rightskewed Nov 03 '15 at 17:28

3 Answers3

4

Yes of course you can. Just be aware of the nature of your categorical data - is it ordered or unordered?

If ordered (e.g. small, medium, large) you might want a single feature X1 with values like (1, 1, 3, 2, 3, 1, ...) where 1 represents small, 2 represents medium, etc.

If unordered (e.g. red, blue, green) you'll want multiple features like X1 = (0, 0, 1, 0) representing "is red?", X2 = (1, 0, 0, 1) representing "is blue?" and so forth.

Ben
  • 1,612
  • 3
  • 17
  • 30
  • Hi, I am not understanding the part where the data is unordered. Can you elaborate on how you create multiple features? – mynameisJEFF Nov 03 '15 at 17:06
  • There's [a lot of discussion about this](https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=how%20to%20encode%20categorical%20data) if you search for it. But suppose your training data contains a feature like (red, blue, blue, green, red). Then you'll want to train your logistic regression model using three features. X1 = IsRed? = (1, 0, 0, 0, 1), X2 = IsBlue? = (0, 1, 1, 0, 0), X3 = IsGreen = (0, 0, 0, 1, 0) where 1s represent "yes" or "true" and 0s represent "no" or "false". In other words, you create a binary vector for each unique class (i.e. category). – Ben Nov 03 '15 at 17:20
  • Upon reading this again, it looks like you're searching for implementation details (i.e. R code) yes? If so, you should include the `R` tag or possibly consider posting in StackOverflow. – Ben Nov 03 '15 at 17:23
  • No, I am not looking for implementation. I just didn't get how you split unordered values into multiple features. That's why I include more details in the quesiton. – mynameisJEFF Nov 04 '15 at 00:31
2

Yes, this is doable.

The (potentially) unseen pitfall is that your model may require a great deal more data than you expect. A general rule of thumb for logistic regression is that you need at least $15$ observations in the less commonly occurring category (i.e., either $0$s or $1$s) for each variable in the model (cf., here). You may think that you have just $2$ variables (viz., X_1 and X_2), and thus, you will be OK as long as you have at least $30$ 'successes' and $30$ 'failures'. However, there is a subtle inconsistency between how we interpret your variables and how a statistical model will use them. You will quite naturally think of X_1 as a single variable, but the model will treat it as $3$. Likewise, the model will treat X_2 as $7$ (!) additional variables, not one. More specifically, you are using the number of levels minus one ($4-1=3$ and $8-1=7$) in your model for every categorical variable you add. The upshot of this is that you want to have at least $150$ 'successes' and $150$ 'failures' ($N>300$) in your dataset to fit a model with just your X_1 and X_2 variables.

A related issue is that you want to be sure there are sufficient data in each of those levels. Obviously, if no one chose X_2 = G, you won't be able to estimate anything about the effect of that level of X_2, but you will also have a problem if some did choose G, but everyone who did has Y = 1. That would lead to the problem of separation. Moreover, if you want to fit the interaction, you will need sufficient data in every combination of levels ($32$, in your case). To read more about these topics, you may want to peruse some of our threads categorized under and .

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
-1

Of course it is possible.

You just need to transform your categorical variables into binary variables and to remove each time one item. For instance, if the variable X takes two values A and B, you need to create the variable which is equal to 1 if X == A and to 0 otherwise. Since X == A implies X != B, you'll have a collinearity in your model if you add the variable which is equal to 1 if X == B and to 0 otherwise.

PAC
  • 814
  • 1
  • 8
  • 19
  • 2
    Hi, if my categorical variable `X` can take values `A,B,C`, do I still need to make them binary ? Can't I just use `as.factor(X)` ? – mynameisJEFF Nov 03 '15 at 17:04
  • 2
    You do not need to make your categorical variables to only have two levels. They can have as many as you can justify statistically and meaningfully. – Frank H. Nov 03 '15 at 17:07
  • 1
    as.factor(X) is just an R function which makes it easy to transform categorical variables into a set of binary variables. – PAC Nov 04 '15 at 09:18