Problems with using many binary variables in regression

Question

I am curious about a possible regression model I want to run. Let's say the model has a (Yes/No) response and several binary independent variables.

The independent binary variables are taken from 1 common variable: Location. Let's say our possible values are: Northwest, Southwest, Midwest, Southeast, and Northeast.

I want to build a regression model that measures the probability of a "Yes" by Location. Let's say I calculated my independent variables by creating 5 (1/0) variables for each of the locations. I understand the interpretation would be different in a model using these variables vs. a model where we use one of the locations as a baseline to compare all other locations against.

My question is: Is there any issue with interpreting the results of the first model with 5 separate Location (1/0) independent variables? Would this type of model raise multicollinearity concerns? Is there any issue with the first model? No continuous variables (response and predictors are all 0s and 1s).

Here's some example code in R that could run these models for reference:

mod1=glm(Response~I(Location),data=data,family="binomial")

mod2=glm(Response~I(NW)+I(SW)+I(MW)+I(SE)+I(NE),data=data,family="binomial")

That would be fine *if* you exclude the intercept from your model, otherwise it would be colinear. A question: What did you measure and what does location represent? Measurements with spatial autocorrelation are more commonly modeled as random effects in mixed models. — Frans Rodenburg, Oct 21 '19 at 04:15
I was measuring whether or a not an event occurred. The location is just location. I just wanted to see if any locations were more likely to have the event occur. I did not plan to factor in spatial autocorrelation into my model - just want to keep it simple for now. — user2813606, Oct 28 '19 at 18:44

Frans Rodenburg · Accepted Answer · 2019-10-22T07:24:00.260

Using dummy variables for all categories and including an intercept is also known as the dummy variable trap.

Including separate dummy variables instead of a single factor would be fine if you exclude the intercept from your model, otherwise it would be perfectly colinear. You can do that with y ~ 0 + x1 + x2 + ....

If an observation is zero for $k-1$ dummy variables, then its category must be the remaining $k^\text{th}$ location (the intercept). A $k^\text{th}$ dummy variable would be redundant.

When you include location as a factor in $\textsf{R}$, what really happens is that one of the locations becomes part of the intercept. It is the same as using $k-1$ dummy variables and an intercept.

Another important consideration: What did you measure and what does location represent? Measurements with spatial autocorrelation are more commonly modeled as random effects in mixed models. Whether a set of dummy variables is at all appropriate will depend on how these data were collected.

Problems with using many binary variables in regression

1 Answers1