Choosing the proper statistical approach for glm

Question

I have data on alcohol use by drivers as a binary response variable along with many other variables such as:

Driver's age, gender and race
Passenger's (if present) age, gender
Location variables
Road characteristics

Here is sample of the data:

data1 <- structure(list(drunk = c(1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 
1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 
1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 
0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 
0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 
0, 0, 1, 1), DriverAge = structure(c(1L, 3L, 2L, 1L, 2L, 2L, 
1L, 3L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 1L, 1L, 2L, 3L, 2L, 2L, 
3L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 3L, 1L, 2L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L, 
2L, 3L, 3L, 2L, 2L, 2L, 1L, 3L, 2L, 1L, 3L, 1L, 1L, 2L, 1L, 3L, 
3L, 3L, 2L, 3L, 1L, 2L, 1L, 2L, 2L, 3L, 1L, 3L, 2L, 2L, 3L, 3L, 
2L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 2L, 2L, 2L, 2L), .Label = c("MidAge", 
"Old", "Young"), class = "factor"), DGender = structure(c(1L, 
2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 
1L, 1L, 2L), .Label = c("F", "M"), class = "factor"), location = structure(c(1L, 
1L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 
3L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 
3L, 2L, 3L, 3L, 1L, 2L, 3L, 1L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 
3L, 2L, 1L, 1L, 2L, 3L, 3L, 1L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 
3L, 2L, 2L, 2L, 3L, 1L, 3L, 1L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 
2L, 3L, 2L), .Label = c("other", "rural", "urban"), class = "factor"), 
    PassAge = structure(c(4L, 3L, 1L, 4L, 1L, 1L, 3L, 3L, 4L, 
    2L, 2L, 2L, 3L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 3L, 1L, 3L, 
    1L, 3L, 2L, 2L, 4L, 1L, 3L, 3L, 3L, 2L, 4L, 2L, 1L, 3L, 2L, 
    2L, 4L, 3L, 1L, 4L, 4L, 4L, 2L, 3L, 1L, 1L, 3L, 4L, 4L, 2L, 
    4L, 1L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 4L, 1L, 4L, 4L, 2L, 
    2L, 1L, 1L, 1L, 1L, 3L, 3L, 4L, 4L, 3L, 3L, 2L, 2L, 4L, 2L, 
    3L, 2L, 3L, 4L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 4L, 3L, 3L, 2L, 
    2L), .Label = c("MidAge", "None", "Old", "Young"), class = "factor"), 
    PGender = c("M", "M", "M", "M", "M", "M", "M", "M", "M", 
    "None", "None", "None", "M", "None", "M", "M", "M", "M", 
    "None", "M", "M", "M", "M", "M", "M", "M", "None", "None", 
    "M", "M", "M", "M", "M", "None", "M", "None", "M", "M", "None", 
    "None", "M", "M", "M", "M", "M", "M", "None", "M", "M", "M", 
    "M", "M", "M", "None", "M", "M", "M", "None", "M", "M", "M", 
    "M", "M", "M", "M", "M", "M", "M", "None", "None", "M", "M", 
    "M", "M", "M", "M", "M", "M", "M", "M", "None", "None", "M", 
    "None", "M", "None", "M", "M", "M", "M", "M", "M", "M", "M", 
    "M", "M", "M", "M", "None", "None")), .Names = c("drunk", 
"DriverAge", "DGender", "location", "PassAge", "PGender"), row.names = c(NA, 
-100L), class = "data.frame")

Now, how do I go about analyzing the data. Will my method depend on what I want to model / analyze?
My question is more general than this particular data-set. Should we always take all the available data together. For example, suppose I want to see how passenger's age and gender affect driver's being drunk. One thing I can do is do an ANOVA with respect to passenger's gender. One can also do a binary logistic regression with all data and study the marginal effects/odds-ratios for passenger's gender. How about spiting the data into two sets based on passenger's gender and then getting two separate models and comparing the co-efficient? To me it seems wrong since i have not seen anything like that. Can anyone explain to me why is it so?
You may also give a hint on how you would go about analyzing this data from scratch.

You should really say something more about how this data was obtained, specifically how the response variable --- use of alcohole --- was defined. — kjetil b halvorsen, Aug 08 '12 at 04:13

score 6 · Accepted Answer · edited Apr 13 '17 at 12:44

I think most of your questions are really about the general modeling strategy to be used in data analysis:

1. I think you should always customize your modeling strategy based on what your substantive concerns are.

2 / 3. The best general approach is to include all of the variables that you care about into one larger model, rather than fitting separate models to different situations based on subsets of the data. There are two main reasons for this:

If there is a difference in the response depending on whether a particular covariate has a given value or not, that can be properly tested within a larger model. Moreover, testing interactions (whether the effect of another covariate depends on the value of this one) can also be properly tested.
Your model can borrow information from the cases where a covariate takes on the other value to more precisely estimate all of the effects included. For example, you will get a better estimate of the intercept when all the data (e.g., $male$ vs. $female$) are included, simply because there's more data to use for the estimate.

(Note that there are some lower-frequency cases that appear to diverge from this advice. One I can think of concerns survival analyses; the commonly used Cox model assumes proportional hazard, and when a covariate violates this assumption, a stratified approach can be used. Even in this case, however, all strata are fit together, so that the estimates 'borrow strength' across the strata. I'm hard pressed to think of a situation where you'd want to break up your data and fit several smaller models independently.)

Relative to this specific case:

2. I don't think you should do a standard ANOVA here, as your response variable is binary. Your best bet is to use a logistic regression model. (For more information about GLiM's, you may find the answer I wrote here to be helpful, although it was written in a different context.)

For (2.), what if the response variable is not binary but continuous, will ANOVA the way to go? — Stat-R, Aug 07 '12 at 16:42
If you had a continuous, vaguely-normalish response variable, and all the covariates that you were interested in were categorical in nature, then yes, an ANOVA would be the way to go. — gung - Reinstate Monica, Aug 07 '12 at 16:44

Choosing the proper statistical approach for glm

1 Answers1

Linked