I have data on alcohol use by drivers
as a binary response variable along with many other variables such as:
- Driver's age, gender and race
- Passenger's (if present) age, gender
- Location variables
- Road characteristics
Here is sample of the data:
data1 <- structure(list(drunk = c(1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 1), DriverAge = structure(c(1L, 3L, 2L, 1L, 2L, 2L,
1L, 3L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 1L, 1L, 2L, 3L, 2L, 2L,
3L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 3L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L,
2L, 3L, 3L, 2L, 2L, 2L, 1L, 3L, 2L, 1L, 3L, 1L, 1L, 2L, 1L, 3L,
3L, 3L, 2L, 3L, 1L, 2L, 1L, 2L, 2L, 3L, 1L, 3L, 2L, 2L, 3L, 3L,
2L, 2L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 2L, 2L, 2L, 2L), .Label = c("MidAge",
"Old", "Young"), class = "factor"), DGender = structure(c(1L,
2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 2L), .Label = c("F", "M"), class = "factor"), location = structure(c(1L,
1L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 3L,
3L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 3L, 2L, 2L, 3L, 3L, 3L, 2L,
3L, 2L, 3L, 3L, 1L, 2L, 3L, 1L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L,
3L, 2L, 1L, 1L, 2L, 3L, 3L, 1L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 3L,
3L, 2L, 2L, 2L, 3L, 1L, 3L, 1L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 2L,
2L, 3L, 2L), .Label = c("other", "rural", "urban"), class = "factor"),
PassAge = structure(c(4L, 3L, 1L, 4L, 1L, 1L, 3L, 3L, 4L,
2L, 2L, 2L, 3L, 2L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 3L, 1L, 3L,
1L, 3L, 2L, 2L, 4L, 1L, 3L, 3L, 3L, 2L, 4L, 2L, 1L, 3L, 2L,
2L, 4L, 3L, 1L, 4L, 4L, 4L, 2L, 3L, 1L, 1L, 3L, 4L, 4L, 2L,
4L, 1L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 4L, 1L, 4L, 4L, 2L,
2L, 1L, 1L, 1L, 1L, 3L, 3L, 4L, 4L, 3L, 3L, 2L, 2L, 4L, 2L,
3L, 2L, 3L, 4L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 4L, 3L, 3L, 2L,
2L), .Label = c("MidAge", "None", "Old", "Young"), class = "factor"),
PGender = c("M", "M", "M", "M", "M", "M", "M", "M", "M",
"None", "None", "None", "M", "None", "M", "M", "M", "M",
"None", "M", "M", "M", "M", "M", "M", "M", "None", "None",
"M", "M", "M", "M", "M", "None", "M", "None", "M", "M", "None",
"None", "M", "M", "M", "M", "M", "M", "None", "M", "M", "M",
"M", "M", "M", "None", "M", "M", "M", "None", "M", "M", "M",
"M", "M", "M", "M", "M", "M", "M", "None", "None", "M", "M",
"M", "M", "M", "M", "M", "M", "M", "M", "None", "None", "M",
"None", "M", "None", "M", "M", "M", "M", "M", "M", "M", "M",
"M", "M", "M", "M", "None", "None")), .Names = c("drunk",
"DriverAge", "DGender", "location", "PassAge", "PGender"), row.names = c(NA,
-100L), class = "data.frame")
- Now, how do I go about analyzing the data. Will my method depend on what I want to model / analyze?
- My question is more general than this particular data-set. Should we always take all the available data together. For example, suppose I want to see how passenger's age and gender affect driver's being drunk. One thing I can do is do an ANOVA with respect to passenger's gender. One can also do a binary logistic regression with all data and study the marginal effects/odds-ratios for passenger's gender. How about spiting the data into two sets based on passenger's gender and then getting two separate models and comparing the co-efficient? To me it seems wrong since i have not seen anything like that. Can anyone explain to me why is it so?
- You may also give a hint on how you would go about analyzing this data from scratch.