3

I'm working on analyzing a dataset with the following structure:

+------------------------------+
| gender race state died alive |
+------------------------------+
|   f     w     ny   40   450  |
|   m     w     ny   20   300  |
|   f     b     ny   45   300  |
+------------------------------+

etc

Hence since this dataset contains $2$ genders, $2$ races, and $5$ states, there are $2\times2\times5 = 20$ rows. The died and alive columns are the number of people who died and lived within that combination of categories.

My question here is how do i test for significance in a dataset like this, particularly in R? I want to see whether the categories given have an effect on the number of people who died or lived, but I'm not sure how to go about it. For every test I've researched, it seems like it doesn't quite fit the problem.

One thing I also tried was adding an extra column to report the percentage. i.e. in R:

mydata$percentage = mydata$died / (mydata$died + mydata$alive)
Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
  • Your dependent variables ('died' and 'alive') are not categorical. And are you sure you want to do a test on the *number* who died or lived? That would be heavily influenced by the sample in each state; your final thought on the percentage seems like a better approach (I'm making some assumptions here - we don't know much about your data). – mkt May 03 '18 at 16:57
  • My main worry is that the size of state would get lost with the conversion to percentages. But if I do go with that approach, I'm still not sure how to test whether the categories (state, gender, race) had an influence. – thedoggoperson May 03 '18 at 17:04
  • @mkt, `alive` vs `died` is categorical. It's a binomial – gung - Reinstate Monica May 03 '18 at 18:39

1 Answers1

3

Per discussion with @mkt and after a quick search, I would recommend the easiest approach to be to follow the "one numeric predictor" example in this tutorial.

Your code would look something like:

glm(cbind(dead, alive)~gender+state+race, family=binomial(logit), data=yourdata)

I'm pretty sure this will get you what you need. I would be tempted to set this up as hierarchical with observations within states, but not sure how to do that with aggregate values and mkt is right that your random effect might not be stable.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Tdisher
  • 415
  • 2
  • 9