The response variable is binary (dead="1" or not="0"),and there are 4 numeric independent variables. I tried logistic regression and 2 of independent variables are significant. However, the prediction is bad because all the predict outcome went to "0". I think that is because there are a lot of more "0"s than "1" in the data. Is there any package in R contains some kind of correction? Or should I use logistic model at all? If I use classification methods like LDA, will it have the same problem?
Asked
Active
Viewed 407 times
0
-
Note that a logistic regression model allows you to predict the probability of "dead" for each individual. As you say "dead" is a rare event, it doesn't necessarily reflect badly on the model that it predicts everyone's most likely "not dead" - that's merely to say that all the predicted probabilities are below one half (I'm guessing that's the cut-off you used as you didn't mention another one). If you want to take different actions according to how people are classified then use a different cut-off calculated from the costs of mis-classifying them. – Scortchi - Reinstate Monica Nov 06 '14 at 17:54
1 Answers
1
I've just worked in an environment like that. You should take a look at this paper, it tackles exactly the same problem you are describing. If you need any more information, don't hesitate to ask.
http://gking.harvard.edu/files/gking/files/0s.pdf
King, Gary, and Langche Zeng. 2001. Logistic Regression in Rare Events Data, Political Analysis 9: 137–163. Copy at http://j.mp/lBZoIi

jmnavarro
- 446
- 5
- 12