I remember whenever one needs to predict rare events (naive bayes classifier or logistic regression) that it is smart to simply predict the reverse of the rare event, which is far more common one can get far higher accuracy results.
I just tried the methods and discovered it actually did not matter at all. I admit I did some copy-pasting but checked it multiple times and did not see a mistake in my code. It does give slightly different accuracies, so I figure the code must not be pure duplicates.
Here is a quick glance at my code and the data.
# A tibble: 574 x 5
Survived Pclass Sex Age Fare
<int> <dbl> <dbl> <dbl> <dbl>
1 0 3 0 22 7.25
2 0 3 0 35 8.05
3 0 3 0 NA 8.46
4 0 1 0 54 51.9
5 0 3 0 2 21.1
6 0 3 0 20 8.05
7 0 3 0 39 31.3
8 0 3 1 14 7.85
9 0 3 0 2 29.1
10 0 3 1 31 18
# … with 564 more rows
# summary
Survived Pclass Sex Age Fare
Min. :0.00000 Min. :1.000 Min. :0.0000 Min. : 1.00 Min. : 0.000
1st Qu.:0.00000 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:21.00 1st Qu.: 7.859
Median :0.00000 Median :3.000 Median :0.0000 Median :28.00 Median : 10.500
Mean :0.04355 Mean :2.512 Mean :0.1742 Mean :30.65 Mean : 23.557
3rd Qu.:0.00000 3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:39.00 3rd Qu.: 26.000
Max. :1.00000 Max. :3.000 Max. :1.0000 Max. :74.00 Max. :512.329
NAs :128
log_mod_1 <- glm(Survived~Pclass+Sex+Age+Fare, data=tr_subs, family='binomial')
pred_1 <- predict(log_mod_1, tr_subs, type = 'response')
pred_1 <- ifelse(pred_1>0.5, 1 , 0)
mean(pred_1 == tr_subs$Survived, na.rm=TRUE)
log_mod_0 <- glm(notSurvived~Pclass+Sex+Age+Fare, data=tr_subs, family='binomial')
pred_0 <- predict(log_mod_0, tr_subs, type = 'response')
pred_0 <- ifelse(pred_0>0.5, 1, 0)
mean(pred_0 == tr_subs$notSurvived, na.rm=TRUE)
What I tried: I checked and Pclass and Sex should not cause problems, being a factor. (I tested it, not sure though whether that is mathematically correct when estimating those theta's.)
Even after removing the NA's I got the same result.
Question: I wonder, what is it about this approach. Is it by chance due to the data that I don't achieve higher accuracies, is it the method that doesn't do what I intended or did I maybe, even after close to 2 hours trying to figure something out, overlook a mistake in my code?