0

I remember whenever one needs to predict rare events (naive bayes classifier or logistic regression) that it is smart to simply predict the reverse of the rare event, which is far more common one can get far higher accuracy results.

I just tried the methods and discovered it actually did not matter at all. I admit I did some copy-pasting but checked it multiple times and did not see a mistake in my code. It does give slightly different accuracies, so I figure the code must not be pure duplicates.

Here is a quick glance at my code and the data.

# A tibble: 574 x 5
   Survived Pclass   Sex   Age  Fare
      <int>  <dbl> <dbl> <dbl> <dbl>
 1        0      3     0    22  7.25
 2        0      3     0    35  8.05
 3        0      3     0    NA  8.46
 4        0      1     0    54 51.9 
 5        0      3     0     2 21.1 
 6        0      3     0    20  8.05
 7        0      3     0    39 31.3 
 8        0      3     1    14  7.85
 9        0      3     0     2 29.1 
10        0      3     1    31 18   
# … with 564 more rows

# summary
    Survived           Pclass           Sex              Age             Fare        
 Min.   :0.00000   Min.   :1.000   Min.   :0.0000   Min.   : 1.00   Min.   :  0.000  
 1st Qu.:0.00000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:21.00   1st Qu.:  7.859  
 Median :0.00000   Median :3.000   Median :0.0000   Median :28.00   Median : 10.500  
 Mean   :0.04355   Mean   :2.512   Mean   :0.1742   Mean   :30.65   Mean   : 23.557  
 3rd Qu.:0.00000   3rd Qu.:3.000   3rd Qu.:0.0000   3rd Qu.:39.00   3rd Qu.: 26.000  
 Max.   :1.00000   Max.   :3.000   Max.   :1.0000   Max.   :74.00   Max.   :512.329  
                                                    NAs    :128                     

log_mod_1 <- glm(Survived~Pclass+Sex+Age+Fare, data=tr_subs, family='binomial')
pred_1    <- predict(log_mod_1, tr_subs, type = 'response')
pred_1    <- ifelse(pred_1>0.5, 1 , 0)
mean(pred_1 == tr_subs$Survived, na.rm=TRUE)

log_mod_0 <- glm(notSurvived~Pclass+Sex+Age+Fare, data=tr_subs, family='binomial')
pred_0    <- predict(log_mod_0, tr_subs, type = 'response') 
pred_0    <- ifelse(pred_0>0.5, 1, 0)
mean(pred_0 == tr_subs$notSurvived, na.rm=TRUE)

What I tried: I checked and Pclass and Sex should not cause problems, being a factor. (I tested it, not sure though whether that is mathematically correct when estimating those theta's.)

Even after removing the NA's I got the same result.

Question: I wonder, what is it about this approach. Is it by chance due to the data that I don't achieve higher accuracies, is it the method that doesn't do what I intended or did I maybe, even after close to 2 hours trying to figure something out, overlook a mistake in my code?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
thebilly
  • 135
  • 6
  • 5
    I see a threshold of 0.5 in your code. You may be interested in [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352). Also very relevant: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) and [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) – Stephan Kolassa May 08 '19 at 18:32
  • 1
    You say "predicting rare events" but is sounds like you're actually predicting their *probability*. Note that under the usual iid assumptions the variance of the sample proportion is the same for both $\hat{p}$ and $\widehat{_{1-p}}$. – Glen_b May 09 '19 at 05:17

1 Answers1

2

You are using logistic regression, which is based on a binomial model. Also, you are not predicting rare events, but their probabilities. But for events: Predicting if the event $E$ should occur is the same as predicting that the reverse event, $E^c$, should not occur, so one cannot be more precise than the other, it is really the same.

And for a random variable $X \sim \mathcal{Binom}(n, p)$, the variance of $X$ is $np(1-p)$, which is the same as the variance of $n-X \sim \mathcal{Binom}(n, 1-p)$. So there is no difference. Also, note the comment above by @Stephan Kolassa, and read Classification probability threshold (and the other links there.)

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467