Bootstrapping accuracy, f1-score?

Question

I have typical train/test setting, with an ordinary dataset. As I am comparing performance of two approaches to a problem (namely churn prediction with AdaBoost and BG/NBD), I would like to estimate the confidence intervals of the results, and compare, whether accuracy scores are really higher for one approach than for the other.

To do so, I would like to use bootstrap. I am currently doing:

0 for n times:
1     bootstrap the training dataset,
2     fit the models,
3     use the models to predict target on test data
4     store models' accuracy, f1-score
5 compare accuracy, f1-score using two-sample t-test

I am not sure whether this approach makes sense. From my provisional results it seems that some of the distributions are not normally distributed. Eyeballing the results it also seems that I am able to artificially lower the variance and thus rejecting the null hypothesis.

My question is: what are the weak points of this approach? What can I improve?

Edit: I am aware of that both accuracy and f1-score are not the best metrics for model evaluation, and therefore I decided to report apart from them also precision and recall. However, this is not mentioned in the original question, as I thought this is not necessary for the point I am trying to make. Apart from that, even though my dataset is imbalanced (38% v 62%), this proportion is not big and therefore I decided to report accuracy as well.

The downvotes to both question and answer seem unjustified without an explanation, so I'm cancelling them both out. — mkt, Aug 01 '18 at 15:27
@mkt: I'm not complaining, but there is no expectation of explaining downvotes. [And this.](https://math.meta.stackexchange.com/q/23486/51074) — Stephan Kolassa, Aug 01 '18 at 16:09
@StephanKolassa I'm not opposed to unexplained downvotes in general - I do it myself. It didn't seem merited in this particular case. — mkt, Aug 01 '18 at 17:10

score 1 · Answer 1 · answered Aug 01 '18 at 15:01

1

Do not use accuracy to evaluate a classifier: Why is accuracy not the best measure for assessing classification models? The very same arguments show that the F1 score is problematic.

Instead, use probabilistic predictions and assess these using proper scoring-rules on holdout datasets. Do feel free to bootstrap these, or run cross-validation.

answered Aug 01 '18 at 15:01

Stephan Kolassa

95,027
13
197
357

Thank you for your response. I am aware of that, and therefore I decided to report apart from accuracy and f1-score also precision and recall. However, this is not mentioned in the question, as I thought this is not necessary for the point I am trying to make. Apart from that, even though my dataset is imbalanced (38% v 62%), the proportion is not big and therefore I decided to report accuracy as well. – johnnyheineken Aug 01 '18 at 15:22
4

@johnnyheineken, I understand your perspective, but this *is* an answer to your larger question. The implication is that you should not be on this path at all. That could be incorrect, of course, but it is a sincere attempt to help. – gung - Reinstate Monica Aug 01 '18 at 15:34
Thank you for good moderating. I can see the other perspective now. – johnnyheineken Aug 01 '18 at 15:41
@johnnyheineken: I understand why you might not be satisfied with my answer. [This may be informative.](https://stats.meta.stackexchange.com/a/5004/1352) I won't hold it against you if you are still not happy. (And accuracy is also a problem with balanced datasets.) – Stephan Kolassa Aug 01 '18 at 16:03
@Kolassa. I was satisfied with your answer. Arguably a little terse, but arguably to the point. – meh Jun 10 '19 at 13:46

Bootstrapping accuracy, f1-score?

1 Answers1