I have typical train/test setting, with an ordinary dataset. As I am comparing performance of two approaches to a problem (namely churn prediction with AdaBoost
and BG/NBD
), I would like to estimate the confidence intervals of the results, and compare, whether accuracy scores are really higher for one approach than for the other.
To do so, I would like to use bootstrap. I am currently doing:
0 for n times:
1 bootstrap the training dataset,
2 fit the models,
3 use the models to predict target on test data
4 store models' accuracy, f1-score
5 compare accuracy, f1-score using two-sample t-test
I am not sure whether this approach makes sense. From my provisional results it seems that some of the distributions are not normally distributed. Eyeballing the results it also seems that I am able to artificially lower the variance and thus rejecting the null hypothesis.
My question is: what are the weak points of this approach? What can I improve?
Edit: I am aware of that both accuracy and f1-score are not the best metrics for model evaluation, and therefore I decided to report apart from them also precision and recall. However, this is not mentioned in the original question, as I thought this is not necessary for the point I am trying to make. Apart from that, even though my dataset is imbalanced (38% v 62%), this proportion is not big and therefore I decided to report accuracy as well.