So, let's say that I train two models on the same dataset. I run the experiment once and I get the following results:
- Using a Neural Network I get an AUC ROC of 0.941.
- Using Random Forest I get an AUC ROC of 0.947.
However, both algorithms have some random processes inside, and therefore if I would run the experiment again, the results may vary slightly.
My question is: how should I measure/evaluate the statistical significance of this improvent? When is it safe to claim that one algorithm is doing better than the other?
Also, I have read a lot of Machine Learning papers where they do not measure the statistical significance between the results obtained by the proposed model and the baseline model. So I guess that when the difference is big enough, there is no need to evaluate its statistical significance? If so, when is the difference considered big enough? I'd love to see what the community thinks about this issue.
Thanks a lot!