A question related to whether a statistical significance test makes sense when you are testing an algorithm on same data (i.e. not doing any sampling).
I have an algorithm that receives some data as input and its output can be evaluated in terms of accuracy. There's many variations of this algorithm and I'd like to know which one is best in terms of accuracy.
The only way I can currently test the algorithm is on some data that I collected from my users a while ago (1000 data points). This means that same 1000 data points will be used for evaluation each time I try to test a variation of my algorithm.
So given the following evaluation results obtained using same 1000 data points:
- base algorithm: 71% accuracy
- treatment algorithm: 73% accuracy
I'd like to know how certain can I be that the treatment algorithm is really better. Does doing a statistical significance test make sense here given that I'm testing both base and treatment algorithms on all the data that I have (i.e. I'm not doing any sampling). To me it seems like my entire dataset can be treated as the entire population and thus a statistical test makes little sense. If my assumption is incorrect and I instead should be doing statistical significance testing, what test would make sense in my case?