0

A question related to whether a statistical significance test makes sense when you are testing an algorithm on same data (i.e. not doing any sampling).

I have an algorithm that receives some data as input and its output can be evaluated in terms of accuracy. There's many variations of this algorithm and I'd like to know which one is best in terms of accuracy.

The only way I can currently test the algorithm is on some data that I collected from my users a while ago (1000 data points). This means that same 1000 data points will be used for evaluation each time I try to test a variation of my algorithm.

So given the following evaluation results obtained using same 1000 data points:

  • base algorithm: 71% accuracy
  • treatment algorithm: 73% accuracy

I'd like to know how certain can I be that the treatment algorithm is really better. Does doing a statistical significance test make sense here given that I'm testing both base and treatment algorithms on all the data that I have (i.e. I'm not doing any sampling). To me it seems like my entire dataset can be treated as the entire population and thus a statistical test makes little sense. If my assumption is incorrect and I instead should be doing statistical significance testing, what test would make sense in my case?

  • See [What is the difference between a population and a sample?](https://stats.stackexchange.com/questions/269/what-is-the-difference-between-a-population-and-a-sample). – user2974951 Mar 04 '22 at 11:17
  • I had a good read and it's indeed an interesting post, but I wasn't able to answer my question with it. I guess my question boils down to the fact that I sampled 1000 data points once, and now I'm testing all my algorithms on that very same dataset. So the question is: does the fact that I use same data points for each validation test make my dataset the population meaning a statistical test makes no much sense. – wanttoaskstupidquestions Mar 04 '22 at 11:47
  • Why not test your algorithm on random *samples* of your dataset? This permits you to conceive of your dataset as a *population* (whose characteristics might be similar to future populations to which the algorithm will be applied) and enables you to compare the algorithm's results for any subsample to the properties of the held-out sample. This procedure, when applied in an automatic but principled way, is generally known as *cross-validation* -- and you can read a tremendous amount about it here on *Cross Validated!* – whuber Mar 04 '22 at 14:25

0 Answers0