Binomial-CDF/binomial-test for classifier significance testing

Question

General Problem Description and Goal

I set up a matching algorithm that matches a user input (string) with a list of possible values (words) that is conclusive but very large (some 5 digit possibilities) and I want to evaluate the quality in terms of accuracy of the algorithm and give a significance but not compare it to any other algorithm. The problem came up since it is not possible to restrict the user input, as it is pulled from a website not operated by myself and the user input is partly incorrect e.g. due to typos but I know all possible meaningful inputs. The possible inputs are not perfectly uniformly distributed as some words appear more often than others, nevertheless, there is no predominant class.

Approach

Inspired by this site I thought of simply using the binomial CDF, as I think, that the results of my matching algorithm are bernoulli distributed: X_i = 1 if correct and X_i = 0 if false. Therefore I draw a randomized sample of 1000 results and manually inspected if they are correct. That resulted in 15 false matchings and 985 correct matchings. Then I calculated the CDF for k = 985, n = 1000 and p = 0,95, implying that my algorithm is 95% correct. That gave me a value of 0.99999999905952 which would allow me to discard H_0 that my algorithm is worse than 95% correct matchings at the 1% significance level.

Problem and Question

Is that a common and acceptable approach? If not, what kind of test should i consider?

Are there any publications which are a good source for that approach, as I only found the above linked website?

Of possible interest: Benavoli, Alessio, et al. "Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis." The Journal of Machine Learning Research 18.1 (2017): 2653-2688. https://www.jmlr.org/papers/volume18/16-305/16-305.pdf — Dave, May 04 '21 at 10:51
But do note the issues with accuracy as a performance metric: https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, May 04 '21 at 10:52

Binomial-CDF/binomial-test for classifier significance testing

General Problem Description and Goal

Approach

Problem and Question

0 Answers0