2

I have a binary classifier for which I'd like to estimate the accuracy w.r.t. a test set $X$ of size $n$ (actually, I'm calculating the Matthews correlation coefficient, but I don't think it makes a difference). I'd also like to estimate a confidence interval (a range in which the true accuracy, over all possible inputs, is found with say 90% probability). Is using bootstrapping appropriate for this?

My understanding of the way this would work is that I'd sample with replacement $k$ samples (say $k=50$) each of size $n$ from my test set, or rather the classifier's predictions on these samples, compute the accuracy for each of these sample sets, and then take the range between the 5'th and 95'th percentiles of the resulting accuracies as my confidence interval. Is this correct?

H.Rappeport
  • 421
  • 2
  • 9

1 Answers1

2

Your approach might lead you astray in a few ways.

The idea behind the bootstrap is that taking bootstrap samples from the original data sample is analogous to taking multiple original data samples from the population. This has some initially surprising implications.

First, this can lead to a problem in estimating confidence intervals (CI). Your hope that it doesn't make a difference what you're calculating from the bootstraps is unrealistic. For bootstrapping to provide reliable CI, the statistic you are calculating must be pivotal: its distribution can't depend on unknown parameters, like the actual correlation coefficient (as opposed to your estimate of it). For example, a t-statistic calculated on a sample from a normal distribution is pivotal.

Unfortunately, the skewness of the distribution of correlation coefficient estimates depends on the actual value, symmetric toward the middle (near 0 true values for the Matthews correlation)) and necessarily skewed as you go out toward the limits (-1 and +1 here). And some statistics are so biased that the point estimate from the full sample is outside the bootstrapped CI calculated your way! There are several flavors of bootstrap CI to deal with this; the BCa bootstrap (which deals with both bias and skew) is often more reliable.

Second, it looks like what you are proposing is to test your model from the entire data sample against each of a set of bootstrap samples. The bootstrap principle suggests that is going the wrong way. You want to estimate how well the model from your full sample is going to work on the entire population. The analogy to that, under the bootstrap principle, would be to develop full models from each of the bootstrapped samples, and then test them against the full data sample. Use the distribution of those correlation coefficients to estimate the CI, again with the BCa method being a fairly safe bet. This also provides a way to estimate bias and optimism in your statistic; see the middle part of my answer here, for example.

Finally, you might want to reconsider the Matthews Correlation Coefficient as the measure of model performance. If your "classifier" returns probability estimates, like logistic regression, the confusion matrix needed to calculate it is probably just based on a default probability cutoff of 0.5, which implicitly treats false-positive and false-negative results at equally costly. That's not always the best cutoff choice. If your "classifier" doesn't return probability estimates, many on this site would recommend that you use one that does. See this page and many others this site for why a proper scoring rule like log-loss or the Brier score is a better choice.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • 1/2 Thank you for the elaborate answer. Starting with your last point, my classifier does not output probability estimates. To make a long story short, it's a regression model, but one for which I'm interested, in addition to the mean regression error, in the correctness of the sign of the estimation (i.e. if the true value output for a sample is 0.5, an output of -0.5 is worse than 1.5, even though they have the same error). Both error types are equivalent to me, but I expect that the distribution of inputs will be somewhat skewed, so the MCC serves me well. – H.Rappeport Jul 29 '20 at 11:33
  • 2/2 As for the second point, the model is fixed, and I'm atttempting to estimate the true MCC over some given population from which I have sampled the test set, preferably with confidence intervals. This sounds well defined to me, is there something wrong with this? Are there more appropriate methods to use here? Your point about the skewness of the MCC distribution and the bias that introduces is well taken, can you recommend a source for learning about BCa bootstraping? – H.Rappeport Jul 29 '20 at 11:33
  • @H.Rappeport [this paper](https://projecteuclid.org/euclid.ss/1032280214) explains different ways to get bootstrapped CI, including BCa. BCa is one of 4 types provided by the [`boot` package](https://cran.r-project.org/package=boot) in R; think it's also available via other statistical software. – EdM Jul 29 '20 at 15:45
  • @H.Rappeport the rest depends on what you mean by "the model is fixed." If the model _type_ is fixed and can be re-fitted to the bootstrapped samples, then my recommendation should work for MCC. If you must use only the model fit to the full data set, your approach is method (1) on [this page](https://stats.stackexchange.com/q/26537/28500). It's limited in that every case on which you test will have been used in building the model, so it says little about generalizability of the model or the model-building process. – EdM Jul 29 '20 at 16:02