Your approach might lead you astray in a few ways.
The idea behind the bootstrap is that taking bootstrap samples from the original data sample is analogous to taking multiple original data samples from the population. This has some initially surprising implications.
First, this can lead to a problem in estimating confidence intervals (CI). Your hope that it doesn't make a difference what you're calculating from the bootstraps is unrealistic. For bootstrapping to provide reliable CI, the statistic you are calculating must be pivotal: its distribution can't depend on unknown parameters, like the actual correlation coefficient (as opposed to your estimate of it). For example, a t-statistic calculated on a sample from a normal distribution is pivotal.
Unfortunately, the skewness of the distribution of correlation coefficient estimates depends on the actual value, symmetric toward the middle (near 0 true values for the Matthews correlation)) and necessarily skewed as you go out toward the limits (-1 and +1 here). And some statistics are so biased that the point estimate from the full sample is outside the bootstrapped CI calculated your way! There are several flavors of bootstrap CI to deal with this; the BCa bootstrap (which deals with both bias and skew) is often more reliable.
Second, it looks like what you are proposing is to test your model from the entire data sample against each of a set of bootstrap samples. The bootstrap principle suggests that is going the wrong way. You want to estimate how well the model from your full sample is going to work on the entire population. The analogy to that, under the bootstrap principle, would be to develop full models from each of the bootstrapped samples, and then test them against the full data sample. Use the distribution of those correlation coefficients to estimate the CI, again with the BCa method being a fairly safe bet. This also provides a way to estimate bias and optimism in your statistic; see the middle part of my answer here, for example.
Finally, you might want to reconsider the Matthews Correlation Coefficient as the measure of model performance. If your "classifier" returns probability estimates, like logistic regression, the confusion matrix needed to calculate it is probably just based on a default probability cutoff of 0.5, which implicitly treats false-positive and false-negative results at equally costly. That's not always the best cutoff choice. If your "classifier" doesn't return probability estimates, many on this site would recommend that you use one that does. See this page and many others this site for why a proper scoring rule like log-loss or the Brier score is a better choice.