No it's not a silly question. There are not a lot of statistics for comparing statistics. For example, you have a lot of t-tests (logistic models) and want to perform a hypothesis test to determine which t-test is the most significant. That is, hypothesis tests for hypothesis test results.
For logistic, compare each model using a variety of test results for each like coefficients, overall chi-squared p-value, Hosmer-Lemeshow statistic and table, deviance GOF. For machine learning issues, there is the ROC-AUC, sensitivity, and specificity for each model, as well as PV+, PV- (predictive value plus, minus - which is hinged to prevalence, or proportion of outcomes with a one).
Things get complicated however, because there can be issues like the input features (predictors used for each), and the cross-validation methods used for each model.
But overall, the AUC-ROC would be a good start. This is the receiver operator characteristic curve - area under the curve based on a plot of sensitivity vs. 1-specificity. People who present ML classification results at meetings/conferences for e.g. a lot of biological markers as predictors for class outcomes will simply go through several slides entitled "AUC", or AUC-ROC, listing how AUC changes with use of different combinations of features. AUC-ROC incorporates both sensitivity and specificity, which is much more informative than recall or classification accuracy, which is your $M_1$ and $M_2$.
In fact, if you present results based on AUC for different combinations of input features, you only need to mention which classifier was used, because AUC can be calculated for any classifier. Thus, you could have one slide of AUC for various mixtures of features that's based on multiple classifiers, where the AUC from multiple classifiers for a specific set of features is called "ensemble classifier fusion."
The point in mentioning the above is that an experienced ML analyst would quickly get away from what you are asking and launch into a lot of other things (like ensemble methods, each which use CV and multiple classifiers) without getting tripped up on looking for statistical tests to prove which AUC is the best. At that point however, you have to look at overfitting and the bias/variance dilemma, effect of the "curse of dimensionality" of each feature set on each classifier.