Automated ML vs the entire replicability/reproducibility crisis

Question

There is a trend in machine learning implementations to make things easier and easier for implementers, a very natural engineering concern. Easy APIs to create any kind of model you want, easy infrastructure to manage versions of data and models, easy deployment of models as APIs. One of these trends is AutoML, an end-to-end process of creating a model (out of many) on very few general hyperparameters, hiding more and more of the usual stats process, all to the point of reducing the need for understanding the many hard to learn nuances of the statistical practices involved.

On the whole other end of the spectrum are the methods to addresse replicability crisis occurring in many scientific areas, mostly motivated by the poor use of statistics: confusion of statistical and effect significance, p-hacking, HARK-ing, other superficial uses of statistics. All this is asking people who use these tools to know more and better the nuances of statistical thinking.

Details are missing about the innards of AutoML: is it running an SVM and a LR and a RF with multiple kernels, hyperparams, etc? Is it following basic defensive statistics like Bonferroni correction? Or is it just jumping straight in to picking he best p-value out of all?

I've set this up as a dichotomy between ease of use in engineering and the correct thought in the statistical procedures. AutoML seems like a great thing for creating successful models. But then I wonder if they're not only ignoring the entire history of statistical thinking but even running away from it.

Are the AutoML researchers taking into account the statistical nuances successfully or are they enabling even more problems with models by ignoring the nuances (choosing between too many models for the amount of data)? And likewise are those who are statisticians making it harder to make reputable models? As a side question, is this characterization of AutoML as a more problematic statistical procedure accurate?

I suppose a TL;DR to all this is is AutoML just p-hacking all the models?

AutoML is generally used for finding a model, or possibly the best model, for a given fixed task. This breaks down into data preprocessing, feature selection, model selection and hyper parameter tuning. This doesn't really have anything to do with p-values or Bonferroni corrections, as the goal is to make the most accurate, or fastest, model, sometimes without any human supervision involved. Whereas what you're asking about sounds more like experimental design. — Alex R., Mar 11 '19 at 19:24
@AlexR. Not that AutoML does this, but isn't choosing a model with a lower p-value one way to attempt to find a better model? — Mitch, Mar 11 '19 at 20:59
Model choice is based on a metric related to accuracy, precision, ranking, etc. $p$-values would enter here if you trained 100 models, the question is, is the best model actually better than the rest or did it get lucky on the validation set. So there's going to be some statistics involved in correctly interpreting cross-validation results. To your credit, the analysis tends to be thin because it's expensive to train over multiple folds of data. So there's probably some p-hacking going on in published ML research (you wouldn't publish a paper with worse results than the current best model). — Alex R., Mar 11 '19 at 21:36

score 3 · Answer 1 · answered Sep 06 '19 at 12:31

I agree with Alex R's comments, and I'm expanding them into a full answer.

I'll be talking about "black box" models in this answer, by which I mean machine learning (ML) models whose internal implementations are either not known or not understood. Using some sort of "Auto ML" framework would produce a black box model. More generally, many people would consider hard-to-interpret methods such as deep learning and large ensembles as black boxes.

It's certainly possible that people could use black boxes in a statistically unrigorous way, but I think the question is somewhat misunderstanding what I believe to be the typical use case.

Are your model's components important, or just its outputs?

In many fields, we use regression techniques as a way to try to understand the world. Having a super accurate prediction is not the main goal. Usually the goal is more explanatory, e.g. trying to see the effect dosage has on survival rates. Here, getting a rigorous, un-hacked p-value measures of significance for the components of your model (e.g. your coefficients/biases) is extremely important. Since the components are what's important, you should not use a black box!

But there are also many other areas where the main goal is simply the most "accurate" (substitute accuracy for your favorite performance metric) prediction. In this case, we don't really care about the p-value of specific components of our model. What we should care about, is the p-value of our model's performance metric compared to a baseline. This is why you will see people split the data into a training set, a validation set, and a held out test set. That held out test set should be looked at only a very small number of times to avoid p-hacking and/or overfitting.

In short, if you care about using the internal components of your model to make statements about our world, then obviously you should know what the internals are and probably not be using hard-to-interpret or even unknown-to-you techniques. But if all you care about is the output of your model, then make sure you have a robust test set (no overlap with training/validation sets, i.i.d., etc.) that you don't look at too much and you are likely good to go even if your model is a black box.

So there's no reproducibility problems in performance-oriented machine learning?

I wanted to be clear about this -- there are definitely reproducibility problems in performance-oriented machine learning. If you train thousands of models and see their performance on the same test set, you are likely getting non-reproducible results. If you take a biased sample for your test set, you are likely getting non-reproducible results. If you have "data leakage", i.e. overlap between your train/validation set and your test set, you are likely getting non-reproducible results.

But none of these problems are inherent to the use of black box models. They are the problems of the craftsman, not the tools!

score 1 · Answer 2 · answered Sep 06 '19 at 15:09

Preamble: Models (also as constructed by Auto-ML) can be used for many aims, not just for running tests and p-values. The first issue when investigating reproducibility is to define what exactly you want to do, how you interpret your result, and what you expect to be reproduced, and all further considerations depend on that.

Now let's assume you are in fact interested in a test/p-value, and let's say AutoML comes up with an in some sense optimal model, chosen from the given data, and then you run a test on the same data based on that model (I do realise that this is not 100% in line with the linked Wikipedia page but from your posting I guess that you have something like this in mind, and a version of AutoML that does this is at least conceivable; actually software like this exist, I only don't know whether people would call it "AutoML").

The issue here is that the theory behind the p-value assumes that the model is assumed and the test was fixed independently of the data. As the data choose your model, this assumption is violated here. This means that the p-value is technically invalid. In some cases this may be relatively harmless (in case your test is independent or approximately independent of what was done during model selection), but you cannot take this for granted, and as long as you don't know precisely what your AutoML does, there is no way to find out.

More generally, if what you do with the selected model is independent of the model selection (for example if it was done on new data, such as prediction quality evaluation on independent data put aside and not used for AutoML), this evaluation is unaffected by the model selection; otherwise it is and can therefore not be trusted to generalise to new data.

Automated ML vs the entire replicability/reproducibility crisis

2 Answers2

Are your model's components important, or just its outputs?

So there's no reproducibility problems in performance-oriented machine learning?