4

If you train a lot of machine learning algorithms on a problem, svm, nn, rf,...via i.e. caret, and I mean a lot of it,like hundreds or thousands (not that hard,considering all parameter tuning in validation), eventually you will find one that will work.

But that is data snooping. As you have trained so much models, I think that you have to test the hypothesis that you had been lucky, maybe with White Reallity Check test or Hansen Superior Predictive Ability test.

Around 99.9% of the papers, articles, posts,.... didn't use White's or Hansen's tests (0.1% are papers regarding stock trading via ML). I suppose this is because, on a normal basis, we use to train a few models(really?).

The question is:

Do you have any idea about the number of models required to consider data snooping terrible effects?

This is, if i'm choosing between 3 models, I think that the possiblity of getting good results by chance is low. But in choosing between 30? 300? 3000?

1 Answers1

1

A lot of effort is underway to address this issue of data dredging with the techniques from differential privacy. The algorithm allows repeated reuse of the test set in evaluating different models, and in the process keeps enough details about the test set hidden that the final model does not overfit.

See the tutorial "Rigorous Data Dredging: Theory and Tools for Adaptive Data Analysis" in this yr's ICML

http://icml.cc/2016/?page_id=97

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015, June). Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (pp. 117-126). ACM.

horaceT
  • 3,162
  • 3
  • 15
  • 19
  • Yes, and if you can explain why their method works, please answer the [following question](http://stats.stackexchange.com/questions/223188/has-the-journal-science-endorsed-the-garden-of-forking-pathes-analyses). I'm not saying there's nothing to this paper, but I am saying it doesn't make sense to me. I am very interested if someone can explain it a clear manner. – Cliff AB Sep 14 '16 at 20:30
  • @Cliff AB I think we crossed path before. Let me make it clear, I'm not saying differential privacy is the (only) holy grail in machine learning that completely solves the data snooping problem. But, this is a pretty neat trick on an otherwise intractable problem, isn't it. Before embarking on a 1000 model searches, this algorithm provides at least a sort of protective shield against overfitting. – horaceT Sep 14 '16 at 20:54
  • 1
    To be clear on my side, I'm not saying there's nothing to it, but that I don't get it yet. There's a lot of things that work great that I don't get! I have it in my mind to thoroughly investigate this some day, but given that I haven't had the time, if someone else has I'm very interested. – Cliff AB Sep 14 '16 at 20:58
  • 1
    @Cliff AB I hear you and I'll add an example to show how I think it should work. But your skepticism is justified, given what little else is out there right now. – horaceT Sep 14 '16 at 21:59
  • If I understood it, the idea is: add noise to the test set performance and use this as the real test set performance for that model. You've just made it re-usable. Depending on the similarity of train and test performances, you add noise in one way or another. I don't know if it works, but sounds good. Is this right? – PeterTschuschke Sep 15 '16 at 18:23
  • @PeterTschuschke That's the general idea, but they pick Laplacian noise because of certain sample complexity guar.entee. See Theorem 14 in the paper cite above. – horaceT Sep 16 '16 at 16:41
  • I must be missing something, why in slide 71 they pick normal noise? if (abs (sample_mean-holdout_mean)) < randon.normal.... @horaceT – PeterTschuschke Sep 16 '16 at 16:56
  • The type of noise and the dispersion of the noise variables determine the guarantee of differential privacy. See their Thresholdout algorithm, and Thm 9 following it for proof of the guarantee. – horaceT Sep 16 '16 at 17:09
  • I believe most of the differential-privacy theory uses Laplace noise for proofs. But I think the "ThresholdOut" demos do use Gaussian noise in practice (their Python demo for the Science paper, at least). – GeoMatt22 Sep 16 '16 at 17:44
  • @GeoMatt22 I've seen gaussian noise being used, too. I think the deal is about satisfying the $\delta, \epsilon$ bound. – horaceT Sep 16 '16 at 19:19
  • I'm having troubles trying to implement that. Let's say i have some classification problem, and i want a model, lets say a SVM with linear kernel.It only has a paremeter C. In the non-private way,roughly, i did cross-validation, select a value for C, try in the test set, and if test set accuracy is similar to cross validation estimation of the test error, that's the end. – PeterTschuschke Sep 19 '16 at 11:22
  • But now, in this differencial privacy new frame,¿how should i proceed? I try a lot of C values on training set, get my training errors, and test them one by one against the test set returning the test score according to the thresholdout algorithm? I think that this has no sense, because how do i choose between models with different C's? According to the test set score (which has noise) ? And how do i choose sigma and threshold? ¿sigma = 1.0/sqrt(len(sample))? threshold=k*sigma, k=2 or 3? – PeterTschuschke Sep 19 '16 at 11:37