2

RE: https://www.quora.com/What-do-statisticians-e-g-Stats-PhDs-think-of-data-scientists-in-industry-without-stats-backgrounds

There are several comments made regarding "PROC FISH syndrome", whereby data scientists "try to keep trying things until they say what you want them to say."

Let's say that I have a problem and am trying to use a probabilistic model to develop either inferences or predictions. I would do exploratory data analysis, hypothesis test, evaluate predictors, and so forth. Let's say my model doesn't perform well. In that case, why wouldn't I want to consider other modeling options and keep testing things until I find a satisfactory solution where my classification rate or AUC/f1 score is adaquete.

Basically, what is so wrong with PROC FISH syndrome?

Glen_b
  • 257,508
  • 32
  • 553
  • 939
AGUY
  • 1,014
  • 1
  • 10
  • 7
  • 1
    That expression ("PROC fishing" / "PROC FISH") seems to assume that SAS is all anyone uses (I've never heard it described that way before). To be honest I haven't seen a *stats* PhD student using SAS in a couple of decades (though I saw a finance PhD use it just recently), so I doubt many stats PhDs would call it that. Depending on what exactly is done (and party on their background or audience) they might call it [*data-dredging*](https://en.wikipedia.org/wiki/Data_dredging) or *p-hacking*, among other things. ...ctd – Glen_b Jun 12 '16 at 22:47
  • 1
    ctd... In the distant past it used to often be called "data mining" but that's been adopted to mean something somewhat different (e.g. see our [tag](http://stats.stackexchange.com/questions/tagged/data-mining)) and so is used less to represent this particular activity. You might also find [this post](http://stats.stackexchange.com/questions/200745/how-much-do-we-know-about-p-hacking-in-the-wild/) relating to p-hacking of interest, including references there. – Glen_b Jun 12 '16 at 22:47

2 Answers2

1

The basic problem of trying multiple models or analyses* until you get one you like is that essentially all of the meaningful properties of our resulting inferences (including predictions) simply don't hold any more.

* (on the same data you estimate the the model on)

Specifically, significance levels will be larger than we specify, while calculated p-values will be too small. Standard errors and confidence intervals (and prediction intervals) will be too small. Parameter estimates will generally be biased (to correspond to the direction we "wanted" -- so if we were looking for significance, for example, the coefficients in our model will tend to be biased away from 0).

A good place to read about this is Chapter 4 of Frank Harrell's Regression Modeling Strategies (at least it was Ch4 in the first edition; I assume it's still the case in the new one). [In that case it relates to model selection, but the issues carry over to choices between kinds of analysis as well.]

The impact of data-dredging can be so strong it reduces our model to a comforting story we tell ourselves about the data, one that has more to do with us than it does with the data.

What's left that's of any value?

Indeed the problem can be very subtle. To quote Andrew Gelman** (here speaking specifically about p-values, though many of the objections carry over to other parts of inferential statistics with suitable modification of phrasing):

Valid p-values cannot be drawn without knowing, not just what was done with the existing data, but what the choices in data coding, exclusion, and analysis would have been, had the data been different. This ‘what would have been done under other possible datasets’ is central to the definition of p-value.” The concern is not just multiple comparisons, it is multiple potential comparisons.

Even experienced users of statistics often have the naive belief that if they did not engage in “cherry-picking . . . data dredging, significance chasing, significance questing, selective inference and p-hacking” (to use the words of the ASA’s statement), and if they clearly state how many and which analyses were conducted, then they’re ok. In practice, though, as Simmons, Nelson, and Simonsohn (2011) have noted, researcher degrees of freedom (including data-exclusion rules; decisions of whether to average groups, compare them, or analyze them separately; choices of regression predictors and iteractions; and so on) can and are performed after seeing the data.

** see here

--

As hinted above, much of it can be obviated by not using the same data to select and estimate a model. Approaches such as sample splitting/cross validation for example can avoid many of these problems (though Gelman's garden of forking paths$^\dagger$ might still affect us in some more subtle ways as the quote points out).

$^\dagger$ see also Gelman & Loken (2013). The phrase itself appears to be a reference to the short story of the same name by Borges

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • +1 for cross validation and hold-outs. That is a key data science step that avoids the "lucky coincidence" issue when exploring models. –  Jun 13 '16 at 03:17
1

This is such a wide spread problem that it now has a term for it : Adaptive Data Analysis. Anyone ever followed Kaggle competitions should be familiar with the fact that the public leaderboard shows much better results than the final evaluation scores. Because of the severity of it[1], researchers have come up with a cure : technique of differential privacy[2].

Since it's state-of-art research, it may not work for all problems, but it's a pretty interesting idea.

[1] Gelman, A., & Loken, E. (2016). The statistical crisis in science. The Best Writing on Mathematics 2015.

[2] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015, June). Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (pp. 117-126). ACM.

horaceT
  • 3,162
  • 3
  • 15
  • 19