The basic problem of trying multiple models or analyses* until you get one you like is that essentially all of the meaningful properties of our resulting inferences (including predictions) simply don't hold any more.
* (on the same data you estimate the the model on)
Specifically, significance levels will be larger than we specify, while calculated p-values will be too small. Standard errors and confidence intervals (and prediction intervals) will be too small. Parameter estimates will generally be biased (to correspond to the direction we "wanted" -- so if we were looking for significance, for example, the coefficients in our model will tend to be biased away from 0).
A good place to read about this is Chapter 4 of Frank Harrell's Regression Modeling Strategies (at least it was Ch4 in the first edition; I assume it's still the case in the new one). [In that case it relates to model selection, but the issues carry over to choices between kinds of analysis as well.]
The impact of data-dredging can be so strong it reduces our model to a comforting story we tell ourselves about the data, one that has more to do with us than it does with the data.
What's left that's of any value?
Indeed the problem can be very subtle. To quote Andrew Gelman** (here speaking specifically about p-values, though many of the objections carry over to other parts of inferential statistics with suitable modification of phrasing):
Valid p-values cannot be drawn without knowing, not just what was done with the existing data, but what the choices in data coding, exclusion, and analysis would have been, had the data been different. This ‘what would have been done under other possible datasets’ is central to the definition of p-value.” The concern is not just multiple comparisons, it is multiple potential comparisons.
Even experienced users of statistics often have the naive belief that if they did not engage in “cherry-picking . . . data dredging, significance chasing, significance questing, selective inference and p-hacking” (to use the words of the ASA’s statement), and if they clearly state how many and which analyses were conducted, then they’re ok. In practice, though, as Simmons, Nelson, and Simonsohn (2011) have noted, researcher degrees of freedom (including data-exclusion rules; decisions of whether to average groups, compare them, or analyze them separately; choices of regression predictors and iteractions; and so on) can and are performed after seeing the data.
** see here
--
As hinted above, much of it can be obviated by not using the same data to select and estimate a model. Approaches such as sample splitting/cross validation for example can avoid many of these problems (though Gelman's garden of forking paths$^\dagger$ might still affect us in some more subtle ways as the quote points out).
$^\dagger$ see also Gelman & Loken (2013). The phrase itself appears to be a reference to the short story of the same name by Borges