1

Let's say we have immense literature theoretical justification to expect that X1 predicts Y, even though X1 contributes only slightly to explaining variation in Y. We want to know if X1 is more predictive of Y specifically when X2 is low in value, and expect changes in X1 to be less important when X2 is high in value. We will test this hypothesis with the same data used to establish the documented relationship between X1 and Y.

Y = Constant + B1∗X1 + B2∗X2 + B3∗X1∗X2 + error

The interaction term is nonsignificant and model fit has not improved (from a model without B3∗X1∗X2). Examination of the post estimation predictions, however, suggests that changes in X1 are indeed only associated with Y (p < .05) when X2 is low (e.g. 1.5 SD below the mean), and changes in X1 are not associated with Y when X2 is high (e.g. between .5 SD below the mean and 2.5 SD above the mean).

To confirm the hypothesis another way, we cut X2 in categorical tertiles and run the model again.

Y = Constant + B1∗X1 + B2∗X2² + B3∗X2³ + B4∗X1∗X2² + B5∗X1∗X2³ + error

Corroborating our previous plot, we find that B5∗X1∗X2³ is statistically significantly different from B1∗X1 in the expected direction. Changes in X1 appear to only be associated with Y when X2 is low in value.


Is this a wild (and wrong) goose chase in the name of theory? In cases like this, it seems odd to use the statistical significance and model fit statistics alone to assess moderation, as is so common in the literature. Thanks.

EDIT: To be clear, I've done nothing and this is a purely synthetic example; there's no actual research being done around this question. The heart of my question is this: if there is truly in fact a point on a distribution of X2 at which X1 matters most or in fact doesn't matter at all, that can be majorly substantively important. How is a researcher supposed to detect this?

Nico
  • 33
  • 3

1 Answers1

4

Well the problem is that you are p-hacking like crazy. This kind of multiple testing is the sort of thing that has caused the scientific replication crisis in the social sciences and produced hundreds/thousands of seriously flawed papers. You cant just keep trying different ways of interacting your variables and cutting things up, without destroying the statistical significance of the results (even if the final p-value is low). See:

https://en.wikipedia.org/wiki/Multiple_comparisons_problem

https://en.wikipedia.org/wiki/Data_dredging

The sort of goose chase you describe is perfectly fine for generating hypotheses but in order to test the final hypothesis you end up with, you need to run the model you have fit on an independent data set which was not used during the hypothesis generation procedure. Otherwise you are just overfitting. That means either collecting a new data set, or holding-out some of your original data for testing, and only using the remainder for your hypothesis generation.

James
  • 506
  • 1
  • 5
  • You response seems to miss the point of the post, likely due to my lack of clarity. Imagine instead that the relation between X1 and Y is well established, and was established in the data set being used. You make important points, but they're points not addressing the core question about detection of moderation. – Nico Jun 24 '20 at 20:31
  • 1
    Your original post said that the interaction between X1 and X2 in the dataset you used was not significant. As such, further (non-preregistered) exploration of different relationships between X1 and X2 is p-hacking and you would need to adjust your p-values due to multiple comparisons (or verify the final model on an independent test set). The previous well-established relationship between X1 and X2 would be something to think about if you wanted to do a Bayesian analysis and use this to motivate an informative prior on interaction coefficients, but its not relevant to p-value calculations. – James Jun 24 '20 at 20:55
  • This is all really a special case of how to interpret p-values in the final model that has been selected, after performing model selection (since your use of different specifications is essentially model selection).There is a good discussion here https://stats.stackexchange.com/questions/179941/why-are-p-values-misleading-after-performing-a-stepwise-selection and here http://joshualoftus.com/post/model-selection-bias-invalidates-significance-tests/ – James Jun 24 '20 at 21:05
  • 1
    Note that if all you care about is predicting Y then its not a huge problem (although you should still use a testset to get an unbiased estimate of the error because your use of multiple models means that you are going to be slightly overfitting) but if you are trying to draw meaningful scientific conclusions about X1/X2 then it becomes a more serious issue. – James Jun 24 '20 at 21:11
  • 1
    @Marcus the issue here is not about the main effects, or even interactions specifically; the issue is that you picked the interaction split based on the model results. This means that you effectively did a bunch of hypothesis tests already: you looked at splitting at the median, rejected that, looked at quartiles, rejected that, rejected the continuous interaction, and picked tertiles specifically. Your tertile-test p-value is the min of several p-values, and therefore the procedure does not maintain $\alpha$ at 0.05. – juod Jun 24 '20 at 22:59
  • To be clear, I've done nothing and this is a purely synthetic example; there's no actual research being done around this question. The heart of my question is this: if there is truly in fact a point on a distribution of X2 at which X1 matters most or in fact doesn't matter at all, that can be majorly substantively important. How is a researcher supposed to detect this? – Nico Jun 25 '20 at 01:40
  • 1
    " How is a researcher supposed to detect this?" – James Jun 25 '20 at 17:58