Let's say we have immense literature theoretical justification to expect that X1 predicts Y, even though X1 contributes only slightly to explaining variation in Y. We want to know if X1 is more predictive of Y specifically when X2 is low in value, and expect changes in X1 to be less important when X2 is high in value. We will test this hypothesis with the same data used to establish the documented relationship between X1 and Y.
Y = Constant + B1∗X1 + B2∗X2 + B3∗X1∗X2 + error
The interaction term is nonsignificant and model fit has not improved (from a model without B3∗X1∗X2). Examination of the post estimation predictions, however, suggests that changes in X1 are indeed only associated with Y (p < .05) when X2 is low (e.g. 1.5 SD below the mean), and changes in X1 are not associated with Y when X2 is high (e.g. between .5 SD below the mean and 2.5 SD above the mean).
To confirm the hypothesis another way, we cut X2 in categorical tertiles and run the model again.
Y = Constant + B1∗X1 + B2∗X2² + B3∗X2³ + B4∗X1∗X2² + B5∗X1∗X2³ + error
Corroborating our previous plot, we find that B5∗X1∗X2³ is statistically significantly different from B1∗X1 in the expected direction. Changes in X1 appear to only be associated with Y when X2 is low in value.
Is this a wild (and wrong) goose chase in the name of theory? In cases like this, it seems odd to use the statistical significance and model fit statistics alone to assess moderation, as is so common in the literature. Thanks.
EDIT: To be clear, I've done nothing and this is a purely synthetic example; there's no actual research being done around this question. The heart of my question is this: if there is truly in fact a point on a distribution of X2 at which X1 matters most or in fact doesn't matter at all, that can be majorly substantively important. How is a researcher supposed to detect this?