Often I see people to first do linear regression and after they do not find a significant result, they proceed by doing a t-test comparing only n highest and lowest scoring subjects.
Is this procedure valid? It seems to me that grouping of people based on a arbitrary selected threshold and throwing away information by discretization of continuous variables should be a problem, on the other hand it works, in the sense of finding a significant differences where linear regression failed to do so.
Edit: I am aware of a p-hacking problem, and this is not really what I am interested in. Would this be a problem even if it is a part of preregistered analysis protocol? So no p-hacking is involved.
What about the situation where no data are thrown away, but linear regression shows no significant result, but t-test based on prespecified groups did. For example obese vs. not obese people based on BMI threshold is significant, but having BMI as a continuous predictor in regression is not. If there is problem of loss of information, why is the test with less information giving better results?
Edit 2, splines: As I understand it, splines are a good way to deal with nonlinear relationship that can make linear regression perform worse. However what if the relationship is linear, but have a very low effect size, so whole regression is not significant, but t-test of highest/lowest scoring subjects is? Would splines find the difference that linear regression didn't?