Is it valid to do a test only on most extreme subjects, instead of everybody?

Question

Often I see people to first do linear regression and after they do not find a significant result, they proceed by doing a t-test comparing only n highest and lowest scoring subjects.

Is this procedure valid? It seems to me that grouping of people based on a arbitrary selected threshold and throwing away information by discretization of continuous variables should be a problem, on the other hand it works, in the sense of finding a significant differences where linear regression failed to do so.

Edit: I am aware of a p-hacking problem, and this is not really what I am interested in. Would this be a problem even if it is a part of preregistered analysis protocol? So no p-hacking is involved.

What about the situation where no data are thrown away, but linear regression shows no significant result, but t-test based on prespecified groups did. For example obese vs. not obese people based on BMI threshold is significant, but having BMI as a continuous predictor in regression is not. If there is problem of loss of information, why is the test with less information giving better results?

Edit 2, splines: As I understand it, splines are a good way to deal with nonlinear relationship that can make linear regression perform worse. However what if the relationship is linear, but have a very low effect size, so whole regression is not significant, but t-test of highest/lowest scoring subjects is? Would splines find the difference that linear regression didn't?

Although the procedure as you describe it is not valid, because it smells like a way to get a "positive" result no matter what the data might say, procedures very much like this--either specified formally in an experimental protocol or as part of exploratory data analysis--are used all the time and can be justified. A sophisticated way of doing this is to use *splines* for the explanatory variable. It amounts to very nearly the same thing when the "knots" in the splines are based on quantiles of the regressors. This should give you some good search terms to investigate. — whuber, May 19 '17 at 17:52
I added splines into my question. What if the relationship is linear, but just small. Would splines provide benefits compared to linear regression? — rep_ho, May 19 '17 at 18:37
Splines enable you to identify and test for nonlinear relationships. As an extreme example, suppose $Y$ sits near $0$ when $X$ is small, then gradually increases to $2$ for middling values of $X$, then settles down to $1$ for the largest $X$. This might not be detectable with a linear regression, because of its departures from linearity, but the differences in $Y$ values between the smallest and largest $X$ might be evident. This is the sort of behavior splines are good at modeling. — whuber, May 19 '17 at 19:02
Re Edit 2: You can demonstrate analytically that the linear regression is more powerful than the t-test when the underlying relation is linear. There is good intuition for this. First, the t-test is based only on a subset of the data (the values at the extremes of the regressors), which causes it to lose power. Second, it does not account for the slope. As such, it has to incorporate all variation within the random errors, thereby increasing the expected sizes of the residuals--which also loses power. In short, the t-test you describe makes sense *only* for nonlinear associations. — whuber, May 19 '17 at 21:48

score 5 · Answer 1 · answered May 19 '17 at 18:08

5

This is not valid. People are only going onto the t-tests because the regression failed to yield a significant result. Andrew Gelman refers to these choices as the "garden of forking paths," and if a researcher does enough things to the data in search of p < .05, the Type I error rate can be greatly inflated.

Dichotomizing a continuous variable is likely not a good idea, but this does not sound like that. It seems like a researcher using only the subjects that will confirm their hypothesis. They are not making the decision for choosing these people a priori, but on a post hoc basis after looking at what the data will give them. This is p-hacking and must be avoided.

answered May 19 '17 at 18:08

Mark White

8,712
4
23
61

Thanks. I edited my question. Assuming it was done without p-hacking, is the loss of information problem, if the thing with less information gives better results? – rep_ho May 19 '17 at 18:22
Could you please comment on the comment on the use of splines by @whuber? – Hans May 19 '17 at 18:23
2

I'm afraid I don't know enough about splines to comment on what @whuber said, but I would be interested in reading about it, if they have a useful citation or something I could check out. – Mark White May 19 '17 at 18:29

Glen_b · Accepted Answer · 2017-05-22T01:09:14.800

Since you raise pre-registering a procedure in comments I thought I'd post a brief discussion related to it. Let's imagine you can avoid all of the pitfalls in the Garden of forking paths paper linked in another answer.

So we're at the point of that choices of what procedures to use. In what follows I'll make a number of simplifying assumptions (though generally speaking the conclusions will carry much more broadly). First, for simplicity, I'll restrict consideration to the case of a single predictor (independent variable). Let's consider two possible situations:

In this situation you're able to choose your design -- that is what values the predictor variable (independent variable) will take. If you believe before getting the data that the relationship is weak (so power matters) and linear, the choice is between linear regression and a t-test after choosing how to place the values taken by the independent variable (predictor).

If those values are based on placing the data into two extreme groups in such a ways as to give the two-group t-test the best possible power then the $x$'s will be at some extreme possible low value $x_L$ and extreme possible high value $x_U$ with nothing in between (and no ability to assess that linearity assumption). The resulting choice between tests is simple enough -- the two group t-test and test of regression slope are exactly the same - they will have the same t-statistic. So nothing is gained here by doing anything but regression.
In this second situation you don't choose what values are taken by your predictor or for other reasons you don't get to just place them in a way that maximizes power for the t-test. If you believe before getting the data that the relationship is weak and linear ($y=\alpha+\beta x +\epsilon$) and the usual regression assumptions hold, the choice between linear regression and a t-test based on splitting the data into three groups (not necessarily all equal sized, but where their sizes are specified before seeing the data), leaving out the middle group and doing a t-test on the outer pair of groups is simple: the linear regression has more power.

Let us investigate why this is obvious. Note that instead of doing the t-test on the difference in two group means ($\bar y_U-\bar y_L$), if we have the $x$ values we can compute the expected difference under the linear assumption. It is some multiple of $\beta$ (which multiple depends on the $x$'s we have). As a result we can scale the difference (as a function of the $x$-values in the two groups) so that the scaled difference has an expected value of $\beta$, the population slope of the regression line. This won't change the t-ratio at all, since it will scale both the numerator and the denominator of the t-statistic by the same scale-factor.

[I will assume for simplicity that the placements of the $x$'s is symmetric either by design, or if the $x$'s are drawn at random from some symmetric distribution but we won't know the x-values until we draw the sample. In this situation it will be best for the two groups to contain the same number of observations. I'll also assume that $n$ is even. Neither is necessary but the discussion is simplified.]

So now we're considering two different estimators of $\beta$, both linear. One is the usual least squares estimate ($\hat \beta$) and the other has $i$ observations in each of the two groups.

[If the IV is uniformly placed in some range, the choice of how many points to include in each group is an old solved problem; it turns out it's roughly $\frac13$. If they're normally distributed it's smaller - I believe about 27%. If the design-points are mostly near the ends it's higher, until in the extreme situation we're back in case 1 above, with 50% in each group. The peaks in power are pretty flat, so it's not critical exactly which value you use.]

Immediately we can apply the Gauss-Markov theorem and know that the linear regression will outperform the two-group one -- you'll have two estimators with the same expected value but the linear regression one will have the smaller expected standard error (and so have more power).

[With proportions close to the optimal one, the power gets pretty close to that from linear regression, but not so close that you'd regard it as basically a toss-up.]

If you see that someone chooses to go the t-test route and gets a significant result that would not have been obtained with a full linear regression then either they were just really lucky with the way their data turned out or you have to wonder whether the procedure was really so "fully pre-registered" after all.

There's another situation that modifies this discussion somewhat - the errors-in-variables case ($x$'s observed with error, sometimes called Model II regression). In this case, ordinary regression is neither optimal nor unbiased and should not be used. That would be a comparison for a different question, though.

score 1 · Answer 3 · 2017-05-19T18:06:40.747

I would comment if I could, but you may find this What is the benefit of breaking up a continuous predictor variable? useful -- Scorthi's answer in particular.

Also this http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous for a list of problems caused by categorising a continuous variable.

To me what you describe sounds like "p-hacking", and results in a loss of information.

Is it valid to do a test only on most extreme subjects, instead of everybody?

3 Answers3