Can I partition the only independent variable in the regression into two groups and compare the slopes between the partitioned groups?

Question

I have a regression model of y = a + b * x and both variables are continuous. I've found that the coefficient of x, which is b, is statistically significant in the regression y = a + b*x.

Now I want to test whether the coefficients of x will be different for small and large values of x if I partition x into two groups (small vs. large). For example, how can I test whether b1 and b2 are statistically different?

y1 = a1 + b1 * x, for x < a cutoff value of x (small values of x)

y2 = a2 + b2 * x, for x >= a cutoff value of x (large values of x)

I found the method discussed in the Stata FAQs can be helpful https://www.stata.com/support/faqs/statistics/test-equality-of-coefficients/

However, I am not sure if it is OK that my y and z are essentially the same variable, i.e. y1 for small values of x and y2 for large values of x.

Also, one answer to the question How to compare two regression slopes for one predictor on two different outcomes? said the dependent variables in the method above, namely y and z, need to be independent. If it is true, how can I test my y1 and y2 are independent from each other?

@JamesPhillips the data can be accessed via this link https://docs.google.com/spreadsheets/d/1ffb2rNZQpUHiB6OmqIXoz0riz4Qyyh2Dnq1RxEycj68/edit?usp=sharing Thanks! — mokusei, Nov 12 '19 at 16:45
@whuber Thanks for the reply! But I am more concerned about whether I can set a cutoff value k for the continuous variable x and then run a regression like y = a + b * x + c * dummy x ( x < k or x >= k). Also, a lot of articles and paper I found suggest it is problematic to categorize a continuous variable http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous — mokusei, Nov 12 '19 at 16:55
You are asking to identify a single changepoint in a regression. This is not a matter of converting a continuous variable into a category: $x$ still enters directly into the model. — whuber, Nov 12 '19 at 17:19
I see a potential problem, There would two different straight lines, "y = bx + a" and "y = bx + a + c", which means that at the cutoff value of k the lines will be disjointed with a gap between them. Is this acceptable, or are you looking for a "broken stick" model where both lines are equal at the value of "k"? — James Phillips, Nov 12 '19 at 18:21
@whuber But is it ok to have a continuous variable of x and a dummy variable of x in the same regression model? Is the regression y = a + b * x + c * dummy of x correct (dummy of x determined by the cutoff value of k? More importantly, if in the regression y = a + b * x + c * dummy of x + d * x * dummy of x, I find c and d are not statistically significant, can I say that the coefficients of x are not different for x < k and x >=k? Thanks! — mokusei, Nov 12 '19 at 19:45
@JamesPhillips Actually, I am more interested in whether I can say the coefficients of x are different for x < k and x >=k based on the regression model y = a + b * x + c * dummy of x + d * x * dummy of x, and the dummy of x is categorized by x < k and x >= k. For example, if I find c and d are not statistically significant, can I say that the coefficients of x are not different for x < k and x >=k? Thanks! — mokusei, Nov 12 '19 at 19:51
Absolutely, Indeed, there's practically no way to exclude this possibility, because it just might happen in a dataset of $(x,y,z)$ values that in every case $z=\mathcal{I}(x\ge k)$ for some number $k.$ There's no difference between using the values of $z$ you collected or using the values as computed from $x.$ As far as your example goes, (1) you should use an $F$ test to evaluate $c$ and $d$ simultaneously and (2) if you estimated $k$ from the data, you need a test that treats $k$ as a parameter, too. (This will make all coefficients less significant.) — whuber, Nov 12 '19 at 19:52

Jonas Lindeløv · Answer 1 · 2020-01-15T11:56:50.753

Using the data linked to in the comments, you can do this in the R package mcp:

Define a model with two linear segments:

model = list(
  y ~ 1 + x,
  ~ 1 + x
)

Now sample it (setting adapt high to reach convergence) and test the hypothesis that the first slope is greater than the second. I get a evidence ratio (Bayes Factor) of around 1.5.

library(mcp)
fit = mcp(model, data = df, adapt = 4000)
hypothesis(fit, "x_1 > x_2")

Note that:

This dataset contains very little information about such a change point. Therefore it is hard to identify and the model fit is poor (poor convergence between chains). To help it along, maybe update the priors to better represent the knowledge about the data you are modeling, e.g., if the change point is known to occur in a certain interval or if the slopes are known to be positive.
Set segment 2 to ~ 0 + x if the slopes are joined.

Can I partition the only independent variable in the regression into two groups and compare the slopes between the partitioned groups?

1 Answers1