1

I would like to do a segmented regression (or breakpoint or piecewise regression) with lines connecting at the breakpoint $c$. So
(a) $y = a_1 + b_1x$ for $x \leq c$
(b) $y = a_2 + b_2x$ for $x > c$
(c) $a_1 + b_1c = a_2 + b_2c$ (for lines to connect at $c$)

describe the set of equations. With (c) we can rewrite (b) as $y = a_1 + (b_1 - b_2)c + b_2x$

In one equation it gives: $$y = a_1 + b_1x + (b_2 - b_1)(x - c)\mathbb{1}_{x>c}$$ with $\mathbb{1}_{x>c}$ indicator function or dummy variable $=1$ if $x>c$ and $=0$ if $x \leq c$.

It can be rewritten as: $$y = a_1 + b_1x + (b_1 - b_2)c\mathbb{1}_{x>c} +(b_2 - b_1)x\mathbb{1}_{x>c}$$

In R:

set.seed(1)
n <- 100
time <- 1:n
breakpoint <- 20
error <- rnorm(length(t), 0, 10)
a1 <- 20; b1 <- 8
a2 <- 33; b2 <- 2
y1 <- a1 + b1 * time[1:breakpoint] + error[1:breakpoint]
y2 <- a2 + breakpoint * (b1 - b2) + b2 * time[(breakpoint+1):n] +
  error[(breakpoint+1):n]
df <- data.frame(time=time, y = c(y1,y2))
df[1:breakpoint, "group"] <- "0"
df[(breakpoint+1):n, "group"] <- "1"

Using the first equation, model m1 looks correct. (Intercept) estimates $a_1$, time estimates $b_1$ and I(time - breakpoint):group0 estimates $b_2 - b_1$. We can find estimates of equations (a) and (b).

And why there is this 2nd estimate for the interaction? Because there is no main effect? And here also, group1 is the reference. Usually it is by alphabetic order. Why? So estimation of $a_1$ by (Intercept) is inverted but doesn't matter here? Right? The plot is correct.

m1 <- lm(y ~ time + I(time - breakpoint):group, data=df)

> m1$coef
                (Intercept)                        time 
                 150.612717                    2.045662 
I(time - breakpoint):group0 I(time - breakpoint):group1 
                   6.752500                          NA 


plot(df$time, df$y)
abline(a = m1$coef[1], b = m1$coef[2], col="hotpink")
abline(a = m1$coef[1] - m1$coef[3]*breakpoint, b=m1$coef[2] + m1$coef[3], col="seagreen")
abline(v=breakpoint)

Using the second equation. (Intercept) estimates $a_1$, time estimates $b_1$ and group1 estimates $(b_1 - b_2)c$ and time:group1 estimates b_2 - b_1. We can find estimates of equations (a) and (b). But that's wrong. To me m2 is the model without lines connecting. So I'm puzzled. What am I doing wrong?

m2 <- lm(y ~ time * group, data=df)

> m2$coef
(Intercept)        time      group1 time:group1 
  19.639072    8.215825  133.824511   -6.208863 

plot(df$time, df$y)
abline(a = m2$coef[1], b = m2$coef[2], col="hotpink")
abline(a = m2$coef[1] + m2$coef[3], b = m2$coef[2] + m2$coef[4], col="seagreen")
abline(v=breakpoint)

EDIT:

I think it must be linked to the fact that: $$y = a_1 + b_1x + (b_1 - b_2)c\mathbb{1}_{x>c} +(b_2 - b_1)x\mathbb{1}_{x>c}$$ can be rewritten as: $$y = \beta_0 + \beta_1x + \beta_2\mathbb{1}_{x>c} + \beta_3x\mathbb{1}_{x>c}$$ where $\beta_3$ is completely determined by $\beta_2$, as $\beta_2 = -c\beta_3$. So it should mess it up somewhere.

  • [Here](https://stats.stackexchange.com/questions/61805/standard-error-of-slopes-in-piecewise-linear-regression-with-known-breakpoints) is an example that fits such a model in R, maybe it'll help. – COOLSerdash Mar 05 '21 at 20:49
  • Thanks for the link. However I am trying to fit the model using an interaction term with a dummy variable. And this is where I get into trouble when modeling the two formulations. – user3631369 Mar 08 '21 at 10:19

1 Answers1

1

I was thinking too much.

$y_1 = a_1 + b_1x$ for $x \leq c_1$ (1)

$y_2 = a_2 + b_2x$ for $c_1 < x \leq c_2$ (2)

$y_3 = a_3 + b_3x$ for $x > c_2$ (3)

We need $a_1 + b_1c_1 = a_2 + b_2c_1$ and $a_2 + b_2c_2 = a_3 + b_3c_2$ for the lines to connect. So intercepts for (1) and (2) are $a_2 = a_1 + (b_1-b_2)c_1$ and $a_3 = a_2 + (b_2-b_3)c_2 = a_1 + (b_1-b_2)c_1 + (b_2-b_3)c_2$, respectively.

In R:

First, create dataset

set.seed(1)
n <- 100
time <- 1:n
bp1 <- 20
bp2 <- 60
err <- rnorm(length(time), 0, 10)
a1 <- 20; b1 <- 8
a2 <- 33; b2 <- 2
a3 <- 61; b3 <- -1

y1 <- a1 +                                     b1 * time[1:bp1] + err[1:bp1]
y2 <- a1 + bp1 * (b1 - b2) +                   b2 * time[(bp1+1):bp2] + err[(bp1+1):bp2]
y3 <- a1 + bp1 * (b1 - b2) + bp2 * (b2 - b3) + b3 * time[(bp2+1):n] + err[(bp2+1):n]
df <- data.frame(group=factor(rep(c(0,1,2), c(bp1, bp2-bp1, n-bp2))),
                 time=c(time[1:bp1], time[(bp1+1):bp2], time[(bp2+1):n]),
                 y = c(y1, y2, y3))

Then, run the model, get estimates from above-mentioned equations and plot the results

m1 <- lm(y ~ time*group, data=df)

b1 <- m1$coef[2]
b2 <- m1$coef[2] + m1$coef[5]
b3 <- m1$coef[2] + m1$coef[6]
a1 <- m1$coef[1]
a2 <- a1 + (b1-b2) * bp1
a3 <- a2 + (b2-b3) * bp2

plot(df$time, df$y)
abline(a = a1, b = b1, col = "hotpink")
abline(a = a2, b = b2, col = "seagreen")
abline(a = a3, b = b3, col = "dodgerblue")
abline(v=bp1)
abline(v=bp2)

Plot