I would like to do a segmented regression (or breakpoint or piecewise regression) with lines connecting at the breakpoint $c$. So
(a) $y = a_1 + b_1x$ for $x \leq c$
(b) $y = a_2 + b_2x$ for $x > c$
(c) $a_1 + b_1c = a_2 + b_2c$ (for lines to connect at $c$)
describe the set of equations. With (c) we can rewrite (b) as $y = a_1 + (b_1 - b_2)c + b_2x$
In one equation it gives: $$y = a_1 + b_1x + (b_2 - b_1)(x - c)\mathbb{1}_{x>c}$$ with $\mathbb{1}_{x>c}$ indicator function or dummy variable $=1$ if $x>c$ and $=0$ if $x \leq c$.
It can be rewritten as: $$y = a_1 + b_1x + (b_1 - b_2)c\mathbb{1}_{x>c} +(b_2 - b_1)x\mathbb{1}_{x>c}$$
In R:
set.seed(1)
n <- 100
time <- 1:n
breakpoint <- 20
error <- rnorm(length(t), 0, 10)
a1 <- 20; b1 <- 8
a2 <- 33; b2 <- 2
y1 <- a1 + b1 * time[1:breakpoint] + error[1:breakpoint]
y2 <- a2 + breakpoint * (b1 - b2) + b2 * time[(breakpoint+1):n] +
error[(breakpoint+1):n]
df <- data.frame(time=time, y = c(y1,y2))
df[1:breakpoint, "group"] <- "0"
df[(breakpoint+1):n, "group"] <- "1"
Using the first equation, model m1
looks correct. (Intercept)
estimates $a_1$, time
estimates $b_1$ and I(time - breakpoint):group0
estimates $b_2 - b_1$. We can find estimates of equations (a) and (b).
And why there is this 2nd estimate for the interaction? Because there is no main effect? And here also, group1 is the reference. Usually it is by alphabetic order. Why? So estimation of $a_1$ by (Intercept)
is inverted but doesn't matter here? Right? The plot is correct.
m1 <- lm(y ~ time + I(time - breakpoint):group, data=df)
> m1$coef
(Intercept) time
150.612717 2.045662
I(time - breakpoint):group0 I(time - breakpoint):group1
6.752500 NA
plot(df$time, df$y)
abline(a = m1$coef[1], b = m1$coef[2], col="hotpink")
abline(a = m1$coef[1] - m1$coef[3]*breakpoint, b=m1$coef[2] + m1$coef[3], col="seagreen")
abline(v=breakpoint)
Using the second equation. (Intercept)
estimates $a_1$, time
estimates $b_1$ and group1
estimates $(b_1 - b_2)c$ and time:group1
estimates b_2 - b_1
. We can find estimates of equations (a) and (b). But that's wrong. To me m2
is the model without lines connecting. So I'm puzzled. What am I doing wrong?
m2 <- lm(y ~ time * group, data=df)
> m2$coef
(Intercept) time group1 time:group1
19.639072 8.215825 133.824511 -6.208863
plot(df$time, df$y)
abline(a = m2$coef[1], b = m2$coef[2], col="hotpink")
abline(a = m2$coef[1] + m2$coef[3], b = m2$coef[2] + m2$coef[4], col="seagreen")
abline(v=breakpoint)
EDIT:
I think it must be linked to the fact that: $$y = a_1 + b_1x + (b_1 - b_2)c\mathbb{1}_{x>c} +(b_2 - b_1)x\mathbb{1}_{x>c}$$ can be rewritten as: $$y = \beta_0 + \beta_1x + \beta_2\mathbb{1}_{x>c} + \beta_3x\mathbb{1}_{x>c}$$ where $\beta_3$ is completely determined by $\beta_2$, as $\beta_2 = -c\beta_3$. So it should mess it up somewhere.