Simple regression models for data with a breakpoint

Question

I am currently working on a segmented regression model with two variables $(x_i,y_i)$ for $i = 1..N$. The model should take the form:

$y_i = \beta_0 + \beta_1 x_i \quad$ for $x_i < x_{crit}$

$y_i = \alpha_0 + \beta_0 + \beta_1 x_i \quad$ for $x_i \geq x_{crit}$

where $x_{crit}$ is a breakpoint value that has to be determined from the data, as well as the coefficients $\alpha_0, \beta_0$ and $\beta_1$.

I have seen similar ideas in statistics / econometrics papers (i.e. segmented regression models / regression discontinuity design) but I would like to get some feedback on what kinds of model I should use.

Ideally, I would like to use a well-documented framework where I can get some sort of confidence interval on the breakpoint $x_{crit}$.

score 6 · Accepted Answer · answered Mar 17 '12 at 20:42

This will be an R-centric answer. One approach is to wrap the call to lm in a function which is passed the breakpoint and constructs a regression conditional upon that breakpoint, then minimize the deviance of the fitted model conditional upon the breakpoint by just iterating over the possible values for the breakpoint. This maximizes the profile log likelihood for the breakpoint, and, in general (i.e., not just for this problem) if the function interior to the breakpoint iteration (lm in this case) finds maximum likelihood estimates conditional upon the parameter passed to it, the whole procedure finds the joint maximum likelihood estimates for all the parameters.

For example:

# True model: y = a + b*(obs. no >= shift) + c*x
# a = 0, b = 1, c = 1, shift at observation 31

# Construct sample data
x <- rnorm(100)
shift <- c(rep(0,30),rep(1,70))
y <- shift + x + rnorm(100)

# Find deviance conditional upon breakpoint
lm.shift <- function(y, x, shift.obs) {
  shift.var <- c(rep(0, (shift.obs-1)), rep(1, length(y)-shift.obs+1))
  deviance(lm(y~x+shift.var))
}

# Find deviance of all breakpoint values 
dev.value <- rep(0, length(y))
for (i in 1:length(y)) {
  dev.value[i] <- lm.shift(y, x, i)
}

# Calculate profile-ll based confidence interval
estimate <- which.min(dev.value)
profile.95.dev <- min(dev.value) + qchisq(0.95,1)
est.lb.95 <- max(which(dev.value[1:estimate] > profile.95.dev))
est.ub.95 <- est -1 + min(which(dev.value[estimate:length(y)] > profile.95.dev))

> estimate
[1] 30
> est.lb.95
[1] 28
> est.ub.95
[1] 33

So our estimate is 30 with a 95% confidence interval of 28 - 33. Pretty tight, but that was a pretty big shift relative to the standard deviation of the error term too.

Note some messiness is involved in calculating the profile log-likelihood based confidence interval, but the basic idea is to find the largest index less than the estimate with a deviance greater than the cutoff level for the lower bound and the smallest index larger than the estimate with a deviance greater than the cutoff level for the upper bound.

One really should plot the deviance curve out, just to make sure you don't have multiple local minima that are close to as good as each other, which might tell you something interesting about the assumed model (or the data):

enter image description here

:jbowman So I get it , you assume a model form and that the errors from this model are independent with constant mean and constant variance and that the coefficients of the model are invariant over the observation set. With all of that set in stone you definitely are getting the break point correct. — IrishStat, Mar 17 '12 at 22:33
@IrishStat - yes, assume, assume, assume. But you could relax some assumptions too, just change the code in the lm.shift function some. I should have spelled out all the assumptions, true (+1). — jbowman, Mar 18 '12 at 01:36
But the question at hand is what assumptions should I relax or enforce. As usual the Devil is in the details. — IrishStat, Mar 18 '12 at 19:50
I wish that you had set a seed because this would be more reproducible. I think you mean: est.ub.95 profile.95.dev)) — EngrStudent, Jan 26 '15 at 14:16

score 2 · Answer 2 · edited Apr 13 '17 at 12:44

This is an example of detecting a change in intercept (B0 in your notation ) or sometimes referred to a level or step shift. This often occurs in time series data where a variable in the model is impacted by a 0,0,0,0,0,0,1,1,1,1,1,1 ..... phenomenon at or around some unknown arbitrary point. It is referred to as Intervention Detection see as the break point (intervention) is found (detected) by trial and error aka a search process. If your data is not time series it is possible to use a time series package that identifies Interventions while specifying a frequency of "1" and disableing any ARMA structure thus yielding a model that you require.I your data is time series as might be expected then you need to consider Intervention Detection in he presence of ARIMA structure and PDL's (ADL's) in the user-suggested input/causal series. If you wished to post your data I would demonstrate this to you and the list. Additionally you might look at Outlier detection for generic time series and/or www.forecastingsolutions.com/publications/Introducing_cart.pdf

score 1 · Answer 3 · answered Mar 17 '12 at 18:17

1

Sounds like you want a spline regression with a single knot. In SAS see PROC TRANSREG. In R see (e.g) the Splines package.

answered Mar 17 '12 at 18:17

Peter Flom

94,055
35
143
276

like this? http://stats.stackexchange.com/questions/7316/setting-knots-in-natural-cubic-splines-in-r/7317#7317 – EngrStudent Jan 26 '15 at 14:10

Simple regression models for data with a breakpoint

3 Answers3

Linked