7

I've been playing around with the package strucchange (and to some extent segmented) in R. I'm trying to determine whether there are changes in slope in a linear regression and more importantly, how many breakpoints. A toy dataset:

x <- c(0, 5, 10, 15, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80)
y <- c(-84.16, -86.67, -87.74, -86.07, -89.15, -91.90, -93.64, -95.92,
  -95.96, -99.19, -100.73, -107.29, -106.10, -107.29)

First problem: if I use the breakpoints function:

breakpoints(y ~ x, data = data.frame(x, y))

I get the following error:

Error in breakpoints.formula(y ~ x, data = data.frame(x, y)) : 
minimum segment size must be greater than the number of regressors

I think this arises because the default h parameter in the breakpoints command is 0.15 and 14 (the number of observations I have) * 0.15 = 2.1 which, rounded down, is not greater than 2 (the "number of regressors": incidentally, I would have thought the number of regressors would be 1 given my formula but I've learned from other working examples of y ~ x that nreg = 2 in these cases. I guess the intercept counts as a regressor?).

If I set h to 3 or some fraction such that 14 * h >= 3, the command works.

breakpoints(y ~ x, data = data.frame(x, y), h = 3)

Two breakpoints are returned. But the result is sensitive to h. Such that if I use:

breakpoints(y ~ x, data = data.frame(x, y), h = 4)

I get a different solution. In the latter case, a single optimal break is found because the minimum number of observations before a break can be called is higher. Is there a way to somehow determine whether one solution has more support than the other? In other words, how best to optimize not the position of breakpoints, but the number of breakpoints (perhaps across values of h)? I think the Fstats command might be the key but I'm having a lot of trouble understanding the help for this command...

Achim Zeileis
  • 13,510
  • 1
  • 29
  • 53
user2414840
  • 263
  • 1
  • 3
  • 4
  • 1
    `breakpoints` is not a standard `R` function. Whose implementation or package are you using? What is its underlying statistical method? – whuber Jul 23 '15 at 14:22
  • @whuber it would presumably be the one in the package `strucchange` that the OP mentioned in the opening sentence, though I agree that the method it's using should definitely be mentioned in the question. – Glen_b Jul 23 '15 at 20:24
  • 1
    @Glen_b Thanks for pointing that out. I think I had read "strucchange" and "segmented" as typographical errors for "structural change," etc., because they were not quoted (or formatted) to suggest they were names of anything. – whuber Jul 23 '15 at 21:06

1 Answers1

9

Some remarks:

  • You need to estimate two parameters in each segment (intercept and slope). Hence breakpoints() requires that there are at least three observations in each segment...otherwise you cannot estimate the parameters (without getting a perfect fit).

  • But three is the minimal value that is technically possible. Whether or not it leads to meaningful results is a different question. Usually, you probably wouldn't use a regression model for just three observations, would you?

  • Therefore the task of identifying 2 breakpoints plus 6 regression coefficients (three intercepts and three slopes) from just 14 observations is really challenging. Without using additional prior knowledge it is probably hard to argue that this does not overfit the data.

  • If you set the minimal segment size to 5 (or higher) you cannot estimate 2 breakpoints on 14 observations. Hence, setting h = 3 or h = 4 are the only options that would allow 2 breakpoints. The reason that the latter prefers only 1 breakpoint is that it is not possible anymore to group the last three observations into their own segment.

  • More documentation than on the manual page are in the third reference in citation("strucchange") discussing practical problems in breakpoint estimation.

To compare the 1- and 2-breakpoint solution you can first estimate the breakpoints:

bp <- breakpoints(y ~ x, h = 3)

And then you can visualize the fits:

plot(y ~ x, pch = 19)
lines(fitted(bp, breaks = 1) ~ x, col = 4, lwd = 1.5)
lines(fitted(bp, breaks = 2) ~ x, col = 2, lwd = 1.5)

scatterplot with breakpoints

As remarked above: From a purely data-driven perspective, I would probably consider both models overfitting. But maybe one or the other model has a plausible interpretation for you.

Achim Zeileis
  • 13,510
  • 1
  • 29
  • 53
  • Thank you Achim for your response. This clarifies my understanding. I guess a bigger question then is whether breakpoints estimation can be used in the way I am hoping: as a way to describe the complexity of the shape of the relationship between two variables and whether the models that I'm fitting to my data (e.g. linear, four-parameter logistic etc.) are too simplistic to describe the number of changes... – user2414840 Jul 29 '15 at 06:57
  • In principle, this strategy could be applied - if a reasonable number of observations were available. With 14 obervations, however, it's already challenging enough to estimate these models. Hence, my impression is that additional model selection (say by information criteria) is asking a bit too much. – Achim Zeileis Jul 29 '15 at 08:20