What is the best linear regression model to use when the shape of the data resembles a cubic distribution?

Question

I have a data set with a distribution of one variable against the other resembling a cubic one (rises to some point and then falls to a steady level without a consequent rise). I know in which cases to use log-linear, log-lin, lin-log, and reciprocal or log reciprocal linear models, but I am not sure what to do here (I have checked all of the above and they not surprisingly turned out to be a bad fit). Is there any linear model that would help me in this case?

@Andy: How come? I don't think so, I mentioned that all of the linear models I know are a very bad fit, and that I am wondering if there is another transformation that can be performed on variables to find a better linear fit. — Econometrician, May 11 '11 at 01:04
A cubic looking curve would suggest a cubic type transformation! Try fitting an OLS model with higher order polynomial X terms (specifically X^3). There are other potential solutions as well (splines, breaking the X variable into different categories and using dummy variables). A recent post also details exploratory data analysis in examining such relationships, http://stats.stackexchange.com/questions/10363/data-mining-how-should-i-go-about-finding-the-functional-form/10520#10520 — Andy W, May 11 '11 at 01:15
It might be worth remarking that "rises to some point and then falls to a steady level without a consequent rise" is distinctly *non-cubic* behavior. Cubics don't have horizontal asymptotes. More (quantitative) details would be helpful. — whuber, May 11 '11 at 03:15
Hi, You might try to fit a "nonlinear" model. Unfortunately, I can't see the graphic of your variable but I do believe that nonlinear regression will be the best choice for your question. Sincerely, — Tu.2, May 11 '11 at 01:27

score 4 · Answer 1 · answered May 11 '11 at 13:31

Restricted cubic splines (natural splines) are an excellent choice. These are piecewise cubic polynomials that can fit any shape given enough knots. The following code in R shows how to fit such relationships and to plot the fit with confidence bands.

require(rms)
dd <- datadist(mydata); options(datadist='dd')
f <- ols(y ~ rcs(x1, 5))  # 5 knots at default locations
f   # print model stats
plot(Predict(f))  # or plot(Predict(f, x1)) # plots over 10th smallest to 10th largest observation

score 2 · Answer 2 · answered May 11 '11 at 12:35

I would have thought a "cubic regression" would work well for a cubic relationship. Call $Y_{i}$ the dependent variable, and $X_{i}$ the independent variable (or regressor). You simply use a polynomial regression:

$$Y_{i}=\left(\sum_{j=0}^{p}\beta_{j}X_{i}^{j}\right)+e_{i}$$

I would use BIC to select the value of $p$. To do this is very easy - calculate the coefficient of determination $R_{p}^{2}$ from a standard OLS regression output. Then a convenient form of BIC is given by:

$$BIC_{p}=n\log(1-R_{p}^{2})+p\log(n)$$

Although this is the standard form, with the natural logarithm's, a more convenient numerical form is given by $$BIC10_{p}=-\frac{1}{2}\log_{10}(e)BIC_{p}$$

The reason I say this is that in this form above, you get BIC expressed in based 10 log units, and this leads to a very quick interpretation of the actual number of the BIC. If BIC is positive, then the current order $p$ polynomial is more supported by the data (compared to intercept only model), and the numerical value in odds form is $10^{BIC10_{p}}$. So if $BIC10_{p}=1$, then the order $p$ polynomial is 10 times more likely than the intercept only model, if $BIC10_{p}=10$ then the order $p$ polynomial is 10 billion times more likely. BIC10 tells you how many digits are in the odds ratio. So a reasonable way to proceed is to continue to increase the order of a polynomial until $BIC10_{p}$ becomes sufficiently large.

One thing to be careful of though, is that this type of procedure is not likely to work well for extrapolation outside the range of the $X_{i}$ values. This is mainly because this is a data driven procedure.

What is the best linear regression model to use when the shape of the data resembles a cubic distribution?

2 Answers2