6

What is the best statistical test to use if I measure the value of $Y$ (e.g. pH) for specific values of $X$ e.g. $X=0,10,20,30,...,100$ (e.g. temperature) and I want to test weather the relationship between $X$ and $Y$ is linear? (i.e. $H_0$: The relationship between $X$ and $Y$ is linear, $H_1$: The relationship between $X$ and $Y$ is not linear).

My thoughts

I was thinking of either a Pearson's rank test, however on further reading it appears this cannot in fact be used to test linearity or a model misspecification test but I think the data set would be to small for any reasonable progress to be made with such a test.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • Your data requires some mathematical transform. One variable is somewhat autocorrelated and another is random i.e. can increase or decrease. Correlation analysis in certain cases ,assumes linear relationship and in other cases it presumes non linear relation between two variables. –  Oct 08 '16 at 15:30

3 Answers3

4

Any rank test will only test for monotonicity, and a highly nonlinear relationship can certainly be monotone. So any rank-based test won't be helpful.

I would recommend that you fit a linear and a nonlinear model and assess whether the nonlinear model explains a significantly larger amount of variance via ANOVA. Here is a little example in R:

set.seed(1)
xx <- runif(100)
yy <- xx^2+rnorm(100,0,0.1)
plot(xx,yy)

linearity

model.linear <- lm(yy~xx)
model.squared <- lm(yy~poly(xx,2))
anova(model.linear,model.squared)

Analysis of Variance Table

Model 1: yy ~ xx
Model 2: yy ~ poly(xx, 2)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     98 1.27901                                  
2     97 0.86772  1   0.41129 45.977 9.396e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this particular case, the ANOVA correctly identifies nonlinearity.

Of course, you can also look at higher orders than squares. Or use instead of unrestricted polynomials. Or harmonics. It really depends on what specific alternatives you have in mind, which will in turn depend on your specific use case.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • 1
    Very good answer, but maybe it's worth adding a few words ob yhe fact that the two or more models to be tested with ANOVA have to be nested. Not sure if the OP is aware of that. – DeltaIV Oct 08 '16 at 16:03
  • 1
    From what you are indicating, there is no test to determine if the relationship is linear, just tests to determine which of two given models is better (correct?). Since I have no specific alternative in mind is their any test to show that a linear model is adequate? – Quantum spaghettification Oct 08 '16 at 17:55
  • 2
    I don't know of any "general" test for linearity. There are many, many possible departures from linearity - quadratic, cubic, harmonic (with different frequencies!), discontinuous at one point, at many points, and so forth. I'd be a little surprised if there were any test against a very general alternative hypothesis such as this one. – Stephan Kolassa Oct 08 '16 at 18:07
3

I like a lot the ANOVA-based answer by Stephan Kolassa. However, I would also like to offer a slightly different perspective.

First of all, consider the reason why you're testing for nonlinearity. If you want to test the assumptions of ordinary least squares to estimate the simple linear regression model, note that if you then want to use the estimated model to perform other tests (e.g., test whether the correlation between $X$ and $Y$ is statistically significant), the resulting test will be a composite test, whose Type I and Type II error rates won't be the nominal ones. This is one of multiple reasons why, instead than formally testing the assumptions of linear regression, you may want to use plots in order to understand if those assumptions are reasonable. Another reason is that the more tests you perform, the more likely you are to get a significant test result even if the null is true (after all, linearity of the relationship between $X$ and $Y$ is not the only assumption of the simple linear regression model), and closely related to this reason there's the fact that assumption tests have themselves assumptions!

For example, following Stephan Kolassa's example, let's build a simple regression model:

set.seed(1)
xx <- runif(100)
yy <- xx^2+rnorm(100,0,0.1)
plot(xx,yy)
linear.model <- lm(yy ~ xx)

The plot function for linear models shows a host of plots whose goal is exactly to give you an idea about the validity of the assumptions behind the linear model and the OLS estimation method. The purpose of the first of these plots, the residuals vs fitted plot, is exactly to show if there are deviations from the assumption of a linear relationship between the predictor $X$ and the response $Y$:

plot(linear.model)

enter image description here

You can clearly see that there is a quadratic trend between fitted values and residuals, thus the assumption that $Y$ is a linear function of $X$ is questionable.

If, however, you are determined on using a statistical test to verify the assumption of linearity, then you're faced with the issue that, as noted by Stephan Kolassa, there are infinitely many possible forms of nonlinearity, so you cannot possibly devise a single test for all of them. You need to decide your alternatives and then you can test for them. Now, if all your alternatives are polynomials, then you don't even need ANOVA, because by default R computes orthogonal polynomials. Let's test 4 alternatives, i.e., a linear polynomial, a quadratic one, a cubic one and a quartic one. Of course, looking at the residual vs fitted plot, there's not evidence for an higher than degree 2 model here. However, we include the higher degree models to show how to operate in a more general case. We just need one fit to compare all four models:

quartic.model <- lm(yy ~ poly(xx,4))
summary(quartic.model)
Call:
lm(formula = yy ~ poly(xx, 4))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.175678 -0.061429 -0.007403  0.056324  0.264612 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.33729    0.00947  35.617  < 2e-16 ***
poly(xx, 4)1  2.78089    0.09470  29.365  < 2e-16 ***
poly(xx, 4)2  0.64132    0.09470   6.772 1.05e-09 ***
poly(xx, 4)3  0.04490    0.09470   0.474    0.636    
poly(xx, 4)4  0.11722    0.09470   1.238    0.219    

As you can see, the p-values for the first and second degree term are extremely low, meaning that a linear fit is insufficient, but the p-values for the third and fourth term are much larger, meaning that third or higher degree models are not justified. Thus, we select the second degree model. Note that this is only valid because R is fitting orthogonal polynomials (don't try to do this when fitting raw polynomials!). The result would have been the same if we had used ANOVA. As a matter of fact, the squares of the t-statistics here are equal to the F-statistics of the ANOVA test:

linear.model <- lm(yy ~ poly(xx,1))
quadratic.model <- lm(yy ~ poly(xx,2))
cubic.model <- lm(yy ~ poly(xx,3))
anova(linear.model, quadratic.model, cubic.model, quartic.model)
Analysis of Variance Table

Model 1: yy ~ poly(xx, 1)
Model 2: yy ~ poly(xx, 2)
Model 3: yy ~ poly(xx, 3)
Model 4: yy ~ poly(xx, 4)
  Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
1     98 1.27901                                   
2     97 0.86772  1   0.41129 45.8622 1.049e-09 ***
3     96 0.86570  1   0.00202  0.2248    0.6365    
4     95 0.85196  1   0.01374  1.5322    0.2188   

For example, 6.772^2 = 45.85998, which is not exactly 45.8622 but pretty close, taking into account numerical errors.

The advantage of the ANOVA test comes into play when you want to explore non-polynomial models, as long as they're all nested. Two or more models $M_1,\dots,M_N$ are nested if the predictors of $M_i$ are a subset of the predictors of $M_{i+1}$, for each $i$. For example, let's consider a cubic spline model with 1 interior knot placed at the median of xx. The cubic spline basis includes linear, second and third degree polynomials, thus the linear.model, the quadratic.model and the cubic.model are all nested models of the following spline.model:

spline.model <- lm(yy ~ bs(xx,knots = quantile(xx,prob=0.5)))

The quartic.model is not a nested model of the spline.model (nor is the vice versa true), so we must leave it out of our ANOVA test:

anova(linear.model, quadratic.model,cubic.model,spline.model)
Analysis of Variance Table

Model 1: yy ~ poly(xx, 1)
Model 2: yy ~ poly(xx, 2)
Model 3: yy ~ poly(xx, 3)
Model 4: yy ~ bs(xx, knots = quantile(xx, prob = 0.5))
  Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
1     98 1.27901                                   
2     97 0.86772  1   0.41129 46.1651 9.455e-10 ***
3     96 0.86570  1   0.00202  0.2263    0.6354    
4     95 0.84637  1   0.01933  2.1699    0.1440   

Again, we see that a quadratic fit is justified, but we have no reason to reject the hypothesis of a quadratic model, in favour of a cubic or a spline fit alternative.

Finally, if you would like to test also non-nested model (for example, you would like to test a linear model, a spline model and a nonlinear model such as a Gaussian Process), then I don't think there are hypothesis tests for that. In this case your best bet is cross-validation.

DeltaIV
  • 15,894
  • 4
  • 62
  • 104
1

Along the lines of what @Kolassa said, I would suggest fitting a non-parametric model (something like a cubic spline smoother) to your data, and see if the improvement in the fit is significant.

If I remember well, the book "Generalized Additive Models", by Hastie and Tibshirani, contains a description of test that can be used for that.

F. Tusell
  • 7,733
  • 19
  • 34