2

I have a dataset with 10 predictors and 1 outcome variable. Looking at the Residual Vs Fitted Plot, I suspect a Non-Linearity that I am missing. But how can I check out of the 10 predictors, which are linearly and which are non-linearly related to the outcome ?

  • Maybe by running 10 separate individual regressions? Though, this does not take into account nonlinearity stemming from covariate-on-covariate relationships... – ERT Jul 25 '18 at 15:00
  • Thanks ERT. However I wanted to also know how do you take care of such a situation wherein your lm's diagnostic plots (Residual vs Fitted) show a non-linearity and you have a handful of predictors like in my case. If you want to introduce polynomial terms for a subset of the predictors, how do you chose the right ones ? – Nithya Subramanian Jul 25 '18 at 15:11
  • Hi, welcome. Perhaps this could be useful: https://en.wikipedia.org/wiki/Non-linear_least_squares – Jim Jul 25 '18 at 15:46
  • Thanks Jim, I went through the link but still confused. Say I have 10 variables to predict one outcome. I have no SME knowledge to even guess which of those 10 would be non-linearly related to my outcome. Is there a way of finding out which of those 10 variables are non-linearly related to the outcome ? – Nithya Subramanian Jul 26 '18 at 09:41
  • 1
    If you visually inspect scatterplots of the raw data, you should be able to determine if there are any clearly obvious non-linear relationships or gaps in the data. Also, making such visual inspection can be considered "due diligence" regarding the analysis. – James Phillips Jul 26 '18 at 14:17
  • 2
    how about using a generalised additive model: add all continuous terms as splne terms and see if they could reasonably be linear – user20650 Jul 27 '18 at 20:06

1 Answers1

0

It may not answer the question of which exact variable is non-linear, but there is a test designed for this exact type of problem in the econometrics literature called the Ramsey Reset test.

The way it works is by testing whether polynomial terms of the fitted values from the original regression are significant terms. If there is no non-linearity and the original model is correctly specified, then the polynomial terms should not be significant. This is done with an f-test.

The other thing that you can do if you are mainly concerned with fit of the regression (i.e not running a regression with the purpose of doing causal inference) is to pick a small polynomial term, say square or cube and apply it to all of your variables.

Let's say we picked cube for clarity. Run the regression with all cubic terms. Look at the coefficients of all the cubic terms and starting from the coefficients with the largest p-values (least significant so to speak although this wording is making me cringe as I write) and test the hypothesis that the more parsimonious model is an adequate simplification of the original.

Now your p-values will be different, pick the least significant cubed term that remains, eliminate it and test the hypothesis. Rinse and Repeat. Continue until you have eliminated all the non-significant terms and then proceed with the squared terms in this fashion. Eventually you will be left with a regression that cannot be adequately simplified by removing polynomial terms. If there are important non-linearities captured by a cubic they should still be included in the model. With 10 variables this is cumbersome, you could also take a guess at which variables you think might have the non-linearities and work with a subset of cubic terms instead. Again, not ideal for performing inference, your resulting p-values etc are a little strange to interpret since you are squeezing your data so much to get the model in the first place, but this can be reasonable to produce predictive values. You can inject more or less parsimony in this process by changing your significance level for the f-tests.

Tyrel Stokes
  • 1,158
  • 6
  • 8