2

I would like to compare the affect of parameters x and z on dependent variable y. I'm not sure how to know whether z or x is 'better'/'stronger'/'more likely to be a driver' of y.

For x, when I plotted the data I noticed a quadratic relationship lm(y~ x^2)

I wrote the polynomial regression code to call from my data frame dat_CVlike this:

lm(dat_CV[[y]] ~ dat_CV[[x]] + I(dat_CV[[x]]^2) , data= dat_CV)

My output for model using x is

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1671 -0.0685  0.0227  0.0665  0.1144 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)       1.143040   0.230929    4.95   0.0017 **
dat_CV[[x]]       0.093053   0.022701    4.10   0.0046 **
I(dat_CV[[x]]^2) -0.001987   0.000477   -4.16   0.0042 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.101 on 7 degrees of freedom
Multiple R-squared:  0.713, Adjusted R-squared:  0.63 
F-statistic: 8.68 on 2 and 7 DF,  p-value: 0.0127

The relationship for y~z was linear

lm(dat_CV[[y]] ~ dat_CV[[z]], data = dat_CV)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.0946 -0.0638 -0.0369  0.0943  0.1073 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.0307     0.4418   -0.07   0.9463   
dat_CV[[z]]   3.1370     0.6682    4.69   0.0016 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0911 on 8 degrees of freedom
Multiple R-squared:  0.734, Adjusted R-squared:   0.7 
F-statistic:   22 on 1 and 8 DF,  p-value: 0.00155

Regarding the quadratic results of model x

i) Im not sure how to interpret the p values. Can I phrase the result in a paper like this:

Parameter x was found to have a significant quadratic relationship with y (F= 8.68, 2,7, p=0.013)

Should I be reporting the p-value of I(dat_CV[[x]]^2) or both rows instead or as well as the overall model?

ii) How do I interpret the fact that the p-values are significant at p<0.05 for each parameter but not for the overall model?

Comparing the two models

iii) Can I use the $R^2$ to compare the linear and quadratic models? If not so can I compare the Residual standard errors to say which model is a better goodness-of-fit?

i.e y~x^2 resid.s.e. = 0.101

y~z resid.s.e = 0.091

Therefore y~z is a 'slightly' better fit? (I know here the s.e. are almost the same but in other comparisons the difference between models was much bigger so I want to understand the meaning)

Does this mean that z is a 'better' predictor of y, even though both had significant p-values?

iv) Since the estimate for a quadratic is no longer the slope like in a linear regression, how can I evaluate the 'size'/'strength' of the correlation to compare between models?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
  • 4
    All the models you mention are linear in the parameters (and in the vectors of predictors). [*Nonlinear regression*](https://en.wikipedia.org/wiki/Nonlinear_regression) is usually reserved for the case where the model is not linear in the parameters, rather than merely curved in some original $x$. So - for example - polynomial regression is generally referred to as multiple linear regression rather than non-linear regression. – Glen_b Nov 30 '17 at 01:14
  • 4
    If you want to compare the effects, then you should use models that include *all* variables. Using regression methods alone, you have no basis for claiming any of them are "drivers" for $y$: all you can do is study how they are associated with $y$. – whuber Nov 30 '17 at 23:41

1 Answers1

1

As was pointed out in the comments you need to include all of your variables in the model to understand importance. A simple and effective way to understand a variable's importance with respect to the ability of your model to make good predictions is to use the Mean Decrease in Accuracy (which can be used to understand the effect of a variable on any score, like MSE). Make sure you apply this technique to data that was not used (hold out data) to build the model.

Chris
  • 681
  • 4
  • 13