2

Hi I am having trouble acquiring the final results for presentation. The results from a multiple regression are different to my results in a simple linear regression.

For example, the multiple regression model mul <- lm(Response ~ Temp + Wave + Overcast, data = env) gives me a p-value for 'Temp' of $0.01$, while the others are $>0.05$.

However, a simple linear regression sim <- lm(Response ~ Temp, data =env) returns a p-value for 'Temp' of $0.03$.

Which result should I be presenting, and also how to I interpret the relationship between the predictors in the multiple regression?

Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
Nieve K
  • 101
  • 1
  • 1
  • 5
  • Hint: look at the degrees of freedom for the t statistics in the simple and multiple linear regression. Then compare them. – Stat Jan 31 '16 at 01:30
  • There is a superb answer to this general question [here](http://stats.stackexchange.com/a/78830/28500). In the terminology of that answer, think of each linear regression coefficient as the result of _ignoring_ all of the other variables, while each multiple regression coefficient is _controlling for_ all the other variables. – EdM Jan 31 '16 at 15:23

1 Answers1

1

You are comparing two entirely different models. Which one you choose has a lot to do with experience in the field of study, aided by things such as adjusted R squares and ANOVA testing of both models side-by-side.

In multiple regression, the idea is that when we add a regressor ($x_2$) to a pre-existing model with just one regressor ($x_1$), we are regressing the "dependent variable" $y$ not over $x_2$, but over the residuals of the regression of $x_2$ over $x_1$. This changes everything.

If we set up the model as:

$y = \color{blue}{5}\, x_2 + \color{red}{15}\, x_1$ with both $x_1$ and $x_2$ generated as random $\sim N(0,1)$, the correlation $y$ with $x_1$ will be much higher than with $x_2$:

The regression of $y$ over $x_2$ in isolation will not capture the slope we set p of $5$, because $x_2$ will be compensating for the absence of the main explanatory variable in the model, i.e. $x_1$. The slope of $x_2$ will be, in fact, very close to the coefficient we assigned to $x_1$: $\color{red}{15}$. Leaving the intercept out, summary(lm(y ~ x2 - 1))$coefficients will return a slope of $\color{red}{15.42}$.

However, if we now include $x_1$ in the regression in a sneaky way, and instead of calling lm(y ~ x1 + x2 - 1), we first regress $x_2$ over $x_1$ and keep the residuals before "tossing" $x_1$ as errors <- residuals(lm(x2 ~ x1 - 1)) and then call summary(lm(y ~ errors - 1))$coefficients the slope will be $\color{blue}{4.681616}$, very close to the coefficient we set up for $x_2$, and... here comes the punch of the story... identical to coefficient for $x_2$ in summary(lm(y ~ x2 + x1 - 1))$coefficients: $x_2 \,\,\color{blue}{4.681616}$. The coefficient for $x_1$ will be $x_1 \,\,\color{red}{15.091305}$.

So ignoring the hidden confounder $x_1$ in the model $y \sim x_2$ forced $x_2$ to explain all by itself as much of the variation in $y$ as possible, resulting in a completely different slope as compared to the more accurate $y \sim x_1 + x_2$. You can check the concept of omitted variable bias.

It makes sense that the $p$-values are going to be significant in both instances, albeit different.

Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
  • Thank you so much for your detailed reply :) However I've had cases where the relationship between a predictor and a dependent data was significant in a multiple regression but not significant in simple linear regression.. How would I interpret this? Would I try to find which other predictor is affecting the relationship and control for it? – Nieve K Feb 01 '16 at 06:23