Regression coefficients significance

Question

What are theoretical reasons to keep variables which coefficients are not significant?

I have several coefficients with p > 0.05. What's causing large p values?

See [here](http://stats.stackexchange.com/questions/66448/should-covariates-that-are-not-statistically-significant-be-kept-in-when-creat) — Chris C, Dec 06 '14 at 18:39
And [here](http://stats.stackexchange.com/a/20856/7290), & [here](http://stats.stackexchange.com/a/85914/7290). — gung - Reinstate Monica, Dec 07 '14 at 00:56

Zachary Blumenfeld · Answer 1 · 2014-12-07T00:50:50.903

If your p-values are coming out of a table generated by STATA, R, SPSS, or something similar, than the p-values result from the following statistical test.

$$ H_0: \beta = 0 $$ $$ H_a: \beta \neq 0 $$

$p > 0.05$ means that there is insufficient evidence to reject the null hypothesis at the %5 level.

In certain areas of applied statistics it is important to include non-significant terms when they are correlated with significant terms. This is because for a regression model with $k$ independent variables to be consistent and unbiased; it is necessary that for every $i$ observation

$$ E(u_i\,|\,x_{i,1},x_{i,2},...,x_{i,k}) = 0 $$ where $u$ is the residual term. This is known as "conditional mean independence" or "zero conditional mean". It means that for any given values of the $x$'s, the average of the residuals is the same and therefore it must equal the average value in the entire population. A proof of this fact involves the use of probability limits and linear algebra, and I will refrain from showing this here. Some good references are http://w3.uniroma1.it/belloc/Introduction%20and%20SLR.pdf which discusses issues with violating "conditional mean independence" and http://www.ssc.wisc.edu/~bhansen/econometrics/Econometrics2010.pdf pg 68-69 which walks through a proof of regression consistency (both are for econometric application).

Take the following example, suppose y followed the below process:

$$ y=\beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon $$ Where the $\beta$'s are the true coefficient values and $\varepsilon$ is a disturbance term. If we omit $x_2$ then the resulting model would be $$ y=\beta_0 + x_1\beta_1 + w $$ Where $w= x_2\beta_2 + \varepsilon $. If $x_1$ and $x_2$ are correlated then

$$ E(w_i\,|\,x_{i,1}) \neq 0 $$

So linear regression of $y$ on a constant and $x_1$ would not be a consistent estimator of $\beta_0$ and $\beta_1$. This is called omitted variable bias. Since in practice we do not know the $\beta$ terms, we refrain from omitting insignificant terms to avoid the above bias and inconsistency.

Why would $e$ contain $x_2$ if the true [DGP](http://en.wikipedia.org/wiki/Data_generating_process) doesn't? It doesn't make any sense. You seem to be trying to show [omitted variable bias](http://en.wikipedia.org/wiki/Omitted-variable_bias), but in a wrong way. — Aksakal, Dec 07 '14 at 00:23
@Aksakal you are correct, i did not properly specify my variables and theoretical model so the reasoning could not have made sense. — Zachary Blumenfeld, Dec 07 '14 at 00:53

Aksakal · Accepted Answer · 2014-12-07T15:54:40.117

0

One reason to keep the covariates is if they're a part of the group, which is as a whole is significant. So individual variables are not significant, but if you test the significance of the group, the group comes significant. In this, if you're not trying to tease out the relationship for a particular variable, but are using the model for other purposes such as forecasting, you may choose to keep the insignificant variables.

For instance, if the purpose of your analysis is to determine the impact of a gender on salary, then keeping the gender variable when it is insignificant would be strange. The whole purpose of the study is studying gender impact, so if it's not significant statistically, I'd argue that your model or data is inadequate. In this case you may need to add more control variables, change the specification or collect more/better data.

If you're building a marketing system, which personalizes offers for your customers and it has a bunch of variables like gender, marital status etc., and gender variable is not significant, then it's not a reason to immediately remove. Especially if the out of sample and other fit diagnostics are better with this variable, and if it's a group variables that are significant as a whole. Here, you're not studying the gender, you're simply trying to forecast preferences of a customer based on some characteristics.

edited Dec 07 '14 at 15:54

answered Dec 07 '14 at 00:34

Aksakal

55,939
5
90
176

Can you please explain a little bit more your example of the impact of a gender on salary? I am not sure that I am following you with the differences between the model purposes and when to keep or drop insignificant variables. – Quirik Dec 07 '14 at 14:33
Let me check if I understood correctly: by trying to estimate the price of a car by observing bundle of characteristics that are important to users (e.g. horse power, fuel economy, acceleration, etc.) it's ok to keep variables which may prove insignificant? On the other hand, if I am estimating the price of a car by observing the global price of a steel, it wouldn't be very smart to keep this variable if it turns out to be insignificant? Is this correct analogy? – Quirik Dec 08 '14 at 21:38
If the purpose of research is understanding the impact of steel price on car prices, then you want steel to be significant in the regression. If you're simply trying to come up with car pricing model where steel is one of the bunch of variables then you can argue to keep the steel price in the model even if it's insignificant. – Aksakal Dec 08 '14 at 21:58
Many thanks, you helped me a lot so far to clarify doubts I had. There is still one unresolved issue that bothers me. Is there any statistical definition/theorem for this? Or in other words, how to explain the reasoning behind this? – Quirik Dec 09 '14 at 14:18
The only thing I can think of is testing the restrictions like $(\beta_1,\beta_2\dots\beta_n)=0$ to support keeping $\beta_1$ in the model, which I already mentioned. Otherwise, it's to the discretion of the modeler. Especially, in the fields like econometric forecasting, it's not all stats that rule the day. – Aksakal Dec 09 '14 at 14:21
Actually I did test restricted vs. unrestricted model and concluded that certain group of variables contributes significantly which on the other hand some of these variables, individually, prove to be insignificant. It is good to know that the discretion of the modeler is something that counts too... – Quirik Dec 09 '14 at 14:33

Regression coefficients significance

2 Answers2

Linked