What are theoretical reasons to keep variables which coefficients are not significant?
I have several coefficients with p > 0.05. What's causing large p values?
What are theoretical reasons to keep variables which coefficients are not significant?
I have several coefficients with p > 0.05. What's causing large p values?
If your p-values are coming out of a table generated by STATA, R, SPSS, or something similar, than the p-values result from the following statistical test.
$$ H_0: \beta = 0 $$ $$ H_a: \beta \neq 0 $$
$p > 0.05$ means that there is insufficient evidence to reject the null hypothesis at the %5 level.
In certain areas of applied statistics it is important to include non-significant terms when they are correlated with significant terms. This is because for a regression model with $k$ independent variables to be consistent and unbiased; it is necessary that for every $i$ observation
$$ E(u_i\,|\,x_{i,1},x_{i,2},...,x_{i,k}) = 0 $$ where $u$ is the residual term. This is known as "conditional mean independence" or "zero conditional mean". It means that for any given values of the $x$'s, the average of the residuals is the same and therefore it must equal the average value in the entire population. A proof of this fact involves the use of probability limits and linear algebra, and I will refrain from showing this here. Some good references are http://w3.uniroma1.it/belloc/Introduction%20and%20SLR.pdf which discusses issues with violating "conditional mean independence" and http://www.ssc.wisc.edu/~bhansen/econometrics/Econometrics2010.pdf pg 68-69 which walks through a proof of regression consistency (both are for econometric application).
Take the following example, suppose y followed the below process:
$$ y=\beta_0 + x_1\beta_1 + x_2\beta_2 + \varepsilon $$ Where the $\beta$'s are the true coefficient values and $\varepsilon$ is a disturbance term. If we omit $x_2$ then the resulting model would be $$ y=\beta_0 + x_1\beta_1 + w $$ Where $w= x_2\beta_2 + \varepsilon $. If $x_1$ and $x_2$ are correlated then
$$ E(w_i\,|\,x_{i,1}) \neq 0 $$
So linear regression of $y$ on a constant and $x_1$ would not be a consistent estimator of $\beta_0$ and $\beta_1$. This is called omitted variable bias. Since in practice we do not know the $\beta$ terms, we refrain from omitting insignificant terms to avoid the above bias and inconsistency.
One reason to keep the covariates is if they're a part of the group, which is as a whole is significant. So individual variables are not significant, but if you test the significance of the group, the group comes significant. In this, if you're not trying to tease out the relationship for a particular variable, but are using the model for other purposes such as forecasting, you may choose to keep the insignificant variables.
For instance, if the purpose of your analysis is to determine the impact of a gender on salary, then keeping the gender variable when it is insignificant would be strange. The whole purpose of the study is studying gender impact, so if it's not significant statistically, I'd argue that your model or data is inadequate. In this case you may need to add more control variables, change the specification or collect more/better data.
If you're building a marketing system, which personalizes offers for your customers and it has a bunch of variables like gender, marital status etc., and gender variable is not significant, then it's not a reason to immediately remove. Especially if the out of sample and other fit diagnostics are better with this variable, and if it's a group variables that are significant as a whole. Here, you're not studying the gender, you're simply trying to forecast preferences of a customer based on some characteristics.