Statistical significance of betas in linear regression

Question

This question is similar to another question I had recently posted but I have a follow on.

In classical linear regression, we have

$$\hat{\beta } \sim N(\beta,(X^{T}X)^{-1}\sigma^2).$$

Using this, one builds individual hypotheses of the significance of the coefficients, as done in the book by Tibshirani et al. My questions are two fold:

1) The book talks about a combined hypothesis built by proving that

$$(\hat{\beta }-\beta)^T(X^{T}X)^{-1}(\hat{\beta }-\beta) \sim \sigma^2\chi_{p+1}^2.$$

I don't see how this formula can be derived from the equation I wrote above. I do see that

$$\hat{\beta }-\beta \sim N(0,(X^{T}X)^{-1}\sigma^2).$$

How do we take the $X^TX$ matrix out and prove the above? I would be grateful if someone could outline the steps.

2) My second question is, how do we think about building the hypothesis? Do we think about building individual coefficient hypotheses or is it a good idea to view everything together? In other words, what is the difference/pros and cons of using the two different styles of hypotheses, the individual coefficient one or viewing everything together as per the above equation? Can we have an example of building a combined hypothesis? I am guessing that most statistical packages don't really take into account the correlation between different $\beta$ which is encoded in the matrix $X^TX$. Please clarify, any help will be much appreciated.

Presumably, in the second formula, $\le$ is a typo for $\sim$ and $p+1$ is the number of parameters (including the constant), right? — whuber, Nov 18 '11 at 18:48
Yes, and Yes. Can you make sense of it though, how does the second formula come? — bgbgh, Nov 18 '11 at 19:42
The left hand side in (1) is a sum of squares of $p+1$ normal variables. By definition, a chi-squared distribution describes a sum of squares of $p+1$ standard normal variables. — whuber, Nov 18 '11 at 20:30
How do you get that? How can we take the XTX matrix out from the variance term? — bgbgh, Nov 18 '11 at 20:35
Factor it: because it is symmetric positive-definite, you can write it as $X^TX = UU^T$ for an invertible $p+1$ by $p+1$ matrix $U$. — whuber, Nov 18 '11 at 20:39
Wait, I am grossly missing something here, the matrix encodes the correlation between the different betas right, now if we take the matrix out we are left with a variance term multilplying the Identity matrix, and a new scaled variable on the left which is $$(X^TX)^(-1/2)(\betahat-\beta)$$ this variable has no corrleation structure? Since the variance covariance matrix just becomes sigma^2*I — bgbgh, Nov 18 '11 at 20:50
We are assuming that the variable I wrote above is normaly with mean 0 and a variance conariance matrix of sigma^2*I. My question is that how can this be done, where do the cross correlation term go? — bgbgh, Nov 18 '11 at 20:52
You would benefit greatly from working an actual problem. Why don't you fit a line through the points $(0,2),(3,4),(4,8),(7,10)$. (I chose this for the easy arithmetic.) You should compute $U = \{\{2,0\},\{7,5\}\}$. Then you might see where the correlation appears. — whuber, Nov 18 '11 at 21:37

Xi'an · Answer 1 · 2011-11-19T07:42:21.630

1) I do not have the book with me so I cannot check the original, but there is a typo in the first formula as given in that it should be $$ (\hat\beta-\beta)^T(X^TX)(\hat\beta-\beta)\sim \sigma^2\chi^2_{p+1} $$ with no inverse for the $(X^T X)$ matrix.

This is a consequence of the following: if $x\sim\mathcal{N}_p(0,\Sigma)$, then $$ Ax\sim\mathcal{N}_q(0,A\Sigma A^T) $$ for any $(q,p)$ matrix $A$. Thus, taking one symmetric version of the square root of $(X^TX)$, i.e. $V$ such that $V^TV=(X^TX)$ and $V(X^TX)^{-1}V^T=I_{p+1}$, using for instance the eigenbasis and eigenvalues, you get that $$ V(\hat\beta-\beta)\sim\mathcal{N}_{p+1}(0,\sigma^2I_{p+1}) $$ and $$ (\hat\beta-\beta)^TV^TV(\hat\beta-\beta)=(\hat\beta-\beta)^T(X^TX)(\hat\beta-\beta)\sim \sigma^2\chi^2_{p+1}. $$

2) again, I do not have the book so cannot guess what the author mean by "building an hypothesis". The natural approach is to have an exogenous question about the significance of one group of variables and to test it by the corresponding chi-square test, using the corresponding submatrix of $(X^TX)^{-1}$. For instance, testing for $\beta_1=\beta_2=0$ leads to $$ (\hat\beta_{1:2}-\beta_{1:2})^T\left[(X^TX)^{-1}_{1:2,1:2}\right]^{-1}(\hat\beta_{1:2} -\beta_{1:2})\sim \sigma^2\chi^2_{2}. $$

score 2 · Answer 2 · edited Apr 13 '17 at 12:44

In response to 2)

Recall that linear regression is a conditional mean. Therefore, an "individual coefficient" hypothesis for the $j$-th coefficient is an hypothesis about $\mathbb{E}[Y|X_j]$. An hypothesis about "everything together" is an hypothesis about $\mathbb{E}[Y|X_1,X_2,...,X_J]$. Therefore your hypotheses are always in a sense conditional on each other. An hypothesis about a single coefficient is kind of like a marginal hypothesis, "averaged over" values of the other predictors. An hypothesis about everything together is a joint hypothesis. For that reason, hypotheses about individual coefficients based on pairwise relationships tend not to translate to good joint hypotheses.

This is where the first-semester surprise comes from, where you fit two univariate regressions that have significant coefficients but when you put them together in multiple regression, or add a third predictor, or an interaction, they are both nonsignificant. Better yet is when they only become significant when you add the interaction. Bonus points if the interaction itself is non-significant.

Unless I misunderstood you, and you're asking about how to test an existing model. For that I defer to David Giles at his blog. The punch line is that you probably shouldn't test individual coefficients unless you have a substantive reason for doing so, but you should read the whole thing because it's a fantastic post and everyone that ever plans to use multiple regression should read the whole thing.

It's also not meaningful to talk about correlation between $\beta_j$s outside of a Bayesian context (although I got into a debate with another poster on a related subject). Correlation between $\hat{\beta}_j$s is different, and correlation between $X_j$s is different still. All statistics packages take the former "into account" because, well, they explicitly compute $\mathbb{V}[\hat{\beta}]$, and standard error is computed using the main diagonal of that matrix, which is what you typically use to test hypotheses. The latter isn't a big deal in principle, except for the fact that highly correlated predictors will "steal" the magnitudes of their coefficients from each other, particularly if they are on different measurement scales. It's very often good practice to re-scale and center your variables. And if they are perfectly correlated, you don't have a full rank $X$ matrix and regression is mathematically impossible. If they are very very highly correlated, this is theoretically okay but it will make your computer very unhappy and you will get numerical issues trying to invert $X^TX$.

I'd personally recommend not learning regression from Tibshirani & company, at least not at first. I have great respect for them and I hold dear my copy of Elements of Statistical Learning, but as a machine learning book it takes a very... machine-like approach to regression that in my opinion doesn't admit the kind of thinking needed to build a meaningful parametric model. My background is in economics, so I'll invariably recommend Wooldrige's Introductory Econometrics: A Modern Approach for what I think is a much more organic and intuitive approach to regression. There's a lot of stuff in there you don't need to know if you aren't working with, say, survey data, but there's nothing in there you don't want to know. Seeing regression built up from statistics principles, as well as the geometric/algebraic principles you get in Elements, is important for understanding it.

score 1 · Answer 3 · answered Jun 28 '14 at 22:54

As for the question about evaluating the hypothesis by "viewing everything together", consider the use of BIC or AIC to compare different models, rather than running significance tests. It's more computational intensive, and makes an assymptotic assumption (doesn't apply for small datasets), but it can compare different kinds (e.g. non-linear) models, including models of variables that are subsets of each other.

Statistical significance of betas in linear regression

3 Answers3