0

Suppose I have a linear model Y=AX, and I tune A based on observed data. I know that correlation among my independent variables, X, will increase the uncertainty in my model coefficients, A. How do I quantify that uncertainty?

To illustrate my question with a concrete example, I composed a dummy system in which y = a0 x0 + a1 x1 + e. Independent variables x0 and x1 are normally distributed 50±10, and e is 0±10. I generated 1000 sample points, modeled the system as y = a0 x0 + a1 x1, and solved for a0 and a1. I repeated 10,000 times and found a0 = 1.000 ± 0.023 a1 = 1.000 ± 0.023

Then I repeated the experiment, but this time I engineered the data set so that x0 and x1 are highly correlated, with r-squared of 0.9. This time I found a0 = 1.000 ± 0.100 a1 = 1.000 ± 0.100

Clearly, the correlation among the independent variables led to a significant increase in the standard deviation around the estimates of the model coefficients.

My question is: If I have known correlation among my independent variables, how can I estimate the resulting uncertainty in my model coefficients? Or, to relate the question to my example, how could I have used knowledge of a 0.9 r-squared between x0 and x1 to predict that uncertainty around a0 and a1 would increase from 0.023 to 0.100?

awlman
  • 3
  • 2
  • Many related threads in our [tag:multicollinearity] tag. You might find one of them answers your questions. – mkt Oct 10 '19 at 18:32
  • Yeah, I think I did find an answer. I just learned about the "variance inflation factor". That seems to be exactly what I'm looking for. I'll leave this question open in case anyone has anything to add, but I feel like I already learned what I needed. – awlman Oct 10 '19 at 18:36
  • I'm voting to close this question because it has been answered elsewhere. – Peter Flom Oct 11 '19 at 11:22

1 Answers1

0

As explained on this page the calculation is not simple but it is typically performed for you by standard statistical software. In general you need to know the relationships among all of the predictors.

Write the linear regression model in matrix form: $Y=X\beta+\epsilon$ with $X$ the matrix of data values (rows for observations, columns for predictors with the first column consisting of 1s corresponding to the intercept), $\beta$ the regression coefficients and $\epsilon$ the error term. With $X^T$ as the transpose of $X$, the covariances among the coefficients are the elements of the matrix:

$$\Sigma = s^2\cdot(X^TX)^{-1}$$ where $s^2$ is the residual variance from the model fit. The variances of the individual coefficients are the diagonal elements of this matrix. The off-diagonal elements provide information you need to estimate variances of combinations of coefficients in statistical tests.

This effect of correlations among predictors is often assessed by the variance inflation factor.

If you are doing a generalized regression like logistic, Cox, or Poisson that is solved with a maximum likelihood estimate (MLE), the relationships between the predictor correlations and the coefficient covariances can be less straightforward. The estimate of the coefficient covariance matrix is then "(the negative of) the inverse of the Hessian of the log-likelihood function of the sample, evaluated at the MLE," as noted on this page.

EdM
  • 57,766
  • 7
  • 66
  • 187