3

I have a set of predictors that clearly suffer from some amount of multicollinearity, so I am using PCA to make the columns of X orthogonal. I am also using this as a way to regularize the subsequent regression by removing components that account for ~0% of the variance.

For example, if I run OLS regression on PCA-transformed data that has 8 predictors, I am then able to use the eigenvectors from the original PCA transformation to get the beta weights for the original 12 predictors. So far, so good.

However, to be able to evaluate the contributions of these predictors to the model fit, I'd like to transform the 95% confidence intervals back into the original space of the 12 predictors. That way, I can use the overall R^2 and associated p-values for the full model to find regressions that are significant, where specific predictors have non-zero contributions.

It is unclear to me how to transform the 95% confidence intervals. If that's not possible, is there another way to evaluate the significance of specific predictors in the original space?

Thanks

Matt L.
  • 71
  • 4
  • Well I don't think that my data satisfy any of those special cases. I'm not clear why the CIs have infinite range, though. If that's true, then is there another way to assess the significance of individual weights? I don't particularly care about arbitrary thresholds (which is why 95% CIs are not ideal for me anyway), but I just need to be able to know whether a given weight has a "significant" (i.e., non-zero) contribution to the overall model. – Matt L. Feb 12 '14 at 19:41
  • Ok, I think that makes sense. What are the implications for interpreting the beta weights after projecting them back into the original space? It seems that they can still be understood as contributions of individual predictors to explaining the output, but does that mean that any non-zero value is "different" from zero? – Matt L. Feb 12 '14 at 20:25
  • @whuber: May I ask you to read my answer and check whether my reasoning is correct? – cbeleites unhappy with SX Feb 12 '14 at 20:59
  • @amoeba You are correct and I apologize for posting a comment that was so misleading (since deleted). Please see my remarks after the answer posted by cbeleites, which apparently crossed yours in the ether. – whuber Feb 12 '14 at 21:53

1 Answers1

2

Do I understand correctly:

PCA calculates $p$ scores $\mathbf T^{(n \times p)}$ from data $mathbf X^{(n \times m)}$ using the transpose of the loadings $\mathbf P^{(p \times m)}$:

$\mathbf T = \mathbf X \mathbf P^T$

then the OLS calculates some $Y^{(n \times 1)}$ using coefficients $\beta^{(p \times 1)}$:

$Y = \mathbf T \beta$, together:

$Y = \mathbf X \mathbf P^T \beta = \mathbf X \mathbf B$ with

$\mathbf B^{(m \times 1)} = \mathbf P^T \beta$

And now you want to have some indication of the variance of $\mathbf B$?


First of all, in order to get confidence intervals for $\mathbf B$ you need to consider both the PCA and the regression.

Calculating confidence intervals for $\beta$ alone doesn't make sense: PCA is not a projection that is unique, i.e. the axes can flip without notice. In addition for your PCR model, in the $p$-dimensional space of the retained PCs you can also have rotations which do not affect the predictions if $\beta$ changes accordingly.
I suspect that not taking care of these equivalence rules (= restrictions/contraints) is what causes the $\pm \infty$ range in @whuber's comment.

I think of this as: what happens to my model if I acquire a new data set and fit a new model. The models can be equivalent (having the same $\mathbf B$), but different loadings $\mathbf P$ and regression coefficients $\beta$.


Now I have no idea how to get confidence intervals for the PCA, and then how to combine these two given the equvalence contraints. I usually go a much easier way:

I bootstrap $\mathbf B$ during a resampling (out-of-bootstrap) validation.

(So far I don't need confidence intervals for $\mathbf B$, for my purposes the distribution of observed $\mathbf B$s over the bootstrapping is good enough - I need "hard numbers" only on the predictive power)

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • Your description of the problem is accurate to what I'm doing. I thought about bootstrapping $\mathbf B$, but is it bootstrapped over different PCA loadings? As in, do I need to do it selecting different combinations of PCs? That wouldn't seem to be very useful since I have a set PCs that account for 99% of the variance, and throwing out additional PCs will only decrease the data quality. – Matt L. Feb 12 '14 at 21:09
  • @MattL.: I'd avoid selection of particular PCs if possible as it is a hyperparameter optimization problem (you can read lots about the associated problems on this site). For my domain, I have enough experience to fix the number of PCs beforehand or on data from a previous experiment (it is astonishingly low...). But if you have an automated rule that decides which PCs to use, you can just plug it between the PCA calculation and the OLS - you anyways need to recalculate the whole model starting with the PCA for each bootstrap iteration. – cbeleites unhappy with SX Feb 12 '14 at 21:17
  • I don't think the non-uniqueness is an issue, because geometrically PCA *is* unique: all that's going on is some arbitrariness in how its result is *described.* It's not too different from the way in which no number in a regression is unique: you could double all the independent values, for instance, thereby halving their coefficients, but the *regression* itself does not change. – whuber Feb 12 '14 at 21:20
  • And, by the way: if you find that the first PCs don't contribute much to the prediction, i.e. your rule tends to exclude them, it is time to try out PLS regression instead. If you think about excluding variates from the original data, see e.g. [here](http://stats.stackexchange.com/questions/82992/using-principal-components-analysis-for-feature-selection/82995#82995) – cbeleites unhappy with SX Feb 12 '14 at 21:21
  • @whuber: so let's assume we have enough data points to get a resonable OLS regression directly on the data. We could get confidence intervals for each of the $m$ coefficients, and they are not just $[-\infty, +\infty]$. Now we add a regularization, which is supposed to reduce the variance, possibly at the cost of a bias. Confidence intervals should still exist, and the whole purpose of the regularization was to make them *narrower* rather than wider. Or, we could repeat the whole experiment and then calculate c.i.s from the observed distribution. – cbeleites unhappy with SX Feb 12 '14 at 21:35
  • @whuber So: how can we reconcile these two trains of thought? What would be needed to "rescue" the c.i. discussed in the comments above? – cbeleites unhappy with SX Feb 12 '14 at 21:36
  • => would that be better discussed in the chat? – cbeleites unhappy with SX Feb 12 '14 at 21:40
  • The first PCs account for quite a lot of the variance, so the rule I've been using is to find the number of PCs needed to account for 100%, and take that -1 components. (I feel justified doing this because I knew going in that there were severely LD predictors -- which is why I'm doing PCA in the first place.) Then, it seems as though PCA is a unique solution, as is OLS. So the variability across bootstrap iterations would be zero. – Matt L. Feb 12 '14 at 21:40
  • 2
    I have been thinking about this and realize that I have confused intervals of the estimated coefficients with intervals of the data (which is silly, but there it is). I apologize for pointing you in the wrong direction. The solution is straightforward: when $\hat\beta_i$ are estimated coefficients for principal components $\sum_j a_{ij}X_j$, then $X_j$ enters into the regression with coefficient $\sum_j\hat\beta_ia_{ij}$. Its variance is easily computed from the covariance matrix $\Sigma$ of the $\hat\beta_i$ and from that one obtains confidence intervals as usual. – whuber Feb 12 '14 at 21:47
  • 2
    (Note that it doesn't matter that the principal vectors are determined only up to sign. In fact, it doesn't even matter that the $\sum_ja_{ij}X_j$ are principal components: the same calculation applies no matter what the $a_{ij}$ happen to be, provided their computation is not based on any values of the *dependent* variable in the regression.) – whuber Feb 12 '14 at 21:51
  • @MattL.: well, if the data doesn't have artificial collinearities, 100% variance is with $p = min (m, n)$, for any kind of experimental data that would probably be much overfit. Variability across the bootstrap would not necessarily be 0. – cbeleites unhappy with SX Feb 12 '14 at 21:53
  • @whuber: yes, because we multiply with the loadings, so it is always the β that correspond to how the PCA happens to be on that particular data set. But does your solution already account for the uncertainty of the PCA projection? I'm chemist, to me it looks at the moment as if we have of the total error just the P dβ term, and someting like β dP is still missing for d(Pβ)? – cbeleites unhappy with SX Feb 12 '14 at 21:57
  • 1
    I am taking the $X_j$ to be established by the experimenter rather than measured with error, as is usual with OLS regression. Therefore there is no uncertainty in the PCA: it merely is an alternative description of the $\mathbb{X}$ matrix. If you like, think of it as a multivariate generalization of the possibility of changing the units of measurement of the individual IVs (which would correspond to a diagonal matrix $\mathbb{A}=(a_{ij})$). If you intend to use the regression results for prediction, then you need to continue to use the matrix $\mathbb{A}$ rather than recompute it on new data! – whuber Feb 12 '14 at 22:00
  • 1
    Would be nice to have $\mathbf X$ without error - I'll try to tell that to my spectrometer... My argumentation for PCR or PLSR would be that the regularized projection is a step that produces low-variance input for the OLS. Which doesn't that much about the stability of the projection. But thinking further, if the model is well set up (i.e. not too many latent variables), the projection should be stable. But with that assumption we do not have any guard against overfitting the PCA. – cbeleites unhappy with SX Feb 12 '14 at 22:11
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/13010/discussion-between-cbeleites-and-whuber) – cbeleites unhappy with SX Feb 12 '14 at 22:11
  • @whuber: I'm not entirely clear on your notation. What are $\sum_j, a_{ij}, X_j$? – Matt L. Feb 12 '14 at 22:22