PCA and PLS: testing variables for significance

Question

I'm trying to understand the process for statistical testing for principal component analysis or partial least squares.

Step 1. PCA: I feel that I have a not-terrible understanding of PCA: You find the ellipsoid described by the covariance matrix of the data, and then successively take the largest axis of variation (principal component 1), then the second largest (principal component 2), and so on. If the ellipsoid is long and stretched, then the variation is mostly along the first principal component (the eigenvector corresponding to the largest eigenvalue of the ellipsoid). If the ellipsoid is a planar "disc", then the variation in the data is explained well by two principal components, etc.

I also understand that after choosing to use (for example) only the first two principal components, then all of the data points can be plotted on a "Scores" plot that shows, for each data point $D^{(i)}$, the projection of $D^{(i)}$ into the plane spanned by the first two principal components. Likewise, for the "Loadings" plot (I think) you write the first and second principal components as linear combinations of the input variables and then for each variable, plot the coefficients that it contributes to the first and second principal components.

Step 2. PLS or PLS-DA: If there are labels on the data (let's say binary classes), then build a linear regression model to use the first and second principal components to discriminate class 0 (for data point $i$, that means $Y^{(i)}=0$) from class 1 (for data point $i$, that means $Y^{(i)}=1$) by first projecting all data to only lie in the plane spanned by the first and second principal components, and then regressing the projected input data $X_1', X_2'$ to $Y$. This regression could be written as (first step) the affine transformation (i.e. linear transformation + bias) that projects along $PC_1, PC_2$ (the first and second principal components), and then (second step) a second affine transformation that predicts $Y$ from $PC_1, PC_2$. Together these transformations $Y \approx Affine(Affine(X))$ can be written as a single affine transformation $Y \approx C (A X + B) + D = E X + F$.

Step 3. Testing variables from $X$ for significance in predicting the class $Y$: This is where I could use some help (unless I'm way off already, in which case tell me!). How do you test whether an input variable (i.e. a feature that has not yet been projected onto the principal components (hyper)plane), and decide if it has a statistically significant coefficient in the regression $Y \approx E X + F$? Qualitatively, a coefficient in $E$ that is further from zero (i.e. positive and negative values with large magnitude) indicates a larger contribution from that variable.

I remember seeing linear regression t-tests for normally distributed data (to test whether the coefficients were zero). Is this the standard approach? In that case, I would guess that ever variable from $X$ has been transformed to have a roughly normal distribution in Step 0 (i.e. before any of these other steps are performed).

Otherwise, I could see performing a permutation test (by running this entire procedure thousands of times and each time permuting $Y$ to shuffle the labels, and then comparing each single coefficient in $E$ from the un-shuffled analysis to the distribution of coefficients from shuffled analyses).

Can you help me see anywhere my intuition is failing? I've been trying to look through papers using similar procedures to see what they did, and as is often the case, they're clear as mud. I'm preparing a tutorial for some other researchers, and I want to do a good job.

If you are using PCs from a PCA within some other procedure, their origin in PCA is immaterial for significance testing with that other procedure. That is a little contentious, as statistical people don't all agree on whether PCA is a multivariate transformation procedure or model estimation, but I think it is a good first approximation. If that argument is accepted, then your question is just about significance testing in whatever you are doing and covered by any standard account. Do you regard linear regression and PLS as equivalent? You seem unclear which you are using. — Nick Cox, Nov 02 '13 at 12:13
@NickCox Thanks, good comment. With PCA, do you think the description (as the first part of "significant" feature selection where the features have covariation) is correct? And for LinReg vs. PLS: I was indeed conflating linear regression for a single variate binary $Y$ variable as equivalent to PLS, but come to think of it, I'm not sure why (I guess I thought minimum square error would also be maximum margin discrimination)-- is it not true? — user, Nov 02 '13 at 21:01
Sorry, but I don't understand what you are seeking here either from me or from the site. As you say, linear regression does not mean PLS, or vice versa. — Nick Cox, Nov 03 '13 at 10:16
1) I am trying to get a good understanding of what type of significance testing procedure (from start to finish) is used when doing PCA-PLS (sometimes called PLS-DA). This is commonly used in processing metabolomic data (http://goo.gl/TPM3iV does not really describe the statistical testing). 2) From the Wikipedia article on Partial Least Squares: "it finds a linear regression model by projecting the predicted variables and the observable variables to a new space". I meant performing a linear regression in the transformed space ($PC_1$ and $PC_2$). — user, Nov 03 '13 at 14:19
I can't say much more. Perhaps you need to ask in some chemometrics forum, so no expert here is biting on this yet. A significance testing procedure for the whole of what you do would need a probability model for the whole of what you do. On the face of it, that would be a lot of work to set up and evaluate. — Nick Cox, Nov 03 '13 at 16:08
Still waiting on an explanation for why the Wikipedia page says that "Partial Least Squares... finds a linear regression model...". If you don't know, then it's hard to blame you-- I don't know why it says that either if the two are so diametrically opposed as you say! — user, Nov 04 '13 at 16:33
They are not synonymous; that's all I am saying. You'd better ask a new question. — Nick Cox, Nov 04 '13 at 16:55
You may want to check out the recent publication on this subject http://www.researchgate.net/publication/264936800_Interpretation_of_Variable_Importance_in_Partial_Least_Squares_with_Significance_multivariate_correlation_(SMC) Success ! Thanh Tran — , Oct 04 '14 at 17:21

score 6 · Accepted Answer · edited Jun 11 '20 at 14:32

Inclusion/exclusion of variates (step 3):

I understand that you ask which of the original measurement channels to include into the modeling.

Is such a decision sensible for your data?
E.g. I work mainly with spectroscopic data, for which PLS is frequently and successfully used. Well measured spectra have a high correlation betweeen neighbour variates and the relevant information in spectroscopic data sets tends to be spread out over many variates. PLS is well suited for such data, but deciding on a variate-to-variate basis which variates to use for the model IMHO is usually not appropriate (decisions about inclusion/exclusion of spectral ranges based on spectroscopic knowledge about the application is IMHO a far better approach).
If for your data and application variable selection is a natural choice, is PLS the regularization technique you want?
You may want to read the sections about regularization (3.4 - 3.6) in the Elements of Statistical Learning where PLS as a regularization is compared to other regularization approaches. My point here is that in contrast to e.g. the Lasso, PLS is a regularization technique that does not tend to completely exclude variables from the model. I'd thus say that PLS is probably more suitable for data where this behaviour is sensible, but in that case variable selection is not a natural choice (e.g. spectroscopic data).
Does your data contain enough information for such a data-driven model optimization? Doing a t-test for each input variable is a massive multiple testing situation.
IMHO the main point of PLS (or other regularization techniques) is to avoid the need for such a variable selection.

Remark to Step 2:

If you build a linear regression model in PCA score space, that is usually called principal component regression (PCR) in chemometrics. It is not the same as a PLS model.

How to find out which variates are used by the PCA/PLS model?

There are several ways to approach this question. Obviously, variates where the PCA loadings or PLS weights are 0 do not enter the model. Whether it is sufficient to look at the loadings or whether you need to go a step further depends on your data: if the data set is not standardized you may want to calculate how much each variate "contributes" to the respective PCA/PLS score.
Literature where we did that with LDA (works just the same way): C. Beleites, K. Geiger, M. Kirsch, S. B. Sobottka, G. Schackert and R. Salzer: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal. Bioanal. Chem., 400 (2011), 2801 - 2816. The linked page has both links to the official web page and my manuscript.

You can also derive e.g. bootstrap distributions of the loadings (or the contributions) and have a look at them. For PCR and PLS coefficients that is straightforward, as the Y variable automatically "aligns" the coefficients. PCA and PLS scores need some more care, as e.g. flipping of the directions needs to be taken into account, and you may decide to treat models as equivalent where the scores which are then used for further modeling are just rotated or scaled versions of each other. Thus, you may want to align the scores first e.g. by Procrustes analysis. The paper linked above also discusses this (for LDA, but again, the ideas apply to the other bilinear models as well).

Last but not least, you need to be careful not to overinterprete the models, and you can have situations where important variates have coefficients frequently touching the zero mark in the bootstrap experiments if you have correlation between variates. However, ehat you can or cannot conclude will depend on your type of data, though.

This is a brilliant answer (and beautiful figures in your paper). Good point about regularization replacing (rather than enabling) hard-edged feature selection and different flavors of regularization (such as Lasso's enrichment for $\{0,1\}$ convergence). Nizza Papier und vielen Dank! — user, Nov 05 '13 at 13:15
@Oliver: thanks :-) If you need more detailed help on chemometric topics (seeing that you are with Thermo Fisher), you find my contact information via the project page where the link goes. — cbeleites unhappy with SX, Nov 05 '13 at 13:32

PCA and PLS: testing variables for significance

1 Answers1

Inclusion/exclusion of variates (step 3):

Remark to Step 2:

How to find out which variates are used by the PCA/PLS model?

Linked