39

Can anyone recommend a good exposition of the theory behind partial least squares regression (available online) for someone who understands SVD and PCA? I have looked at many sources online and have not found anything that had the right combination of rigor and accessibility.

I have looked into The Elements of Statistical Learning, which was suggested in a comment on a question asked on Cross Validated, What is partial least squares (PLS) regression and how is it different from OLS?, but I don't think that this reference does the topic justice (it's too brief to do so, and doesn't provide much theory on the subject). From what I've read, PLS exploits linear combinations of the predictor variables, $z_i=X \varphi_i$ that maximize the covariance $ y^Tz_i $ subject to the constraints $\|\varphi_i\|=1$ and $z_i^Tz_j=0$ if $i \neq j$, where the $\varphi_i$ are chosen iteratively, in the order in which they maximize the covariance. But even after all I've read, I'm still uncertain whether that is true, and if so, how the method is executed.

ClarPaul
  • 1,130
  • 11
  • 18

2 Answers2

44

Section 3.5.2 in The Elements of Statistical Learning is useful because it puts PLS regression in the right context (of other regularization methods), but is indeed very brief, and leaves some important statements as exercises. In addition, it only considers a case of a univariate dependent variable $\mathbf y$.

The literature on PLS is vast, but can be quite confusing because there are many different "flavours" of PLS: univariate versions with a single DV $\mathbf y$ (PLS1) and multivariate versions with several DVs $\mathbf Y$ (PLS2), symmetric versions treating $\mathbf X$ and $\mathbf Y$ equally and asymmetric versions ("PLS regression") treating $\mathbf X$ as independent and $\mathbf Y$ as dependent variables, versions that allow a global solution via SVD and versions that require iterative deflations to produce every next pair of PLS directions, etc. etc.

All of this has been developed in the field of chemometrics and stays somewhat disconnected from the "mainstream" statistical or machine learning literature.

The overview paper that I find most useful (and that contains many further references) is:

For a more theoretical discussion I can further recommend:


A short primer on PLS regression with univariate $y$ (aka PLS1, aka SIMPLS)

The goal of regression is to estimate $\beta$ in a linear model $y=X\beta + \epsilon$. The OLS solution $\beta=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y$ enjoys many optimality properties but can suffer from overfitting. Indeed, OLS looks for $\beta$ that yields the highest possible correlation of $\mathbf X \beta$ with $\mathbf y$. If there is a lot of predictors, then it is always possible to find some linear combination that happens to have a high correlation with $\mathbf y$. This will be a spurious correlation, and such $\beta$ will usually point in a direction explaining very little variance in $\mathbf X$. Directions explaining very little variance are often very "noisy" directions. If so, then even though on training data OLS solution performs great, on testing data it will perform much worse.

In order to prevent overfitting, one uses regularization methods that essentially force $\beta$ to point into directions of high variance in $\mathbf X$ (this is also called "shrinkage" of $\beta$; see Why does shrinkage work?). One such method is principal component regression (PCR) that simply discards all low-variance directions. Another (better) method is ridge regression that smoothly penalizes low-variance directions. Yet another method is PLS1.

PLS1 replaces the OLS goal of finding $\beta$ that maximizes correlation $\operatorname{corr}(\mathbf X \beta, \mathbf y)$ with an alternative goal of finding $\beta$ with length $\|\beta\|=1$ maximizing covariance $$\operatorname{cov}(\mathbf X \beta, \mathbf y)\sim\operatorname{corr}(\mathbf X \beta, \mathbf y)\cdot\sqrt{\operatorname{var}(\mathbf X \beta)},$$ which again effectively penalizes directions of low variance.

Finding such $\beta$ (let's call it $\beta_1$) yields the first PLS component $\mathbf z_1 = \mathbf X \beta_1$. One can further look for the second (and then third, etc.) PLS component that has the highest possible covariance with $\mathbf y$ under the constraint of being uncorrelated with all the previous components. This has to be solved iteratively, as there is no closed-form solution for all components (the direction of the first component $\beta_1$ is simply given by $\mathbf X^\top \mathbf y$ normalized to unit length). When the desired number of components is extracted, PLS regression discards the original predictors and uses PLS components as new predictors; this yields some linear combination of them $\beta_z$ that can be combined with all $\beta_i$ to form the final $\beta_\mathrm{PLS}$.

Note that:

  1. If all PLS1 components are used, then PLS will be equivalent to OLS. So the number of components serves as a regularization parameter: the lower the number, the stronger the regularization.
  2. If the predictors $\mathbf X$ are uncorrelated and all have the same variance (i.e. $\mathbf X$ has been whitened), then there is only one PLS1 component and it is equivalent to OLS.
  3. Weight vectors $\beta_i$ and $\beta_j$ for $i\ne j$ are not going to be orthogonal, but will yield uncorrelated components $\mathbf z_i=\mathbf X \beta_i$ and $\mathbf z_j=\mathbf X \beta_j$.

All that being said, I am not aware of any practical advantages of PLS1 regression over ridge regression (while the latter does have lots of advantages: it is continuous and not discrete, has analytical solution, is much more standard, allows kernel extensions and analytical formulas for leave-one-out cross-validation errors, etc. etc.).


Quoting from Frank & Friedman:

RR, PCR, and PLS are seen in Section 3 to operate in a similar fashion. Their principal goal is to shrink the solution coefficient vector away from the OLS solution toward directions in the predictor-variable space of larger sample spread. PCR and PLS are seen to shrink more heavily away from the low spread directions than RR, which provides the optimal shrinkage (among linear estimators) for an equidirection prior. Thus PCR and PLS make the assumption that the truth is likely to have particular preferential alignments with the high spread directions of the predictor-variable (sample) distribution. A somewhat surprising result is that PLS (in addition) places increased probability mass on the true coefficient vector aligning with the $K$th principal component direction, where $K$ is the number of PLS components used, in fact expanding the OLS solution in that direction.

They also conduct an extensive simulation study and conclude (emphasis mine):

For the situations covered by this simulation study, one can conclude that all of the biased methods (RR, PCR, PLS, and VSS) provide substantial improvement over OLS. [...] In all situations, RR dominated all of the other methods studied. PLS usually did almost as well as RR and usually outperformed PCR, but not by very much.


Update: In the comments @cbeleites (who works in chemometrics) suggests two possible advantages of PLS over RR:

  1. An analyst can have an a priori guess as to how many latent components should be present in the data; this will effectively allow to set a regularization strength without doing cross-validation (and there might not be enough data to do a reliable CV). Such an a priori choice of $\lambda$ might be more problematic in RR.

  2. RR yields one single linear combination $\beta_\mathrm{RR}$ as an optimal solution. In contrast PLS with e.g. five components yields five linear combinations $\beta_i$ that are then combined to predict $y$. Original variables that are strongly inter-correlated are likely to be combined into a single PLS component (because combining them together will increase the explained variance term). So it might be possible to interpret the individual PLS components as some real latent factors driving $y$. The claim is that it is easier to interpret $\beta_1, \beta_2,$ etc., as opposed to the joint $\beta_\mathrm{PLS}$. Compare this with PCR where one can also see as an advantage that individual principal components can potentially be interpreted and assigned some qualitative meaning.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    That paper looks useful. I don't think it addresses how much overfitting can be caused by PLS. – Frank Harrell Nov 02 '15 at 12:15
  • 3
    That's right, @Frank, but honestly, as far as predictive performance is concerned, I don't see much sense in doing anything else than ridge regression (or perhaps an elastic net if sparsity is desired too). My own interest in PLS is in the dimensionality reduction aspect when both $X$ and $Y$ are multivariate; so I am not very interested in how PLS performs as a regularization technique (in comparison with other regularization methods). When I have a linear model that I need to regularize, I prefer to use ridge. I wonder what's your experience here? – amoeba Nov 02 '15 at 14:05
  • 3
    My experience is that ridge (quadratic penalized maximum likelihood estimation) gives superior predictions. I think that some analysts feel that PLS is a dimensionality reduction technique in the sense of avoiding overfitting but I gather that's not the case. – Frank Harrell Nov 02 '15 at 16:24
  • @Frank Harrell, why do you imply PLS is not "a dimensionality reduction technique in the sense of avoiding overfitting"? In ["Elements of Statistical Learning"](http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf), the topic is considered with selection and shrinkage techniques, from which I infer the authors consider it regularization-like. It is certainly a parameter reduction technique, in that it helps "down-select" to small(er) number of coefficients from an originally very large set (e.g., one with $p > n$). And very large sets of variables tend to overfit. – ClarPaul Nov 02 '15 at 17:00
  • 1
    @clarpaul: I think what Frank meant is that while PLS is a regularization technique (as you say), it might not be able to "avoid overfitting" altogether; there can still be a fair amount of overfitting when using PLS (as compared e.g. with ridge regression). – amoeba Nov 02 '15 at 17:07
  • In [Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf), the algorithm that the authors give for PLS in section 3.5.2 runs out after only the first iteration if the variables are already orthogonalized (stated in 3.5.2; see Exercise 3.14). So, does that imply that for cases where the data is almost orthogonal (for whatever reason) PLS brings little value? And that to some degree, PLS is more about orthogonalization than about "uncovering latent structure" in the variables $X$ & $y$? – ClarPaul Nov 02 '15 at 17:09
  • @clarpaul, But what *is* "uncovering latent structure"? If you run PCA on orthogonalized data, you will also get exactly the same data back; still, PCA is usually taken to "uncover latent structure". That's because we assume that latent structure manifests itself in correlations; if all correlations are zero, there is no *latent* structure. In the context of regression, if predictors $X$ are orthogonal (and if there is only one $y$) then both PCR (Section 3.5.1) and PLS will equal OLS. Note that ridge shrinkage can still be beneficial (even though it's rarely used with orthogonal $X$). – amoeba Nov 02 '15 at 17:17
  • @clarpaul, I have now included an extended discussion of what PLS regression does (that might make it clearer to you), and along the way realized that PLS only yields OLS after one iteration if predictors are uncorrelated **and all have the same variance**. Hastie et al. assume that that's the case in the beginning of Section 3.5.2. Without this equal variance condition, the statement is not necessarily true. – amoeba Nov 02 '15 at 23:34
  • @amoeba, thanks for your comments. When applied to orthogonal, normalized $X$ for single response $y$, PLS creates only a single latent variable, thus not providing any guidance on feature selection at all. In PCR, the number of distinct PCA directions, and the order of preference for regression, is pre-defined and does not depend on starting point or choice of computational algorithm. Am I wrong? – ClarPaul Nov 03 '15 at 00:09
  • @clarpaul, you are not wrong, that's all true. I noticed that I was not correct above in my comment about PCR. So if X is orthogonal and standardized, then PLS creates only one latent variable and this be equivalent to OLS. Further, if X is orthogonal and standardized, then all directions have exactly the same variance, so PCR is sort of ill-defined... It can pick up the same latent variable as the first PC, but it can also pick up any other linear combination as the first latent variable. So it does not have to be equivalent to OLS. – amoeba Nov 03 '15 at 00:35
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/31034/discussion-between-clarpaul-and-amoeba). – ClarPaul Nov 03 '15 at 01:48
  • @amoeba, in *Introduction to Statistical Learning*, by Hastie & Tibshirani, the authors actually recommend "standardizing" each predictor by dividing by its standard deviation! Can it be that there is still useful PCA data left in $X$ after it is standardized? – ClarPaul Nov 03 '15 at 04:27
  • @amoeba, thanks for your write-up. But in your post above, how should we interpret your statements about singular vectors of $X'y$? $X'y$ is only a vector, not a matrix, right? – ClarPaul Nov 03 '15 at 04:47
  • (1) If all predictors are standardized and uncorrelated, then their covariance matrix is the identity matrix. There is absolutely no sense in doing PCA then. But if the predictors are correlated, then even after standardizing PCA can certainly make sense. (2) Oops, thanks for noticing. I was thinking about multivariate $Y$, when $X^\top Y$ is a whole matrix. In the case of single $y$ that we are discussing here, you are right that $X^\top y$ is only a vector, so no need for any singular vectors at all! I will now edit my post to remove this confusion. Thanks. – amoeba Nov 03 '15 at 09:51
  • @clarpaul: I found one more paper that discusses theory behind PLS regression, I inserted the link into my answer and some quotes from there. Seems to be a useful paper. See also my previous comment. – amoeba Nov 03 '15 at 10:37
  • 1
    I think that nomenclature can be confusing. _Dimension reduction_ makes it sound as if the effective number of parameters being estimated is lessened. But when associations with $Y$ are examined in order to achieve this, the number of parameters "lost" is illusory and overfitting is not avoided at least not entirely. This is in distinction with PCA, predictor variable clustering, etc. – Frank Harrell Nov 03 '15 at 11:36
  • 1
    @amoeba: are two very practical and very chemometric reasons I can think of: a) The external knowledge about application and data generating process can decidedly point towards a particular number of latent variables. In this situation, I check whether the data agrees with this external knowledge. If yes, fine, basically the hyperparameter optimization is eleminated. If not, this tells me that I need to understand the application problem better. – cbeleites unhappy with SX Nov 03 '15 at 14:39
  • 1
    ... b) Depending on the data generating process, the latent variables can be interpreted (e.g. what their spectroscopic meaning is). This is much easier if sorted into a bunch of latent variables comparted to having just one coefficient vector. – cbeleites unhappy with SX Nov 03 '15 at 14:40
  • @cbeleites: Thanks for joining in! (a) I am not sure I fully understand. Are you saying that you expect PLSR to outpeform RR in chemometrics applications due to certain specific properties of the data? Can you point me to some studies where this was actually shown to happen? Or what happens when the performance of PLSR is compared to RR? (b) That's interesting. So are you saying that $\beta_1$, $\beta_2$ etc. can be separately interpreted, and even though they can all in the end be joined into one single $\beta_\mathrm{PLS}$, it's easier to interpret them separately? – amoeba Nov 03 '15 at 15:07
  • 1
    @amoeba: I'm not saying that the best PLS model is going to outperform the best RR model, I'm trying to say that in some situations (particularly if sample size does not really allow model comparisons) to get a good (or even the best) PLS model by external knowledge without hyperparameter optimization. I do not have enough experience with RR to know whether with more experience it is possible to fix λ by similarly external knowledge. – cbeleites unhappy with SX Nov 03 '15 at 16:14
  • 2
    b) if you are going for a, say, spectroscopic interpretation of what the model does, I find it easier to look at PLS loadings what kind of substances are measured. You may find one or two substances/substance classes in there, wheras the coefficients which include all latent variables are harder to interprete because spectral contributions of more substances are combined. This is more prominent because not all of the usual spectral interpretation rules apply: a PLS model may pick some bands of a substance while ignoring others. "Normal" spectra interpretation uses a lot of this band could ... – cbeleites unhappy with SX Nov 03 '15 at 16:19
  • 2
    ... come from this or that substance. If it is this substance, there must be this other band. As this latter possibility of verifying the substance is not possible with the latent variables/loadings/coefficients, interpreting things that vary together and therefore end up in the same latent variable is much easier than interpreting the coefficients that already summarize all kinds of possible "hints" that are known by the model. – cbeleites unhappy with SX Nov 03 '15 at 16:21
  • @FrankHarrell: I seem to remember reading that the required (recommended) sample size for training PLS and PCA based classifiers compared to unregularized linear models improves from ca. 5 to 3 cases per class and variate. (Cannot find the source right now, possibly in the handbook of statistics vol. 2; I'm not sure about further assumptions and characteristics of the explored situations) - so some improvement, but nothing close to solving all kinds of small sample size problems. – cbeleites unhappy with SX Nov 03 '15 at 16:26
  • @amoeba, this is just a "nit", but in your write up on PLS1, don't you have to specificy that the magnitude of $ \beta_1 $ is 1? Otherwise you'd end up with large values of $\beta_1$, in order to maximize $\sqrt{var(X\beta_1)}$. – ClarPaul Nov 07 '15 at 19:40
  • @amoeba, I don't understand why *Frank and Friedman* say variance dominates bias where there is high collinearity. Do you, and can you explain this: `"The solutions and hence the performance of RR, PCR, and PLS tend to be quite similar in most situations, largely because they are applied to problems involving high colli- nearity in which variance tends to dominate the bias"` It's near the end of the first paragraph of section 1.1. – ClarPaul Nov 07 '15 at 21:43
  • Thanks, @clarpaul. I made various fixes/updates and a big addendum summarizing the points raised by cbeleites. Regarding this quote: no, it is not completely clear to me either. What is clear, is that high collinearity means a lot of variance (for the OLS estimator); that's what leads to overfitting. F&F probably mean that RR/PCR/PLS introduce bias, but it's beneficial because reduction in variance is larger (see http://stats.stackexchange.com/questions/179864/ and also *The Elements*). That's how I understand "dominates". Not sure though how this explains that RR/PCR/PLS tend to be similar. – amoeba Nov 07 '15 at 22:58
  • Thanks, @cbeleites. I made an update trying to summarize your points as I understood them. – amoeba Nov 07 '15 at 22:59
  • @amoeba, the reminder that OLS maximizes corr(Xβ,y) was really really helpful to me! Thanks for all your work on this. But, may I suggest, you define what you mean by z_i & z_j in your Note (3)? – ClarPaul Nov 30 '15 at 17:33
  • 1
    @amoeba, regarding *I don't see much sense in doing anything else than ridge regression (or perhaps an elastic net if sparsity is desired too)*. The Zou & Hastie paper that introduced elastic net has some examples where the elastic net beats ridge regression by design. So it is not only sparsity, but also forecast accuracy where elastic net may excel above ridge. (I am working on these issues from a theoretical perspective these days.) – Richard Hardy Mar 11 '17 at 15:07
  • @RichardHardy Yes, I agree. – amoeba Mar 11 '17 at 22:03
  • 1
    So for the first note (PLS1 is the same as OLS when all the latent variables are included), does it also hold when the number of predictors > the number of samples? Because in this case OLS does not have an unique solution, but PLS does, so I am a little confused. Or do you have the ref for that? Thanks! – Vickyyy Jan 22 '19 at 21:47
  • 1
    @Vickyyy Good question. What exactly do you mean by PLS1 with all components when p>n? What are "all" components? I guess the idea is that components equal to zero are not considered? In this case, there can be only n-1 components and the PLS1 regression solution will be the same as doing PCA of X, removing all zero PCs, and then doing OLS. Another way to say it, is that it's a minimum-norm OLS solution. Anyway, I should edit to say this more explicitly -- good point. – amoeba Jan 22 '19 at 23:24
4

Yes. Herman Wold's book Theoretical Empiricism: A general rationale for scientific model-building is the single best exposition of PLS that I'm aware of, especially given that Wold is an originator of the approach. Not to mention that it's simply an interesting book to read and know about. In addition based on a search on Amazon, the number of references to books on PLS written in German is astonishing but it may be that the subtitle of Wold's book is part of the reason for that.

Mike Hunter
  • 9,682
  • 2
  • 20
  • 43
  • 1
    This http://www.amazon.com/Towards-Unified-Scientific-Models-Methods/dp/B007N3HZS2/ref=sr_1_1?s=books&ie=UTF8&qid=1446487159&sr=1-1&keywords=inge+helland is related but covers much more than PLS – kjetil b halvorsen Nov 02 '15 at 18:00
  • That is true but the primary focus of the book is Wold's development of the theory and application of PLS. – Mike Hunter Nov 02 '15 at 18:05