7

Principal component regression (PCR) in fact is regression on PC scores but not PCs. Why then in so many books and tutorials do they say something like,

in statistics, principal component regression (PCR) is a regression analysis that uses principal component analysis when estimating regression coefficients

(wiki), and also in the famous book Principal Component Analysis (Jolliffe, 2002, page 169) it says

... which [PCR] has simply replaced the predictor variables by their PCs in the regression model

It makes me quite confused.

amoeba
  • 93,463
  • 28
  • 275
  • 317
mingzuheng
  • 71
  • 1
  • 3
  • 2
    Probably just because it's simpler than using principal component scores all the time; I've not seen other people confused by this. I also don't see how it would be possible to do analysis on the components - I'm not even sure what you are suggesting. – Peter Flom Sep 29 '12 at 13:45
  • 2
    "Principal components" *are* principal component scores. It is the same thing. – amoeba Dec 12 '14 at 11:37
  • @amoeba you are mistaken. Given X, nd m x n matrix, in PCA we find T, P such that T = PX where t_1,…,t_n are uncorrelated and arranged in order of decreasing variance. T is called the “scores” and P is called the “principal components.” Your statement implies P=T which is fundamentally misleading. http://users.cecs.anu.edu.au/~kee/pls.pdf – tbenst Apr 27 '20 at 18:19
  • @tbenst https://stats.stackexchange.com/questions/88118/what-exactly-is-called-principal-component-in-pca – amoeba Apr 27 '20 at 19:37
  • @amoeba thanks I wasn’t aware of what you call convention 1. Nonetheless, if you look at the book the OP listed, they use PC and PC score, so convention 2 is used for this question. Furthermore, the original PCA paper clearly defines a component as having unit variance (https://psycnet.apa.org/fulltext/1934-00645-001.pdf). You are entitled to your convention preference but not sure why you’d downvote my answer when I have cited literature whereas you cite your own opinion. – tbenst Apr 28 '20 at 05:26
  • OK - I commented under your answer. – amoeba Apr 28 '20 at 08:20

2 Answers2

5

I think the wikipedia article is being a little sloppy in saying "uses principal component analysis when estimating regression coefficients". Better might be something like "uses principal component analysis to create explanatory variables before estimating regression coefficients." There's nothing objectionable in the subsequent sentence "In PCR instead of regressing the dependent variable on the independent variables directly, the principal components of the independent variables are used."

I also don't see anything wrong with your quote from Jolliffe's book (which I haven't read). It is correct that PCR uses principal components of variables as the predictor variables in a regression model.

I don't quite understand what you mean by "regression on PC scores but not PC". You first conduct principal component analysis to create the scores and then use those scores in the regression.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Peter Ellis
  • 16,522
  • 1
  • 44
  • 82
  • 1
    +1. I think OP was confused because he or she thought that the "principal component" is something different from the "principal component score". In fact, *it is exactly the same thing* (even though some people do use this term to refer to something else). – amoeba Dec 12 '14 at 11:33
2

The other answers use a different terminology than what the author may be familiar with. Below, I refer to the scores matrix and use principal components to refer to the unit variance eigenvectors.

If you consider the answer as applied to the general case of in-sample and out-of-sample regression, then knowing the principal components matrix is sufficient to perform PCR, but knowing the scores matrix is not.

Principal component analysis

Given $X$, an $m \times n$ matrix, in PCA we find $T$, $P$ such that $T = PX$ where $t_1,\dots ,t_n$ are uncorrelated and arranged in order of decreasing variance. $T$ is called the “scores” and $P$ is called the “principal components.”

Principal component regression

To regress design matrix $X$ onto response vector $y$ using PCR, first find the principal components of $X$ using PCA. Then, using the first $k$ principal components from $X$, perform ordinary least squares of $P_{k}$ onto $y$.

Algorithm overview

Using ordinary least squares, solve $Y=PXB$, where $B$ is the matrix of coefficients. So in the sense that in regression we do operations on $X$, you just need the principal components (i.e. eigenvectors) and design matrix ($X$) but obviously $PX=T$ is the score matrix.

Now suppose you evaluate your $B$ on new data $X'$. You still need the principal component matrix $P$ (up to $k$ components), but do not need $T$. Thus, PCR uses the PC matrix but not the scores matrix in the general case.

Answer sourced from A Simple Explanation of Partial Least Squares, by Kee Siong Ng (2013).

Thanks to @amoeba for help clarifying this answer.

chl
  • 50,972
  • 18
  • 205
  • 364
tbenst
  • 121
  • 2
  • -1 for the first sentence: "The other answers bafflingly confuse the scores matrix and the principal components matrix." – amoeba Apr 28 '20 at 08:17
  • More importantly -- terminology aside, if my X matrix has 1000 samples and 10 variables, and I want to use two PCs for the regression, do you agree that this will be a matrix with 1000 samples and 2 variables? Do you agree this is called the "scores" matrix? I don't understand what you mean when you say that PCR does not use "not the score matrix" – amoeba Apr 28 '20 at 08:19
  • @amoeba my understanding is that for PCR, first do PCA to find T,P such that T=PX. Using ordinary least squares, solve Y=PXB, where B is the matrix of coefficients. So in the sense that in regression we do operations on X, you just need the principal components (i.e. eigenvectors) and design matrix (X) but obviously PX=T is the score matrix. But I stand by my citations from the original PCA paper, the book referenced by OP, and the link in my answer, that all use "principal component" to refer to unit eigenvectors. – tbenst Apr 28 '20 at 21:08
  • I don't see how what you described in the last comment can be summarized as "PCR uses the principal components matrix but not the score matrix" (quote from your answer). As you said, we are solving Y=TB. This uses T which is the score matrix. – amoeba Apr 28 '20 at 22:12
  • Suppose you evaluate your B on new data X'. You still need the principal component matrix P (up to k components), but do not need T. Thus, PCR uses the PC matrix but not the scores matrix in the general case. – tbenst Apr 30 '20 at 21:59
  • 1
    I see what you saying. I think we are in full agreement as far as the math is concerned, and the disagreement is only about the meaning of the word "uses" in this particular case. – amoeba Apr 30 '20 at 23:14
  • Thanks, agreed! sorry for any bikeshedding & thanks for pointing out the terminology confusion – tbenst Apr 30 '20 at 23:59
  • I'd be happy to remove the downvote, if you could make the 1st sentence of the answer sound less aggressive, and maybe clarify what exactly the 2nd one means. – amoeba May 01 '20 at 14:26
  • Thanks for your help clarifying this answer – tbenst May 08 '20 at 23:56
  • Thank you! Now I upvoted :) – amoeba May 09 '20 at 16:45