5

I am interested in evaluating estimated principal components. The components are estimated from a sample and I want to evaluate how good my estimated principal components are. Are there any commonly used approaches for this? I have been trying to find some literature regarding this issue with little success.

I was thinking of simulate this using some sort of L2-norm, maybe $L_2=\|\hat{y}-y\|^2=\|\hat{\Pi}x-\Pi x\|^2$ where $\hat{y}$ is my estimated PCs ($\hat{\Pi}$ is my estimated PC coefficients) and $y$ are the "true" PCs ($\Pi$ is the "true" PC coefficients). Although I thought I better check if there are standard ways of doing this first.

Syltherien
  • 161
  • 1
  • 9

2 Answers2

3

In the approach you're considering, you won't have the 'true' PCs unless you're using a synthetic dataset where ground truth is already known.

Reconstruction error is one way to measure performance. In fact, one way to think about PCA is that it minimizes this quantity on the training set. Use PCA to project the points into the low dimensional space. Then, reconstruct the original points by projecting the low dimensional representations back into the original, high dimensional space. The distance between the original points and their reconstructions is inversely related to how well the model captures the structure of the data. This is related to the view of PCA as lossy data compression. When the low dimensional representation retains more information, the original points can be reconstructed more accurately. Reconstruction error can also be used to compute the commonly used performance measure $R^2$ (fraction of variance accounted for).

Cross validation should be used, rather than measuring reconstruction error for the training data. This is because using the training data to both fit the model and measure performance would give an overoptimistic, biased estimate. Instead, cross validation tries to estimate performance on future, unseen data drawn from the same distribution as the training data. Using cross validation, you'd split the data into training and test sets, train the PCA model on each training set, then use it to compute the reconstruction error for the corresponding test set.

The purpose of this quantity is to give a performance metric. But, it's not appropriate to select the number of components by trying to minimize it, because the error decreases with the number of components. This is to be expected, because more information is retained. If the goal is to select the number of components, there are a number of methods to use instead. @amoeba's post here describes how to do this using cross validation.

Computing reconstruction error for the test set

For a particular cross validation fold, say the training set is $X = \{x_1, \dots, x_n\}$, the test set is $X' = \{x'_1, \dots, x'_m\}$, and each point is represented as a column vector. Use the training set to fit the PCA model, consisting of the mean $\mu$, the weights $W$, and the number $k$ of components to retain. If you're using an algorithm to choose $k$, it may differ across training sets (which is ok; what's being tested here is the entire model fitting procedure). If you're also using cross validation to choose $k$, you'd have to split the training set into further training/validation sets (i.e. perform nested cross validation).

To obtain the low dimensional representation $y'$ of a test point $x'$, you'd first subtract the mean, then multiply by the weights:

$$y' = W (x_i' - \mu)$$

To reconstruct the original test point from the low dimensional representation, you'd multiply the low dimensional representation by the transposed weights, then add the mean back again:

$$\hat{x}' = W^T y' + \mu$$

The low dimensional representation discards information, so the reconstruction won't be perfect. Its error can be measured by the squared Euclidean distance between the original test point and the reconstruction:

$$\|x' - \hat{x}'\|^2$$

Therefore, the mean squared reconstruction error for the entire test set is:

$$L(W, \mu, X') = \frac{1}{m} \sum_{i=1}^m \left \|x_i' - \left ( W^T W (x'_i - \mu) + \mu \right ) \right \|^2$$

Repeat this procedure for all cross validation folds. Then take the mean reconstruction error across test sets, weighted by the number of points in each test set.

Example plot

For geometric intuition, here's an example with 2 dimensions and 1 principal component. The grey points are the training data. The blue points are test data, and the orange points are their reconstructions (in practice, there would be many more test points than this). The mean squared reconstruction error is the average squared length of the dashed lines.

enter image description here

Fraction of variance accounted for

$R^2$ (the fraction of variance accounted for) is another way to measure performance. Test set $R^2$ can be computed from the test set reconstruction error:

$$R^2(W, \mu, X') = 1 - \frac{ L(W, \mu, X') }{ \sum_{i=1}^m \|x'_i - \langle X' \rangle \|^2 }$$

where $\langle X' \rangle$ is the mean of the test set, so the denominator measures the mean squared distance from the test points to their mean (which is equal to the sum of the variance along each dimension).

user20160
  • 29,014
  • 3
  • 60
  • 99
  • Just to note, it would be possible to use *all* principal components rather than just the first few. It depends on what the user is actually interested in. – Richard Hardy Apr 04 '17 at 16:35
  • 1
    Good point. Reconstruction error wouldn't be useful in that situation. – user20160 Apr 04 '17 at 19:36
  • Why not? I think it could. – Richard Hardy Apr 04 '17 at 19:37
  • If the number of PCs is equal to the number of dimensions, the weight vectors would span the full space. In that case, the projection step would just amount to a rotation, so no information would be lost and the reconstruction error would always be zero. Or am I missing something? – user20160 Apr 04 '17 at 20:12
  • 1
    In sample, yes, but not out of sample. So cross validation will work anyway (where the training part would be the "in sample" and the validation part the "out of sample"). – Richard Hardy Apr 05 '17 at 06:00
  • 1
    If #PCs = #dimensions, then $W$ is square and orthogonal, so $W^T W = I$. Then $x - (W^T W(x - \mu) + \mu) = x - (I(x - \mu) + \mu) = x - x = 0$ for all $x$ (even if drawn from a completely different distribution), no? – user20160 Apr 05 '17 at 20:08
  • user20160 is right that if you use all of the principal components, there is no reconstruction error. – Neil G Apr 06 '17 at 01:18
  • It turns out however that you can quantify how good a set of principal components is even if the components span the entire subspace. – Neil G Apr 06 '17 at 01:40
  • @user20160, once again, there is a difference between in sample and out of sample. If you estimate the loadings in sample and then apply them out of sample, you will see that the reconstruction is not perfect because the optimal loadings for out of sample will generally differ from those estimated in sample. – Richard Hardy Apr 06 '17 at 05:09
  • **The procedure described in this answer will not work correctly.** See my answer here http://stats.stackexchange.com/questions/93845 for an explanation why. CC to @RichardHardy. – amoeba Apr 06 '17 at 06:36
  • @amoeba, if the main problem is the use of in-sample data to assess in-sample fit, then I agree with you. If the main problem is more subtle, then I would have to dig into the long (although neat) post you linked. I am also aware of Neil G's answer that makes a good point. – Richard Hardy Apr 06 '17 at 07:10
  • @RichardHardy The main problem is more subtle. One *can* use cross-validation for PCA, but out-of-sample error has to be computed differently, not like user20160 describes. It's enough to read the beginning of my linked post to see why. – amoeba Apr 06 '17 at 07:37
  • 1
    @RichardHardy Any basis that spans the entire space will have zero reconstruction error on any sample. It doesn't matter at all whether it was part of your training data or not. – Neil G Apr 06 '17 at 08:36
  • @NeilG, OK, maybe I am not getting how PCA works, but isn't it true that if you take two samples from the same population and obtain PCs from both of them, the loadings will differ? And thus if you apply loadings from sample 1 to construct PCs from sample 2, you will not be able to perfectly reconstruct sample 2? – Richard Hardy Apr 06 '17 at 08:38
  • 2
    @RichardHardy Imagine a cloud of training data that is roughly ellipsoid. PCA returns, in order, the directions of greatest variation. So, it will return the major axis of the ellipsoid first, then the minor axes. The matrix $W$ rotates these directions so that they end up on the axes. Typically $W$ is rectangular because we omit the directions of lowest variation, but if we keep them all, then when you apply $W^T$, you end up back where you started. The same is true for any new data point (out of the training examples) $W$ rotates them to the new basis and $W^T$ rotates them back. – Neil G Apr 06 '17 at 08:46
  • 1
    @Neil's point here is related to my linked answer: "Any basis that spans the entire space will have zero reconstruction error on any sample", which already by itself shows that the cross-validation procedure described by user20160 is invalid. There are, however, other ways to set up cross-validation, that *are* valid. See my linked answer. – amoeba Apr 06 '17 at 08:53
  • user20160, I was wrong, and @NeilG and amoeba were right. Thanks guys for educating me. – Richard Hardy Apr 06 '17 at 08:57
  • @amoeba That's a great post! I'm aware the reconstruction error I described decreases with the number of PCs (lossy compression *should* behave this way). My intention wasn't to suggest that one should choose the number of PCs by minimizing it (which, as you point out, won't work). Rather, it's to estimate the reconstruction error on future data, as a performance metric rather than a model selection criterion. It seemed to be one way to answer the question "how good are my PCs?". In retrospect, I think it would have been better to first ask for clarification (good for what?) – user20160 Apr 06 '17 at 09:41
  • With this caveat - okay, fine (but maybe you could edit your answer to clarify this behaviour?). But I am not sure what value can a "performance metric" have if it cannot be used as a model selection criterion... To put it another way, why would one want to compute R^2 on the test data using this method, as opposed to simply computing R^2 on the training data (which is a standard "explained variance" in PCA)? I don't see what one gains. – amoeba Apr 06 '17 at 09:44
  • The point of CV would be to give a less biased estimate of the error. Test set reconstruction error can be converted to test set $R^2$. Yes, edit needed to clarify – user20160 Apr 06 '17 at 10:05
1

The issue with reconstruction error given in the user20160's answer is that any set of principal components that spans the same subspace will have the same reconstruction error. If your goal is reconstruction, then that's fine.

But, if you want any smaller subset of your principal components to be as good as possible for reconstruction, which is what PCA also promises, then that solution doesn't help you. In fact, you can quantify how good a set of principal components is even if the components span the entire subspace.

Here's how. Set everything up the same way as in user20160's answer. However, let's centre everything so that we can forget about the means. Let $S$ be the variance of the data:

\begin{align} S &\triangleq Var(x) \end{align}

I will also use the same notation as user20160 except my $W$ is going to match the cited paper below:

\begin{align} W &\triangleq \Gamma^{-\frac12}U^T \end{align}

My $U$ is user20160's $W$. It is the orthogonal matrix that the rotates the directions of greatest variation to the principal axes; these are the eigenvectors of the variance matrix $S$. $\Gamma$ is the magnitude of the variances in each direction; these are the eigenvalues of $S$. With this definition, $W$ is a whitening transformation, and we want to know how good $W$ is.

Let the variance matrix defined by the test data be \begin{align} S' \triangleq Var(x') \end{align}

Define the cross-covariance matrix \begin{align} \Phi &\triangleq Cov(y', x') \\ &= Cov(Wx', x') \\ &= WS' \end{align}

Define the sum of squared-cross-covariances: \begin{align} \phi_i &\triangleq \sum_{j=1}^d \Phi_{ij}^2 \\ &= \sum_{j=1}^d Cov(y'_i, x'_i)^2. \end{align}

Then, it has been shown† that PCA is the unique whitening transformation that maximizes $\phi_i$ subject to the constraint that they are in decreasing order. On the training data, this sum would be maximized. On the testing data, this sum of squared-cross-covariances quantifies how good your components are.

† A. Kessy, A. Lewin, and K. Strimmer, “Optimal whitening and decorrelation,” The American Statistician, 2017 (arXiv: https://arxiv.org/abs/1512.00809)

Neil G
  • 13,633
  • 3
  • 41
  • 84
  • 2
    This is really interesting. I don't upvote yet, because I did not really get it. Will read the paper and come back here. Thanks for posting this answer. – amoeba Apr 06 '17 at 08:58
  • @amoeba I found the paper's really intuitive. I'm reading your answer now. I agree with your point about cross-validation using the target being inappropriate, but I haven't quite understood your answer yet. – Neil G Apr 06 '17 at 08:59
  • That answer of mine really needs some work, I need to explain it better. – amoeba Apr 06 '17 at 09:01
  • @amoeba If you feel like polishing it, that would be great for me since this is a topic I'm interested in learning more about. – Neil G Apr 06 '17 at 09:05
  • I have skimmed through the Kessy et al. paper; it looks like a good discussion of PCA/ZCA whitening (very relevant for my answer in http://stats.stackexchange.com/questions/117427 - I guess I should edit to reference it there). However, I don't see there any discussion of using the sum of cross-covariances as a model selection criterion. Are you sure it works? Have you tried it yourself, or seen implemented somewhere? E.g. if one wants to select the optimal number $k$ of PCs using cross-validation (as in my answer in other thread), can one use this approach? – amoeba Apr 06 '17 at 11:06
  • @amoeba The cross-covariances are shown to be maximized by PCA whitening in 6.3. I haven't tried it, but it seems easy to test out. What does it mean to select the optimal number of principal components? The question I tried to answer is "how good are my principal components at coinciding with directions of variation on test data?" – Neil G Apr 06 '17 at 12:10