15

Christopher Bishop writes in his book Pattern Recognition and Machine Learning a proof, that each consecutive principal component maximizes the variance of the projection to one dimension, after the data has been projected to orthogonal space to the previously selected components. Others show similar proofs.

However, this only proves that each consecutive component is the best projection to one dimension, in terms of maximizing the variance. Why does this imply, that variance of a projection to say 5 dimensions is maximized choosing first such components?

amoeba
  • 93,463
  • 28
  • 275
  • 317
michal
  • 1,138
  • 3
  • 11
  • 14
  • 1
    Could you please tell us exactly what would be meant by the "variance" of the five-dimensional dataset that results from a projection of a dataset into five dimensions? (In order for such a quantity to be subject to maximization it would have to be a *single* number.) – whuber Jun 09 '14 at 13:27
  • 4
    Very good point. Chris Bishop in his book refers to minimizing variance of a projection and it is not very clear what that would mean for more then 1 dimension. I would like to learn in what sense the variance is maximized and why such a procedure maximizes it jointly. – michal Jun 10 '14 at 10:11

2 Answers2

14

What is understood by variance in several dimensions ("total variance") is simply a sum of variances in each dimension. Mathematically, it's a trace of the covariance matrix: trace is simply a sum of all diagonal elements. This definition has various nice properties, e.g. trace is invariant under orthogonal linear transformations, which means that if you rotate your coordinate axes, the total variance stays the same.

What is proved in Bishop's book (section 12.1.1), is that the leading eigenvector of covariance matrix gives the direction of maximal variance. Second eigenvector gives the direction of maximal variance under an additional constraint that it should be orthogonal to the first eigenvector, etc. (I believe this constitutes the Exercise 12.1). If the goal is to maximize the total variance in the 2D subspace, then this procedure is a greedy maximization: first choose one axis that maximizes variance, then another one.

Your question is: why does this greedy procedure obtain a global maximum?

Here is a nice argument that @whuber suggested in the comments. Let us first align the coordinate system with the PCA axes. The covariance matrix becomes diagonal: $\boldsymbol{\Sigma} = \mathrm{diag}(\lambda_i)$. For simplicity we will consider the same 2D case, i.e. what is the plane with maximal total variance? We want to prove that it is the plane given by the first two basis vectors (with total variance $\lambda_1+\lambda_2$).

Consider a plane spanned by two orthogonal vectors $\mathbf{u}$ and $\mathbf{v}$. The total variance in this plane is $$\mathbf{u}^\top\boldsymbol{\Sigma}\mathbf{u} + \mathbf{v}^\top\boldsymbol{\Sigma}\mathbf{v} = \sum \lambda_i u_i^2 + \sum \lambda_i v_i^2 = \sum \lambda_i (u_i^2+v_i^2).$$ So it is a linear combination of eigenvalues $\lambda_i$ with coefficients that are all positive, do not exceed $1$ (see below), and sum to $2$. If so, then it is almost obvious that the maximum is reached at $\lambda_1 + \lambda_2$.

It is only left to show that the coefficients cannot exceed $1$. Notice that $u_k^2+v_k^2 = (\mathbf{u}\cdot\mathbf{k})^2+(\mathbf{v}\cdot\mathbf{k})^2$, where $\mathbf{k}$ is the $k$-th basis vector. This quantity is a squared length of a projection of $\mathbf k$ onto the plane spanned by $\mathbf u$ and $\mathbf v$. Therefore it has to be smaller than the squared length of $\mathbf k$ which is equal to $|\mathbf{k}|^2=1$, QED.

See also @cardinal's answer to What is the objective function of PCA? (it follows the same logic).

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    (+1) But is it not intuitively obvious that given a collection of wallets of various amounts of cash (modeling the non-negative eigenvalues), and a fixed number $k$ that you can pick, that selecting the $k$ richest wallets will maximize your total cash? The proof that this intuition is correct is almost trivial: if you haven't taken the $k$ largest, then you can improve your sum by exchanging the smallest one you took for a larger amount. – whuber Jun 11 '14 at 17:23
  • @amoeba: if the goal is to maximize the sum of the variances and not variance of the sum, there is no reason for the second projection to be orthogonal to the first. – Innuo Jun 11 '14 at 19:32
  • @whuber: I am not sure I follow your argument. Of course what you say about wallets *is* obvious, yes. But axes/variables are not wallets because we can take linear combinations of them. Let me reiterate: 1st eigenvector gives a direction of max variance; 2nd eigenvector gives a direction of max variance being orthogonal to the 1st; 1st and 2nd together define a plane with total variance equal to the sum of the first two eigenvalues; but maybe one can choose *another* plane with larger total variance? That is the OP's question, and I still do not think it is *obvious* that one cannot. – amoeba Jun 11 '14 at 22:22
  • 1
    I apologize--I had thought you had already developed the analysis to the point of recognizing that the total variance in a $k$-dimensional subspace is a non-negative linear combination of the eigenvalues, in which none of the coefficients can exceed $1$ and the total of the coefficients equals $k$. (That's a matter of a simple matrix multiplication--Lagrange multipliers aren't needed.) That then brings us to the wallets metaphor. I agree that some such analysis has to be done. – whuber Jun 11 '14 at 22:47
  • @Innuo: you are right, I was not precise enough. The goal is to maximize the total variance in the plane. The total variance is given by a trace of covariance matrix (which is a sum of variances), but one needs an orthonormal basis in this plane for that. If the two chosen axes are not orthogonal, then the total variance in the plane spanned by these two axes is generally not given by the sum of their individual variances (as I am sure you are well aware of). – amoeba Jun 11 '14 at 22:47
  • @whuber: Ah, now I see! Thank you, this is a nice argument. I updated my answer to include [an elaborated version of] it. – amoeba Jun 12 '14 at 13:13
  • @amoeba: Nice answer. I don't understand the last bit though. You are proving that the sum of coefficients from the 2 vectors is not larger then 1. You are proving this for the case when u and v are the first and second basis vectors. Why is this enough? Shouldn't you consider any pair of orthogonal vectors? – michal Jul 02 '14 at 08:37
  • 1
    @amoeba: I mean we are considering the problem in the base consisting of eigenvectors (this is the base of u and v if we calculate their variance by multiplicating by diagonal covariance matrix). u and v will turn out in the end to be them, but at the stage of this proof we shouldn't assume this I think. Shouldn't the argument rather be, that if at any point the sum was larger than 1, then the 2 vectors would not be orthogonal anymore, since the base is orthogonal and each of the vectors brings at most 1? But then again, why do we restrict ourselves to orthogonal vectors u and v? – michal Jul 02 '14 at 08:53
  • @user123675: Yes, it is a bit of a tricky point in the proof. (1) U and V are *any* two orthogonal unit vectors, they can be chosen arbitrarily. (2) Then I use the fact that its coefficients are given by scalar products with basis vectors (see my answer for the formula). (3) Scalar products are independent of the choice of basis, so now I choose the basis such as to make U and V first two basis vectors; this can always be done. (4) After this change of basis, the expression with scalar products simplifies a lot, and becomes obviously less than 1. – amoeba Jul 02 '14 at 14:25
  • @user123675: (cont.) The argument you briefly sketched in your comment above is basically the same argument. Regarding your second question, why do we restrict ourselves to orthogonal U and V. This is because we are looking for a *plane* with maximal total variance. But any plane can be specified by two orthogonal vectors, so one can always choose such two vectors. – amoeba Jul 02 '14 at 14:27
  • Please forgive me asking silly questions, but: is the fact that we are restricting ourselves to orthogonal vectors v1 and v2 coming from assumptions or from the proof? We could span the plane not using the orthogonal vectors, and then we could get higher variance. For example, then we could take 1*v1 and a1*v1 + a2*v2. The same plane is spanned as when v1 and v2 are taken, but the joint variance is higher. This is wrong, but why? – michal Jul 10 '14 at 20:34
  • @user123675: we are looking for a plane with maximal *total variance*. Total variance is defined as a trace of covariance matrix, i.e. sum of variances along basis vectors (in this plane) U and V. U and V have to be orthogonal, it is just a definition of total variance. This definition makes sense because you can choose any pair of orthogonal vectors spanning the plane and the total variance will stay the same (because trace is invariant under rotations). If U and V are not orthogonal (as in your example), sum of variances along them can indeed be higher than the total variance of the plane. – amoeba Jul 10 '14 at 21:30
  • @amoeba Great answer, though I have the same confusion about why we can assume $\mathbf{u}$ and $\mathbf{v}$ to be the first and second basis in the last bit. On the contrary, $\mathbf{u}$ and $\mathbf{v}$ can be any two bases and the proof still works. Indeed, let $\mathbf{u}$ and $\mathbf{v}$ be the 2nd and 5th bases, then $u_k^2 + v_k^2 = k_2 + k_5 < |\mathbf{k}|^2 = 1$. – Heisenberg Mar 10 '15 at 01:10
  • @Heisenberg: Yes, they can be any two basis vectors and the proof still works; however, one certainly *can* let them be the first and the second one. Why not? It's just two orthogonal vectors. I can choose them as the first two basis vectors, and fill the rest of the basis arbitrarily. – amoeba Mar 10 '15 at 01:16
  • @amoeba When you say let them be the first two basis vectors, do you mean let them be the eigenvectors associated with the 1st and 2nd largest eigenvalues? If you are not, then this is how I and OP misunderstood you. If you are, then we're confused because the proof starts with any two orthogonal $\mathbf{u}$ and $\mathbf{v}$, then ends up with the conclusion that they are the first 2 eigenvectors. We shouldn't use the conclusion itself in the middle of the proof. – Heisenberg Mar 10 '15 at 01:26
  • 1
    @Heisenberg: Ah, I see! No, of course I did not mean that! But I see now why it was confusing. I rewrote this last bit of the proof to get rid of this "choosing a basis" step. Please see my edit. Thank you. – amoeba Mar 10 '15 at 10:16
  • @amoeba is there a way to personally contact you ? – ombk Apr 25 '21 at 09:29
2

If you have $N$ uncorrelated random variables sorted in descending order of their variance and were asked to choose $k$ of them such that the variance of their sum is maximized, would you agree that the greedy approach of picking the first $k$ would accomplish that?

The data projected onto the eigenvectors of its covariance matrix is essentially $N$ uncorrelated columns of data and whose variance equals the respective eigenvalues.

For the intuition to be clearer we need to relate variance maximization with computing the eigenvector of the covariance matrix with the largest eigenvalue, and relate orthogonal projection to removing correlations.

The second relation is clear to me because the correlation coefficient between two (zero mean) vectors is proportional to their inner product.

The relation between maximizing variance and the eigen-decomposition of the covariance matrix is as follows.

Assume that $D$ is the data matrix after centering the columns. We need to find the direction of maximum variance. For any unit vector $v$, the variance after projecting along $v$ is

$E[(Dv)^t Dv] = v^t E[D^tD] v = v^t Cov(D) v$

which is maximized if $v$ is the eigenvector of $Cov(D)$ corresponding to the largest eigenvalue.

Innuo
  • 1,418
  • 10
  • 11
  • The original question is rather: choose $k$ orthogonal linear combinations of them (as opposed to $k$ of them) such that the sum of their variances is maximized. Is it still obvious that the greedy approach of picking the first $k$ accomplishes that? – amoeba Jun 11 '14 at 16:17
  • Finding $N$ orthogonal linear combinations and then choosing the first most variant $k$ of them is what the process describes (loosely). My answer just claims that orthogonality is what is sufficient for the greedy process to achieve the goal of maximizing the total variance. – Innuo Jun 11 '14 at 16:26
  • I am not sure I follow the argument. How does the orthogonality matter? If you have $N$ variables and have to choose $k$ with highest total variance, you should pick $k$ with highest variance (irrespective of whether they are correlated or not). – amoeba Jun 11 '14 at 16:31
  • Ah, I understand the confusion. There was a typo in my answer. Fixed now. – Innuo Jun 11 '14 at 17:13
  • I think you might be on to something here, but the magical appearance of the *sum* needs explaining. What relevance does that have to PCA or even to spectral decompositions? – whuber Jun 11 '14 at 17:19