40

The PCA algorithm can be formulated in terms of the correlation matrix (assume the data $X$ has already been normalized and we are only considering projection onto the first PC). The objective function can be written as:

$$ \max_w (Xw)^T(Xw)\; \: \text{s.t.} \: \:w^Tw = 1. $$

This is fine, and we use Lagrangian multipliers to solve it, i.e. rewriting it as:

$$ \max_w [(Xw)^T(Xw) - \lambda w^Tw], $$

which is equivalent to

$$ \max_w \frac{ (Xw)^T(Xw) }{w^Tw},$$

and hence (see here on Mathworld) seems to be equal to $$\max_w \sum_{i=1}^n \text{(distance from point $x_i$ to line $w$)}^2.$$

But this is saying to maximize the distance between point and line, and from what I've read here, this is incorrect -- it should be $\min$, not $\max$. Where is my error?

Or, can someone show me the link between maximizing variance in projected space and minimizing distance between point and line?

Cam.Davidson.Pilon
  • 11,476
  • 5
  • 47
  • 75
  • I think minimum distance is used to meet the criterion of orthogonality for the components. The points are projected into the PCs that are orthogonal to each other but in each successive component the remaining variance is maximized. – Michael R. Chernick Jul 12 '12 at 15:42
  • Hint: What happens when you consider the *smallest* eigenvalue first, rather than the largest one? – whuber Jul 12 '12 at 15:45
  • @whuber The smallest eigenvalue probably has the PC that is the solution to the final objective function. But this PC does not maximixe the original objective function. – Cam.Davidson.Pilon Jul 12 '12 at 17:06
  • 2
    I'm not sure what you mean by "final" and "original" objective function, Cam. PCA is not (conceptually) an optimization program. Its output is a set of principal directions, not just one. It is an (interesting) mathematical theorem that these directions can be found by solving a sequence of constrained quadratic programs, but that's not basic to the concepts or the practice of PCA. I am only suggesting that, by focusing on the smallest eigenvalue rather than on the largest one, you can reconcile the two ideas of (1) minimizing distances and (2) taking an optimization view of PCA. – whuber Jul 12 '12 at 18:32
  • @whuber I was referring to the first and last opt. objectives in my post (I shouldn't have wrote 'function'). Perhaps I should rephrase it out of the PCA context (though that is my final application): can one connect the two optimization objectives, the first being maximize variance, and the second being minimize distances to line. – Cam.Davidson.Pilon Jul 12 '12 at 19:22
  • As you point out, these are not two different objectives: given the constraint, they are exactly the same thing. But that's not relevant to the points I am making. – whuber Jul 12 '12 at 22:21
  • I did not explicitly answer the question about where your mistake was. I think your last equation does not hold, and I am not sure where you saw it in the Mathworld page you referenced... – amoeba Feb 03 '15 at 10:14
  • 1
    That's okay - your answer was the non-mistake version of what I was trying to do. – Cam.Davidson.Pilon Feb 03 '15 at 13:22
  • Check this out, http://stats.stackexchange.com/a/140579/36601 – Thamme Gowda Sep 25 '16 at 19:37
  • @whuber your hint looks really interesting, but it doesn't get me there right now, it would be great if you turn it into a full answer! – Dahn Nov 26 '19 at 10:20

1 Answers1

56

Let $\newcommand{\X}{\mathbf X}\X$ be a centered data matrix with $n$ observations in rows. Let $\newcommand{\S}{\boldsymbol \Sigma}\S=\X^\top\X/(n-1)$ be its covariance matrix. Let $\newcommand{\w}{\mathbf w}\w$ be a unit vector specifying an axis in the variable space. We want $\w$ to be the first principal axis.

According to the first approach, first principal axis maximizes the variance of the projection $\X \w$ (variance of the first principal component). This variance is given by the $$\mathrm{Var}(\X\w)=\w^\top\X^\top \X \w/(n-1)=\w^\top\S\w.$$

According to the second approach, first principal axis minimizes the reconstruction error between $\X$ and its reconstruction $\X\w\w^\top$, i.e. the sum of squared distances between the original points and their projections onto $\w$. The square of the reconstruction error is given by \begin{align}\newcommand{\tr}{\mathrm{tr}} \|\X-\X\w\w^\top\|^2 &=\tr\left((\X-\X\w\w^\top)(\X-\X\w\w^\top)^\top\right) \\ &=\tr\left((\X-\X\w\w^\top)(\X^\top-\w\w^\top\X^\top)\right) \\ &=\tr(\X\X^\top)-2\tr(\X\w\w^\top\X^\top)+\tr(\X\w\w^\top\w\w^\top\X^\top) \\ &=\mathrm{const}-\tr(\X\w\w^\top\X^\top) \\ &=\mathrm{const}-\tr(\w^\top\X^\top\X\w) \\ &=\mathrm{const} - \mathrm{const} \cdot \w^\top \S \w. \end{align}

Notice the minus sign before the main term. Because of that, minimizing the reconstruction error amounts to maximizing $\w^\top \S \w$, which is the variance. So minimizing reconstruction error is equivalent to maximizing the variance; both formulations yield the same $\w$.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Something I noticed, isn't $ {w}^{T} \Sigma w $ a convex function (With respect to $ w $ as $ \Sigma $ is PSD? How come we try to maximize it? – Royi Dec 24 '16 at 10:15
  • @amoeba can you explain how you go from tr() to const in the last step? – alberto Jan 06 '17 at 11:35
  • 1
    @alberto What is inside the trace is a number (1x1 matrix); a trace of a number is this number itself, so the trace can be removed. The constant appears because $\Sigma$ is equal to $X^\top X/n$, so there is this $1/n$ factor. – amoeba Jan 06 '17 at 16:00
  • Does this calculation hold verbatim for general dimension by replacing the vector $w$ with a matrix $W$? The main point I'm not certain about is whether it true that for a vector $x$, $xWW^T$ is the projection of $x$ seen in the original space? – Emolga Feb 18 '17 at 17:52
  • 1
    @Leullame The calculation will hold verbatim for $W$ if it is a matrix with orthonormal columns. You need $W^\top W = I$ to go from line #3 to line #4. If matrix $W$ has orthonormal columns, then indeed $xWW^\top$ will be a projection of $x$ onto the subspace spanned by the columns of $W$ (here $x$ is a row vector). – amoeba Feb 18 '17 at 18:51
  • 1
    Can you describe how this changes if the data is not centered? – elliotp Feb 13 '18 at 21:44
  • @elliotp If X is not centered then $w^\top X^\top X w /(n-1)$ is not the variance (because variance is by definition around the mean) and the whole thing breaks down. – amoeba Feb 13 '18 at 21:55
  • What about the constraint? Shouldn't you be minimizing the Lagrangian? – elliotp Feb 18 '18 at 15:49
  • @elliotp I am not deriving the solution in this answer. I am only proving that maximum variance direction coincides with minimum error direction. No need for Lagrangian here. – amoeba Feb 18 '18 at 16:37
  • @amoeba, I understand that to go from line #3 to line #4 you used $\mathbb{w}^\top \mathbb{w} = 1$. However, why is it safe to assume that the $\mathbb{w}$ which minimizes reconstruction error satisfies $\mathbb{w}^\top \mathbb{w} = 1$?. In the variance maximization derivation, you impose that constraint to the optimization problem after observing that otherwise the variance can be increased indefinitely by increasing the length of $\mathbb{w}$, but here you seem to take $\mathbb{w}^\top \mathbb{w} = 1$ for granted. – Daniel López Feb 22 '18 at 11:44
  • 1
    @DanielLópez Well, we are looking for a 1-dimensional subspace minimizing reconstruction error. A 1-dimensional subspace can be defined by a unit-norm vector pointing into its direction, which is what $w$ is taken to be. It has unit norm by construction. – amoeba Feb 22 '18 at 11:54
  • @amoeba As you stated X is a centered data matrix, so I assume that PCA applied on original data matrix without centering would produce incorrect results. Is it true? If done I would probably be looking at faulty Principal Components? – Furqan Hashim Oct 17 '18 at 16:04
  • @Furqan PCA includes centering by definition. Without centering it is not PCA. – amoeba Oct 17 '18 at 16:15
  • I'm wondering why you use `n -1` here? Is this is because you are assuming the data is a sample and not the population? – CMCDragonkai Aug 27 '19 at 16:09
  • @amoeba Could you explain how you get $Var(\textbf{Xw}) = \textbf{w}^T\Sigma\textbf{w}$? As far as I can tell, on the left side of the equality you take variance (covariance?) of a vector, which should result in a matrix, but on the right side you have a scalar... – aaa Nov 11 '20 at 00:57