60

Principal component analysis can use matrix decomposition, but that is just a tool to get there.

How would you find the principal components without the use of matrix algebra?

What is the objective function (goal), and what are the constraints?

amoeba
  • 93,463
  • 28
  • 275
  • 317
Neil McGuigan
  • 9,292
  • 13
  • 54
  • 62
  • 1
    Maybe I'm missing something so please correct me if I'm wrong, but it should be possible (at least in principle) to construct what is done in PCA using matrices as a (complicated) linear programming problem, but I don't know how you'd state all of the constraints required. Also I'm not sure that'd be very simple to do compared to just using PCA. Why are you trying to avoid matrices? – Chris Simokat May 03 '11 at 20:45
  • @Chris I don't see how one can get to a linear programming problem. It wasn't my understanding either that matrices should be avoided in the *computation*. The question was what kind of problem is solved by PCA, and not the way it is done (by computing the SVD for instance). The solution by cardinal says that you find successive orthogonal directions of *maximal variance*. The solution I presented says that you find hyperplanes with minimal reconstruction error. – NRH May 03 '11 at 21:30
  • 1
    @chris I am hoping to find another way to view PCA, without the matrix algebra, in order to augment my understanding of it. – Neil McGuigan May 03 '11 at 21:52
  • 1
    @Chris, You have a quadratic objective function and an $\ell_2$ norm equality constraint. Alternatively, under the formulation in @NRH's answer, you have a matrix rank constraint. That's not going to beat itself down to a linear-programming problem. @NRH gives some good intuition, and, in fact, there is a very close connection between the two perspectives on PCA that have been given. Perhaps in collaboration with @NRH, we can add that to his/her post to make the full set of answers more complete. – cardinal May 03 '11 at 21:59
  • @cardinal Yes, either way leads to the same thing of course. I really think this is very well explained in The Elements, and since the book is freely available online for download, it is possible to study the details there. I just might add that the data vectors in @cardinal's solution are assumed centered. – NRH May 03 '11 at 22:25
  • 1
    @NRH, Actually, I like *ESL* a lot, but I think the treatment there of this topic is pretty superficial, as it is for many of the topics in the book. In particular, they don't prove (or even assign as an exercise) the important part of the solution for the optimization problem you give. – cardinal May 03 '11 at 22:32
  • @NRH, Also, whether or not you choose to center the features is a decision outside of the scope of PCA. There are statistical situations where doing one or the other makes the most sense. Though, I'd say, centering does win out usually. – cardinal May 03 '11 at 22:34
  • @cardinal, point taken on the possibility of choosing not to center, but without centering I would not call $\mathbf{S}$ the sample covariance. – NRH May 03 '11 at 22:50
  • @NRH: Yes, I agree. – cardinal May 04 '11 at 00:08
  • @NRH and @Cardinal I think both of your posts and comments here are very insightful, but based on the OP and the comment @Neil made in these posts it seems like he's looking for something more along the lines of a way to view this as a systems of equations than in terms of linear algebra for this problem. Clearly it isn't a linear program, but I apparently had a brain fail and couldn't think of systems of equations yesterday. I think that's why he hasn't accepted either of your excellent answers yet. – Chris Simokat May 04 '11 at 16:05
  • +1. I took the liberty to change the title of your question. It was about finding PCs "without matrix algebra", but the accepted answer is all about matrix algebra! So I thought the title was very misleading. I changed it to focus on the objective function, which is what both (excellent) answers are specifically about. I hope you don't mind, @Neil. Let me know if you think it was an inappropriate edit. – amoeba Feb 07 '15 at 01:08

3 Answers3

53

Without trying to give a full primer on PCA, from an optimization standpoint, the primary objective function is the Rayleigh quotient. The matrix that figures in the quotient is (some multiple of) the sample covariance matrix $$\newcommand{\m}[1]{\mathbf{#1}}\newcommand{\x}{\m{x}}\newcommand{\S}{\m{S}}\newcommand{\u}{\m{u}}\newcommand{\reals}{\mathbb{R}}\newcommand{\Q}{\m{Q}}\newcommand{\L}{\boldsymbol{\Lambda}} \S = \frac{1}{n} \sum_{i=1}^n \x_i \x_i^T = \m{X}^T \m{X} / n $$ where each $\x_i$ is a vector of $p$ features and $\m{X}$ is the matrix such that the $i$th row is $\x_i^T$.

PCA seeks to solve a sequence of optimization problems. The first in the sequence is the unconstrained problem $$ \begin{array}{ll} \text{maximize} & \frac{\u^T \S \u}{\u^T\u} \;, \u \in \reals^p \> . \end{array} $$

Since $\u^T \u = \|\u\|_2^2 = \|\u\| \|\u\|$, the above unconstrained problem is equivalent to the constrained problem $$ \begin{array}{ll} \text{maximize} & \u^T \S \u \\ \text{subject to} & \u^T \u = 1 \>. \end{array} $$

Here is where the matrix algebra comes in. Since $\S$ is a symmetric positive semidefinite matrix (by construction!) it has an eigenvalue decomposition of the form $$ \S = \Q \L \Q^T \>, $$ where $\Q$ is an orthogonal matrix (so $\Q \Q^T = \m{I}$) and $\L$ is a diagonal matrix with nonnegative entries $\lambda_i$ such that $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0$.

Hence, $\u^T \S \u = \u^T \Q \L \Q^T \u = \m{w}^T \L \m{w} = \sum_{i=1}^p \lambda_i w_i^2$. Since $\u$ is constrained in the problem to have a norm of one, then so is $\m{w}$ since $\|\m{w}\|_2 = \|\Q^T \u\|_2 = \|\u\|_2 = 1$, by virtue of $\Q$ being orthogonal.

But, if we want to maximize the quantity $\sum_{i=1}^p \lambda_i w_i^2$ under the constraints that $\sum_{i=1}^p w_i^2 = 1$, then the best we can do is to set $\m{w} = \m{e}_1$, that is, $w_1 = 1$ and $w_i = 0$ for $i > 1$.

Now, backing out the corresponding $\u$, which is what we sought in the first place, we get that $$ \u^\star = \Q \m{e}_1 = \m{q}_1 $$ where $\m{q}_1$ denotes the first column of $\Q$, i.e., the eigenvector corresponding to the largest eigenvalue of $\S$. The value of the objective function is then also easily seen to be $\lambda_1$.


The remaining principal component vectors are then found by solving the sequence (indexed by $i$) of optimization problems $$ \begin{array}{ll} \text{maximize} & \u_i^T \S \u_i \\ \text{subject to} & \u_i^T \u_i = 1 \\ & \u_i^T \u_j = 0 \quad \forall 1 \leq j < i\>. \end{array} $$ So, the problem is the same, except that we add the additional constraint that the solution must be orthogonal to all of the previous solutions in the sequence. It is not difficult to extend the argument above inductively to show that the solution of the $i$th problem is, indeed, $\m{q}_i$, the $i$th eigenvector of $\S$.


The PCA solution is also often expressed in terms of the singular value decomposition of $\m{X}$. To see why, let $\m{X} = \m{U} \m{D} \m{V}^T$. Then $n \S = \m{X}^T \m{X} = \m{V} \m{D}^2 \m{V}^T$ and so $\m{V} = \m{Q}$ (strictly speaking, up to sign flips) and $\L = \m{D}^2 / n$.

The principal components are found by projecting $\m{X}$ onto the principal component vectors. From the SVD formulation just given, it's easy to see that $$ \m{X} \m{Q} = \m{X} \m{V} = \m{U} \m{D} \m{V}^T \m{V} = \m{U} \m{D} \> . $$

The simplicity of representation of both the principal component vectors and the principal components themselves in terms of the SVD of the matrix of features is one reason the SVD features so prominently in some treatments of PCA.

NRH
  • 16,580
  • 56
  • 68
cardinal
  • 24,973
  • 8
  • 94
  • 128
  • If only the first few singular values/vectors are needed, Nash and Shlien give [an algorithm](http://dx.doi.org/10.1093/comjnl/30.3.268) reminiscent of the usual power method for computing dominant eigenvalues. This may be of interest to the OP. – J. M. is not a statistician May 04 '11 at 03:44
  • @NRH, Thanks for catching (and correcting) my typos before I managed to see them! – cardinal May 06 '11 at 22:59
  • 2
    Hi @cardinal, thank you for your answer. But it seems that you didn't give the step of proving why the sequential optimization leads to a global optimum. Could you please elaborate on that? Thanks! – Lifu Huang May 14 '17 at 15:52
  • Hi @NRH, In the second part of proof, why the remaining eigenvectors have to be orthogonal to all of the previous solutions? – Tengerye Mar 05 '20 at 02:01
29

The solution presented by cardinal focuses on the sample covariance matrix. Another starting point is the reconstruction error of the data by a q-dimensional hyperplane. If the p-dimensional data points are $x_1, \ldots, x_n$ the objective is to solve

$$\min_{\mu, \lambda_1,\ldots, \lambda_n, \mathbf{V}_q} \sum_{i=1}^n ||x_i - \mu - \mathbf{V}_q \lambda_i||^2$$

for a $p \times q$ matrix $\mathbf{V}_q$ with orthonormal columns and $\lambda_i \in \mathbb{R}^q$. This gives the best rank q-reconstruction as measured by the euclidean norm, and the columns of the $\mathbf{V}_q$ solution are the first q principal component vectors.

For fixed $\mathbf{V}_q$ the solution for $\mu$ and $\lambda_i$ (this is regression) are $$\mu = \overline{x} = \frac{1}{n}\sum_{i=1}^n x_i \qquad \lambda_i = \mathbf{V}_q^T(x_i - \overline{x})$$

For ease of notation lets assume that $x_i$ have been centered in the following computations. We then have to minimize

$$\sum_{i=1}^n ||x_i - \mathbf{V}_q\mathbf{V}_q^T x_i||^2$$

over $\mathbf{V}_q$ with orthonormal columns. Note that $P = \mathbf{V}_q\mathbf{V}_q^T$ is the projection onto the q-dimensional column space. Hence the problem is equivalent to minimizing
$$\sum_{i=1}^n ||x_i - P x_i||^2 = \sum_{i=1}^n ||x_i||^2 - \sum_{i=1}^n||Px_i||^2$$ over rank q projections $P$. That is, we need to maximize $$\sum_{i=1}^n||Px_i||^2 = \sum_{i=1}^n x_i^TPx_i = \text{tr}(P \sum_{i=1}^n x_i x_i^T) = n \text{tr}(P \mathbf{S})$$ over rank q projections $P$, where $\mathbf{S}$ is the sample covariance matrix. Now $$\text{tr}(P\mathbf{S}) = \text{tr}(\mathbf{V}_q^T\mathbf{S}\mathbf{V}_q) = \sum_{i=1}^q u_i^T \mathbf{S} u_i$$ where $u_1, \ldots, u_q$ are the $q$ (orthonormal) columns in $\mathbf{V}_q$, and the arguments presented in @cardinal's answer show that the maximum is obtained by taking the $u_i$'s to be $q$ eigenvectors for $\mathbf{S}$ with the $q$ largest eigenvalues.

The reconstruction error suggests a number of useful generalizations, for instance sparse principal components or reconstructions by low-dimensional manifolds instead of hyperplanes. For details, see Section 14.5 in The Elements of Statistical Learning.

NRH
  • 16,580
  • 56
  • 68
  • 1
    (+1) Good points. Some suggestions: It would be good to define $\lambda_i$ and would be *really* nice to give a short proof of the result. Or, alternatively, it can be connected to the optimization problem involving Rayleight quotients. I think that would make the answers to this question very complete! – cardinal May 03 '11 at 22:14
  • @cardinal, I believe I completed the missing steps in getting from the reconstruction formulation to the problem you solve. – NRH May 04 '11 at 06:33
  • Nice work. I believe the only remaining gap is in your last statement. It is not immediately apparent that optimizing the sum is the same as performing the sequence of optimizations in my answer. In fact, I don't think it follows directly, in general. But, it needn't be addressed here, either. – cardinal May 06 '11 at 23:01
  • @cardinal, it follows by induction. You provide the induction start, and in the induction step choose orthonormal vectors $w_1, \ldots, w_q$ that maximize the sum and arrange it so that $w_q$ is a unit vector orthogonal to $u_1, \ldots, u_{q-1}$. Then by your results $w_q^T \mathbf{S} w_q \leq u_q^T \mathbf{S} u_q$ and by the induction assumption $\sum_{i=1}^{q-1} w_i^T \mathbf{S} w_i \leq \sum_{i=1}^{q-1}u_i^T \mathbf{S} u_i$. Of course, the basis is not a unique basis for the $q$-dimensional space. You can also generalize the "convex combination argument" you use to give a direct prove. – NRH May 07 '11 at 07:10
  • @NRH, Sorry I've only had a chance to glance at this haphazardly. The statement "*...and arrange it so that $w_q$ is a unit vector orthogonal to $u_1,\ldots,u_{q−1}$...*" seems to me to be the same gap I was referring to. In the induction step, when going from $n=1$ to $n=2$, you are forcing a nesting of the linear subspaces whereas the problem itself doesn't directly make that requirement. (Of course, it still turns out to be the case.) But, it's quite likely I'm just missing a subtle point you've (already) made. – cardinal May 07 '11 at 13:01
  • 1
    @cardinal, I am not forcing a nesting, merely using a dimension consideration. If we have a $q$-dimensional subspace you can always choose $w_q$ in that space such that it is orthogonal to a $(q-1)$-dimensional subspace. Then you fill up the $w$-basis in any way you like. – NRH May 07 '11 at 16:21
  • @NRH, now I see where you were going, and, yes, I think that works. Nice. There are a couple of other ways to go about solving this that I know of, but they're fundamentally a bit different. – cardinal May 07 '11 at 23:42
  • @NRH - I have a [question](http://stats.stackexchange.com/questions/163776/svd-from-matrix-formulation-to-objective-function) about your answer, is there any chance you could please take a look? Thanks a lot! – Matteo Jul 29 '15 at 15:20
  • @NRH, I have the similar question as @cardinal♦ . you mentioned "..If we have a q-dimensional subspace you can always choose $w_q$ in that space such that it is orthogonal to a $(q−1)$-dimensional subspace. Then you fill up the $w$-basis in any way you like...". But I am wondering how can you prove that after your arrangement(which makes $w_q$ a unit vector orthogonal to $u_1, u_2, ..., u_{q-1}$), $w_1, w_2, ..., w_q$ can still maximize the sum? Thanks! – Lifu Huang May 13 '17 at 22:58
  • @cardinal, may I know what are the "other ways" you know to prove the nesting subspaces you mentioned? – Lifu Huang May 14 '17 at 00:07
5

See NIPALS (wiki) for one algorithm which doesn't explicitly use a matrix decomposition. I suppose that's what you mean when you say that you want to avoid matrix algebra since you really can't avoid matrix algebra here :)

JMS
  • 4,660
  • 1
  • 22
  • 32