Does PCA make these assumptions (ellipsoid and hyperplane concentration)?

Question

I have used PCA many times. Now, I always understood it as a method that:

assumed that you can apply linear transformations (rotation, scaling, translation of axes) on your data axes to find a set of axes which are orthogonal to each other and maximally explain variance.

Now, I watch a lecture which says:

Why does PCA make these assumptions? From which line of reasoning I can go to those assumptions because apparently they must be equivalent.

I suppose the second one means "variables are either correlated or have unequal variances", which is reasonable, because if the data cloud is almost [spherical](https://stats.stackexchange.com/q/92791/3277) there is little sense in dimensionality reduction. The first assumption, I guess, might be translated as "data cloud without outliers and straight/packed as a piece of soap", so linear dimensionality reduction will do. Also reasonable, because for curved or ring shapes nonlinear methods are needed. — ttnphns, Dec 31 '17 at 08:57
I didn't like them using "high dimensional" (leaves impression of huge dimensionality higher than cardinality), rather, the speach is simply about "multidimensional", for me. — ttnphns, Dec 31 '17 at 09:07
@ttnphns here is the lecture https://youtu.be/lqFSVdH2tmc?t=2657 — Rafael, Dec 31 '17 at 10:10

score 4 · Answer 1 · answered Dec 31 '17 at 10:52

PCA is simply a procedure, and can be applied to any continuous data. The procedure isn't rendered invalid if these conditions are violated, so they're not assumptions in that sense. But, they do suggest cases where PCA can be useful or not useful.

High dimensional data concentrated near hyperplanes

We could say that, if $D$ dimensional data are concentrated near a $d$ dimensional hyperplane (where $d < D$), then they can be well approximated using $d$ principal components. This is ignoring conditions like outliers that could throw PCA off, despite concentration of the remaining points near the hyperplane.

This follows from the fact that 1) PCA reduces the dimensionality by projecting points into a $d$ dimensional linear subspace (i.e. onto a $d$ dimensional hyperplane), and 2) PCA finds the subspace that minimizes the squared reconstruction error. This is an alternative way of formulating the optimization problem, and is equivalent to maximizing the variance. The reconstruction error for a point $x_i$ is the distance between $x_i$ and its projection onto the hyperplane. Therefore, if points are concentrated near a $d$ dimensional hyperplane, PCA can find this hyperplane, and the reconstruction error using $d$ components will be small.

If the data are not concentrated near a $d$ dimensional hyperplane, then no such hyperplane can provide a low reconstruction error, and $d$ principal components cannot provide a good approximation.

High dimensional data concentrated near ellipsoids

This condition doesn't really determine how well PCA will work in general. Here are a couple counterexamples.

1) Consider a set of points drawn from some arbitrarily shaped distribution on a $d$ dimensional plane, then mapped linearly into a higher dimensional space. PCA will be able to perfectly reconstruct these points using $d$ components, despite the data being shaped nothing like an ellipsoid.

2) A sphere is an ellipsoid. PCA won't work well if the data has a spherical distribution. Yes, that's reading things literally, and perhaps the slides meant something like the following:

If $D$ dimensional data has an ellipsoidal distribution that's elongated along $d$ dimensions and closer to flat along the others, then the data can be well approximated using $d$ principal components. This follows from the fact that such a distribution is concentrated near a $d$ dimensional hyperplane.

Good point about the elongated ellipsoidal support. I tried to illustrate that in the example but the key is indeed that the data is concentrated near some plane. — Miguel, Dec 31 '17 at 12:01

Miguel · Answer 2 · 2017-12-31T12:02:59.010

1

PCA simply does projections onto certain linear subspaces. This can always be done, so I don't really see much content to the statement that you quote. The spaces are eigenspaces of the covariance matrix, selected to have eigenvectors of highest value. The better the data distribution shapes itself "around" these principal spaces, the closer the projected data will be to the original ones.

Think of a Gaussian in 2D with covariance matrix $\Sigma$ having eigenvalues $\lambda_1 = 4$ and $\lambda_2 = 0.1$. The level curves of the Gaussian will be ellipses with principal axes given by the eigenvectors of $\Sigma$. If you project on the eigenspace $\operatorname{span}\{v_1\}$, your new 1D data will be close to the original data in some sense.

So: if the original data is roughly concentrated around ellipsoids flattened around some direction or hyperplanes, the projected data will be close.

See here for some theory on PCA and other techniques for dimension reduction.

edited Dec 31 '17 at 12:02

answered Dec 31 '17 at 10:17

Miguel

549
3
6

1

Honestly, I can't see how this answers the question. First you write `I don't really see much content to the statement that you quote` which implies you don't see the statements as the assumptions. Then you consider elliptic case and conclude `if the original data is roughly concentrated around ellipsoids or [?] hyperplanes, the projected data will be close [to the original]`, which sounds like you find some sense in the quotations. However, you don't answer whether they are the assumptions or not, i.e. if the original data is _not_ concentrated... - will the projected data be close or not? – ttnphns Dec 31 '17 at 10:55
I see no content because those are not assumptions in a mathematical sense. Besides the fact that "explaining variance" is not well defined (albeit intuitively useful), you can always perform any transformation you wish on your data. The issue is whether it will render useful results, and the paragraph doesn't say anything about the quality of the projections. – Miguel Dec 31 '17 at 11:52
As to my statement it means what you think, I guess. And if the distribution of the data has e.g. a spherical support, then the projections will for most purposes not be "close". The first three paragraphs of the lecture notes I linked, explain the sense in which PCA finds the best linear subspace. – Miguel Dec 31 '17 at 11:53
Also, the answer of @user20160 highlights that this "ellipticity" condition only makes sense for e.g. Gaussian distributions, as in my example. What really matters is concentration around an hyperplane, so taking a Gaussian to illustrate the ellipsoid idea was probably not a good idea. – Miguel Dec 31 '17 at 11:56

Does PCA make these assumptions (ellipsoid and hyperplane concentration)?

2 Answers2