Using PCA on an image dataset prior to classification with a neural network

Question

I have an image data matrix $X \in \Re^{N \ \text{x}\ p}$ where $N=50000$ is the number of image examples, and $p=3072$ is the number of image pixels: $p = 3072 = 32 \times 32 \times 3$, because each image is a 3-channel $32 \times 32$ image. Furthermore, each of the 50000 images belongs to 1 of 10 possible classes. That is, there are 5000 images of class 'car', 5000 images, of class 'bird', etc... and there are 10 classes total. This is a part of the CIFAR-10 dataset.

The ultimate goal here is to perform classification on this data set. To this end, the professor mentioned to try PCA on this, and then placing those features into a classifier. As my classifier, I am using a fully connected neural network with one hidden layer and a softmax output.

My problem is that I believe I have done PCA in the correct way, but I think my way might possibly be mis-applied.

This is what I have done:

In order to compute the PCA of my data, this is what I did so far:

First, I compute the mean image $\mu \in \Re^{1\text{x} p}$. Let $x_n$ be the $n$'th row of $X$. Then,

$$ \mu = \frac{1}{N} \sum_{n=1}^{N} x_n $$

Compute the covariance matrix $C \in \Re^{p \text{x} p}$ of my image data:

$$ C = \frac{1}{N-1}(X - \mu)^{T}(X - \mu) $$

Perform an eigen-vector decomposition of $C$, yielding $U$, $S$, and $V$, where the matrix $U$ encodes the principal directions (eigenvectors) as columns. (Also, assume that eigen-values are already sorted in decreasing order). Thus:

$$ [U, S, V] = \text{eig}(C) $$

Finally, perform PCA: i.e, compute a new data matrix $P \in \Re^{N \text{x} k}$, where $k$ is the number of principal components we wish to have. Let $U_k \in \Re^{p \text{x} k}$ - that is, a matrix with only the first $k$ columns. Thus:

$$ P = XU_k $$

The Question:

I think my method of performing PCA on this data is mis-applied, because the way I have done it, I basically end up decorrelating my pixels from each other. (Assume I had set $k=p$). That is, the resultant rows of $P$ look more or less like noise. That being the case, my questions are as follows:

Have I really de-correlated the pixels? That is, have I in fact removed any coupling between pixels that a prospective classifier might have hoped to use?
If the answer to the above is true, then why would we ever do PCA this way?
Finally, related to the last point, how would we do dimensionality reduction via PCA on images, if in fact, the method I have used it wrong?

EDIT:

After further studying and plenty of feedback, I have refined my question to: If one were to use PCA as a pre-processing step for image classification, is it better:

Perform classification on the k principal components of the images? (Matrix $X_{new} = X U_k$ in the above, so now each image is of length $k$ instead of the original $p$)
OR is it better to perform classification on the reconstructed images from k-eigenvectors, (which will then be $X_{new} = X U_k U_k^T$, so even though each each image is STILL the original $p$ in length, it was in fact reconstructed from $k$ eigenvectors).

Empirically, I have found that validation accuracy without PCA > validation accuracy with PCA reconstruction > validation accuracy with PCA PCs.

The images below show that in the same order. 0.5 > 0.41 > 0.31 validation accuracies.

Training on raw pixels images of length $p$:

Training on images of length $p$ but reconstructed with k=20 eigenvectors:

And finally, training on the $k=20 principal components themselves*:

All this has been very illuminating. As I have found out, PCA makes no guarantees that the principal components make demarcation between different classes easier. This is because the principal axes computed are axes that merely try to maximize the energy of projection across all images, agnostic to image class. In contrast, actual images - whether faithfully reconstructed or not, still maintain some aspect of spatial differences that can go - or should go - towards making classification possible.

I am not sure what software you are using, but in Matlab the results of `eig()` are not necessarily sorted. Your notation of `[U,S,V]` would normally be used for `svd()`, which *does* return sorted results. (In either case, you should have V=U, due to symmetric input; the full SVD could be used on X, but not on X'*X). I ask because if the output is not sorted, any patterns could be hidden. — GeoMatt22, Sep 08 '16 at 02:13
@GeoMatt22 Thanks for your comment, yes, you may assume that the eigen-vector/values are already sorted in descending order. (I am making that assumption here). You say "the full SVD could be used on X, but not on X'*X" - what do you mean here? You mean that eigenvectors computed via the SVD on $X^{T}X$ are wrong somehow? — Spacey, Sep 08 '16 at 02:21
I just meant that all 3 outputs of `svd()` would be needed for $X$, as it is not a symmetric matrix. Note that if we assume for simplicity that $X$ is centered (0-mean), the covariance is $C=X^TX$. Then given `[U,S,V]=svd(X)`, i.e. $X=USV^T$, the covariance decomposition will be $C=V(S^2/N)V^T$. For grayscale images, commonly the $k$ roughly corresponds to "frequency" (more oscillatory basis vectors), and the energy decays with $k$. Not sure if this holds for you? The RGB stack could change things also. — GeoMatt22, Sep 08 '16 at 02:30
There is a [deep-learning] tag here. Why? Are you planning to use deep neural networks for classification? Probably convolutional ones? This is important to know. — amoeba, Sep 12 '16 at 13:53
@amoeba Yes, I used a fully connected NN as the classifier. The input was either the original NxP data set, or the NxK data set. All things being the same, the former worked out great, the latter did not learn anything at all. — Spacey, Sep 12 '16 at 17:29
@amoeba No, not at all. This was a simple vanilla 1 hidden layer NN. The output used a softmax. So, input->hidden->output->softmax. There are 49,000 training samples, and 10 classes. — Spacey, Sep 12 '16 at 18:06
And what was your $k$? In your question you wrote "assume I set $k=p$". Are you saying that even for $k=p$ your NN did not learning anything after PCA (but did work without PCA)? — amoeba, Sep 12 '16 at 18:26
@amoeba Sorry let me clarify: In the original experiment, I train on all the images, in the original space. That is, X = NxP matrix, N are the number of images, and P are the number of pixels per image. In the second experiment, I train on X_new which is an NxK matrix, where each row has the K *principal components* - ie, the k *basis co-efficients*, of each image. (X_new = X * U[0:k]). In *this* case, no learning occurs. — Spacey, Sep 12 '16 at 21:02
Yes, I understand. But what is the value of $k$ that you used? — amoeba, Sep 12 '16 at 21:04
@amoeba So in my experiment, I used K=10, 50, and even 100. FYI, on the CIFAR-10 data set, most of the energy was contained in the top say, 50 eigen-vectors. (I even went all the way to k=3072=P, and still no learning was done). (Please bear in mind that this was done on the the basis coefficinets themselves. If I on the other hand *reconstructed* the images from K=50 eigenvectors, that is, X_new = X * U[0:k=50] * U[0:k=50].T, then I get very reasonable results). — Spacey, Sep 12 '16 at 22:04
What you wrote in the last comment is very strange and I suspect some programming mistake. Your `X*U` and `X_new` contain *exactly the same information* up to a linear transform, which can be taken care of in the first layer mapping. You should be getting identical performance. (In any case, I think you should update your question with these important details.) — amoeba, Sep 12 '16 at 22:15
@amoeba Yes, those seem to be on the right track. I think my question was for images specifically - in effect - my hypothesis is that when it comes to *classification* of *images* - not only *can* using the *principal components* be ... how shall we say... useless(?), it seems like by definition, they will probably always be useless. (Because, the top eigen-vectors do not "care" about what class the image belonged to - they just try to maximize energy/variance for all images). This is I think what my intuition was, and at this point I am trying to validate my intuition on this particular case — Spacey, Sep 12 '16 at 22:17
No. If you get good performance with the reconstruction with 50 PCs then you *have to* get the same performance with the 50 PCs themselves. — amoeba, Sep 12 '16 at 22:18
@amoeba "No. If you get good performance with the reconstruction with 50 PCs then you have to get the same performance with the 50 PCs themselves." Hmm, I do not believe this is true. In fact that is what sired this question. :) — Spacey, Sep 12 '16 at 22:20
How do you measure performance - do you have a test set? How do you process your test images? Are you sure you do the same transformation, i.e. project them on the same 50 PCs (computed on the training set)? — amoeba, Sep 12 '16 at 22:24
@amoeba Yes, I have an untouched test set that is run in the end on the learned NN. The pre-processing of the images is a simple de-meaning of each feature. (pixel). I made sure that I do the same pre-processing and have the same statistics after and before PCs. Well, let me do this: Ill go back to the code, and re-run and then re-post my findings. Ill ping you when I do this. — Spacey, Sep 12 '16 at 22:27
@amoeba Hi amoeba, please look under "edit" of my post. I double checked my code and I did in fact find an error, (serves me right for coding late at night), so thanks for being my rubber ducky! :) That said, it does however seem that using the *PCs * VS using images - whether reconstructed or from the originals - is inferior. This is further confirmed by the links you provided about the nature of PCA. Thanks a bunch - I have learned a lot! :) — Spacey, Sep 13 '16 at 23:42
I insist that this is *not* confirmed by the links I have provided. The links say that 50 PCs do not necessarily separate the classes as well as all 3072 dimensions; that is correct. But 50 PCs and 3072 dimensions reconstructed from the 50 PCs separate the classes *to exactly the same extent*. It is a simple mathematical fact. You could take NN trained on reconstructions and manually transform it into NN working on PCs and it will work identically well. I cannot explain why your network performs better with the reconstructions. — amoeba, Sep 14 '16 at 00:09
@amoeba I think we are actually in agreement: We agree that PCs used by themselves do not necessarily lend themselves to better class separability, and in fact might hurt it. And in regards to the reconstructions, yes I certainly do agree - and also verified this with my code - that when k is large enough, ( ~ >= 100 in my case), I can get validation accuracy that is more or less the same as if I used the original images. In the edits that I used up above, I used k=20, which is why we saw the reconstruction validation error be less than the raw one. But that is simply because k was small. — Spacey, Sep 14 '16 at 02:19
Wait a second. I am saying that 20 PCs have exactly the same class separability as the reconstruction with 20 PCs. Do you agree with that? The reason you get different results with NN must have something to do with how it is trained, perhaps gets stuck in a local minimum or something, but in principle the separability must be exactly the same. Of course it's lower than with all 3072 dimensions, this is clear. — amoeba, Sep 14 '16 at 10:32
@amoeba "Wait a second. I am saying that 20 PCs have exactly the same class separability as the reconstruction with 20 PCs. Do you agree with that? " - On this case, then no - we disagree. This goes back to how PCs in and of themselves need not make class separability better - and indeed might hurt it. I do not see why/how k PCs can or should give you performance that is nearly as comparable to reconstructions with k=20 eigen-vectors. What is your reasoning? — Spacey, Sep 15 '16 at 01:26
@Tarantula You are confused about what reconstruction is and how it works. I suggest you read this http://stats.stackexchange.com/questions/229092 and maybe also this http://stats.stackexchange.com/questions/2691 (my answer). The 20 PCs might have lousy separability, this is because you throw out the remaining 3052 PCs and that is a lot of potentially important information. But the reconstruction based on 20 PCs contains exactly the same information as 20 PCs themselves, nothing is thrown out **and nothing is added**. It is simply a rotation, a linear operation. — amoeba, Sep 15 '16 at 09:13
@amoeba (1/2) I have read your posts - they are very nice. So, I am very certain that I understand how reconstruction is done, and what it is. Franky we are in agreement about all of what is said as far as the PCs, the reconstructions, etc. Let me recap: My claim is that if I: Perform classification on k PCs, I need not expect to get the same classification results - (in fact they can be worse), as if I did classification on reconstructions, (yes, I know it's just a rotation, and just a linear-operation), from k eigen-vectors. This is my claim. — Spacey, Sep 15 '16 at 17:04
@amoeba (2/2) So let us be specific: Are you claiming that if you use K PCs for classification, you expect to get the same classification results as what you would get via reconstructions from K-eigen-vectors? (For all Ks)? Let us localize the disagreement. — Spacey, Sep 15 '16 at 17:06
Yes, this is what I claim! The performance should be identical for any "reasonable" classifier: the information contained in the $k$ PCs and in the reconstruction obtained with $k$ PCs is exactly the same. Imagine drawing some points of several classes on a piece of paper, and then rotating the piece of paper. I guess you will agree that any reasonable classifier should achieve the same performance on rotated, and on un-rotated data. They are precisely the same, up to rotation. (And yes, I know that you wrote that you get different results with your NN. I don't know why, I suspect an error.) — amoeba, Sep 15 '16 at 19:19
(Actually there are some existing classifiers that are not invariant to data rotation; I think naive Bayes would be one example. But fully connected NN should be invariant.) — amoeba, Sep 15 '16 at 19:21
@amoeba (1/4) So I think there are a couple things going on: First, some numbers to bear in mind: In my experiments, Number of pixels per image = P = 3072 = 32x32x3, and the NN is: 3072 (input layer) -> 50 (hidden) -> 10 (output layer), in both the k=100 reconstruction experiment, as well as the raw-pixel experiment. (Since the input size doesnt change, P = 3072). For the experiment with k = 100 PCs, *the net is the same, except that the input layer is k = 100 now* obviously. Ok, with all that said, here are my thoughts: — Spacey, Sep 16 '16 at 04:12
@amoeba (2/4) I) Even if we run with your hypothesis, we can see that while W2 in both experiments is equal to a 50 x 10 matrix, the W1's of the raw and reconstructed experiments are 3072 x 50, while the W1 of the PC experiment is only 100 x 50! (Since the input dimensionality changes). So automatically, (at least in this setup), the capacity of the net is implicitly changed by virtue of the input size changing. — Spacey, Sep 16 '16 at 04:13
@amoeba (3/4) You are saying it will/should/could learn the rotation matrix $U^T \in R^{100 \ x \ 3072}$, but $W_1$ is only $\in R^{100 \ x \ 50}$. This is my explanation for the phenomenon observed. It therefore seems to me that I would need to increase by another layer, so that both experiment starts on the same footing. — Spacey, Sep 16 '16 at 04:13
@amoeba (4/4) Finally, in regards to the last point of whether or not a sufficiently large net can learn the rotation-transformation of the PCs that will give the same performance as if we had already rotated the data for it, (via the reconstruction) - on this point, I am not sold yet. *Can* a net learn *something* that would give the same performance. Yes, I suppose it could and would, as it is a universal approximator. However my point of hesitation lies on whether or not it's generalization ability is hampered by the space the initial data lies in. So for point (II), I am not sure. — Spacey, Sep 16 '16 at 04:19
Let $W_\mathrm{pc}$ of $100\times 50$ size be the first NN mapping in the PC-experiment. Let $W_\mathrm{rec}$ of $3072\times 50$ size be the first NN mapping in the reconstruction-experiment. We know that inputs are related as $X_\mathrm{rec}=X_\mathrm{pc} U^\top$, where $U$ is of $3072\times 100$. So if we let $W_\mathrm{pc} = U^\top W_\mathrm{rec}$ then the mappings (and the NN performance) would be exactly the identical. So: why would your NN fail to learn this $W_\mathrm{pc}$? I do not understand it and I do not understand your explanation in comment (3/4) above either. — amoeba, Sep 16 '16 at 09:13
@amoeba (1/2) I think there is a deep disconnect somewhere here between our conversations. You say: "So if we let W_pc=U⊤WrecWpc=U⊤Wrec then ..." I dont know what that means. What do you mean when you say "let" $W_{pc}$ be equal to anything. $W_{pc}$ is just learned. I mentioned this before: The network in both experiment *does not change*, in the sense that I do not add/remove layers. I always have 1 hidden layer. — Spacey, Sep 16 '16 at 19:05
@amoeba (2/2) That layer $W_1$ is either 3072x50, or 100x50. The output mapping is then $W_2$ and that is always 50x10. So that's it. I'm not sure what you mean. I literally just take either replace one input for another, that is, either the input is Nx3072, or Nx100. That's it! If you do that, the performance will degrade with the Nx100 case. Why do you suppose that is? — Spacey, Sep 16 '16 at 19:06
Yes, I understand that the network stays the same. What I mean: take your network that works on reconstructions; take its $W_1$; transform it as I wrote above into $U^\top W_1$; then plug these weights manually into your network that works on PCs. Can you imagine doing it? This resulting network (working on PCs) should have identical performance to the one working on reconstructions. But you are telling me that when you train your network it does not reach this performance. I do not know why. Anyway, we are going in circles. If you want and have time, we can try to discuss it in chat. — amoeba, Sep 16 '16 at 19:45
(My best guess, by the way, is that you do not transform your test set correctly. Are you maybe doing one PCA on the training set and a separate PCA on the test set? This would be wrong: you need to do PCA on the training set and then apply the resulting $U$ to the test set. Apologies for this suspicion if you are doing it correctly already :-) — amoeba, Sep 16 '16 at 19:51

score 1 · Answer 1 · answered Dec 23 '16 at 09:12

I agree with all @amoeba 's comments. But would add a couple of things.

The reason PCA is used on images is because it works as' feature selection ' : objects span multiple pixels, so correlated changes over multiple pixels are indicative of objects. Throwing away un correlated Pixel changes is getting rid of' noise' which could lead to bad generalisation.

In general nns (using gradient descent) do better with sphered inputs (ie un correlated normalised inputs), this is because then your error surface will be more symmetrical and so the single learning rate works better (you need a small learning rate in highly curved directions and a large learning rate in shallow directions).

Did you normalise your inputs (good) as well or just decorrelate?

Did you use l2 regularisation? This has a similar effect to PCA regularisation. For correlated inputs it emphasises small similar weights (ie averaging out the noise, penalising large weights so single pixels cannot have a disproportionate effect on classification) (see eg elements of statistical learning book), so perhaps you saw no benefit to PCA because the l2 regularisation was already effective. Or perhaps the nn was too small to overfit.

Lastly did you reoptimise the parameters... I would expect you would need different learning rate and other parameters after changing to first k principal components.

score 0 · Answer 2 · answered Oct 16 '16 at 18:31

Per definition, PCA removes correlation among variables. It is used before classification because not all classifiers can deal well with high dimensional data (but neural nets can) and not all classifiers can deal well with correlated variables (but neural nets can). Especially with images, you typically do not want to remove correlation and you also want multiple layers such that neighboring pixels can be aggregated into image features. Your experimental results reflect the poor choice of this technique for this particular task.

gunakkoc · Answer 3 · 2016-12-23T08:14:31.510

If your images(the images that you vectorized so that each image is now a single row with M columns where M is the number of total pixels x 32 x 32 x 3) contains little to no correlation, then the number of principle components required to explain most of the variances(>95% for example) in your data increases. In order to determine how "feasible" PCA is, checking explained variance is a good idea. In your case, since your data matrix has a size of NxM where M > N, maximum number of components is N. If the number of PCs required is close to 50000, then using PCA makes no sense. You can calculate the explained variance by:

explained variance = 
sum of the eigenvalues that correspond to the PCs you use / 
sum of ALL eigenvalues

I would choose the number of principle components which explains at least more than 90% variance. Another method for choosing the correct number of PCs is drawing the number of PCs vs. explained cumulative variance plot. In the plot you may choose the number where the explained variance doesn't increase significanly anymore.

Thus, I think your problem might be choosing a good number of PCs.

Another problem might be related to projection of the test samples. When you divided your samples for constructing the NN and for testing the NN, you need to project the test data set using the eigenvectors obtained from the training data set.

Using PCA on an image dataset prior to classification with a neural network

3 Answers3