How to apply PCA on 3 dimensional image data in python

Question

I have dataset containing colored images of cancerous and non-cancerous tissue cells. The image dimensions are 50x50x3, and I have a total of 280,000 images. I want to apply PCA to it in order to reduce the dimensions.

What are the steps that I would take in order to apply PCA to this dataset. I currently have the image paths and the target variables (cancerous/non-cancerous) stored in a dataframe.

The way I thought of approaching it would be to extract the image using im.read() from skimage, the flatten that image so that it would change from a shape of (50,50,3) to (7500,1), then I would append it to a numpy array so that my final numpy array would be 280,000 x 7500, where 280,000 is the total number of images I have.

After that I proceed to apply PCA.

My questions are:

Am I going about applying PCA in the correct way
Does flattening the 3-dimensional color space and placing it in a single vector make sense?

If the above method is not optimal, then what are the steps that I need to take to apply PCA without changing my image to greyscale?

My aim is to apply a Support Vector Machine to classify these images, after reducing the number of dimensions that they have.

score 6 · Accepted Answer · answered Feb 18 '20 at 09:06

In general, your approach may work, and it might even give you something that works somewhat well. However, I would strongly advise against it, or only use something like this as a first step to just get a feel for the problem.

Think about it this way: If you just shift one of the images one pixel to the left, how much would the vector representing that image change? How well could a PCA identify that these two images are in fact the same image, except for this 1-pixel shift.

It is better to use an approach that somewhat shift-invariant (and if possible rotation-invariant) . Here are some ideas:

You could use PCA to reduce the color space. Often the full 3D RGB space is not required. Instead of using the PCA on all pixels of the images, collect all pixels as individual 3D vectors. Then run the PCA on those. The resulting factors tell you which colors are actually representative of your images. However, you would get at best a reduction of the dataset by 1/3rd. In that case, you are reducing to grayscale, but you are retaining as much information as possible.
Use a method similar to one used by convolutional networks. Split each image into small (overlapping) patches of $K\times K$ pixels. Run the PCA on those patches. The resulting factors then represent typical features found in your image and are much more informative than just running a PCA on the complete image. Experiment with the size of the patches and the amount of overlap to see what gives you good result. If you know, for example, how a cancerous region looks, you could look at the resulting factors to see if any of those represent something you might recognize. Or you can drop patches that you recognize to be meaningless (e.g. patches which contain mostly uniform areas etc).
You can test if the patches work better, if you run them on independent colors (seperate patches for each color, with different component structure), of if you combine the colors first.
Mix, combine and stack these methods. If you have found a good size and overlap of the patches, but you have not reduced your data enough, then reduce the data using those patches. Because these patches represent areas of your images, you can still interpret them as 2D (or 3D if you have separate patches for each color) data. Repeat the process and create patches of patches. At this point, you are essentially building some form of convolutional neural network.
Although it might seem counterintuitive, in many cases it is helpful to first blow up your dataset (i.e. generate artificial data based on the data you have). The images you have may be very clean, all from the same angle, centered around the possible cancerous region etc. This may or may not represent the actual situation where you later want to use your data. If it doesn't, then you will not train the SVM (or the PCA) well for the task at hand. Generate additional images by adding noise, shifting them, rotating them a little etc. Then run the PCA and the SVM on increased dataset. This can greatly improve the final classifier.
If you want to get one step further, you should look at more powerful techniques of dimensionality reduction. A PCA is always computing a linear reduction. A better method is auto-encoder networks, which can be seen as a non-linear generalization of a PCA. There are also convolutional versions of auto-encoder networks, that give you the shift-invariance that you usually need. Also have a look at denoising auto-encoders, because these perform much better than naive auto-encoders in many cases. You can directly feed the (encoded) output from an auto-encoder to a SVM for classification. Or you use the auto-encoder in combination with a classical neural network, which essentially is a method for building deep neural networks.

Thank you so much for your comprehensive answer, you have given me a lot to think about. Is there a more efficient way to store the images rather than using a numpy array? — A Merii, Feb 18 '20 at 09:39

Haitao Du · Answer 2 · 2020-02-18T09:06:07.683

4

If you final goal is using SVM, the problem is number of data points instead of the number of dimensions. See following question.

Can support vector machine be used in large data?

In real world SVM will not work very well if you have ~10K data and above.

Your problem is a standard image classification problem using convolutional neural network CNN may be better. And there are many very mature algorithms and packages available for that.

Here is an example.

https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

edited Feb 18 '20 at 09:06

answered Feb 18 '20 at 09:01

Haitao Du

32,885
17
118
213

Thanks for the reply, I am planning to train it on a randomly selected small batch of images. The reason I picked SVM is because it performs well when you have a large number of features. I wanted to experiment with the limitations of SVM and image classification, and wanted to see if a small number of samples would generalize well. – A Merii Feb 18 '20 at 09:07
@AMerii, if you are working with images, convolution definitely is very important. because although the images are high dimensional, each pixels are not independent to each other. – Haitao Du Feb 18 '20 at 09:13
You are absolutely correct, I would assume, that it would be possible to apply CNN and then use SVM as a classifier for the extracted features. Is there a way to apply convolution to them without using a CNN? I am trying to carry out this example as comparison between traditional statistical machine learning and deep learning. – A Merii Feb 18 '20 at 09:38

How to apply PCA on 3 dimensional image data in python

2 Answers2