I have dataset containing colored images of cancerous and non-cancerous tissue cells. The image dimensions are 50x50x3, and I have a total of 280,000 images. I want to apply PCA to it in order to reduce the dimensions.
What are the steps that I would take in order to apply PCA to this dataset. I currently have the image paths and the target variables (cancerous/non-cancerous) stored in a dataframe.
The way I thought of approaching it would be to extract the image using im.read() from skimage, the flatten that image so that it would change from a shape of (50,50,3) to (7500,1), then I would append it to a numpy array so that my final numpy array would be 280,000 x 7500, where 280,000 is the total number of images I have.
After that I proceed to apply PCA.
My questions are:
- Am I going about applying PCA in the correct way
- Does flattening the 3-dimensional color space and placing it in a single vector make sense?
If the above method is not optimal, then what are the steps that I need to take to apply PCA without changing my image to greyscale?
My aim is to apply a Support Vector Machine to classify these images, after reducing the number of dimensions that they have.